Llama.cpp
- github: @ggml-org/llama.cpp
The main goal of
llama.cppis to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2, AVX512 and AMX support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
Notes mentioning this note
Ollama
Open source local first AI that started as a wrapper around [[Llama.cpp]]. Founders Jeffrey Morgan and Michael Chiang (University of...