Llama.cpp

Last updated on November 19, 2025

github: @ggml-org/llama.cpp

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

Plain C/C++ implementation without any dependencies

Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks

AVX, AVX2, AVX512 and AMX support for x86 architectures

1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use

Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)

Vulkan and SYCL backend support

CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

Notes mentioning this note

Ollama

Open source local first AI that started as a wrapper around [[Llama.cpp]]. Founders Jeffrey Morgan and Michael Chiang (University of...