Inference Engine MIT

llama.cpp

The foundational C/C++ inference engine that pioneered consumer-hardware LLM deployment via quantization. Powers Ollama, LM Studio, GPT4All, and KoboldCpp.

GitHub

Platforms: windowsmacoslinuxandroidios

llama.cpp is the C/C++ inference engine that made running large language models on consumer hardware practical. Created by Georgi Gerganov in March 2023, it introduced GGML-format quantization that compresses models from 16-bit floats down to 2-4 bit integers, slashing memory requirements by 4-8x while preserving most output quality. Nearly every major local AI tool — Ollama, LM Studio, GPT4All, KoboldCpp — uses llama.cpp as its core inference backend.

Key Features

Aggressive quantization. The GGUF format supports quantization levels from Q2_K through Q8_0, letting you trade quality for memory savings. A 70B parameter model that normally requires 140GB of VRAM can run in under 40GB with Q4_K_M quantization and minimal perplexity loss.

Broad hardware support. llama.cpp runs on NVIDIA GPUs via CUDA, AMD GPUs via ROCm and Vulkan, Apple Silicon via Metal, and Intel GPUs via SYCL. It also compiles for ARM processors, making it viable on Android phones, Raspberry Pi boards, and iOS devices. CPU-only inference uses AVX2/AVX-512 SIMD instructions for maximum throughput on x86 hardware.

Built-in HTTP server. The llama-server binary provides an OpenAI-compatible API endpoint with support for completions, chat, embeddings, and multimodal inputs. It includes continuous batching for serving multiple concurrent users from a single model instance.

Active development. With thousands of contributors, llama.cpp tracks new model architectures within days of release. Support for Llama, Mistral, Phi, Qwen, Gemma, Command R, and dozens of other architectures ships regularly.

When to Use llama.cpp

Use llama.cpp directly when you need maximum control over inference parameters, want to build from source with specific hardware optimizations, or need to embed LLM inference into a C/C++ application. It is also the right choice for edge deployment on mobile devices and embedded systems.

Ecosystem Role

llama.cpp is the foundation layer of the local AI ecosystem. Higher-level tools like Ollama and LM Studio provide friendlier interfaces on top of it. If you want simplicity, use those tools. If you want control, compile llama.cpp yourself and configure every parameter directly.

Key Features

When to Use llama.cpp

Ecosystem Role

Related Guides

Comparisons