Inference Engine MIT

ExLlamaV2

Fastest inference engine for consumer NVIDIA GPUs. Custom CUDA kernels and EXL2 quantization format deliver maximum tokens per second on desktop hardware.

GitHub

Platforms: linuxwindows

ExLlamaV2 is a high-performance inference engine optimized specifically for consumer NVIDIA GPUs. It achieves the fastest token generation speeds on desktop-class hardware through custom CUDA kernels and its own EXL2 quantization format, which allows per-layer mixed-precision quantization. For enthusiasts and developers running LLMs on consumer GPUs like the RTX 3090 or 4090 who want maximum tokens per second, ExLlamaV2 consistently tops benchmark comparisons.

Key Features

Custom CUDA kernels. ExLlamaV2 implements hand-written CUDA kernels for matrix multiplication, attention, and quantized operations that are optimized for consumer GPU architectures. These kernels outperform generic libraries on desktop-grade hardware where memory bandwidth is the bottleneck.

EXL2 quantization format. The EXL2 format supports mixed-precision quantization at the layer level. Instead of applying uniform 4-bit quantization across all layers, EXL2 assigns more bits to sensitive layers and fewer to robust ones. This produces better quality at the same average bits-per-weight compared to uniform quantization methods like GPTQ.

Variable bits-per-weight. EXL2 models can target any average bits-per-weight — 2.5, 3.0, 4.0, 5.0, 6.5, or anywhere in between. This flexibility lets you precisely balance quality versus VRAM usage for your specific GPU.

Flash Attention integration. ExLlamaV2 integrates Flash Attention for memory-efficient long-context inference. This extends the practical context length achievable within consumer GPU VRAM budgets.

Dynamic batching and paged attention. For multi-user serving scenarios, ExLlamaV2 supports paged attention and dynamic batching through its TabbyAPI server, handling concurrent requests efficiently.

GPTQ compatibility. ExLlamaV2 loads standard GPTQ quantized models in addition to its native EXL2 format, providing access to the large library of existing GPTQ models on Hugging Face.

When to Use ExLlamaV2

Choose ExLlamaV2 when you have an NVIDIA GPU and want the absolute fastest inference speed on consumer hardware. It is the best option for RTX 3090/4090 users who prioritize tokens-per-second, users who want fine-grained control over quantization quality, and developers using text-generation-webui or TabbyAPI as frontends.

Ecosystem Role

ExLlamaV2 targets the enthusiast performance tier between llama.cpp (broad compatibility) and TensorRT-LLM (data center throughput). It is NVIDIA-only and consumer-GPU-focused. It integrates as a backend for text-generation-webui, SillyTavern via TabbyAPI, and other tools. For cross-platform or AMD/Apple support, llama.cpp is the alternative. For enterprise NVIDIA deployments, TensorRT-LLM goes further.