GGUF vs GPTQ vs AWQ vs EXL2: Model Quantization Format Comparison

Compare GGUF, GPTQ, AWQ, and EXL2 quantization formats for local LLMs. Quality retention, inference speed, VRAM usage, tooling support, and CPU compatibility analyzed with benchmark data.

Model quantization is what makes running large language models on consumer hardware possible, and the format you choose affects quality, speed, memory usage, and which tools you can use. GGUF, GPTQ, AWQ, and EXL2 are the four dominant quantization formats in the local LLM ecosystem as of 2026, each optimized for different hardware and use cases. This comparison provides the data-driven analysis you need to pick the right format for your models, hardware, and inference stack.

Quick Comparison

FeatureGGUFGPTQAWQEXL2
Full nameGPT-Generated Unified FormatGPT-QuantizationActivation-aware Weight QuantizationExLlamaV2 format
DeveloperGeorgi Gerganov (llama.cpp)IST Austria + communityMIT Han Labturboderp
Quantization typePost-training (weight-only)Post-training (weight-only)Post-training (weight-only)Post-training (mixed precision)
Bit widths2, 3, 4, 5, 6, 8-bit + IQ2, 3, 4, 8-bit4-bit (primarily)Flexible (2-8 bit, per-layer)
CPU inferenceExcellent (SIMD optimized)No (GPU only)No (GPU only)No (GPU only)
NVIDIA GPUYes (CUDA offloading)Yes (CUDA kernels)Yes (CUDA kernels)Yes (custom CUDA kernels)
AMD GPUYes (ROCm, Vulkan)LimitedLimitedNo
Apple SiliconYes (Metal)NoNoNo
Hybrid CPU+GPUYes (layer offloading)NoNoNo
Primary enginellama.cpp, OllamaAutoGPTQ, vLLM, TGIAutoAWQ, vLLM, TGIExLlamaV2
Calibration dataOptional (imatrix)RequiredRequiredRequired
Self-containedYes (weights + metadata)Separate config neededSeparate config neededSeparate config needed
File structureSingle .gguf fileModel directory (safetensors)Model directory (safetensors)Model directory
HF availabilityThousands of modelsThousands of modelsThousands of modelsHundreds of models

Quality / Size / Speed Comparison

The following table compares Llama 3.1 8B across formats at approximately 4-bit quantization, measured on an NVIDIA RTX 4090. Perplexity is measured on WikiText-2 (lower is better). FP16 baseline perplexity for this model is ~6.2.

Format & QuantSize (GB)PerplexityGeneration (tok/s)Prompt (tok/s)VRAM (GB)
FP16 (baseline)16.06.20~75~2,80016.5
GGUF Q4_K_M4.96.38~110~3,2005.8
GGUF Q4_K_S4.66.42~115~3,3005.5
GGUF IQ4_XS4.36.35~105~3,0005.2
GPTQ 4-bit 128g4.56.40~120~3,5005.5
GPTQ 4-bit 32g5.06.32~105~3,2006.0
AWQ 4-bit 128g4.56.36~125~3,8005.4
EXL2 4.0 bpw4.26.33~150~4,0005.3
EXL2 3.5 bpw3.76.55~160~4,2004.8
EXL2 5.0 bpw5.26.26~135~3,6006.2
GGUF Q5_K_M5.76.28~100~2,9006.6
GGUF Q6_K6.66.23~90~2,6007.4
GGUF Q8_08.56.21~80~2,2009.3

Key observations:

  • EXL2 leads in speed on NVIDIA GPUs due to turboderp’s custom CUDA kernels optimized specifically for mixed-precision inference
  • AWQ is the fastest among standard formats because its activation-aware approach enables more efficient GPU kernels
  • GGUF offers the most size/quality options ranging from Q2 (aggressive compression) through Q8 (near-lossless)
  • Quality is remarkably close across all formats at 4-bit — the perplexity difference is typically under 0.1, which is imperceptible in most practical use
  • GGUF IQ quantizations (imatrix-based) punch above their weight, achieving quality comparable to GPTQ/AWQ at similar sizes

Quality Retention Deep Dive

GGUF

GGUF offers the widest range of quantization levels, from extremely aggressive Q2_K (2-bit) to near-lossless Q8_0 (8-bit). The K-quant variants (Q4_K_M, Q5_K_M, etc.) use mixed precision within the model — more important layers get higher precision, less important layers get lower precision.

The IQ (importance-matrix) quantizations represent the state of the art in GGUF quality. By measuring the importance of each weight using calibration data, imatrix quantizations allocate bits more intelligently. IQ4_XS achieves quality comparable to standard Q4_K_M while being 10-15% smaller.

GGUF’s quality advantage is flexibility: you can choose exactly the quality-size tradeoff you need, from aggressive 2-bit compression for memory-constrained systems to 8-bit for maximum quality.

GPTQ

GPTQ quantizes weights by solving a layer-wise quantization optimization problem using second-order information (the Hessian matrix). The group size parameter controls the granularity of quantization — smaller groups (32) preserve more quality at a slight size cost, larger groups (128) are more compact.

GPTQ produces consistent, reliable quantizations. The format is well-understood and widely supported. Quality at 4-bit with group size 128 is comparable to GGUF Q4_K_M.

AWQ

AWQ (Activation-aware Weight Quantization) identifies weights that are important based on activation patterns in calibration data, and protects those weights during quantization. This activation-aware approach achieves slightly better quality than GPTQ at the same bit width in many benchmarks.

AWQ’s primary advantage is that its quantization-aware approach produces models that are slightly easier for GPU kernels to dequantize efficiently, contributing to its speed advantage.

EXL2

EXL2’s per-layer mixed-precision approach is the most flexible. During quantization, EXL2 allocates bits per layer based on measured sensitivity — layers that affect output quality most get more bits, while less sensitive layers are compressed more aggressively. You specify a target bits-per-weight (bpw), and EXL2 distributes bits optimally across layers.

This flexibility means EXL2 can target any bpw between ~2.0 and ~8.0, and the quality at any given average bpw tends to match or exceed other formats because the bit allocation is globally optimized rather than uniform.

Speed Analysis

On NVIDIA GPU

EXL2 is fastest on NVIDIA GPUs because ExLlamaV2 uses custom CUDA kernels that are purpose-built for mixed-precision dequantization and matrix multiplication. These kernels are optimized for the RTX 3000/4000/5000 series consumer GPUs.

AWQ is second-fastest because its quantization approach produces weight layouts that are efficiently dequantized by GPU kernels. The Marlin and GEMM kernels used for AWQ inference in vLLM are highly optimized.

GPTQ speed depends on the kernel implementation. With the fast Marlin kernel (available in vLLM and newer AutoGPTQ versions), GPTQ matches AWQ speeds. With older kernels, it is 10-20% slower.

GGUF on NVIDIA GPU is competitive but typically 10-30% slower than EXL2/AWQ because llama.cpp’s CUDA kernels are designed for generality (supporting many quantization types) rather than being specialized for a single format.

On CPU

Only GGUF supports efficient CPU inference. llama.cpp’s CPU kernels use SIMD instructions (AVX2, AVX-512 on x86; NEON on ARM) for fast dequantization and matrix multiplication. No other format has comparable CPU support.

On Apple Silicon

Only GGUF supports Apple Silicon inference through Metal compute shaders. MLX uses its own format. GPTQ, AWQ, and EXL2 do not work on Apple Silicon.

VRAM Usage

VRAM usage is primarily determined by model size after quantization, plus overhead for the KV cache and engine runtime.

At 4-bit quantization, all formats produce similarly sized models (within 10-15% of each other for the same model). The differences come from:

  • Group size: Smaller groups (32 vs 128) add overhead for per-group parameters
  • Mixed precision: EXL2 and GGUF K-quants use varying precision per layer, so VRAM depends on the specific bit allocation
  • Engine overhead: ExLlamaV2 has minimal overhead (~200 MB). vLLM has more overhead (~500-800 MB) due to PagedAttention data structures. llama.cpp overhead varies by configuration.

For VRAM-constrained scenarios (fitting the largest possible model in limited VRAM), EXL2 at a low bpw (3.0-3.5) or GGUF Q3_K types offer the most aggressive compression while maintaining usable quality.

Tooling Support

ToolGGUFGPTQAWQEXL2
OllamaYesNoNoNo
LM StudioYesNoNoNo
llama.cppYesNoNoNo
vLLMNoYesYesNo
ExLlamaV2NoYesNoYes
Text Generation InferenceNoYesYesNo
Hugging Face TransformersNoYesYesNo
JanYesNoNoNo
GPT4AllYesNoNoNo
Kobold.cppYesNoNoNo
TabbyAPINoYesNoYes
MullamaYesNoNoNo

GGUF dominates the consumer/desktop tooling ecosystem. GPTQ and AWQ dominate the GPU server ecosystem. EXL2 is supported by a smaller but dedicated set of tools.

CPU Compatibility

This dimension is decisive for many users.

GGUF is the only format with production-quality CPU inference. llama.cpp’s CPU backend supports:

  • x86-64: AVX2 (most modern CPUs), AVX-512 (server CPUs, recent Intel desktop)
  • ARM: NEON (Apple Silicon, Raspberry Pi 4/5, ARM servers)
  • Hybrid CPU+GPU: offload specific layers to GPU while running others on CPU

This makes GGUF the only option for users without a dedicated GPU, users with limited VRAM who need CPU+GPU hybrid inference, and Apple Silicon Macs.

GPTQ, AWQ, EXL2 are GPU-only formats. They rely on GPU-specific dequantization kernels and do not have optimized CPU implementations. While some libraries can technically run them on CPU, performance is impractical for interactive use.

When to Choose Each Format

Choose GGUF if you:

  • Use Ollama, LM Studio, llama.cpp, or any desktop AI tool
  • Need CPU inference or CPU+GPU hybrid inference
  • Run on Apple Silicon
  • Want the widest range of quantization options
  • Need a single self-contained model file
  • Prioritize tooling compatibility over raw GPU speed

Choose GPTQ if you:

  • Deploy on NVIDIA GPU servers using vLLM or TGI
  • Want a well-established, widely-supported GPU format
  • Need 4-bit or 8-bit quantization with good quality
  • Are already in the Hugging Face Transformers ecosystem

Choose AWQ if you:

  • Deploy on NVIDIA GPUs and want the best speed among standard formats
  • Use vLLM or TGI for serving
  • Want efficient GPU kernels (Marlin)
  • Prioritize inference speed alongside quality

Choose EXL2 if you:

  • Have an NVIDIA consumer GPU and want maximum speed
  • Want fine-grained control over bits-per-weight
  • Use ExLlamaV2 or TabbyAPI
  • Need the best quality-per-bit through mixed-precision allocation

The Bottom Line

The quantization format landscape is divided by hardware. If you run inference on CPU, Apple Silicon, or need maximum tool compatibility, GGUF is the only practical choice. If you run inference on NVIDIA GPUs, all four formats work, and the choice comes down to your inference engine (Ollama demands GGUF, vLLM prefers AWQ/GPTQ, ExLlamaV2 demands EXL2). Quality differences between formats at the same bit width are small enough that tool compatibility and inference speed should drive your decision, not quality concerns.

Frequently Asked Questions

Which quantization format has the best quality?

At the same bit width, EXL2 and GGUF (with importance-matrix quantization) generally retain the highest quality. EXL2's per-layer mixed quantization optimizes bit allocation based on each layer's sensitivity. GGUF's IQ quantizations (imatrix-based) achieve similar quality optimization. GPTQ and AWQ are close behind at 4-bit. The practical quality difference between all four at 4-bit is often negligible for most use cases.

Can I run GPTQ or AWQ models on CPU?

GPTQ and AWQ are designed for GPU inference and rely on GPU-specific dequantization kernels. Running them on CPU is either unsupported or extremely slow. If you need CPU inference, GGUF is the only practical choice — it was specifically designed for efficient CPU execution with SIMD optimizations (AVX2, AVX-512, ARM NEON).

Which format should I use with Ollama?

GGUF. Ollama exclusively uses the GGUF format. When you run 'ollama pull llama3.2', you are downloading a GGUF-quantized model. If you have a GPTQ, AWQ, or EXL2 model, you would need to convert it to GGUF (or find a GGUF version) to use it with Ollama.