Local AI Glossary: 80+ Terms Explained

A comprehensive glossary of local AI terminology — from attention mechanisms to zero-shot learning. The definitive reference for anyone deploying AI locally.

This glossary defines over 80 key terms used in the local AI ecosystem — covering model architecture, quantization formats, inference concepts, training methods, hardware terminology, file formats, and retrieval techniques — serving as the definitive reference for anyone running AI on their own hardware. Terms are organized alphabetically for quick lookup. Each definition is written to be self-contained while linking to related concepts within the glossary.

Whether you are setting up your first Ollama model or optimizing a multi-GPU vLLM deployment, this glossary will help you understand the terminology you encounter in documentation, community discussions, and research papers.


A

Activation — The output produced by a neuron or layer in a neural network after applying a non-linear function to its inputs. Activations flow forward through the network during inference. Some quantization methods (like AWQ) analyze activations to determine which weights are most important.

Adapter — A small trainable module inserted into a pre-trained model to modify its behavior without changing the original weights. LoRA adapters are the most common type, typically adding only 0.1-1% additional parameters. Adapters can be swapped in and out to give a single base model different capabilities.

AGI (Artificial General Intelligence) — A hypothetical AI system capable of understanding and learning any intellectual task that a human can. Current LLMs, including the best local models, are narrow AI — they excel at language tasks but lack general reasoning, embodiment, and true understanding.

Alignment — The process of training a model to behave in accordance with human intentions, values, and instructions. Alignment techniques include RLHF, DPO, and supervised fine-tuning on instruction-following datasets. Aligned models are more helpful, less harmful, and more likely to follow user instructions.

Attention mechanism — The core innovation of the Transformer architecture. Attention allows each token in a sequence to “attend to” (consider the relevance of) every other token, creating a weighted representation of context. Self-attention computes query, key, and value matrices for each token, enabling the model to focus on the most relevant parts of the input.

AWQ (Activation-Aware Weight Quantization) — A quantization method that identifies and preserves the most important weights based on their impact on model activations. AWQ achieves strong quality retention at 4-bit precision and is optimized for fast inference on modern NVIDIA GPUs (Ampere and newer). Commonly used with vLLM and TGI.

B

Batch size — The number of input sequences processed simultaneously during inference or training. Larger batch sizes increase throughput (total tokens per second across all requests) but require more memory. In local AI serving, batch size determines how many concurrent users the system can handle efficiently.

Beam search — A text generation strategy that explores multiple candidate sequences simultaneously, keeping the top-k most probable sequences at each step. Produces more coherent and higher-quality output than greedy decoding but is slower and uses more memory. Rarely used in interactive local chat but useful for translation and summarization.

BF16 (BFloat16) — A 16-bit floating-point format developed by Google Brain with the same exponent range as FP32 but fewer mantissa bits than FP16. BF16 handles a wider range of values than FP16, making it better for training. Functionally equivalent to FP16 for inference quality.

Bits per weight (bpw) — A measure of quantization precision indicating the average number of bits used to store each model weight. FP16 uses 16 bpw, Q4_K_M uses approximately 4.8 bpw, and Q2_K uses approximately 2.7 bpw. Lower bpw means smaller model files and lower memory usage, but reduced quality.

C

Calibration data — A representative dataset used during quantization (in GPTQ, AWQ, and EXL2) to measure the impact of weight compression on model outputs. The quantization algorithm uses this data to minimize the difference between the full-precision and quantized model. Typically a few hundred examples from a general text corpus.

Chat template — A formatting specification that defines how user messages, assistant responses, and system prompts are structured for a particular model. Different model families use different templates (e.g., Llama uses <|begin_of_text|> tokens, Mistral uses [INST] markers). Using the wrong template causes poor-quality outputs.

Checkpoint — A saved snapshot of a model’s weights, optimizer state, and training progress at a particular point during training. Checkpoints allow training to be resumed after interruption. In the context of local AI, “checkpoint” often refers to a raw model file before conversion to an inference-optimized format.

Chunking — The process of splitting documents into smaller segments (chunks) for use in RAG pipelines. Effective chunking strategies balance chunk size (too small loses context, too large dilutes relevance) with overlap (adjacent chunks share some text to avoid splitting relevant information). Common chunk sizes range from 256 to 1024 tokens.

Context length (context window) — The maximum number of tokens a model can consider at once, including both the input prompt and the generated output. Common context lengths range from 4,096 to 128,000 tokens. Longer context allows processing larger documents but requires more memory for the KV-cache.

Continuous batching — An inference optimization used by engines like vLLM where new requests are added to the processing batch as soon as existing requests complete, rather than waiting for the entire batch to finish. This maximizes GPU utilization and throughput.

CUDA — NVIDIA’s parallel computing platform and programming model for GPU-accelerated computing. CUDA is the foundation of most GPU-accelerated AI inference. Tools like llama.cpp, vLLM, and PyTorch use CUDA to run computations on NVIDIA GPUs. CUDA is exclusive to NVIDIA hardware.

D

Dense model — A model architecture where every parameter is used for every input token, as opposed to mixture-of-experts (MoE) models where only a subset of parameters is active per token. Llama, Gemma, and Phi are dense models. Dense models have predictable resource usage — a 7B dense model always uses all 7B parameters.

DPO (Direct Preference Optimization) — An alignment training technique that directly optimizes the model to prefer human-preferred outputs over rejected outputs, without needing a separate reward model (unlike RLHF). DPO is simpler and more stable than RLHF, making it popular for fine-tuning local models.

E

Embedding — A dense numerical vector representation of text (or images, audio, etc.) in a high-dimensional space where semantically similar items are close together. Embedding models convert text into fixed-size vectors used for semantic search, clustering, and RAG. Common embedding models for local use include nomic-embed-text and bge-large.

Embedding model — A model specifically designed to produce embedding vectors from input text, as opposed to generative language models that produce text output. Embedding models are typically small (100M-500M parameters) and fast. They are used in RAG pipelines to convert both documents and queries into vectors for similarity search.

Epoch — One complete pass through the entire training dataset during model training or fine-tuning. Fine-tuning typically uses 1-5 epochs. Too many epochs risk overfitting (the model memorizes training data instead of generalizing).

EXL2 — A quantization format used by ExLlamaV2 that supports variable bitrate across model layers. EXL2 assigns more bits to important layers and fewer to less important ones, achieving better quality at the same average bit rate compared to uniform quantization methods. NVIDIA GPU only.

ExLlamaV2 — A high-performance CUDA-only inference engine for running quantized LLMs on NVIDIA GPUs. Known for achieving the fastest token generation speeds, particularly with EXL2 and GPTQ quantized models. Does not support CPU inference or non-NVIDIA hardware.

F

FFN (Feed-Forward Network) — A component of each Transformer layer consisting of two linear transformations with a non-linear activation function in between. The FFN processes each token independently (unlike attention, which processes relationships between tokens) and is where much of the model’s factual knowledge is stored.

Fine-tuning — The process of continuing to train a pre-trained model on a specific dataset to adapt it for a particular task, domain, or behavior. Fine-tuning modifies the model’s weights. Common fine-tuning methods include full fine-tuning (modifying all weights), LoRA (modifying small adapter weights), and QLoRA (LoRA with a quantized base model).

FP16 (Float16) — A 16-bit floating-point number format that can represent values with approximately 3.3 decimal digits of precision. The standard format for storing and distributing LLM weights. Each weight occupies 2 bytes, so a 7B model at FP16 requires approximately 14 GB.

FP32 (Float32) — A 32-bit floating-point format with approximately 7.2 decimal digits of precision. Used internally during some computations but rarely for storing LLM weights because it doubles memory usage with negligible quality improvement over FP16.

G

GGML — The predecessor to GGUF, created by Georgi Gerganov for the llama.cpp project. GGML was the original binary format for quantized models. It has been fully replaced by GGUF, which adds metadata, tokenizer information, and better extensibility. Legacy format — do not use for new deployments.

GGUF (GPT-Generated Unified Format) — The standard quantization format for llama.cpp and the broader local AI ecosystem. GGUF files are self-contained (including weights, tokenizer, and model configuration), support quantization levels from Q2 to Q8, and can run on CPU, NVIDIA GPU, AMD GPU, and Apple Silicon. The most universally supported format for local AI.

GPTQ (Generalized Post-Training Quantization) — A GPU-optimized quantization method that uses calibration data and second-order error correction to minimize quality loss during quantization. GPTQ models are stored in SafeTensors format and run on NVIDIA GPUs via ExLlamaV2, vLLM, or AutoGPTQ. Typically used for 4-bit quantization.

Greedy decoding — A text generation strategy that always selects the most probable next token at each step. Fast and deterministic but often produces repetitive, unimaginative text. Temperature and top-p sampling are preferred for more natural and varied outputs.

Group size — In GPTQ and AWQ quantization, the number of weights that share a single set of scaling parameters. A group size of 128 means every 128 consecutive weights use the same scale and zero-point. Smaller group sizes (64, 32) yield better quality but larger files. 128 is the standard default.

H

Hallucination — When a model generates information that is factually incorrect, nonsensical, or fabricated while presenting it as if it were true. Hallucination is a fundamental limitation of language models — they generate plausible text, not verified facts. RAG can reduce hallucination by grounding outputs in retrieved documents.

Hugging Face — The central platform for sharing open-weight AI models, datasets, and tools. Most models available for local AI are hosted on Hugging Face (huggingface.co). The Transformers library by Hugging Face is the standard Python library for working with AI models.

I

Importance matrix (imatrix) — A data structure that records which model weights have the greatest impact on output quality, generated by running calibration data through the model. Used by GGUF’s IQ quantization variants to allocate more bits to important weights and fewer to unimportant ones, improving quality at low bit rates.

Inference — The process of running a trained model to generate predictions or outputs from new inputs. In the context of LLMs, inference means processing a prompt and generating a text response. Inference is what happens when you chat with a model — as opposed to training, which is how the model learns.

Inference engine — Software that loads a model into memory and executes inference. Examples include llama.cpp, Ollama, vLLM, ExLlamaV2, MLX, and TGI. The inference engine handles tokenization, attention computation, KV-cache management, and token sampling.

Instruct model — A model variant that has been fine-tuned on instruction-following datasets (conversations where a user asks a question and an assistant provides a helpful answer). Instruct models are designed for interactive use. Always use instruct variants for chat — base models generate continuations of text rather than answering questions.

INT4 / INT8 — Integer formats using 4 or 8 bits respectively. Used in quantization to represent model weights with lower precision than floating-point formats. INT8 is approximately 2x smaller than FP16; INT4 is approximately 4x smaller.

K

K-quants — The quantization system used in GGUF format that assigns different bit-widths to different parts of the model based on their importance. K-quant levels (Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K) use mixed precision to achieve better quality than uniform quantization at the same average bit rate.

KV-cache (Key-Value cache) — A memory structure that stores the key and value tensors computed during attention for all previously processed tokens. This cache avoids recomputing attention over the entire context at each generation step, making autoregressive generation computationally feasible. The KV-cache grows with context length and is a significant consumer of memory.

L

Learning rate — A hyperparameter that controls how much the model’s weights are adjusted during each training step. Too high a learning rate causes unstable training; too low causes slow convergence. Typical learning rates for fine-tuning are 1e-5 to 5e-4.

llama.cpp — The foundational open-source inference engine for local AI, written in C/C++ by Georgi Gerganov. Supports GGUF models on CPU, NVIDIA GPU (CUDA), AMD GPU (ROCm), Apple GPU (Metal), and Vulkan. The engine that makes local AI portable and efficient across virtually any hardware.

LLM (Large Language Model) — A neural network with billions of parameters trained on vast text corpora to generate, understand, and manipulate human language. Examples include Llama, Mistral, GPT-4, and Claude. “Large” typically refers to models with 1 billion or more parameters.

LoRA (Low-Rank Adaptation) — A parameter-efficient fine-tuning method that freezes the original model weights and trains small, low-rank adapter matrices that modify the model’s behavior. LoRA typically adds 0.1-1% additional parameters, making fine-tuning possible on consumer hardware. LoRA adapters can be merged into the base model or loaded dynamically.

M

Memory bandwidth — The rate at which data can be read from or written to memory, measured in GB/s. Memory bandwidth is the primary bottleneck for LLM token generation speed. NVIDIA GPUs offer 300-1,000+ GB/s, Apple Silicon offers 100-800 GB/s, and DDR5 RAM offers 50-100 GB/s. Higher bandwidth means faster tok/s.

Metal — Apple’s GPU programming framework, analogous to NVIDIA’s CUDA. Metal provides GPU acceleration for AI inference on Apple Silicon (M1/M2/M3/M4) devices. llama.cpp and MLX use Metal for accelerated inference on Mac hardware.

MLX — Apple’s machine learning framework, designed specifically for Apple Silicon. MLX provides a NumPy-like API with automatic use of unified memory and Metal GPU acceleration. Offers excellent performance for LLM inference on Mac hardware and is the native framework for Apple Silicon AI deployment.

MoE (Mixture of Experts) — A model architecture where the network contains multiple “expert” sub-networks, and a routing mechanism selects a subset of experts to process each token. MoE models have a large total parameter count but only activate a fraction per token, giving larger-model quality at smaller-model inference cost. Examples: Mixtral 8x7B (47B total, 13B active), DeepSeek-V2.

Model merging — The practice of combining weights from multiple fine-tuned models to create a new model that blends their capabilities. Merging methods include SLERP (spherical linear interpolation), TIES (TrIm, Elect Sign & merge), and DARE (Drop And REscale). Popular in the community for creating models that combine different strengths.

N

NF4 (Normal Float 4-bit) — A 4-bit quantization format designed by Tim Dettmers (used in QLoRA) that maps weight values to a 4-bit representation optimized for the normal (Gaussian) distribution that neural network weights typically follow. NF4 achieves better quality than standard INT4 for normally distributed weights.

NVLink — A high-bandwidth interconnect technology by NVIDIA that enables direct GPU-to-GPU communication at 600+ GB/s (in modern versions). NVLink enables efficient multi-GPU inference by allowing GPUs to share data much faster than through the PCIe bus. Available on RTX 3090, A100, H100, and other professional GPUs.

O

Ollama — A user-friendly inference platform built on llama.cpp that provides a simple CLI, a model registry, and an OpenAI-compatible API. Ollama is the most popular tool for getting started with local AI — one command to install, one command to run a model. Supports macOS, Linux, and Windows.

ONNX (Open Neural Network Exchange) — An open format for representing machine learning models, enabling interoperability between different frameworks. ONNX models can run on various hardware through ONNX Runtime. Less commonly used for LLM inference than GGUF or GPTQ but supported by some deployment scenarios.

Open-weight model — A model whose trained weights are publicly available for download, inspection, and use, as opposed to closed/proprietary models accessible only through APIs. Open-weight does not necessarily mean open-source — the training code, data, and methodology may not be public. Examples: Llama, Mistral, Gemma, Qwen.

P

PagedAttention — A memory management technique used by vLLM that stores the KV-cache in non-contiguous memory blocks (pages), similar to how operating systems manage virtual memory. PagedAttention reduces memory waste from fragmentation and enables more efficient memory utilization for concurrent requests.

Parameter — A single learned numerical weight in a neural network. A “7B model” has approximately 7 billion parameters. Each parameter stores a floating-point number (or a quantized integer). The total number of parameters determines the model’s capacity for learning patterns and storing knowledge.

Perplexity — A metric that measures how well a language model predicts a test dataset. Lower perplexity means better prediction (the model is less “perplexed” by the text). Perplexity is the standard metric for comparing quantization quality — quantized models have higher (worse) perplexity than their full-precision counterparts.

Pipeline parallelism — A model parallelism strategy where different layers of the model are assigned to different GPUs, creating a pipeline. GPU 1 processes layers 1-40, GPU 2 processes layers 41-80. Simpler than tensor parallelism but less efficient due to “pipeline bubbles” where GPUs wait for previous stages.

Prefill — The first phase of LLM inference where the model processes the entire input prompt in parallel, computing the initial KV-cache. Prefill is compute-intensive and benefits from GPU parallel processing power. Prefill time determines the time-to-first-token (TTFT) latency.

Prompt — The text input provided to a language model that it uses as context for generating a response. A prompt can include a system prompt (setting behavior), user messages, examples (few-shot), and instructions. Prompt engineering is the practice of crafting effective prompts to get desired outputs.

Q

QLoRA (Quantized LoRA) — A fine-tuning method that combines LoRA with a quantized (typically NF4) base model. QLoRA allows fine-tuning models on consumer GPUs by keeping the base model at 4-bit precision while training small full-precision adapter weights. A 7B model can be fine-tuned with QLoRA on a GPU with as little as 6 GB VRAM.

Quantization — The process of reducing the numerical precision of model weights to decrease memory usage and increase inference speed. Quantization converts weights from high-precision formats (FP16, 16 bits per weight) to lower-precision formats (INT8, INT4, or lower). See the quantization guide for comprehensive coverage.

R

RAG (Retrieval-Augmented Generation) — A technique that enhances LLM responses by first retrieving relevant documents from a knowledge base and then providing those documents as context to the model. RAG reduces hallucination, enables the model to answer questions about private or recent data, and is a cornerstone of enterprise local AI deployments.

Repetition penalty — A sampling parameter that reduces the probability of generating tokens that have already appeared in the output, preventing the model from getting stuck in loops. Values above 1.0 discourage repetition; typical values range from 1.05 to 1.3.

RLHF (Reinforcement Learning from Human Feedback) — An alignment technique where a separate reward model (trained on human preference data) provides feedback signals that guide the language model to produce more helpful, harmless, and honest outputs. Used to align models like ChatGPT and Llama-Instruct.

ROCm (Radeon Open Compute) — AMD’s open-source platform for GPU computing, analogous to NVIDIA’s CUDA. ROCm enables AI inference on AMD GPUs. Support has improved significantly in llama.cpp and vLLM, though the ecosystem is less mature than CUDA. Best supported on Linux.

RoPE (Rotary Position Embedding) — A position encoding method used in modern Transformer models (including Llama, Mistral, and Qwen) that encodes token position information using rotation matrices applied to query and key vectors. RoPE enables efficient handling of sequences and supports context length extension through techniques like NTK-aware scaling.

S

SafeTensors — A file format for storing model weights safely and efficiently, developed by Hugging Face. SafeTensors prevents arbitrary code execution (unlike Python’s pickle format), supports memory-mapped loading for fast access, and is the standard format for distributing full-precision and GPTQ/AWQ quantized models.

Sampling — The process of selecting the next token from the model’s predicted probability distribution during text generation. Sampling strategies include greedy (always pick the most probable), temperature scaling (adjust randomness), top-p (nucleus sampling), and top-k (limit to the k most probable tokens).

Semantic search — Finding documents based on meaning rather than keyword matching. Semantic search uses embedding models to convert text into vectors and then finds the most similar vectors. It is the core retrieval mechanism in RAG pipelines, enabling queries like “explain memory management” to find documents about “RAM allocation” and “VRAM usage.”

SFT (Supervised Fine-Tuning) — Training a model on labeled examples (input-output pairs) to learn specific behaviors. SFT is typically the first step of alignment after pre-training, teaching the model to follow instructions and engage in conversation. Distinct from RLHF and DPO, which further refine behavior based on human preferences.

SLM (Small Language Model) — A language model with a relatively small parameter count (typically 1B-7B), designed to run efficiently on resource-constrained hardware. Microsoft’s Phi series pioneered the SLM category, demonstrating that careful data curation can compensate for fewer parameters.

Speculative decoding — An inference optimization where a smaller, faster “draft” model generates candidate tokens that are then verified by the larger “target” model in parallel. Tokens that the target model agrees with are accepted immediately, speeding up generation without any quality loss. Effective when the draft model is a good approximator of the target.

Stop token — A special token that signals the model to stop generating text. Different model families use different stop tokens (e.g., <|eot_id|> for Llama 3, </s> for Mistral). The inference engine must be configured with the correct stop tokens to prevent the model from generating unwanted extra text.

System prompt — An initial instruction provided to the model that sets its behavior, personality, and constraints for the conversation. System prompts are processed once at the beginning and influence all subsequent responses. Local AI allows full control over system prompts with no hidden overrides from the provider.

T

Temperature — A sampling parameter that controls the randomness of text generation. Temperature 0 (or near 0) produces deterministic, focused outputs. Temperature 1.0 produces varied, creative outputs. Values above 1.0 increase randomness further. Typical values: 0.1-0.3 for factual tasks, 0.7-1.0 for creative writing.

Tensor parallelism — A model parallelism strategy where individual layers are split across multiple GPUs, with each GPU computing a portion of every layer. More communication-intensive than pipeline parallelism but eliminates pipeline bubbles. Requires fast inter-GPU connections (NVLink or high-bandwidth PCIe). Supported by vLLM.

TGI (Text Generation Inference) — Hugging Face’s production-grade inference server for LLMs. TGI supports SafeTensors, GPTQ, and AWQ models with features like continuous batching, tensor parallelism, and token streaming. Commonly used for deploying local AI in production environments.

Token — The fundamental unit of text processing for LLMs. Tokenizers split text into tokens, which may be whole words, subwords, or individual characters. A typical English word is 1-2 tokens. The model operates on token IDs (integers), not raw text. Tokenizers vary between model families (Llama uses a different tokenizer than Mistral).

Tokens per second (tok/s) — The standard measure of LLM inference speed, indicating how many tokens the model generates per second. Human reading speed is approximately 4-5 words per second (~5-7 tok/s). 20+ tok/s is considered comfortable for interactive chat. 50+ tok/s exceeds reading speed.

Top-k sampling — A text generation strategy that restricts the candidate set to the k most probable next tokens at each step, then samples from that reduced set. Top-k=40 means only the 40 most likely tokens are considered. Prevents the model from selecting highly improbable tokens while maintaining variety.

Top-p (nucleus) sampling — A text generation strategy that selects from the smallest set of tokens whose cumulative probability exceeds the threshold p. Top-p=0.9 considers the top tokens that collectively account for 90% of the probability mass. More adaptive than top-k because it adjusts the candidate set size based on the probability distribution.

Transformer — The neural network architecture underlying virtually all modern LLMs, introduced in the 2017 paper “Attention Is All You Need.” Transformers process sequences using self-attention mechanisms and feed-forward networks, enabling parallel processing and long-range dependency modeling. All major local models (Llama, Mistral, Gemma, Qwen) are Transformers.

U

Unified memory — A memory architecture where the CPU, GPU, and other processors share the same physical memory pool. Apple Silicon uses unified memory, meaning all system memory is available to both CPU and GPU for AI inference. This eliminates the distinction between RAM and VRAM, simplifying model deployment and enabling larger models on consumer hardware.

V

Vector database — A specialized database designed to store, index, and search high-dimensional vectors (embeddings) efficiently. Used in RAG pipelines to find documents semantically similar to a query. Popular vector databases for local use include ChromaDB, Qdrant, Milvus, FAISS, and Weaviate.

vLLM — A high-performance inference engine optimized for GPU throughput, featuring PagedAttention for efficient memory management, continuous batching for high concurrency, and tensor parallelism for multi-GPU deployment. The standard choice for production local AI serving when maximum throughput is needed.

VRAM (Video RAM) — Dedicated high-bandwidth memory on a GPU, used to store model weights, KV-cache, and intermediate computations during inference. Consumer GPUs offer 6-24 GB VRAM; professional GPUs offer 24-80 GB. VRAM capacity determines the largest model that can run entirely on the GPU.

Vulkan — A cross-platform graphics and compute API that provides GPU acceleration on a wide range of hardware, including NVIDIA, AMD, and Intel GPUs. llama.cpp supports Vulkan as a backend, enabling GPU-accelerated inference on hardware that does not support CUDA, Metal, or ROCm. Performance is generally lower than CUDA or Metal but offers broad compatibility.

W

Weight — A single numerical parameter in a neural network that is learned during training. The totality of weights in a model defines its knowledge and capabilities. When we say a model has “7 billion parameters,” we mean it has 7 billion individual weights. Weights are what quantization compresses.

Weight merging — See Model merging.

Z

Zero-shot learning — The ability of a model to perform a task it was not specifically trained for, based solely on its general training and the instructions in the prompt. For example, asking a general-purpose model to classify sentiment without providing any examples. Zero-shot capability improves with model size and alignment quality.


This glossary is a living reference. For in-depth guides on the topics covered here, see What Is Local AI? for a comprehensive introduction, Understanding Quantization for detailed quantization coverage, and the Hardware Requirements Guide for hardware terminology in context.

Frequently Asked Questions

What is the difference between VRAM and RAM for local AI?

VRAM (Video RAM) is dedicated memory on a GPU, offering 300-1000+ GB/s bandwidth for fast inference. RAM (system memory) is general-purpose memory used by the CPU, offering 50-100 GB/s bandwidth. Models run faster in VRAM, but system RAM is cheaper per gigabyte and enables CPU-only inference.

What does quantization mean in the context of local LLMs?

Quantization is the process of reducing the numerical precision of model weights — from 16-bit floating point to 8-bit, 4-bit, or lower integers — to reduce memory usage and increase inference speed with a controlled trade-off in output quality.

What is the difference between a parameter and a token?

A parameter is a learned numerical weight stored inside the model — a 7B model has 7 billion parameters. A token is a unit of text (a word, subword, or character) processed during inference. Parameters define the model; tokens are what the model reads and generates.