Choosing the right local LLM depends on three factors: what task you need it for, what hardware you have, and what quality level you require. This guide provides a structured decision framework with concrete model recommendations, quantization advice, and VRAM-based selection tables so you can pick the best model for your specific situation without wasting time on trial and error.
The Decision Framework
Selecting a local LLM isn’t about finding the “best” model. It’s about finding the best model for your constraints. Work through these three questions in order:
Question 1: What Is Your Primary Task?
Different models excel at different tasks. A model trained heavily on code will outperform a general model at coding, even if the general model is larger.
| Task | Best Model Families | Recommended Size |
|---|---|---|
| General chat | Llama 3.1, Qwen 2.5, Gemma 2 | 7B-32B |
| Code generation | Qwen 2.5 Coder, DeepSeek Coder V2, CodeLlama | 7B-32B |
| Reasoning/math | Qwen 2.5, DeepSeek R1 (distilled), Llama 3.1 | 14B-70B |
| Creative writing | Llama 3.1, Mistral, Qwen 2.5 | 8B-32B |
| RAG/document Q&A | Llama 3.1, Qwen 2.5, Mistral | 7B-14B |
| Summarization | Llama 3.1, Qwen 2.5 | 7B-14B |
| Translation | Qwen 2.5, NLLB, Llama 3.1 | 7B-14B |
| Instruction following | Llama 3.1 Instruct, Qwen 2.5 Instruct | 7B-32B |
| Function calling | Llama 3.1, Qwen 2.5, Mistral | 7B-32B |
| Vision/multimodal | LLaVA, Llama 3.2 Vision, Qwen2-VL | 7B-13B |
Question 2: What Hardware Do You Have?
Your available VRAM (or RAM for CPU inference) determines the maximum model size you can run. Use this table to find your ceiling:
| Available VRAM | Max Model (Q4_K_M) | Recommended Models |
|---|---|---|
| No GPU (8 GB RAM) | 3B | Llama 3.2 3B, Phi-3 Mini |
| No GPU (16 GB RAM) | 7B | Llama 3.1 8B, Qwen 2.5 7B |
| No GPU (32 GB RAM) | 13B | Qwen 2.5 14B Q4 |
| 6 GB VRAM | 3-7B (partial offload) | Phi-3 Mini, Llama 3.2 3B |
| 8 GB VRAM | 7B | Llama 3.1 8B, Mistral 7B |
| 10-12 GB VRAM | 7-13B | Qwen 2.5 14B Q4, Llama 3.1 8B Q6 |
| 16 GB VRAM | 13-14B | Qwen 2.5 14B Q5, Llama 3.1 8B FP16 |
| 24 GB VRAM | 30-34B | Qwen 2.5 32B Q4, DeepSeek Coder 33B |
| 48 GB VRAM | 70B | Llama 3.1 70B Q4, Qwen 2.5 72B Q4 |
| Apple M1/M2 16 GB | 7-8B | Llama 3.1 8B, Qwen 2.5 7B |
| Apple M1-M4 32 GB | 14-32B | Qwen 2.5 32B Q4 |
| Apple M1-M4 64 GB | 70B | Llama 3.1 70B Q4 |
| Apple M1-M4 128+ GB | 70B+ high quant | Llama 3.1 70B Q8, Qwen 2.5 72B Q6 |
Question 3: What Quality Level Do You Need?
Not every task requires maximum quality. For a quick code suggestion, a 7B model is fast and good enough. For drafting a legal document, you want the largest model you can run.
Speed-priority tasks (favor smaller models):
- Autocomplete and inline suggestions
- Quick Q&A lookups
- Text classification
- Simple transformations
Quality-priority tasks (favor larger models):
- Complex reasoning and analysis
- Long-form writing
- Nuanced code generation
- Multi-step problem solving
Model Families Deep Dive
Meta Llama 3 / 3.1 / 3.2
The most widely supported model family. Llama 3.1 was a watershed moment for open models, with the 405B version matching GPT-4 on many benchmarks.
Sizes available: 1B, 3B, 8B, 70B, 405B Strengths: Broad capability, excellent instruction following, strong reasoning, massive community support Weaknesses: Not the absolute best at any single task License: Llama 3.1 Community License (commercial use allowed under 700M monthly active users)
# General purpose
ollama run llama3.1:8b
# Larger, more capable
ollama run llama3.1:70b
# Multimodal (vision)
ollama run llama3.2-vision:11b
Best for: Users who want a reliable, well-supported all-rounder.
Qwen 2.5 (Alibaba)
Arguably the strongest open model family as of early 2026. Excellent at every size tier, with specialized variants for coding and math.
Sizes available: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B Strengths: Strong reasoning, excellent multilingual support, specialized code/math variants, competitive at every size Weaknesses: Some users report slightly less natural English prose compared to Llama License: Apache 2.0 (most sizes) or Qwen License (72B)
# General purpose
ollama run qwen2.5:7b
# Coding specialist
ollama run qwen2.5-coder:7b
# Sweet spot for quality
ollama run qwen2.5:32b
Best for: Users who want top-tier performance, especially for coding or multilingual tasks.
Mistral / Mixtral
French AI lab known for efficient architectures. Mixtral introduced Mixture of Experts (MoE) to the open model world, where only a subset of parameters is active for each token.
Sizes available: 7B, 8x7B (Mixtral), 8x22B (Mixtral Large), Mistral Small/Medium/Large Strengths: Efficient, fast, strong at structured output, good function calling Weaknesses: Fewer community fine-tunes, some sizes behind proprietary releases License: Apache 2.0 (most models)
# Fast and efficient
ollama run mistral:7b
# Mixture of Experts
ollama run mixtral:8x7b
Best for: Users who want efficiency and speed, or need strong structured output/function calling.
Google Gemma 2
Google’s open model family. The 9B and 27B sizes offer excellent performance for their parameter counts.
Sizes available: 2B, 9B, 27B Strengths: Efficient architecture, strong benchmark scores for size, good at instruction following Weaknesses: Smaller ecosystem of fine-tunes, some users find it less creative License: Gemma Terms of Use (free for most uses)
ollama run gemma2:9b
ollama run gemma2:27b
Best for: Users who want Google-quality models with moderate hardware.
Microsoft Phi-3 / Phi-4
Small but powerful. Phi models are trained on high-quality synthetic and curated data, making them punch far above their weight class.
Sizes available: 3.8B (Mini), 7B, 14B Strengths: Exceptional quality for size, runs on minimal hardware, fast inference Weaknesses: Can be less robust with unusual prompts, smaller context window on some versions License: MIT
ollama run phi3:mini
ollama run phi4:14b
Best for: Users with limited hardware who still want good results.
DeepSeek
Chinese AI lab with strong coding and reasoning models. DeepSeek R1’s distilled versions brought chain-of-thought reasoning to smaller models.
Sizes available: 1.5B, 7B, 8B, 14B, 32B, 67B, 70B (various model lines) Strengths: Excellent coding, strong mathematical reasoning, innovative architectures Weaknesses: Primarily English/Chinese, some models have restrictive licenses
# Reasoning model
ollama run deepseek-r1:8b
# Coding model
ollama run deepseek-coder-v2:16b
Best for: Users focused on coding or mathematical reasoning.
Task-Specific Recommendations
Best Models for Code Generation
Coding is the strongest use case for local LLMs. Specialized code models often match or exceed cloud models for common programming tasks.
| Budget VRAM | Model | Why |
|---|---|---|
| 4-6 GB | Qwen 2.5 Coder 3B | Best small code model |
| 8 GB | Qwen 2.5 Coder 7B | Excellent code quality |
| 12-16 GB | Qwen 2.5 Coder 14B | Near cloud-quality code |
| 24 GB | DeepSeek Coder V2 33B Q4 | Top-tier local coding |
| 48 GB | Qwen 2.5 Coder 32B Q6 | Maximum local code quality |
Setup with VS Code and Continue:
# Install model
ollama run qwen2.5-coder:7b
# In VS Code, install Continue extension, configure:
# ~/.continue/config.json
{
"models": [{
"title": "Qwen 2.5 Coder 7B",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}]
}
Best Models for General Chat and Reasoning
| Budget VRAM | Model | Why |
|---|---|---|
| 4-6 GB | Phi-3 Mini 3.8B | Best quality at tiny size |
| 8 GB | Llama 3.1 8B | Best all-around at 7B tier |
| 12-16 GB | Qwen 2.5 14B | Significant quality jump |
| 24 GB | Qwen 2.5 32B Q4 | Approaches cloud quality |
| 48 GB | Llama 3.1 70B Q4 | Competes with GPT-4 class |
Best Models for RAG (Retrieval-Augmented Generation)
RAG models need to handle long contexts and follow instructions about provided documents accurately.
| Budget VRAM | Model | Context Length | Why |
|---|---|---|---|
| 8 GB | Llama 3.1 8B | 128K | Excellent instruction following |
| 12-16 GB | Qwen 2.5 14B | 128K | Better comprehension |
| 24 GB | Qwen 2.5 32B Q4 | 128K | Best local RAG quality |
You also need an embedding model for RAG. These are separate, smaller models:
# Popular embedding models
ollama pull nomic-embed-text # 274MB, good general purpose
ollama pull mxbai-embed-large # 670MB, higher quality
Best Models for Creative Writing
Creative writing benefits from larger models with more diverse training data.
| Budget VRAM | Model | Why |
|---|---|---|
| 8 GB | Llama 3.1 8B | Good vocabulary, natural prose |
| 12-16 GB | Mistral 7B (fine-tuned) | Many creative fine-tunes available |
| 24 GB | Qwen 2.5 32B Q4 | Rich, nuanced writing |
| 48 GB | Llama 3.1 70B Q4 | Best local creative quality |
Tip: For creative writing, increase the temperature parameter to 0.8-1.0 and raise top_p to 0.95 for more diverse, interesting output:
# In Ollama Modelfile
FROM llama3.1:8b
PARAMETER temperature 0.85
PARAMETER top_p 0.95
Quantization Deep Dive
Understanding quantization helps you make informed trade-offs between quality and performance.
Quantization Levels Compared
| Level | Bits | Size (7B) | Quality | Speed | When to Use |
|---|---|---|---|---|---|
| FP16 | 16 | ~14 GB | 100% | Baseline | Research, when memory allows |
| Q8_0 | 8 | ~7.5 GB | ~99% | Fast | When you have plenty of VRAM |
| Q6_K | 6 | ~5.5 GB | ~98% | Fast | Best quality-per-GB |
| Q5_K_M | 5 | ~5.0 GB | ~96% | Fast | Good sweet spot |
| Q4_K_M | 4 | ~4.4 GB | ~93% | Fast | Recommended default |
| Q4_0 | 4 | ~4.0 GB | ~90% | Fastest | Maximum speed priority |
| Q3_K_M | 3 | ~3.3 GB | ~85% | Fast | Squeezing models into tight memory |
| Q2_K | 2 | ~2.8 GB | ~70% | Fast | Last resort |
| IQ4_XS | 4 | ~4.0 GB | ~92% | Medium | Tight fit with importance-weighted quant |
| IQ2_XXS | 2 | ~2.3 GB | ~65% | Medium | Extreme compression |
The Sweet Spots
You have plenty of VRAM: Use Q6_K. Nearly indistinguishable from full precision.
You want the best balance: Use Q4_K_M. This is what most people should use. The quality loss is minimal for most tasks.
Your model barely fits: Use Q3_K_M or IQ4_XS. Some quality loss but still very usable.
Bigger model at lower quant vs. smaller model at higher quant?: Generally, a bigger model at Q4 beats a smaller model at Q8 for the same VRAM budget. For example, Qwen 2.5 14B Q4 (8 GB) typically outperforms Qwen 2.5 7B Q8 (7.5 GB).
How to Choose Quantization in Practice
With Ollama, quantization is selected via model tags:
# Default (usually Q4_K_M)
ollama run llama3.1:8b
# Specific quantization
ollama run llama3.1:8b-instruct-q6_K
ollama run llama3.1:8b-instruct-q8_0
With LM Studio or manual GGUF downloads from Hugging Face, you select the quantization level when downloading the file.
Building a Recommendation Matrix
Here’s how to use everything above to make your decision:
Step 1: Identify Your VRAM Budget
No GPU → use RAM total minus 4 GB for OS
NVIDIA GPU → use VRAM amount
Apple Silicon → use total unified memory minus 4-6 GB for OS
Mixed (partial offload) → VRAM + some RAM, but slower
Step 2: Pick Your Task Category
General chat → Llama 3.1 or Qwen 2.5 (general)
Coding → Qwen 2.5 Coder or DeepSeek Coder
Reasoning → Qwen 2.5 or DeepSeek R1 (distilled)
Creative → Llama 3.1 or community fine-tunes
RAG → Llama 3.1 or Qwen 2.5 (long context)
Vision → Llama 3.2 Vision or Qwen2-VL
Step 3: Find Your Size Tier
< 4 GB available → 1-3B models
4-6 GB available → 3-7B models
8-12 GB available → 7-14B models
16-24 GB available → 14-32B models
32-48 GB available → 32-70B models
64+ GB available → 70B+ models
Step 4: Select Quantization
Model fits easily → Q6_K or Q5_K_M
Model fits with room → Q4_K_M (default)
Model barely fits → Q3_K_M or IQ4_XS
Model doesn't fit → drop to smaller size
Testing and Evaluating Models
Once you’ve narrowed your choices, test them with prompts representative of your actual use case.
Quick Comparison Method
# Create a test prompt file
cat > /tmp/test-prompt.txt << 'EOF'
Write a Python function that takes a list of dictionaries
and returns a new list sorted by a specified key, handling
missing keys gracefully. Include type hints and docstring.
EOF
# Test multiple models
for model in llama3.1:8b qwen2.5:7b mistral:7b; do
echo "=== $model ==="
time ollama run $model < /tmp/test-prompt.txt
echo ""
done
Key Metrics to Compare
Tokens per second: How fast the model generates text. Check with ollama run --verbose.
Quality: Does the output match your expectations? Test with 5-10 representative prompts.
Instruction following: Does the model do what you ask? Test with specific format requirements.
Context handling: Give it a long document and ask questions. Does it stay accurate?
Benchmarks vs. Real-World Use
Model benchmarks (MMLU, HumanEval, GSM8K) are useful for rough comparisons but don’t always predict real-world performance. A model that scores higher on HumanEval might not write better code for your specific framework or language. Always test with your own use cases.
Practical Examples
Example 1: Developer on a Budget
Situation: 16 GB RAM laptop, no GPU, need code assistance Recommendation: Qwen 2.5 Coder 7B Q4_K_M Why: Best code model at the 7B tier, runs on CPU with 16 GB RAM
ollama run qwen2.5-coder:7b
Example 2: Privacy-Focused Professional
Situation: M3 MacBook Pro 36 GB, need document Q&A for sensitive files Recommendation: Qwen 2.5 32B Q4_K_M + nomic-embed-text Why: Strong comprehension, fits in unified memory, excellent for RAG
ollama run qwen2.5:32b
ollama pull nomic-embed-text
Example 3: Home Lab Enthusiast
Situation: RTX 4090 (24 GB VRAM), want maximum quality for everything Recommendation: Qwen 2.5 32B Q5_K_M for daily use, swap to specialized models as needed
ollama run qwen2.5:32b # General use
ollama run qwen2.5-coder:32b # Coding sessions
ollama run deepseek-r1:32b # Complex reasoning
Example 4: Small Team Deployment
Situation: Server with 2x RTX A6000 (2x 48 GB VRAM), 5 concurrent users Recommendation: Llama 3.1 70B Q4 via vLLM for throughput
# Using vLLM for multi-user serving
pip install vllm
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
Staying Current
The local LLM landscape evolves rapidly. New models drop weekly. Here’s how to stay informed:
- Hugging Face Open LLM Leaderboard: Standardized benchmarks for new models
- r/LocalLLaMA: Active Reddit community with real-world testing
- local-llm.net: Our guides and comparisons, updated regularly
- Ollama Library: Check for newly supported models
When a new model comes out, ask: does it beat my current model at my specific task, on my hardware? If not, there’s no need to switch. The best model is the one that works for you.