How to Choose the Right Local LLM for Your Use Case

A decision framework for selecting the best local LLM based on your task, hardware, and requirements. Includes model comparisons, quantization guide, and VRAM recommendations.

Choosing the right local LLM depends on three factors: what task you need it for, what hardware you have, and what quality level you require. This guide provides a structured decision framework with concrete model recommendations, quantization advice, and VRAM-based selection tables so you can pick the best model for your specific situation without wasting time on trial and error.

The Decision Framework

Selecting a local LLM isn’t about finding the “best” model. It’s about finding the best model for your constraints. Work through these three questions in order:

Question 1: What Is Your Primary Task?

Different models excel at different tasks. A model trained heavily on code will outperform a general model at coding, even if the general model is larger.

TaskBest Model FamiliesRecommended Size
General chatLlama 3.1, Qwen 2.5, Gemma 27B-32B
Code generationQwen 2.5 Coder, DeepSeek Coder V2, CodeLlama7B-32B
Reasoning/mathQwen 2.5, DeepSeek R1 (distilled), Llama 3.114B-70B
Creative writingLlama 3.1, Mistral, Qwen 2.58B-32B
RAG/document Q&ALlama 3.1, Qwen 2.5, Mistral7B-14B
SummarizationLlama 3.1, Qwen 2.57B-14B
TranslationQwen 2.5, NLLB, Llama 3.17B-14B
Instruction followingLlama 3.1 Instruct, Qwen 2.5 Instruct7B-32B
Function callingLlama 3.1, Qwen 2.5, Mistral7B-32B
Vision/multimodalLLaVA, Llama 3.2 Vision, Qwen2-VL7B-13B

Question 2: What Hardware Do You Have?

Your available VRAM (or RAM for CPU inference) determines the maximum model size you can run. Use this table to find your ceiling:

Available VRAMMax Model (Q4_K_M)Recommended Models
No GPU (8 GB RAM)3BLlama 3.2 3B, Phi-3 Mini
No GPU (16 GB RAM)7BLlama 3.1 8B, Qwen 2.5 7B
No GPU (32 GB RAM)13BQwen 2.5 14B Q4
6 GB VRAM3-7B (partial offload)Phi-3 Mini, Llama 3.2 3B
8 GB VRAM7BLlama 3.1 8B, Mistral 7B
10-12 GB VRAM7-13BQwen 2.5 14B Q4, Llama 3.1 8B Q6
16 GB VRAM13-14BQwen 2.5 14B Q5, Llama 3.1 8B FP16
24 GB VRAM30-34BQwen 2.5 32B Q4, DeepSeek Coder 33B
48 GB VRAM70BLlama 3.1 70B Q4, Qwen 2.5 72B Q4
Apple M1/M2 16 GB7-8BLlama 3.1 8B, Qwen 2.5 7B
Apple M1-M4 32 GB14-32BQwen 2.5 32B Q4
Apple M1-M4 64 GB70BLlama 3.1 70B Q4
Apple M1-M4 128+ GB70B+ high quantLlama 3.1 70B Q8, Qwen 2.5 72B Q6

Question 3: What Quality Level Do You Need?

Not every task requires maximum quality. For a quick code suggestion, a 7B model is fast and good enough. For drafting a legal document, you want the largest model you can run.

Speed-priority tasks (favor smaller models):

  • Autocomplete and inline suggestions
  • Quick Q&A lookups
  • Text classification
  • Simple transformations

Quality-priority tasks (favor larger models):

  • Complex reasoning and analysis
  • Long-form writing
  • Nuanced code generation
  • Multi-step problem solving

Model Families Deep Dive

Meta Llama 3 / 3.1 / 3.2

The most widely supported model family. Llama 3.1 was a watershed moment for open models, with the 405B version matching GPT-4 on many benchmarks.

Sizes available: 1B, 3B, 8B, 70B, 405B Strengths: Broad capability, excellent instruction following, strong reasoning, massive community support Weaknesses: Not the absolute best at any single task License: Llama 3.1 Community License (commercial use allowed under 700M monthly active users)

# General purpose
ollama run llama3.1:8b

# Larger, more capable
ollama run llama3.1:70b

# Multimodal (vision)
ollama run llama3.2-vision:11b

Best for: Users who want a reliable, well-supported all-rounder.

Qwen 2.5 (Alibaba)

Arguably the strongest open model family as of early 2026. Excellent at every size tier, with specialized variants for coding and math.

Sizes available: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B Strengths: Strong reasoning, excellent multilingual support, specialized code/math variants, competitive at every size Weaknesses: Some users report slightly less natural English prose compared to Llama License: Apache 2.0 (most sizes) or Qwen License (72B)

# General purpose
ollama run qwen2.5:7b

# Coding specialist
ollama run qwen2.5-coder:7b

# Sweet spot for quality
ollama run qwen2.5:32b

Best for: Users who want top-tier performance, especially for coding or multilingual tasks.

Mistral / Mixtral

French AI lab known for efficient architectures. Mixtral introduced Mixture of Experts (MoE) to the open model world, where only a subset of parameters is active for each token.

Sizes available: 7B, 8x7B (Mixtral), 8x22B (Mixtral Large), Mistral Small/Medium/Large Strengths: Efficient, fast, strong at structured output, good function calling Weaknesses: Fewer community fine-tunes, some sizes behind proprietary releases License: Apache 2.0 (most models)

# Fast and efficient
ollama run mistral:7b

# Mixture of Experts
ollama run mixtral:8x7b

Best for: Users who want efficiency and speed, or need strong structured output/function calling.

Google Gemma 2

Google’s open model family. The 9B and 27B sizes offer excellent performance for their parameter counts.

Sizes available: 2B, 9B, 27B Strengths: Efficient architecture, strong benchmark scores for size, good at instruction following Weaknesses: Smaller ecosystem of fine-tunes, some users find it less creative License: Gemma Terms of Use (free for most uses)

ollama run gemma2:9b
ollama run gemma2:27b

Best for: Users who want Google-quality models with moderate hardware.

Microsoft Phi-3 / Phi-4

Small but powerful. Phi models are trained on high-quality synthetic and curated data, making them punch far above their weight class.

Sizes available: 3.8B (Mini), 7B, 14B Strengths: Exceptional quality for size, runs on minimal hardware, fast inference Weaknesses: Can be less robust with unusual prompts, smaller context window on some versions License: MIT

ollama run phi3:mini
ollama run phi4:14b

Best for: Users with limited hardware who still want good results.

DeepSeek

Chinese AI lab with strong coding and reasoning models. DeepSeek R1’s distilled versions brought chain-of-thought reasoning to smaller models.

Sizes available: 1.5B, 7B, 8B, 14B, 32B, 67B, 70B (various model lines) Strengths: Excellent coding, strong mathematical reasoning, innovative architectures Weaknesses: Primarily English/Chinese, some models have restrictive licenses

# Reasoning model
ollama run deepseek-r1:8b

# Coding model
ollama run deepseek-coder-v2:16b

Best for: Users focused on coding or mathematical reasoning.

Task-Specific Recommendations

Best Models for Code Generation

Coding is the strongest use case for local LLMs. Specialized code models often match or exceed cloud models for common programming tasks.

Budget VRAMModelWhy
4-6 GBQwen 2.5 Coder 3BBest small code model
8 GBQwen 2.5 Coder 7BExcellent code quality
12-16 GBQwen 2.5 Coder 14BNear cloud-quality code
24 GBDeepSeek Coder V2 33B Q4Top-tier local coding
48 GBQwen 2.5 Coder 32B Q6Maximum local code quality

Setup with VS Code and Continue:

# Install model
ollama run qwen2.5-coder:7b

# In VS Code, install Continue extension, configure:
# ~/.continue/config.json
{
  "models": [{
    "title": "Qwen 2.5 Coder 7B",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }]
}

Best Models for General Chat and Reasoning

Budget VRAMModelWhy
4-6 GBPhi-3 Mini 3.8BBest quality at tiny size
8 GBLlama 3.1 8BBest all-around at 7B tier
12-16 GBQwen 2.5 14BSignificant quality jump
24 GBQwen 2.5 32B Q4Approaches cloud quality
48 GBLlama 3.1 70B Q4Competes with GPT-4 class

Best Models for RAG (Retrieval-Augmented Generation)

RAG models need to handle long contexts and follow instructions about provided documents accurately.

Budget VRAMModelContext LengthWhy
8 GBLlama 3.1 8B128KExcellent instruction following
12-16 GBQwen 2.5 14B128KBetter comprehension
24 GBQwen 2.5 32B Q4128KBest local RAG quality

You also need an embedding model for RAG. These are separate, smaller models:

# Popular embedding models
ollama pull nomic-embed-text    # 274MB, good general purpose
ollama pull mxbai-embed-large   # 670MB, higher quality

Best Models for Creative Writing

Creative writing benefits from larger models with more diverse training data.

Budget VRAMModelWhy
8 GBLlama 3.1 8BGood vocabulary, natural prose
12-16 GBMistral 7B (fine-tuned)Many creative fine-tunes available
24 GBQwen 2.5 32B Q4Rich, nuanced writing
48 GBLlama 3.1 70B Q4Best local creative quality

Tip: For creative writing, increase the temperature parameter to 0.8-1.0 and raise top_p to 0.95 for more diverse, interesting output:

# In Ollama Modelfile
FROM llama3.1:8b
PARAMETER temperature 0.85
PARAMETER top_p 0.95

Quantization Deep Dive

Understanding quantization helps you make informed trade-offs between quality and performance.

Quantization Levels Compared

LevelBitsSize (7B)QualitySpeedWhen to Use
FP1616~14 GB100%BaselineResearch, when memory allows
Q8_08~7.5 GB~99%FastWhen you have plenty of VRAM
Q6_K6~5.5 GB~98%FastBest quality-per-GB
Q5_K_M5~5.0 GB~96%FastGood sweet spot
Q4_K_M4~4.4 GB~93%FastRecommended default
Q4_04~4.0 GB~90%FastestMaximum speed priority
Q3_K_M3~3.3 GB~85%FastSqueezing models into tight memory
Q2_K2~2.8 GB~70%FastLast resort
IQ4_XS4~4.0 GB~92%MediumTight fit with importance-weighted quant
IQ2_XXS2~2.3 GB~65%MediumExtreme compression

The Sweet Spots

You have plenty of VRAM: Use Q6_K. Nearly indistinguishable from full precision.

You want the best balance: Use Q4_K_M. This is what most people should use. The quality loss is minimal for most tasks.

Your model barely fits: Use Q3_K_M or IQ4_XS. Some quality loss but still very usable.

Bigger model at lower quant vs. smaller model at higher quant?: Generally, a bigger model at Q4 beats a smaller model at Q8 for the same VRAM budget. For example, Qwen 2.5 14B Q4 (8 GB) typically outperforms Qwen 2.5 7B Q8 (7.5 GB).

How to Choose Quantization in Practice

With Ollama, quantization is selected via model tags:

# Default (usually Q4_K_M)
ollama run llama3.1:8b

# Specific quantization
ollama run llama3.1:8b-instruct-q6_K
ollama run llama3.1:8b-instruct-q8_0

With LM Studio or manual GGUF downloads from Hugging Face, you select the quantization level when downloading the file.

Building a Recommendation Matrix

Here’s how to use everything above to make your decision:

Step 1: Identify Your VRAM Budget

No GPU → use RAM total minus 4 GB for OS
NVIDIA GPU → use VRAM amount
Apple Silicon → use total unified memory minus 4-6 GB for OS
Mixed (partial offload) → VRAM + some RAM, but slower

Step 2: Pick Your Task Category

General chat → Llama 3.1 or Qwen 2.5 (general)
Coding → Qwen 2.5 Coder or DeepSeek Coder
Reasoning → Qwen 2.5 or DeepSeek R1 (distilled)
Creative → Llama 3.1 or community fine-tunes
RAG → Llama 3.1 or Qwen 2.5 (long context)
Vision → Llama 3.2 Vision or Qwen2-VL

Step 3: Find Your Size Tier

< 4 GB available → 1-3B models
4-6 GB available → 3-7B models
8-12 GB available → 7-14B models
16-24 GB available → 14-32B models
32-48 GB available → 32-70B models
64+ GB available → 70B+ models

Step 4: Select Quantization

Model fits easily → Q6_K or Q5_K_M
Model fits with room → Q4_K_M (default)
Model barely fits → Q3_K_M or IQ4_XS
Model doesn't fit → drop to smaller size

Testing and Evaluating Models

Once you’ve narrowed your choices, test them with prompts representative of your actual use case.

Quick Comparison Method

# Create a test prompt file
cat > /tmp/test-prompt.txt << 'EOF'
Write a Python function that takes a list of dictionaries 
and returns a new list sorted by a specified key, handling 
missing keys gracefully. Include type hints and docstring.
EOF

# Test multiple models
for model in llama3.1:8b qwen2.5:7b mistral:7b; do
  echo "=== $model ==="
  time ollama run $model < /tmp/test-prompt.txt
  echo ""
done

Key Metrics to Compare

Tokens per second: How fast the model generates text. Check with ollama run --verbose.

Quality: Does the output match your expectations? Test with 5-10 representative prompts.

Instruction following: Does the model do what you ask? Test with specific format requirements.

Context handling: Give it a long document and ask questions. Does it stay accurate?

Benchmarks vs. Real-World Use

Model benchmarks (MMLU, HumanEval, GSM8K) are useful for rough comparisons but don’t always predict real-world performance. A model that scores higher on HumanEval might not write better code for your specific framework or language. Always test with your own use cases.

Practical Examples

Example 1: Developer on a Budget

Situation: 16 GB RAM laptop, no GPU, need code assistance Recommendation: Qwen 2.5 Coder 7B Q4_K_M Why: Best code model at the 7B tier, runs on CPU with 16 GB RAM

ollama run qwen2.5-coder:7b

Example 2: Privacy-Focused Professional

Situation: M3 MacBook Pro 36 GB, need document Q&A for sensitive files Recommendation: Qwen 2.5 32B Q4_K_M + nomic-embed-text Why: Strong comprehension, fits in unified memory, excellent for RAG

ollama run qwen2.5:32b
ollama pull nomic-embed-text

Example 3: Home Lab Enthusiast

Situation: RTX 4090 (24 GB VRAM), want maximum quality for everything Recommendation: Qwen 2.5 32B Q5_K_M for daily use, swap to specialized models as needed

ollama run qwen2.5:32b         # General use
ollama run qwen2.5-coder:32b   # Coding sessions
ollama run deepseek-r1:32b     # Complex reasoning

Example 4: Small Team Deployment

Situation: Server with 2x RTX A6000 (2x 48 GB VRAM), 5 concurrent users Recommendation: Llama 3.1 70B Q4 via vLLM for throughput

# Using vLLM for multi-user serving
pip install vllm
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

Staying Current

The local LLM landscape evolves rapidly. New models drop weekly. Here’s how to stay informed:

  • Hugging Face Open LLM Leaderboard: Standardized benchmarks for new models
  • r/LocalLLaMA: Active Reddit community with real-world testing
  • local-llm.net: Our guides and comparisons, updated regularly
  • Ollama Library: Check for newly supported models

When a new model comes out, ask: does it beat my current model at my specific task, on my hardware? If not, there’s no need to switch. The best model is the one that works for you.

Frequently Asked Questions

What is the best local LLM for general use in 2026?

For general-purpose chat and reasoning, Llama 3.1 8B and Qwen 2.5 7B offer the best balance of quality and hardware requirements. If you have 24 GB VRAM, Qwen 2.5 32B is a significant step up. For users with Apple Silicon Macs (32+ GB unified memory), Llama 3.1 70B Q4 is an excellent all-around choice.

Does a bigger model always mean better results?

Not always. A smaller model fine-tuned for a specific task (like coding) often outperforms a larger general-purpose model at that task. Qwen 2.5 Coder 7B can beat Llama 3.1 70B on coding benchmarks, for example. Choose the right model for your task, not just the biggest one that fits in memory.

What quantization level should I use?

Q4_K_M is the recommended default. It reduces model size by roughly 75% with only a moderate quality loss. If your model barely fits in memory at Q4, try Q3_K_M. If you have plenty of headroom, Q5_K_M or Q6_K give better quality. Avoid Q2_K unless you have no other option, as quality degrades significantly.

Can I run multiple models at the same time?

Yes, but each loaded model consumes memory. Ollama keeps models loaded for 5 minutes by default and can load multiple models simultaneously if you have enough RAM/VRAM. You can adjust this with the OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS environment variables.