Running AI locally requires hardware that can load model weights into memory and process them fast enough for interactive use — and the single most important factor is how much memory (RAM or VRAM) your system has. A 7B parameter model at 4-bit quantization needs roughly 4-6 GB of memory, while a 70B model needs 40+ GB. Beyond memory capacity, your inference speed depends on memory bandwidth, compute performance, and storage speed. This guide covers every hardware component that matters, with specific recommendations for every budget from $0 to $2,000+.
The good news: you do not need expensive hardware to get started. If you own a laptop or desktop made after 2020 with 16 GB of RAM, you can already run capable 7B-8B models using CPU inference. If you have a discrete GPU with 8+ GB VRAM or an Apple Silicon Mac, you can run even larger models at comfortable speeds. This guide will help you understand exactly what your current hardware can handle and what to buy if you want more.
What hardware do you need to run AI locally?
Local AI inference has four hardware requirements, in order of importance:
- Memory (VRAM or RAM): The model weights must fit in memory. This is the hard constraint — if you do not have enough memory, the model simply will not load.
- Memory bandwidth: Token generation speed is directly proportional to how fast you can read model weights from memory. More bandwidth means faster responses.
- Compute (CPU/GPU cores): Important for prompt processing (the “prefill” phase). More compute means faster time-to-first-token for long prompts.
- Storage: Models are large files (2-50+ GB). Fast storage means faster load times when switching models.
Understanding these four factors will help you make informed decisions about hardware purchases and model selection.
How much VRAM or RAM do you need for each model size?
The amount of memory required depends on two things: the model’s parameter count and the quantization level. Here is a comprehensive reference table:
| Model Size | FP16 (Full Precision) | Q8_0 (8-bit) | Q5_K_M (5-bit) | Q4_K_M (4-bit) | Q3_K_M (3-bit) | Q2_K (2-bit) |
|---|---|---|---|---|---|---|
| 1B | 2 GB | 1.1 GB | 0.8 GB | 0.7 GB | 0.6 GB | 0.5 GB |
| 3B | 6 GB | 3.3 GB | 2.4 GB | 2.1 GB | 1.7 GB | 1.4 GB |
| 7B | 14 GB | 7.7 GB | 5.3 GB | 4.4 GB | 3.6 GB | 3.0 GB |
| 8B | 16 GB | 8.5 GB | 5.8 GB | 4.9 GB | 4.0 GB | 3.3 GB |
| 13B | 26 GB | 14 GB | 9.5 GB | 7.9 GB | 6.4 GB | 5.3 GB |
| 14B | 28 GB | 15 GB | 10 GB | 8.4 GB | 6.9 GB | 5.7 GB |
| 30B-34B | 60-68 GB | 33-37 GB | 23-25 GB | 19-21 GB | 15-17 GB | 12-14 GB |
| 70B | 140 GB | 77 GB | 53 GB | 44 GB | 36 GB | 29 GB |
| 405B | 810 GB | 440 GB | 300 GB | 250 GB | 200 GB | 165 GB |
Important: These numbers represent the model weights only. You also need additional memory for:
- KV-cache: The key-value cache used during inference grows with context length. At 4K context, expect 0.5-2 GB extra. At 32K context, expect 2-8 GB extra, depending on the model architecture.
- Operating system and applications: Reserve at least 2-4 GB for your OS.
- Inference engine overhead: The inference engine itself uses some memory for buffers and processing.
Rule of thumb: Add 2-4 GB to the model weight size for a comfortable estimate of total memory needed. For long-context use (16K+ tokens), add 4-8 GB.
Which GPU should you buy for local AI?
The GPU is the most impactful hardware upgrade for local AI. GPUs offer 5-20x higher memory bandwidth than system RAM, which translates directly to faster token generation. Here are specific recommendations organized by budget.
Budget: $0 — Use what you already have
If you already have a modern computer, you can start running local AI today at zero cost:
- Any discrete NVIDIA GPU with 6+ GB VRAM: Even an older GTX 1660 Ti (6 GB) or RTX 2060 (6 GB) can run 7B Q4 models.
- Any Apple Silicon Mac: M1, M2, M3, or M4 with 8 GB or more unified memory handles 7B models well.
- Any CPU with 16+ GB RAM: CPU-only inference with llama.cpp or Ollama runs 7B Q4 models at 8-15 tok/s. Slower than GPU, but functional.
- Any AMD GPU with 8+ GB VRAM: ROCm support in llama.cpp enables AMD GPUs, though setup is less polished than NVIDIA’s CUDA.
You do not need to buy anything to try local AI. Install Ollama, download a 7B model, and see what your current hardware can do.
Budget: $300-$800 — The sweet spot for beginners
This range gets you into serious local AI with models up to 13B-14B running comfortably and 30B models at reduced quantization.
| GPU | VRAM | Approx. Price | Best For |
|---|---|---|---|
| RTX 4060 Ti 16 GB | 16 GB | $400-450 | New card, great for 7B-14B models, energy efficient |
| RTX 3060 12 GB | 12 GB | $250-300 (used) | Entry-level GPU inference, runs 7B-13B at Q4 |
| RTX 3070 8 GB | 8 GB | $250-300 (used) | Fast 7B inference, limited by 8 GB VRAM |
| RTX 3080 10 GB | 10 GB | $350-400 (used) | Strong performance, 7B-13B at Q4 comfortably |
| RX 7800 XT | 16 GB | $450-500 | AMD option, 16 GB VRAM, good ROCm support |
| Used RTX 3090 | 24 GB | $700-900 | The value king — 24 GB handles up to 30B at Q4 |
Recommendation: The used RTX 3090 is the single best value for local AI. Its 24 GB VRAM handles the vast majority of models people actually want to run. At $700-900, it delivers VRAM capacity that matches the $1,800 RTX 4090. The RTX 3090 is slower in raw throughput but the VRAM is what matters most for model compatibility.
If you want a new card and do not need 24 GB, the RTX 4060 Ti 16 GB is an excellent choice. Its 16 GB VRAM runs 7B-14B models with room for generous context windows, and its Ada Lovelace architecture is power-efficient (160W TDP vs 350W for the RTX 3090).
Budget: $800-$2,000 — High-performance local AI
This range unlocks 70B models and near-frontier quality.
| GPU | VRAM | Approx. Price | Best For |
|---|---|---|---|
| RTX 4090 | 24 GB | $1,600-2,000 | The best single consumer GPU. Fastest 7B-30B inference. 70B at Q2-Q3. |
| RTX 4080 Super | 16 GB | $950-1,100 | Fast inference for 7B-14B. Less VRAM than 4090 but strong compute. |
| Used RTX A6000 | 48 GB | $1,500-2,000 | Professional GPU, 48 GB VRAM runs 70B at Q4. No NVLink needed. |
| Mac Mini M4 Pro (24 GB) | 24 GB unified | $1,600 | Complete system, quiet, energy efficient, excellent for 7B-14B. |
| Mac Mini M4 Pro (48 GB) | 48 GB unified | $2,000 | Runs 30B-34B models comfortably. 70B at Q3-Q4 with reduced context. |
Recommendation: The RTX 4090 is the fastest consumer GPU for inference. Its combination of 24 GB VRAM, 1 TB/s bandwidth, and massive compute makes it the gold standard for enthusiasts. If you need more VRAM, a used RTX A6000 (48 GB) lets you run 70B models at Q4 without multi-GPU complexity.
For a complete, quiet, energy-efficient system, a Mac Mini M4 Pro with 48 GB is hard to beat. It draws 30W under load (vs 350-450W for a desktop with an RTX 4090), runs silently, and handles 30B models with excellent performance thanks to Apple’s unified memory architecture.
Budget: $2,000+ — Maximum capability
At this tier, you can run 70B+ models at full quality and serve multiple users.
| Setup | VRAM/Memory | Approx. Price | Best For |
|---|---|---|---|
| Dual RTX 3090 | 48 GB total | $1,800-2,200 | 70B at Q4 with good context. Requires NVLink or tensor parallel. |
| Dual RTX 4090 | 48 GB total | $3,600-4,000 | Fastest 70B inference on consumer hardware. |
| Mac Studio M2 Ultra (192 GB) | 192 GB unified | $5,000-6,000 | Runs 120B+ models. 405B at low quant. Silent, efficient. |
| Mac Studio M4 Ultra (256 GB) | 256 GB unified | $7,000-9,000 | Runs 405B at Q4. The most capable single-machine setup. |
| 4x RTX 4090 server | 96 GB total | $8,000-12,000 | High-throughput serving for teams. Multiple concurrent users. |
Recommendation: For individual use, a Mac Studio M2/M4 Ultra with maximum memory provides the simplest path to running 70B+ models. No driver issues, no multi-GPU configuration, no cooling challenges — just plug in and run. For team serving and maximum throughput, a multi-GPU NVIDIA setup with vLLM provides the best tokens-per-second-per-dollar.
How does Apple Silicon compare for local AI?
Apple Silicon deserves special attention because its unified memory architecture gives it a unique advantage for local AI.
What is unified memory?
On a traditional PC, the CPU uses system RAM (DDR4/DDR5) and the GPU uses its own dedicated VRAM (GDDR6X). These are separate memory pools connected by the PCIe bus. If a model does not fit in VRAM, the GPU must stream data over the slower PCIe connection, dramatically reducing speed.
On Apple Silicon, the CPU, GPU, and Neural Engine all share the same pool of unified memory. When you buy a Mac with 32 GB of memory, all 32 GB is available for model weights. There is no separate VRAM and RAM — it is all one fast, shared pool.
Apple Silicon performance comparison
| Chip | Max Memory | Memory Bandwidth | 7B Q4 (tok/s) | 13B Q4 (tok/s) | 70B Q4 Feasible? |
|---|---|---|---|---|---|
| M1 | 16 GB | 68 GB/s | 15-20 | 8-12 | No |
| M1 Pro | 32 GB | 200 GB/s | 25-30 | 15-20 | No |
| M1 Max | 64 GB | 400 GB/s | 35-45 | 20-28 | Yes (Q2-Q3) |
| M1 Ultra | 128 GB | 800 GB/s | 45-55 | 30-38 | Yes (Q4) |
| M2 | 24 GB | 100 GB/s | 18-22 | 10-14 | No |
| M2 Pro | 32 GB | 200 GB/s | 28-35 | 16-22 | No |
| M2 Max | 96 GB | 400 GB/s | 38-48 | 22-30 | Yes (Q3-Q4) |
| M2 Ultra | 192 GB | 800 GB/s | 50-60 | 32-40 | Yes (Q4-Q5) |
| M3 | 24 GB | 100 GB/s | 20-25 | 11-15 | No |
| M3 Pro | 36 GB | 150 GB/s | 25-32 | 15-20 | No |
| M3 Max | 128 GB | 400 GB/s | 40-50 | 24-32 | Yes (Q3-Q4) |
| M4 | 32 GB | 120 GB/s | 22-28 | 12-16 | No |
| M4 Pro | 48 GB | 273 GB/s | 32-40 | 20-26 | Tight (Q2-Q3) |
| M4 Max | 128 GB | 546 GB/s | 50-60 | 30-38 | Yes (Q4) |
| M4 Ultra | 256 GB | 819 GB/s | 55-65 | 35-45 | Yes (Q5+) |
When to choose Apple Silicon vs NVIDIA
Choose Apple Silicon when:
- You want a quiet, energy-efficient, all-in-one system
- You need to run models larger than 24 GB (the VRAM limit of consumer NVIDIA GPUs)
- You value the macOS ecosystem and do not want to deal with Linux/Windows GPU driver issues
- You want to run 70B+ models on a single machine without multi-GPU complexity
- Energy cost and noise are concerns (Apple Silicon draws 30-60W vs 300-450W for high-end GPUs)
Choose NVIDIA GPUs when:
- You need maximum tokens per second for a given model size (NVIDIA GPUs have higher bandwidth per dollar at the low and mid range)
- You want to use CUDA-optimized tools like vLLM, ExLlamaV2, or GPTQ/AWQ quantization
- You need multi-GPU scaling for serving multiple concurrent users
- You are building on Linux and want the deepest ecosystem support
- You plan to do fine-tuning or training (NVIDIA’s CUDA ecosystem for training is unmatched)
Can you run AI on CPU only?
Yes. CPU-only inference is a legitimate option, especially for smaller models and use cases where speed is not critical.
How CPU inference works
When running on CPU, the model weights are loaded into system RAM (DDR4 or DDR5) and all computation happens on the CPU. The inference engine (typically llama.cpp) uses optimized SIMD instructions (AVX2, AVX-512, ARM NEON) to process matrix multiplications on the CPU.
CPU inference performance
| CPU | RAM Type | 7B Q4 (tok/s) | 13B Q4 (tok/s) | Notes |
|---|---|---|---|---|
| Intel Core i7-13700K | DDR5-5600 | 12-18 | 6-10 | Good AVX-512 support |
| Intel Core i5-12400 | DDR5-4800 | 8-12 | 4-7 | Budget option, usable |
| AMD Ryzen 7 7800X3D | DDR5-5200 | 10-15 | 5-9 | 3D V-Cache does not help AI workloads directly |
| AMD Ryzen 9 7950X | DDR5-5600 | 14-20 | 7-12 | 16 cores help with prefill |
| Apple M1 (CPU only) | LPDDR5 | 15-20 | 8-12 | Unified memory architecture helps |
| Intel Core Ultra 200S | DDR5-6400 | 14-18 | 7-11 | Latest gen, decent bandwidth |
When CPU-only makes sense
- You are just getting started and want to try local AI before investing in a GPU
- Your workload is light — a few queries per day, short prompts, short responses
- You are running small models (1B-7B) for specific tasks like classification, extraction, or embeddings
- You have a high-core-count server CPU with large amounts of DDR5 RAM for batch processing
- You are on a laptop without a discrete GPU and want basic chat capabilities
CPU inference at 10-15 tok/s is slower than reading speed (roughly 4-5 words per second) but perfectly usable for interactive chat. For batch processing where latency does not matter, CPU inference is cost-effective since system RAM is much cheaper than VRAM per gigabyte.
How much system RAM do you need?
System RAM serves two purposes: running the operating system and applications, and (if using CPU inference or GPU offloading) holding model weights.
| Use Case | Minimum RAM | Recommended RAM | Notes |
|---|---|---|---|
| GPU inference with small models | 16 GB | 32 GB | Model lives in VRAM; RAM just for OS and KV-cache spillover |
| GPU inference with large models | 32 GB | 64 GB | Useful for partial offload — split model between GPU and CPU |
| CPU-only inference (7B models) | 16 GB | 32 GB | Model lives in RAM; need headroom for OS and context |
| CPU-only inference (13B models) | 32 GB | 48 GB | 13B Q4 is ~8 GB plus KV-cache and OS overhead |
| CPU-only inference (30B models) | 48 GB | 64 GB | 30B Q4 is ~19 GB; need room for context and system |
| Apple Silicon (unified memory) | 16 GB | 32-48 GB | All memory is shared; more is always better |
| Multi-GPU server | 64 GB | 128 GB | Need headroom for multiple models and batch processing |
RAM speed matters
For CPU inference, RAM bandwidth directly determines token generation speed. Faster RAM means faster inference:
- DDR4-3200: ~50 GB/s bandwidth — baseline speed
- DDR5-4800: ~77 GB/s bandwidth — ~50% faster than DDR4
- DDR5-5600: ~90 GB/s bandwidth — recommended for CPU inference
- DDR5-6400: ~100 GB/s bandwidth — best consumer DDR5
- LPDDR5X (laptops): ~60-90 GB/s — varies significantly by laptop
If you are primarily using GPU inference, RAM speed matters less since the model weights live in VRAM. If you are doing CPU inference or partial GPU offloading, investing in fast DDR5 RAM provides a noticeable speed improvement.
Why does SSD storage matter for local AI?
Model files are large. A single quantized model ranges from 2 GB (small 3B models) to 50+ GB (large 70B models). When you download multiple models and multiple quantization levels, storage adds up quickly:
| Usage Pattern | Estimated Storage Needs |
|---|---|
| One or two small models | 10-20 GB |
| Several 7B-13B models | 30-80 GB |
| Full collection including 30B-70B models | 200-500 GB |
| Power user with many quants, fine-tunes, merges | 500 GB - 2 TB |
SSD vs HDD for local AI
- NVMe SSD: Model loading in 2-10 seconds. Instant model switching. This is the recommended option.
- SATA SSD: Model loading in 5-20 seconds. Acceptable performance.
- HDD: Model loading in 30-120+ seconds. Painfully slow when switching models. Avoid if possible.
An NVMe SSD does not affect token generation speed (that depends on GPU/CPU and memory bandwidth), but it dramatically improves the experience of loading models, switching between models, and initial startup time. A 1 TB NVMe SSD costs $60-100 and provides ample space for a large model collection.
How do multi-GPU setups work?
When a single GPU does not have enough VRAM to hold your target model, you have two options: quantize the model more aggressively to make it fit, or use multiple GPUs.
Tensor parallelism
Tensor parallelism splits the model’s layers across multiple GPUs, with each GPU handling a portion of every layer. This approach is supported by vLLM, llama.cpp (partially), and ExLlamaV2. It requires fast inter-GPU communication for good performance.
- NVLink: The fastest inter-GPU connection (600+ GB/s on modern cards). Only available on RTX 3090 and professional GPUs (A100, H100). Provides near-linear scaling.
- PCIe 4.0 x16: 32 GB/s per direction. Workable for large models where each GPU does substantial work per communication step. Expect 60-80% efficiency vs single GPU.
- PCIe 5.0 x16: 64 GB/s per direction. Better scaling than PCIe 4.0, but still well below NVLink.
Pipeline parallelism (layer splitting)
Pipeline parallelism assigns different layers of the model to different GPUs. GPU 1 processes layers 1-40, GPU 2 processes layers 41-80. This is simpler to implement and is the default approach in llama.cpp’s GPU offloading.
- Does not require NVLink
- Works across different GPU models and sizes (you can mix an RTX 3090 and an RTX 3060)
- Has a “pipeline bubble” — GPUs are idle while waiting for the previous GPU to finish its layers
- Generally 50-70% efficient compared to single-GPU inference
Practical multi-GPU recommendations
| Setup | Total VRAM | Models It Enables | Estimated Cost |
|---|---|---|---|
| 2x RTX 3060 12 GB | 24 GB | 30B at Q4, 70B at Q2 | $500-600 (used) |
| 2x RTX 3090 | 48 GB | 70B at Q4, 34B at Q8 | $1,600-1,800 (used) |
| RTX 3090 + RTX 3060 | 36 GB | 70B at Q3, 34B at Q5 | $950-1,200 (used) |
| 2x RTX 4090 | 48 GB | 70B at Q4 with generous context | $3,600-4,000 |
| 4x RTX 4090 | 96 GB | 70B at Q8, 405B at Q2 | $7,000-8,000 |
Key considerations for multi-GPU:
- Ensure your motherboard has enough PCIe slots at x8 or x16 speed
- Your power supply must handle the combined TDP (two RTX 3090s draw 700W+ under load)
- Physical space and cooling — high-end GPUs are large and produce significant heat
- Check that your inference engine supports your desired parallelism strategy
What is the complete GPU comparison table?
Here is a head-to-head comparison of the most popular GPUs for local AI:
| GPU | VRAM | Bandwidth | FP16 TFLOPS | TDP | Price (Approx.) | 7B Q4 tok/s | 13B Q4 tok/s | Max Practical Model |
|---|---|---|---|---|---|---|---|---|
| RTX 3060 12 GB | 12 GB | 360 GB/s | 12.7 | 170W | $250-300 (used) | 40-55 | 20-30 | 13B Q4 |
| RTX 3070 8 GB | 8 GB | 448 GB/s | 20.3 | 220W | $250-300 (used) | 50-65 | N/A (VRAM limit) | 7B Q5-Q8 |
| RTX 3090 | 24 GB | 936 GB/s | 35.6 | 350W | $700-900 (used) | 70-90 | 40-55 | 30B Q4, 70B Q2 |
| RTX 4060 Ti 16 GB | 16 GB | 288 GB/s | 22.1 | 165W | $400-450 | 45-60 | 25-35 | 14B Q4 |
| RTX 4070 | 12 GB | 504 GB/s | 29.1 | 200W | $500-550 | 55-70 | 30-40 | 13B Q4 |
| RTX 4070 Ti Super | 16 GB | 672 GB/s | 44.1 | 285W | $750-850 | 60-80 | 35-50 | 14B Q5, 30B Q2 |
| RTX 4090 | 24 GB | 1,008 GB/s | 82.6 | 450W | $1,600-2,000 | 90-120 | 55-75 | 30B Q4, 70B Q2-Q3 |
| RX 7900 XTX | 24 GB | 960 GB/s | 61.4 | 355W | $850-950 | 60-80 | 35-50 | 30B Q4 (via ROCm) |
| RX 7900 XT | 20 GB | 800 GB/s | 51.6 | 315W | $650-750 | 50-65 | 30-42 | 20B Q4 (via ROCm) |
| M1 (8-core GPU) | 8-16 GB | 68 GB/s | 2.6 | 20W | N/A (system) | 15-20 | 8-12 | 7B Q4 (8 GB) |
| M2 (10-core GPU) | 8-24 GB | 100 GB/s | 3.6 | 22W | N/A (system) | 18-22 | 10-14 | 13B Q4 (24 GB) |
| M3 Pro | 18-36 GB | 150 GB/s | 7.4 | 30W | N/A (system) | 25-32 | 15-20 | 14B Q4 (18 GB) |
| M3 Max | 36-128 GB | 400 GB/s | 14.2 | 40W | N/A (system) | 40-50 | 24-32 | 70B Q3 (128 GB) |
| M4 Pro | 24-48 GB | 273 GB/s | 9.2 | 30W | N/A (system) | 32-40 | 20-26 | 30B Q4 (48 GB) |
| M4 Max | 36-128 GB | 546 GB/s | 18.4 | 40W | N/A (system) | 50-60 | 30-38 | 70B Q4 (128 GB) |
| M4 Ultra | 128-256 GB | 819 GB/s | 36.8 | 60W | N/A (system) | 55-65 | 35-45 | 405B Q3 (256 GB) |
Notes on the table:
- Tokens per second values are approximate and depend on quantization level, context length, batch size, and inference engine.
- Apple Silicon entries show GPU core performance with unified memory. TDP values are for the full SoC, not just the GPU.
- AMD RX 7900 series performance depends heavily on ROCm driver version and inference engine support. Performance continues to improve.
- “Max Practical Model” assumes Q4_K_M quantization and 4K context unless otherwise noted.
How do you choose the right hardware for your use case?
Here is a decision framework based on what you want to do:
I want to try local AI for the first time
Use your existing hardware. Install Ollama, run ollama run llama3.2 (or phi3 for a smaller model), and see how it performs. If you have any discrete GPU with 6+ GB VRAM, or a Mac with 16+ GB unified memory, or a CPU with 16+ GB DDR5 RAM, you can run a useful model today. Spend $0 and decide later if you want to invest more.
I want a reliable daily-driver AI assistant
Budget option: Used RTX 3060 12 GB ($250-300) to run 7B-13B models. Alternatively, a Mac Mini M4 with 16 GB ($600) for a quiet, simple setup.
Recommended option: Used RTX 3090 ($700-900) or Mac Mini M4 Pro 24 GB ($1,600). These handle 7B-14B models effortlessly and can stretch to 30B models for more capable outputs.
I want the best quality models available locally
Recommended: RTX 4090 ($1,800) for fastest single-GPU inference up to 30B. For 70B models, a Mac Studio with 96+ GB unified memory ($3,500+) or dual RTX 3090s ($1,600-1,800) provide the VRAM headroom you need.
I want to serve AI to a team
Recommended: Multi-GPU NVIDIA setup with vLLM. Two to four RTX 4090s or A6000s behind a vLLM instance can serve 10-50 concurrent users depending on the model size and request patterns. Budget $4,000-12,000 for the GPU hardware plus a workstation or server chassis with adequate power and cooling.
I want to run the largest open-weight models (405B)
Only practical options: Mac Studio M4 Ultra with 256 GB ($9,000), or a multi-GPU server with 4-8 high-VRAM GPUs. The 405B model at Q4 requires approximately 250 GB of memory. This is the domain of dedicated infrastructure, not casual hardware.
What about power consumption and noise?
Power consumption and noise are practical considerations that many hardware guides overlook:
| Setup | Power Draw (Under Load) | Noise Level | Annual Electricity Cost (at $0.12/kWh) |
|---|---|---|---|
| Mac Mini M4 Pro | 30-40W | Silent | ~$15-20 |
| Mac Studio M2 Ultra | 50-80W | Very quiet | ~$25-40 |
| Desktop + RTX 4060 Ti | 250-350W | Moderate | ~$130-180 |
| Desktop + RTX 3090 | 400-550W | Loud | ~$200-290 |
| Desktop + RTX 4090 | 450-600W | Loud | ~$230-310 |
| Desktop + 2x RTX 4090 | 800-1000W | Very loud | ~$410-520 |
Apple Silicon systems are dramatically more power-efficient and quieter than GPU-based desktops. If you plan to run your local AI system 24/7 as a server, or if noise is a concern (home office, bedroom), Apple Silicon or laptop GPUs (which have lower TDP variants) are worth considering despite their lower peak performance.
What about AMD GPUs?
AMD GPUs are a viable option for local AI, though the software ecosystem is less mature than NVIDIA’s.
Pros:
- The RX 7900 XTX offers 24 GB VRAM at a lower price than the RTX 4090
- AMD is investing heavily in ROCm (their CUDA equivalent)
- llama.cpp has solid ROCm support, and performance is improving with each release
- The RX 7900 XT offers 20 GB VRAM — more than most NVIDIA consumer cards
Cons:
- ROCm setup is more complex than CUDA — driver installation on Linux can be finicky
- Not all inference engines support ROCm (vLLM has ROCm support; ExLlamaV2 is CUDA-focused)
- Fewer community guides and troubleshooting resources
- Performance per TFLOP is often lower than NVIDIA due to less optimized software stacks
- Windows support for ROCm is limited; Linux is strongly recommended
Verdict: If you are comfortable with Linux and want maximum VRAM per dollar, AMD’s RX 7900 XTX is a strong choice. If you prefer a smoother, more supported experience, NVIDIA remains the safer bet. The gap is closing, and AMD is a valid choice for llama.cpp and Ollama workflows.
What about Intel GPUs?
Intel’s Arc series (A770 16 GB, A750 8 GB) can run local AI through SYCL support in llama.cpp and Vulkan backends. However, performance is well behind NVIDIA and AMD, driver support is inconsistent, and the community is small. Intel GPUs are not recommended for serious local AI workloads at this time, though they can work in a pinch for 7B models on the Arc A770.
What upgrades provide the biggest performance improvement?
If you already have a system and want to improve local AI performance, here are upgrades ranked by impact:
- Add a discrete GPU (or upgrade to one with more VRAM): The single biggest improvement. Going from CPU-only to a 12+ GB GPU typically provides a 3-8x speed increase.
- Increase VRAM (upgrade GPU): More VRAM lets you run larger models or use less aggressive quantization, directly improving output quality.
- Add more system RAM: If you are CPU-only, more RAM lets you run larger models. If you are doing partial GPU offload, more RAM provides headroom for the CPU-processed layers.
- Upgrade to DDR5 (if CPU-only): The bandwidth improvement from DDR4 to DDR5 translates to 30-50% faster CPU inference.
- Upgrade to NVMe SSD: Faster model loading and switching. Does not affect inference speed.
- Add a second GPU: Enables larger models via layer splitting. Cost-effective if you already have one good GPU.
Summary: what should you buy?
| Budget | Recommended Setup | What It Runs |
|---|---|---|
| $0 | Your existing hardware + Ollama | 1B-7B models on CPU; up to 13B if you have a GPU |
| $300 | Used RTX 3060 12 GB | 7B-13B at Q4, comfortable chat and coding |
| $800 | Used RTX 3090 24 GB | 7B-30B at Q4, 70B at Q2. The best value in local AI. |
| $1,600 | Mac Mini M4 Pro 24 GB or RTX 4090 | 7B-14B (Mac) or 7B-30B (4090), fast and efficient |
| $2,000 | Mac Mini M4 Pro 48 GB or RTX 4090 | 30B-34B (Mac) or 30B at Q4 (4090) with full quality |
| $5,000 | Mac Studio M2 Ultra 192 GB | 70B+ at Q4-Q5, silent, energy efficient |
| $8,000+ | Multi-GPU server or M4 Ultra | 405B models, team serving, maximum capability |
The local AI hardware landscape rewards patience and pragmatism. Start with what you have, benchmark real models, and upgrade when your actual workload demands it. The best hardware for local AI is the hardware that runs the models you need at the speed you find acceptable.
Ready to choose a model for your hardware? Read How to Choose the Right Local LLM for a decision framework based on your use case, or jump to Understanding Quantization to learn how quantization affects quality and performance.