Local AI Hardware Guide: GPU, CPU, RAM, and Storage Requirements

A complete guide to the hardware you need to run AI locally — covering GPU VRAM requirements, CPU-only inference, RAM sizing, Apple Silicon, storage, and multi-GPU setups for every budget.

Running AI locally requires hardware that can load model weights into memory and process them fast enough for interactive use — and the single most important factor is how much memory (RAM or VRAM) your system has. A 7B parameter model at 4-bit quantization needs roughly 4-6 GB of memory, while a 70B model needs 40+ GB. Beyond memory capacity, your inference speed depends on memory bandwidth, compute performance, and storage speed. This guide covers every hardware component that matters, with specific recommendations for every budget from $0 to $2,000+.

The good news: you do not need expensive hardware to get started. If you own a laptop or desktop made after 2020 with 16 GB of RAM, you can already run capable 7B-8B models using CPU inference. If you have a discrete GPU with 8+ GB VRAM or an Apple Silicon Mac, you can run even larger models at comfortable speeds. This guide will help you understand exactly what your current hardware can handle and what to buy if you want more.

What hardware do you need to run AI locally?

Local AI inference has four hardware requirements, in order of importance:

  1. Memory (VRAM or RAM): The model weights must fit in memory. This is the hard constraint — if you do not have enough memory, the model simply will not load.
  2. Memory bandwidth: Token generation speed is directly proportional to how fast you can read model weights from memory. More bandwidth means faster responses.
  3. Compute (CPU/GPU cores): Important for prompt processing (the “prefill” phase). More compute means faster time-to-first-token for long prompts.
  4. Storage: Models are large files (2-50+ GB). Fast storage means faster load times when switching models.

Understanding these four factors will help you make informed decisions about hardware purchases and model selection.

How much VRAM or RAM do you need for each model size?

The amount of memory required depends on two things: the model’s parameter count and the quantization level. Here is a comprehensive reference table:

Model SizeFP16 (Full Precision)Q8_0 (8-bit)Q5_K_M (5-bit)Q4_K_M (4-bit)Q3_K_M (3-bit)Q2_K (2-bit)
1B2 GB1.1 GB0.8 GB0.7 GB0.6 GB0.5 GB
3B6 GB3.3 GB2.4 GB2.1 GB1.7 GB1.4 GB
7B14 GB7.7 GB5.3 GB4.4 GB3.6 GB3.0 GB
8B16 GB8.5 GB5.8 GB4.9 GB4.0 GB3.3 GB
13B26 GB14 GB9.5 GB7.9 GB6.4 GB5.3 GB
14B28 GB15 GB10 GB8.4 GB6.9 GB5.7 GB
30B-34B60-68 GB33-37 GB23-25 GB19-21 GB15-17 GB12-14 GB
70B140 GB77 GB53 GB44 GB36 GB29 GB
405B810 GB440 GB300 GB250 GB200 GB165 GB

Important: These numbers represent the model weights only. You also need additional memory for:

  • KV-cache: The key-value cache used during inference grows with context length. At 4K context, expect 0.5-2 GB extra. At 32K context, expect 2-8 GB extra, depending on the model architecture.
  • Operating system and applications: Reserve at least 2-4 GB for your OS.
  • Inference engine overhead: The inference engine itself uses some memory for buffers and processing.

Rule of thumb: Add 2-4 GB to the model weight size for a comfortable estimate of total memory needed. For long-context use (16K+ tokens), add 4-8 GB.

Which GPU should you buy for local AI?

The GPU is the most impactful hardware upgrade for local AI. GPUs offer 5-20x higher memory bandwidth than system RAM, which translates directly to faster token generation. Here are specific recommendations organized by budget.

Budget: $0 — Use what you already have

If you already have a modern computer, you can start running local AI today at zero cost:

  • Any discrete NVIDIA GPU with 6+ GB VRAM: Even an older GTX 1660 Ti (6 GB) or RTX 2060 (6 GB) can run 7B Q4 models.
  • Any Apple Silicon Mac: M1, M2, M3, or M4 with 8 GB or more unified memory handles 7B models well.
  • Any CPU with 16+ GB RAM: CPU-only inference with llama.cpp or Ollama runs 7B Q4 models at 8-15 tok/s. Slower than GPU, but functional.
  • Any AMD GPU with 8+ GB VRAM: ROCm support in llama.cpp enables AMD GPUs, though setup is less polished than NVIDIA’s CUDA.

You do not need to buy anything to try local AI. Install Ollama, download a 7B model, and see what your current hardware can do.

Budget: $300-$800 — The sweet spot for beginners

This range gets you into serious local AI with models up to 13B-14B running comfortably and 30B models at reduced quantization.

GPUVRAMApprox. PriceBest For
RTX 4060 Ti 16 GB16 GB$400-450New card, great for 7B-14B models, energy efficient
RTX 3060 12 GB12 GB$250-300 (used)Entry-level GPU inference, runs 7B-13B at Q4
RTX 3070 8 GB8 GB$250-300 (used)Fast 7B inference, limited by 8 GB VRAM
RTX 3080 10 GB10 GB$350-400 (used)Strong performance, 7B-13B at Q4 comfortably
RX 7800 XT16 GB$450-500AMD option, 16 GB VRAM, good ROCm support
Used RTX 309024 GB$700-900The value king — 24 GB handles up to 30B at Q4

Recommendation: The used RTX 3090 is the single best value for local AI. Its 24 GB VRAM handles the vast majority of models people actually want to run. At $700-900, it delivers VRAM capacity that matches the $1,800 RTX 4090. The RTX 3090 is slower in raw throughput but the VRAM is what matters most for model compatibility.

If you want a new card and do not need 24 GB, the RTX 4060 Ti 16 GB is an excellent choice. Its 16 GB VRAM runs 7B-14B models with room for generous context windows, and its Ada Lovelace architecture is power-efficient (160W TDP vs 350W for the RTX 3090).

Budget: $800-$2,000 — High-performance local AI

This range unlocks 70B models and near-frontier quality.

GPUVRAMApprox. PriceBest For
RTX 409024 GB$1,600-2,000The best single consumer GPU. Fastest 7B-30B inference. 70B at Q2-Q3.
RTX 4080 Super16 GB$950-1,100Fast inference for 7B-14B. Less VRAM than 4090 but strong compute.
Used RTX A600048 GB$1,500-2,000Professional GPU, 48 GB VRAM runs 70B at Q4. No NVLink needed.
Mac Mini M4 Pro (24 GB)24 GB unified$1,600Complete system, quiet, energy efficient, excellent for 7B-14B.
Mac Mini M4 Pro (48 GB)48 GB unified$2,000Runs 30B-34B models comfortably. 70B at Q3-Q4 with reduced context.

Recommendation: The RTX 4090 is the fastest consumer GPU for inference. Its combination of 24 GB VRAM, 1 TB/s bandwidth, and massive compute makes it the gold standard for enthusiasts. If you need more VRAM, a used RTX A6000 (48 GB) lets you run 70B models at Q4 without multi-GPU complexity.

For a complete, quiet, energy-efficient system, a Mac Mini M4 Pro with 48 GB is hard to beat. It draws 30W under load (vs 350-450W for a desktop with an RTX 4090), runs silently, and handles 30B models with excellent performance thanks to Apple’s unified memory architecture.

Budget: $2,000+ — Maximum capability

At this tier, you can run 70B+ models at full quality and serve multiple users.

SetupVRAM/MemoryApprox. PriceBest For
Dual RTX 309048 GB total$1,800-2,20070B at Q4 with good context. Requires NVLink or tensor parallel.
Dual RTX 409048 GB total$3,600-4,000Fastest 70B inference on consumer hardware.
Mac Studio M2 Ultra (192 GB)192 GB unified$5,000-6,000Runs 120B+ models. 405B at low quant. Silent, efficient.
Mac Studio M4 Ultra (256 GB)256 GB unified$7,000-9,000Runs 405B at Q4. The most capable single-machine setup.
4x RTX 4090 server96 GB total$8,000-12,000High-throughput serving for teams. Multiple concurrent users.

Recommendation: For individual use, a Mac Studio M2/M4 Ultra with maximum memory provides the simplest path to running 70B+ models. No driver issues, no multi-GPU configuration, no cooling challenges — just plug in and run. For team serving and maximum throughput, a multi-GPU NVIDIA setup with vLLM provides the best tokens-per-second-per-dollar.

How does Apple Silicon compare for local AI?

Apple Silicon deserves special attention because its unified memory architecture gives it a unique advantage for local AI.

What is unified memory?

On a traditional PC, the CPU uses system RAM (DDR4/DDR5) and the GPU uses its own dedicated VRAM (GDDR6X). These are separate memory pools connected by the PCIe bus. If a model does not fit in VRAM, the GPU must stream data over the slower PCIe connection, dramatically reducing speed.

On Apple Silicon, the CPU, GPU, and Neural Engine all share the same pool of unified memory. When you buy a Mac with 32 GB of memory, all 32 GB is available for model weights. There is no separate VRAM and RAM — it is all one fast, shared pool.

Apple Silicon performance comparison

ChipMax MemoryMemory Bandwidth7B Q4 (tok/s)13B Q4 (tok/s)70B Q4 Feasible?
M116 GB68 GB/s15-208-12No
M1 Pro32 GB200 GB/s25-3015-20No
M1 Max64 GB400 GB/s35-4520-28Yes (Q2-Q3)
M1 Ultra128 GB800 GB/s45-5530-38Yes (Q4)
M224 GB100 GB/s18-2210-14No
M2 Pro32 GB200 GB/s28-3516-22No
M2 Max96 GB400 GB/s38-4822-30Yes (Q3-Q4)
M2 Ultra192 GB800 GB/s50-6032-40Yes (Q4-Q5)
M324 GB100 GB/s20-2511-15No
M3 Pro36 GB150 GB/s25-3215-20No
M3 Max128 GB400 GB/s40-5024-32Yes (Q3-Q4)
M432 GB120 GB/s22-2812-16No
M4 Pro48 GB273 GB/s32-4020-26Tight (Q2-Q3)
M4 Max128 GB546 GB/s50-6030-38Yes (Q4)
M4 Ultra256 GB819 GB/s55-6535-45Yes (Q5+)

When to choose Apple Silicon vs NVIDIA

Choose Apple Silicon when:

  • You want a quiet, energy-efficient, all-in-one system
  • You need to run models larger than 24 GB (the VRAM limit of consumer NVIDIA GPUs)
  • You value the macOS ecosystem and do not want to deal with Linux/Windows GPU driver issues
  • You want to run 70B+ models on a single machine without multi-GPU complexity
  • Energy cost and noise are concerns (Apple Silicon draws 30-60W vs 300-450W for high-end GPUs)

Choose NVIDIA GPUs when:

  • You need maximum tokens per second for a given model size (NVIDIA GPUs have higher bandwidth per dollar at the low and mid range)
  • You want to use CUDA-optimized tools like vLLM, ExLlamaV2, or GPTQ/AWQ quantization
  • You need multi-GPU scaling for serving multiple concurrent users
  • You are building on Linux and want the deepest ecosystem support
  • You plan to do fine-tuning or training (NVIDIA’s CUDA ecosystem for training is unmatched)

Can you run AI on CPU only?

Yes. CPU-only inference is a legitimate option, especially for smaller models and use cases where speed is not critical.

How CPU inference works

When running on CPU, the model weights are loaded into system RAM (DDR4 or DDR5) and all computation happens on the CPU. The inference engine (typically llama.cpp) uses optimized SIMD instructions (AVX2, AVX-512, ARM NEON) to process matrix multiplications on the CPU.

CPU inference performance

CPURAM Type7B Q4 (tok/s)13B Q4 (tok/s)Notes
Intel Core i7-13700KDDR5-560012-186-10Good AVX-512 support
Intel Core i5-12400DDR5-48008-124-7Budget option, usable
AMD Ryzen 7 7800X3DDDR5-520010-155-93D V-Cache does not help AI workloads directly
AMD Ryzen 9 7950XDDR5-560014-207-1216 cores help with prefill
Apple M1 (CPU only)LPDDR515-208-12Unified memory architecture helps
Intel Core Ultra 200SDDR5-640014-187-11Latest gen, decent bandwidth

When CPU-only makes sense

  • You are just getting started and want to try local AI before investing in a GPU
  • Your workload is light — a few queries per day, short prompts, short responses
  • You are running small models (1B-7B) for specific tasks like classification, extraction, or embeddings
  • You have a high-core-count server CPU with large amounts of DDR5 RAM for batch processing
  • You are on a laptop without a discrete GPU and want basic chat capabilities

CPU inference at 10-15 tok/s is slower than reading speed (roughly 4-5 words per second) but perfectly usable for interactive chat. For batch processing where latency does not matter, CPU inference is cost-effective since system RAM is much cheaper than VRAM per gigabyte.

How much system RAM do you need?

System RAM serves two purposes: running the operating system and applications, and (if using CPU inference or GPU offloading) holding model weights.

Use CaseMinimum RAMRecommended RAMNotes
GPU inference with small models16 GB32 GBModel lives in VRAM; RAM just for OS and KV-cache spillover
GPU inference with large models32 GB64 GBUseful for partial offload — split model between GPU and CPU
CPU-only inference (7B models)16 GB32 GBModel lives in RAM; need headroom for OS and context
CPU-only inference (13B models)32 GB48 GB13B Q4 is ~8 GB plus KV-cache and OS overhead
CPU-only inference (30B models)48 GB64 GB30B Q4 is ~19 GB; need room for context and system
Apple Silicon (unified memory)16 GB32-48 GBAll memory is shared; more is always better
Multi-GPU server64 GB128 GBNeed headroom for multiple models and batch processing

RAM speed matters

For CPU inference, RAM bandwidth directly determines token generation speed. Faster RAM means faster inference:

  • DDR4-3200: ~50 GB/s bandwidth — baseline speed
  • DDR5-4800: ~77 GB/s bandwidth — ~50% faster than DDR4
  • DDR5-5600: ~90 GB/s bandwidth — recommended for CPU inference
  • DDR5-6400: ~100 GB/s bandwidth — best consumer DDR5
  • LPDDR5X (laptops): ~60-90 GB/s — varies significantly by laptop

If you are primarily using GPU inference, RAM speed matters less since the model weights live in VRAM. If you are doing CPU inference or partial GPU offloading, investing in fast DDR5 RAM provides a noticeable speed improvement.

Why does SSD storage matter for local AI?

Model files are large. A single quantized model ranges from 2 GB (small 3B models) to 50+ GB (large 70B models). When you download multiple models and multiple quantization levels, storage adds up quickly:

Usage PatternEstimated Storage Needs
One or two small models10-20 GB
Several 7B-13B models30-80 GB
Full collection including 30B-70B models200-500 GB
Power user with many quants, fine-tunes, merges500 GB - 2 TB

SSD vs HDD for local AI

  • NVMe SSD: Model loading in 2-10 seconds. Instant model switching. This is the recommended option.
  • SATA SSD: Model loading in 5-20 seconds. Acceptable performance.
  • HDD: Model loading in 30-120+ seconds. Painfully slow when switching models. Avoid if possible.

An NVMe SSD does not affect token generation speed (that depends on GPU/CPU and memory bandwidth), but it dramatically improves the experience of loading models, switching between models, and initial startup time. A 1 TB NVMe SSD costs $60-100 and provides ample space for a large model collection.

How do multi-GPU setups work?

When a single GPU does not have enough VRAM to hold your target model, you have two options: quantize the model more aggressively to make it fit, or use multiple GPUs.

Tensor parallelism

Tensor parallelism splits the model’s layers across multiple GPUs, with each GPU handling a portion of every layer. This approach is supported by vLLM, llama.cpp (partially), and ExLlamaV2. It requires fast inter-GPU communication for good performance.

  • NVLink: The fastest inter-GPU connection (600+ GB/s on modern cards). Only available on RTX 3090 and professional GPUs (A100, H100). Provides near-linear scaling.
  • PCIe 4.0 x16: 32 GB/s per direction. Workable for large models where each GPU does substantial work per communication step. Expect 60-80% efficiency vs single GPU.
  • PCIe 5.0 x16: 64 GB/s per direction. Better scaling than PCIe 4.0, but still well below NVLink.

Pipeline parallelism (layer splitting)

Pipeline parallelism assigns different layers of the model to different GPUs. GPU 1 processes layers 1-40, GPU 2 processes layers 41-80. This is simpler to implement and is the default approach in llama.cpp’s GPU offloading.

  • Does not require NVLink
  • Works across different GPU models and sizes (you can mix an RTX 3090 and an RTX 3060)
  • Has a “pipeline bubble” — GPUs are idle while waiting for the previous GPU to finish its layers
  • Generally 50-70% efficient compared to single-GPU inference

Practical multi-GPU recommendations

SetupTotal VRAMModels It EnablesEstimated Cost
2x RTX 3060 12 GB24 GB30B at Q4, 70B at Q2$500-600 (used)
2x RTX 309048 GB70B at Q4, 34B at Q8$1,600-1,800 (used)
RTX 3090 + RTX 306036 GB70B at Q3, 34B at Q5$950-1,200 (used)
2x RTX 409048 GB70B at Q4 with generous context$3,600-4,000
4x RTX 409096 GB70B at Q8, 405B at Q2$7,000-8,000

Key considerations for multi-GPU:

  • Ensure your motherboard has enough PCIe slots at x8 or x16 speed
  • Your power supply must handle the combined TDP (two RTX 3090s draw 700W+ under load)
  • Physical space and cooling — high-end GPUs are large and produce significant heat
  • Check that your inference engine supports your desired parallelism strategy

What is the complete GPU comparison table?

Here is a head-to-head comparison of the most popular GPUs for local AI:

GPUVRAMBandwidthFP16 TFLOPSTDPPrice (Approx.)7B Q4 tok/s13B Q4 tok/sMax Practical Model
RTX 3060 12 GB12 GB360 GB/s12.7170W$250-300 (used)40-5520-3013B Q4
RTX 3070 8 GB8 GB448 GB/s20.3220W$250-300 (used)50-65N/A (VRAM limit)7B Q5-Q8
RTX 309024 GB936 GB/s35.6350W$700-900 (used)70-9040-5530B Q4, 70B Q2
RTX 4060 Ti 16 GB16 GB288 GB/s22.1165W$400-45045-6025-3514B Q4
RTX 407012 GB504 GB/s29.1200W$500-55055-7030-4013B Q4
RTX 4070 Ti Super16 GB672 GB/s44.1285W$750-85060-8035-5014B Q5, 30B Q2
RTX 409024 GB1,008 GB/s82.6450W$1,600-2,00090-12055-7530B Q4, 70B Q2-Q3
RX 7900 XTX24 GB960 GB/s61.4355W$850-95060-8035-5030B Q4 (via ROCm)
RX 7900 XT20 GB800 GB/s51.6315W$650-75050-6530-4220B Q4 (via ROCm)
M1 (8-core GPU)8-16 GB68 GB/s2.620WN/A (system)15-208-127B Q4 (8 GB)
M2 (10-core GPU)8-24 GB100 GB/s3.622WN/A (system)18-2210-1413B Q4 (24 GB)
M3 Pro18-36 GB150 GB/s7.430WN/A (system)25-3215-2014B Q4 (18 GB)
M3 Max36-128 GB400 GB/s14.240WN/A (system)40-5024-3270B Q3 (128 GB)
M4 Pro24-48 GB273 GB/s9.230WN/A (system)32-4020-2630B Q4 (48 GB)
M4 Max36-128 GB546 GB/s18.440WN/A (system)50-6030-3870B Q4 (128 GB)
M4 Ultra128-256 GB819 GB/s36.860WN/A (system)55-6535-45405B Q3 (256 GB)

Notes on the table:

  • Tokens per second values are approximate and depend on quantization level, context length, batch size, and inference engine.
  • Apple Silicon entries show GPU core performance with unified memory. TDP values are for the full SoC, not just the GPU.
  • AMD RX 7900 series performance depends heavily on ROCm driver version and inference engine support. Performance continues to improve.
  • “Max Practical Model” assumes Q4_K_M quantization and 4K context unless otherwise noted.

How do you choose the right hardware for your use case?

Here is a decision framework based on what you want to do:

I want to try local AI for the first time

Use your existing hardware. Install Ollama, run ollama run llama3.2 (or phi3 for a smaller model), and see how it performs. If you have any discrete GPU with 6+ GB VRAM, or a Mac with 16+ GB unified memory, or a CPU with 16+ GB DDR5 RAM, you can run a useful model today. Spend $0 and decide later if you want to invest more.

I want a reliable daily-driver AI assistant

Budget option: Used RTX 3060 12 GB ($250-300) to run 7B-13B models. Alternatively, a Mac Mini M4 with 16 GB ($600) for a quiet, simple setup.

Recommended option: Used RTX 3090 ($700-900) or Mac Mini M4 Pro 24 GB ($1,600). These handle 7B-14B models effortlessly and can stretch to 30B models for more capable outputs.

I want the best quality models available locally

Recommended: RTX 4090 ($1,800) for fastest single-GPU inference up to 30B. For 70B models, a Mac Studio with 96+ GB unified memory ($3,500+) or dual RTX 3090s ($1,600-1,800) provide the VRAM headroom you need.

I want to serve AI to a team

Recommended: Multi-GPU NVIDIA setup with vLLM. Two to four RTX 4090s or A6000s behind a vLLM instance can serve 10-50 concurrent users depending on the model size and request patterns. Budget $4,000-12,000 for the GPU hardware plus a workstation or server chassis with adequate power and cooling.

I want to run the largest open-weight models (405B)

Only practical options: Mac Studio M4 Ultra with 256 GB ($9,000), or a multi-GPU server with 4-8 high-VRAM GPUs. The 405B model at Q4 requires approximately 250 GB of memory. This is the domain of dedicated infrastructure, not casual hardware.

What about power consumption and noise?

Power consumption and noise are practical considerations that many hardware guides overlook:

SetupPower Draw (Under Load)Noise LevelAnnual Electricity Cost (at $0.12/kWh)
Mac Mini M4 Pro30-40WSilent~$15-20
Mac Studio M2 Ultra50-80WVery quiet~$25-40
Desktop + RTX 4060 Ti250-350WModerate~$130-180
Desktop + RTX 3090400-550WLoud~$200-290
Desktop + RTX 4090450-600WLoud~$230-310
Desktop + 2x RTX 4090800-1000WVery loud~$410-520

Apple Silicon systems are dramatically more power-efficient and quieter than GPU-based desktops. If you plan to run your local AI system 24/7 as a server, or if noise is a concern (home office, bedroom), Apple Silicon or laptop GPUs (which have lower TDP variants) are worth considering despite their lower peak performance.

What about AMD GPUs?

AMD GPUs are a viable option for local AI, though the software ecosystem is less mature than NVIDIA’s.

Pros:

  • The RX 7900 XTX offers 24 GB VRAM at a lower price than the RTX 4090
  • AMD is investing heavily in ROCm (their CUDA equivalent)
  • llama.cpp has solid ROCm support, and performance is improving with each release
  • The RX 7900 XT offers 20 GB VRAM — more than most NVIDIA consumer cards

Cons:

  • ROCm setup is more complex than CUDA — driver installation on Linux can be finicky
  • Not all inference engines support ROCm (vLLM has ROCm support; ExLlamaV2 is CUDA-focused)
  • Fewer community guides and troubleshooting resources
  • Performance per TFLOP is often lower than NVIDIA due to less optimized software stacks
  • Windows support for ROCm is limited; Linux is strongly recommended

Verdict: If you are comfortable with Linux and want maximum VRAM per dollar, AMD’s RX 7900 XTX is a strong choice. If you prefer a smoother, more supported experience, NVIDIA remains the safer bet. The gap is closing, and AMD is a valid choice for llama.cpp and Ollama workflows.

What about Intel GPUs?

Intel’s Arc series (A770 16 GB, A750 8 GB) can run local AI through SYCL support in llama.cpp and Vulkan backends. However, performance is well behind NVIDIA and AMD, driver support is inconsistent, and the community is small. Intel GPUs are not recommended for serious local AI workloads at this time, though they can work in a pinch for 7B models on the Arc A770.

What upgrades provide the biggest performance improvement?

If you already have a system and want to improve local AI performance, here are upgrades ranked by impact:

  1. Add a discrete GPU (or upgrade to one with more VRAM): The single biggest improvement. Going from CPU-only to a 12+ GB GPU typically provides a 3-8x speed increase.
  2. Increase VRAM (upgrade GPU): More VRAM lets you run larger models or use less aggressive quantization, directly improving output quality.
  3. Add more system RAM: If you are CPU-only, more RAM lets you run larger models. If you are doing partial GPU offload, more RAM provides headroom for the CPU-processed layers.
  4. Upgrade to DDR5 (if CPU-only): The bandwidth improvement from DDR4 to DDR5 translates to 30-50% faster CPU inference.
  5. Upgrade to NVMe SSD: Faster model loading and switching. Does not affect inference speed.
  6. Add a second GPU: Enables larger models via layer splitting. Cost-effective if you already have one good GPU.

Summary: what should you buy?

BudgetRecommended SetupWhat It Runs
$0Your existing hardware + Ollama1B-7B models on CPU; up to 13B if you have a GPU
$300Used RTX 3060 12 GB7B-13B at Q4, comfortable chat and coding
$800Used RTX 3090 24 GB7B-30B at Q4, 70B at Q2. The best value in local AI.
$1,600Mac Mini M4 Pro 24 GB or RTX 40907B-14B (Mac) or 7B-30B (4090), fast and efficient
$2,000Mac Mini M4 Pro 48 GB or RTX 409030B-34B (Mac) or 30B at Q4 (4090) with full quality
$5,000Mac Studio M2 Ultra 192 GB70B+ at Q4-Q5, silent, energy efficient
$8,000+Multi-GPU server or M4 Ultra405B models, team serving, maximum capability

The local AI hardware landscape rewards patience and pragmatism. Start with what you have, benchmark real models, and upgrade when your actual workload demands it. The best hardware for local AI is the hardware that runs the models you need at the speed you find acceptable.


Ready to choose a model for your hardware? Read How to Choose the Right Local LLM for a decision framework based on your use case, or jump to Understanding Quantization to learn how quantization affects quality and performance.

Frequently Asked Questions

How much VRAM do I need to run a local LLM?

It depends on the model size and quantization. A 7B model at Q4 quantization needs about 4-6 GB VRAM. A 13B model needs 8-10 GB. A 70B model needs 40+ GB at Q4, or 24 GB at Q2-Q3. For most users, 8-16 GB VRAM handles the most popular models comfortably.

Can I run AI without a GPU?

Yes. llama.cpp and Ollama both support CPU-only inference using system RAM. A 7B Q4 model runs at 8-15 tokens per second on a modern CPU with DDR5 RAM. It is slower than GPU inference but entirely usable for chat and light workloads.

Is Apple Silicon good for local AI?

Excellent. Apple Silicon's unified memory architecture lets the CPU and GPU share the same memory pool, so a Mac with 32 GB unified memory can run models that would require a 32 GB GPU on a PC. Memory bandwidth on M2/M3/M4 chips ranges from 100-800 GB/s, delivering strong token generation speeds.

What is the best GPU for local AI on a budget?

The used NVIDIA RTX 3090 at around $700-900 offers the best value: 24 GB VRAM handles models up to 30B parameters at Q4 and even 70B at aggressive quantization. For a new card, the RTX 4060 Ti 16 GB ($400-450) is the budget sweet spot.

Do I need an SSD to run local AI?

An SSD is strongly recommended. Model files range from 2-50+ GB, and loading them from a spinning hard drive can take minutes instead of seconds. An NVMe SSD with at least 500 GB of free space provides the best experience.