How much VRAM do I need to run a local LLM?

It depends on the model size and quantization. A 7B model at Q4 quantization needs about 4-6 GB VRAM. A 13B model needs 8-10 GB. A 70B model needs 40+ GB at Q4, or 24 GB at Q2-Q3. For most users, 8-16 GB VRAM handles the most popular models comfortably.

Can I run AI without a GPU?

Yes. llama.cpp and Ollama both support CPU-only inference using system RAM. A 7B Q4 model runs at 8-15 tokens per second on a modern CPU with DDR5 RAM. It is slower than GPU inference but entirely usable for chat and light workloads.

Is Apple Silicon good for local AI?

Excellent. Apple Silicon's unified memory architecture lets the CPU and GPU share the same memory pool, so a Mac with 32 GB unified memory can run models that would require a 32 GB GPU on a PC. Memory bandwidth on M2/M3/M4 chips ranges from 100-800 GB/s, delivering strong token generation speeds.

What is the best GPU for local AI on a budget?

The used NVIDIA RTX 3090 at around $700-900 offers the best value: 24 GB VRAM handles models up to 30B parameters at Q4 and even 70B at aggressive quantization. For a new card, the RTX 4060 Ti 16 GB ($400-450) is the budget sweet spot.

Do I need an SSD to run local AI?

An SSD is strongly recommended. Model files range from 2-50+ GB, and loading them from a spinning hard drive can take minutes instead of seconds. An NVMe SSD with at least 500 GB of free space provides the best experience.

Local AI Hardware Guide: GPU, CPU, RAM, and Storage Requirements

Running AI locally requires hardware that can load model weights into memory and process them fast enough for interactive use — and the single most important factor is how much memory (RAM or VRAM) your system has. A 7B parameter model at 4-bit quantization needs roughly 4-6 GB of memory, while a 70B model needs 40+ GB. Beyond memory capacity, your inference speed depends on memory bandwidth, compute performance, and storage speed. This guide covers every hardware component that matters, with specific recommendations for every budget from $0 to $2,000+.

The good news: you do not need expensive hardware to get started. If you own a laptop or desktop made after 2020 with 16 GB of RAM, you can already run capable 7B-8B models using CPU inference. If you have a discrete GPU with 8+ GB VRAM or an Apple Silicon Mac, you can run even larger models at comfortable speeds. This guide will help you understand exactly what your current hardware can handle and what to buy if you want more.

What hardware do you need to run AI locally?

Local AI inference has four hardware requirements, in order of importance:

Memory (VRAM or RAM): The model weights must fit in memory. This is the hard constraint — if you do not have enough memory, the model simply will not load.
Memory bandwidth: Token generation speed is directly proportional to how fast you can read model weights from memory. More bandwidth means faster responses.
Compute (CPU/GPU cores): Important for prompt processing (the “prefill” phase). More compute means faster time-to-first-token for long prompts.
Storage: Models are large files (2-50+ GB). Fast storage means faster load times when switching models.

Understanding these four factors will help you make informed decisions about hardware purchases and model selection.

How much VRAM or RAM do you need for each model size?

The amount of memory required depends on two things: the model’s parameter count and the quantization level. Here is a comprehensive reference table:

Model Size	FP16 (Full Precision)	Q8_0 (8-bit)	Q5_K_M (5-bit)	Q4_K_M (4-bit)	Q3_K_M (3-bit)	Q2_K (2-bit)
1B	2 GB	1.1 GB	0.8 GB	0.7 GB	0.6 GB	0.5 GB
3B	6 GB	3.3 GB	2.4 GB	2.1 GB	1.7 GB	1.4 GB
7B	14 GB	7.7 GB	5.3 GB	4.4 GB	3.6 GB	3.0 GB
8B	16 GB	8.5 GB	5.8 GB	4.9 GB	4.0 GB	3.3 GB
13B	26 GB	14 GB	9.5 GB	7.9 GB	6.4 GB	5.3 GB
14B	28 GB	15 GB	10 GB	8.4 GB	6.9 GB	5.7 GB
30B-34B	60-68 GB	33-37 GB	23-25 GB	19-21 GB	15-17 GB	12-14 GB
70B	140 GB	77 GB	53 GB	44 GB	36 GB	29 GB
405B	810 GB	440 GB	300 GB	250 GB	200 GB	165 GB

Important: These numbers represent the model weights only. You also need additional memory for:

KV-cache: The key-value cache used during inference grows with context length. At 4K context, expect 0.5-2 GB extra. At 32K context, expect 2-8 GB extra, depending on the model architecture.
Operating system and applications: Reserve at least 2-4 GB for your OS.
Inference engine overhead: The inference engine itself uses some memory for buffers and processing.

Rule of thumb: Add 2-4 GB to the model weight size for a comfortable estimate of total memory needed. For long-context use (16K+ tokens), add 4-8 GB.

Which GPU should you buy for local AI?

The GPU is the most impactful hardware upgrade for local AI. GPUs offer 5-20x higher memory bandwidth than system RAM, which translates directly to faster token generation. Here are specific recommendations organized by budget.

Budget: $0 — Use what you already have

If you already have a modern computer, you can start running local AI today at zero cost:

Any discrete NVIDIA GPU with 6+ GB VRAM: Even an older GTX 1660 Ti (6 GB) or RTX 2060 (6 GB) can run 7B Q4 models.
Any Apple Silicon Mac: M1, M2, M3, or M4 with 8 GB or more unified memory handles 7B models well.
Any CPU with 16+ GB RAM: CPU-only inference with llama.cpp or Ollama runs 7B Q4 models at 8-15 tok/s. Slower than GPU, but functional.
Any AMD GPU with 8+ GB VRAM: ROCm support in llama.cpp enables AMD GPUs, though setup is less polished than NVIDIA’s CUDA.

You do not need to buy anything to try local AI. Install Ollama, download a 7B model, and see what your current hardware can do.

Budget: $300-$800 — The sweet spot for beginners

This range gets you into serious local AI with models up to 13B-14B running comfortably and 30B models at reduced quantization.

GPU	VRAM	Approx. Price	Best For
RTX 4060 Ti 16 GB	16 GB	$400-450	New card, great for 7B-14B models, energy efficient
RTX 3060 12 GB	12 GB	$250-300 (used)	Entry-level GPU inference, runs 7B-13B at Q4
RTX 3070 8 GB	8 GB	$250-300 (used)	Fast 7B inference, limited by 8 GB VRAM
RTX 3080 10 GB	10 GB	$350-400 (used)	Strong performance, 7B-13B at Q4 comfortably
RX 7800 XT	16 GB	$450-500	AMD option, 16 GB VRAM, good ROCm support
Used RTX 3090	24 GB	$700-900	The value king — 24 GB handles up to 30B at Q4

Recommendation: The used RTX 3090 is the single best value for local AI. Its 24 GB VRAM handles the vast majority of models people actually want to run. At $700-900, it delivers VRAM capacity that matches the $1,800 RTX 4090. The RTX 3090 is slower in raw throughput but the VRAM is what matters most for model compatibility.

If you want a new card and do not need 24 GB, the RTX 4060 Ti 16 GB is an excellent choice. Its 16 GB VRAM runs 7B-14B models with room for generous context windows, and its Ada Lovelace architecture is power-efficient (160W TDP vs 350W for the RTX 3090).

Budget: $800-$2,000 — High-performance local AI

This range unlocks 70B models and near-frontier quality.

GPU	VRAM	Approx. Price	Best For
RTX 4090	24 GB	$1,600-2,000	The best single consumer GPU. Fastest 7B-30B inference. 70B at Q2-Q3.
RTX 4080 Super	16 GB	$950-1,100	Fast inference for 7B-14B. Less VRAM than 4090 but strong compute.
Used RTX A6000	48 GB	$1,500-2,000	Professional GPU, 48 GB VRAM runs 70B at Q4. No NVLink needed.
Mac Mini M4 Pro (24 GB)	24 GB unified	$1,600	Complete system, quiet, energy efficient, excellent for 7B-14B.
Mac Mini M4 Pro (48 GB)	48 GB unified	$2,000	Runs 30B-34B models comfortably. 70B at Q3-Q4 with reduced context.

Recommendation: The RTX 4090 is the fastest consumer GPU for inference. Its combination of 24 GB VRAM, 1 TB/s bandwidth, and massive compute makes it the gold standard for enthusiasts. If you need more VRAM, a used RTX A6000 (48 GB) lets you run 70B models at Q4 without multi-GPU complexity.

For a complete, quiet, energy-efficient system, a Mac Mini M4 Pro with 48 GB is hard to beat. It draws 30W under load (vs 350-450W for a desktop with an RTX 4090), runs silently, and handles 30B models with excellent performance thanks to Apple’s unified memory architecture.

Budget: $2,000+ — Maximum capability

At this tier, you can run 70B+ models at full quality and serve multiple users.

Setup	VRAM/Memory	Approx. Price	Best For
Dual RTX 3090	48 GB total	$1,800-2,200	70B at Q4 with good context. Requires NVLink or tensor parallel.
Dual RTX 4090	48 GB total	$3,600-4,000	Fastest 70B inference on consumer hardware.
Mac Studio M2 Ultra (192 GB)	192 GB unified	$5,000-6,000	Runs 120B+ models. 405B at low quant. Silent, efficient.
Mac Studio M4 Ultra (256 GB)	256 GB unified	$7,000-9,000	Runs 405B at Q4. The most capable single-machine setup.
4x RTX 4090 server	96 GB total	$8,000-12,000	High-throughput serving for teams. Multiple concurrent users.

Recommendation: For individual use, a Mac Studio M2/M4 Ultra with maximum memory provides the simplest path to running 70B+ models. No driver issues, no multi-GPU configuration, no cooling challenges — just plug in and run. For team serving and maximum throughput, a multi-GPU NVIDIA setup with vLLM provides the best tokens-per-second-per-dollar.

How does Apple Silicon compare for local AI?

Apple Silicon deserves special attention because its unified memory architecture gives it a unique advantage for local AI.

What is unified memory?

On a traditional PC, the CPU uses system RAM (DDR4/DDR5) and the GPU uses its own dedicated VRAM (GDDR6X). These are separate memory pools connected by the PCIe bus. If a model does not fit in VRAM, the GPU must stream data over the slower PCIe connection, dramatically reducing speed.

On Apple Silicon, the CPU, GPU, and Neural Engine all share the same pool of unified memory. When you buy a Mac with 32 GB of memory, all 32 GB is available for model weights. There is no separate VRAM and RAM — it is all one fast, shared pool.

Apple Silicon performance comparison

Chip	Max Memory	Memory Bandwidth	7B Q4 (tok/s)	13B Q4 (tok/s)	70B Q4 Feasible?
M1	16 GB	68 GB/s	15-20	8-12	No
M1 Pro	32 GB	200 GB/s	25-30	15-20	No
M1 Max	64 GB	400 GB/s	35-45	20-28	Yes (Q2-Q3)
M1 Ultra	128 GB	800 GB/s	45-55	30-38	Yes (Q4)
M2	24 GB	100 GB/s	18-22	10-14	No
M2 Pro	32 GB	200 GB/s	28-35	16-22	No
M2 Max	96 GB	400 GB/s	38-48	22-30	Yes (Q3-Q4)
M2 Ultra	192 GB	800 GB/s	50-60	32-40	Yes (Q4-Q5)
M3	24 GB	100 GB/s	20-25	11-15	No
M3 Pro	36 GB	150 GB/s	25-32	15-20	No
M3 Max	128 GB	400 GB/s	40-50	24-32	Yes (Q3-Q4)
M4	32 GB	120 GB/s	22-28	12-16	No
M4 Pro	48 GB	273 GB/s	32-40	20-26	Tight (Q2-Q3)
M4 Max	128 GB	546 GB/s	50-60	30-38	Yes (Q4)
M4 Ultra	256 GB	819 GB/s	55-65	35-45	Yes (Q5+)

When to choose Apple Silicon vs NVIDIA

Choose Apple Silicon when:

You want a quiet, energy-efficient, all-in-one system
You need to run models larger than 24 GB (the VRAM limit of consumer NVIDIA GPUs)
You value the macOS ecosystem and do not want to deal with Linux/Windows GPU driver issues
You want to run 70B+ models on a single machine without multi-GPU complexity
Energy cost and noise are concerns (Apple Silicon draws 30-60W vs 300-450W for high-end GPUs)

Choose NVIDIA GPUs when:

You need maximum tokens per second for a given model size (NVIDIA GPUs have higher bandwidth per dollar at the low and mid range)
You want to use CUDA-optimized tools like vLLM, ExLlamaV2, or GPTQ/AWQ quantization
You need multi-GPU scaling for serving multiple concurrent users
You are building on Linux and want the deepest ecosystem support
You plan to do fine-tuning or training (NVIDIA’s CUDA ecosystem for training is unmatched)

Can you run AI on CPU only?

Yes. CPU-only inference is a legitimate option, especially for smaller models and use cases where speed is not critical.

How CPU inference works

When running on CPU, the model weights are loaded into system RAM (DDR4 or DDR5) and all computation happens on the CPU. The inference engine (typically llama.cpp) uses optimized SIMD instructions (AVX2, AVX-512, ARM NEON) to process matrix multiplications on the CPU.

CPU inference performance

CPU	RAM Type	7B Q4 (tok/s)	13B Q4 (tok/s)	Notes
Intel Core i7-13700K	DDR5-5600	12-18	6-10	Good AVX-512 support
Intel Core i5-12400	DDR5-4800	8-12	4-7	Budget option, usable
AMD Ryzen 7 7800X3D	DDR5-5200	10-15	5-9	3D V-Cache does not help AI workloads directly
AMD Ryzen 9 7950X	DDR5-5600	14-20	7-12	16 cores help with prefill
Apple M1 (CPU only)	LPDDR5	15-20	8-12	Unified memory architecture helps
Intel Core Ultra 200S	DDR5-6400	14-18	7-11	Latest gen, decent bandwidth

When CPU-only makes sense

You are just getting started and want to try local AI before investing in a GPU
Your workload is light — a few queries per day, short prompts, short responses
You are running small models (1B-7B) for specific tasks like classification, extraction, or embeddings
You have a high-core-count server CPU with large amounts of DDR5 RAM for batch processing
You are on a laptop without a discrete GPU and want basic chat capabilities

CPU inference at 10-15 tok/s is slower than reading speed (roughly 4-5 words per second) but perfectly usable for interactive chat. For batch processing where latency does not matter, CPU inference is cost-effective since system RAM is much cheaper than VRAM per gigabyte.

How much system RAM do you need?

System RAM serves two purposes: running the operating system and applications, and (if using CPU inference or GPU offloading) holding model weights.

Use Case	Minimum RAM	Recommended RAM	Notes
GPU inference with small models	16 GB	32 GB	Model lives in VRAM; RAM just for OS and KV-cache spillover
GPU inference with large models	32 GB	64 GB	Useful for partial offload — split model between GPU and CPU
CPU-only inference (7B models)	16 GB	32 GB	Model lives in RAM; need headroom for OS and context
CPU-only inference (13B models)	32 GB	48 GB	13B Q4 is ~8 GB plus KV-cache and OS overhead
CPU-only inference (30B models)	48 GB	64 GB	30B Q4 is ~19 GB; need room for context and system
Apple Silicon (unified memory)	16 GB	32-48 GB	All memory is shared; more is always better
Multi-GPU server	64 GB	128 GB	Need headroom for multiple models and batch processing

RAM speed matters

For CPU inference, RAM bandwidth directly determines token generation speed. Faster RAM means faster inference:

DDR4-3200: ~50 GB/s bandwidth — baseline speed
DDR5-4800: ~77 GB/s bandwidth — ~50% faster than DDR4
DDR5-5600: ~90 GB/s bandwidth — recommended for CPU inference
DDR5-6400: ~100 GB/s bandwidth — best consumer DDR5
LPDDR5X (laptops): ~60-90 GB/s — varies significantly by laptop

If you are primarily using GPU inference, RAM speed matters less since the model weights live in VRAM. If you are doing CPU inference or partial GPU offloading, investing in fast DDR5 RAM provides a noticeable speed improvement.

Why does SSD storage matter for local AI?

Model files are large. A single quantized model ranges from 2 GB (small 3B models) to 50+ GB (large 70B models). When you download multiple models and multiple quantization levels, storage adds up quickly:

Usage Pattern	Estimated Storage Needs
One or two small models	10-20 GB
Several 7B-13B models	30-80 GB
Full collection including 30B-70B models	200-500 GB
Power user with many quants, fine-tunes, merges	500 GB - 2 TB

SSD vs HDD for local AI

NVMe SSD: Model loading in 2-10 seconds. Instant model switching. This is the recommended option.
SATA SSD: Model loading in 5-20 seconds. Acceptable performance.
HDD: Model loading in 30-120+ seconds. Painfully slow when switching models. Avoid if possible.

An NVMe SSD does not affect token generation speed (that depends on GPU/CPU and memory bandwidth), but it dramatically improves the experience of loading models, switching between models, and initial startup time. A 1 TB NVMe SSD costs $60-100 and provides ample space for a large model collection.

How do multi-GPU setups work?

When a single GPU does not have enough VRAM to hold your target model, you have two options: quantize the model more aggressively to make it fit, or use multiple GPUs.

Tensor parallelism

Tensor parallelism splits the model’s layers across multiple GPUs, with each GPU handling a portion of every layer. This approach is supported by vLLM, llama.cpp (partially), and ExLlamaV2. It requires fast inter-GPU communication for good performance.

NVLink: The fastest inter-GPU connection (600+ GB/s on modern cards). Only available on RTX 3090 and professional GPUs (A100, H100). Provides near-linear scaling.
PCIe 4.0 x16: 32 GB/s per direction. Workable for large models where each GPU does substantial work per communication step. Expect 60-80% efficiency vs single GPU.
PCIe 5.0 x16: 64 GB/s per direction. Better scaling than PCIe 4.0, but still well below NVLink.

Pipeline parallelism (layer splitting)

Pipeline parallelism assigns different layers of the model to different GPUs. GPU 1 processes layers 1-40, GPU 2 processes layers 41-80. This is simpler to implement and is the default approach in llama.cpp’s GPU offloading.

Does not require NVLink
Works across different GPU models and sizes (you can mix an RTX 3090 and an RTX 3060)
Has a “pipeline bubble” — GPUs are idle while waiting for the previous GPU to finish its layers
Generally 50-70% efficient compared to single-GPU inference

Practical multi-GPU recommendations

Setup	Total VRAM	Models It Enables	Estimated Cost
2x RTX 3060 12 GB	24 GB	30B at Q4, 70B at Q2	$500-600 (used)
2x RTX 3090	48 GB	70B at Q4, 34B at Q8	$1,600-1,800 (used)
RTX 3090 + RTX 3060	36 GB	70B at Q3, 34B at Q5	$950-1,200 (used)
2x RTX 4090	48 GB	70B at Q4 with generous context	$3,600-4,000
4x RTX 4090	96 GB	70B at Q8, 405B at Q2	$7,000-8,000

Key considerations for multi-GPU:

Ensure your motherboard has enough PCIe slots at x8 or x16 speed
Your power supply must handle the combined TDP (two RTX 3090s draw 700W+ under load)
Physical space and cooling — high-end GPUs are large and produce significant heat
Check that your inference engine supports your desired parallelism strategy

What is the complete GPU comparison table?

Here is a head-to-head comparison of the most popular GPUs for local AI:

GPU	VRAM	Bandwidth	FP16 TFLOPS	TDP	Price (Approx.)	7B Q4 tok/s	13B Q4 tok/s	Max Practical Model
RTX 3060 12 GB	12 GB	360 GB/s	12.7	170W	$250-300 (used)	40-55	20-30	13B Q4
RTX 3070 8 GB	8 GB	448 GB/s	20.3	220W	$250-300 (used)	50-65	N/A (VRAM limit)	7B Q5-Q8
RTX 3090	24 GB	936 GB/s	35.6	350W	$700-900 (used)	70-90	40-55	30B Q4, 70B Q2
RTX 4060 Ti 16 GB	16 GB	288 GB/s	22.1	165W	$400-450	45-60	25-35	14B Q4
RTX 4070	12 GB	504 GB/s	29.1	200W	$500-550	55-70	30-40	13B Q4
RTX 4070 Ti Super	16 GB	672 GB/s	44.1	285W	$750-850	60-80	35-50	14B Q5, 30B Q2
RTX 4090	24 GB	1,008 GB/s	82.6	450W	$1,600-2,000	90-120	55-75	30B Q4, 70B Q2-Q3
RX 7900 XTX	24 GB	960 GB/s	61.4	355W	$850-950	60-80	35-50	30B Q4 (via ROCm)
RX 7900 XT	20 GB	800 GB/s	51.6	315W	$650-750	50-65	30-42	20B Q4 (via ROCm)
M1 (8-core GPU)	8-16 GB	68 GB/s	2.6	20W	N/A (system)	15-20	8-12	7B Q4 (8 GB)
M2 (10-core GPU)	8-24 GB	100 GB/s	3.6	22W	N/A (system)	18-22	10-14	13B Q4 (24 GB)
M3 Pro	18-36 GB	150 GB/s	7.4	30W	N/A (system)	25-32	15-20	14B Q4 (18 GB)
M3 Max	36-128 GB	400 GB/s	14.2	40W	N/A (system)	40-50	24-32	70B Q3 (128 GB)
M4 Pro	24-48 GB	273 GB/s	9.2	30W	N/A (system)	32-40	20-26	30B Q4 (48 GB)
M4 Max	36-128 GB	546 GB/s	18.4	40W	N/A (system)	50-60	30-38	70B Q4 (128 GB)
M4 Ultra	128-256 GB	819 GB/s	36.8	60W	N/A (system)	55-65	35-45	405B Q3 (256 GB)

Notes on the table:

Tokens per second values are approximate and depend on quantization level, context length, batch size, and inference engine.
Apple Silicon entries show GPU core performance with unified memory. TDP values are for the full SoC, not just the GPU.
AMD RX 7900 series performance depends heavily on ROCm driver version and inference engine support. Performance continues to improve.
“Max Practical Model” assumes Q4_K_M quantization and 4K context unless otherwise noted.

How do you choose the right hardware for your use case?

Here is a decision framework based on what you want to do:

I want to try local AI for the first time

Use your existing hardware. Install Ollama, run ollama run llama3.2 (or phi3 for a smaller model), and see how it performs. If you have any discrete GPU with 6+ GB VRAM, or a Mac with 16+ GB unified memory, or a CPU with 16+ GB DDR5 RAM, you can run a useful model today. Spend $0 and decide later if you want to invest more.

I want a reliable daily-driver AI assistant

Budget option: Used RTX 3060 12 GB ($250-300) to run 7B-13B models. Alternatively, a Mac Mini M4 with 16 GB ($600) for a quiet, simple setup.

Recommended option: Used RTX 3090 ($700-900) or Mac Mini M4 Pro 24 GB ($1,600). These handle 7B-14B models effortlessly and can stretch to 30B models for more capable outputs.

I want the best quality models available locally

Recommended: RTX 4090 ($1,800) for fastest single-GPU inference up to 30B. For 70B models, a Mac Studio with 96+ GB unified memory ($3,500+) or dual RTX 3090s ($1,600-1,800) provide the VRAM headroom you need.

I want to serve AI to a team

Recommended: Multi-GPU NVIDIA setup with vLLM. Two to four RTX 4090s or A6000s behind a vLLM instance can serve 10-50 concurrent users depending on the model size and request patterns. Budget $4,000-12,000 for the GPU hardware plus a workstation or server chassis with adequate power and cooling.

I want to run the largest open-weight models (405B)

Only practical options: Mac Studio M4 Ultra with 256 GB ($9,000), or a multi-GPU server with 4-8 high-VRAM GPUs. The 405B model at Q4 requires approximately 250 GB of memory. This is the domain of dedicated infrastructure, not casual hardware.

What about power consumption and noise?

Power consumption and noise are practical considerations that many hardware guides overlook:

Setup	Power Draw (Under Load)	Noise Level	Annual Electricity Cost (at $0.12/kWh)
Mac Mini M4 Pro	30-40W	Silent	~$15-20
Mac Studio M2 Ultra	50-80W	Very quiet	~$25-40
Desktop + RTX 4060 Ti	250-350W	Moderate	~$130-180
Desktop + RTX 3090	400-550W	Loud	~$200-290
Desktop + RTX 4090	450-600W	Loud	~$230-310
Desktop + 2x RTX 4090	800-1000W	Very loud	~$410-520

Apple Silicon systems are dramatically more power-efficient and quieter than GPU-based desktops. If you plan to run your local AI system 24/7 as a server, or if noise is a concern (home office, bedroom), Apple Silicon or laptop GPUs (which have lower TDP variants) are worth considering despite their lower peak performance.

What about AMD GPUs?

AMD GPUs are a viable option for local AI, though the software ecosystem is less mature than NVIDIA’s.

Pros:

The RX 7900 XTX offers 24 GB VRAM at a lower price than the RTX 4090
AMD is investing heavily in ROCm (their CUDA equivalent)
llama.cpp has solid ROCm support, and performance is improving with each release
The RX 7900 XT offers 20 GB VRAM — more than most NVIDIA consumer cards

Cons:

ROCm setup is more complex than CUDA — driver installation on Linux can be finicky
Not all inference engines support ROCm (vLLM has ROCm support; ExLlamaV2 is CUDA-focused)
Fewer community guides and troubleshooting resources
Performance per TFLOP is often lower than NVIDIA due to less optimized software stacks
Windows support for ROCm is limited; Linux is strongly recommended

Verdict: If you are comfortable with Linux and want maximum VRAM per dollar, AMD’s RX 7900 XTX is a strong choice. If you prefer a smoother, more supported experience, NVIDIA remains the safer bet. The gap is closing, and AMD is a valid choice for llama.cpp and Ollama workflows.

What about Intel GPUs?

Intel’s Arc series (A770 16 GB, A750 8 GB) can run local AI through SYCL support in llama.cpp and Vulkan backends. However, performance is well behind NVIDIA and AMD, driver support is inconsistent, and the community is small. Intel GPUs are not recommended for serious local AI workloads at this time, though they can work in a pinch for 7B models on the Arc A770.

What upgrades provide the biggest performance improvement?

If you already have a system and want to improve local AI performance, here are upgrades ranked by impact:

Add a discrete GPU (or upgrade to one with more VRAM): The single biggest improvement. Going from CPU-only to a 12+ GB GPU typically provides a 3-8x speed increase.
Increase VRAM (upgrade GPU): More VRAM lets you run larger models or use less aggressive quantization, directly improving output quality.
Add more system RAM: If you are CPU-only, more RAM lets you run larger models. If you are doing partial GPU offload, more RAM provides headroom for the CPU-processed layers.
Upgrade to DDR5 (if CPU-only): The bandwidth improvement from DDR4 to DDR5 translates to 30-50% faster CPU inference.
Upgrade to NVMe SSD: Faster model loading and switching. Does not affect inference speed.
Add a second GPU: Enables larger models via layer splitting. Cost-effective if you already have one good GPU.

Summary: what should you buy?

Budget	Recommended Setup	What It Runs
$0	Your existing hardware + Ollama	1B-7B models on CPU; up to 13B if you have a GPU
$300	Used RTX 3060 12 GB	7B-13B at Q4, comfortable chat and coding
$800	Used RTX 3090 24 GB	7B-30B at Q4, 70B at Q2. The best value in local AI.
$1,600	Mac Mini M4 Pro 24 GB or RTX 4090	7B-14B (Mac) or 7B-30B (4090), fast and efficient
$2,000	Mac Mini M4 Pro 48 GB or RTX 4090	30B-34B (Mac) or 30B at Q4 (4090) with full quality
$5,000	Mac Studio M2 Ultra 192 GB	70B+ at Q4-Q5, silent, energy efficient
$8,000+	Multi-GPU server or M4 Ultra	405B models, team serving, maximum capability

The local AI hardware landscape rewards patience and pragmatism. Start with what you have, benchmark real models, and upgrade when your actual workload demands it. The best hardware for local AI is the hardware that runs the models you need at the speed you find acceptable.

Ready to choose a model for your hardware? Read How to Choose the Right Local LLM for a decision framework based on your use case, or jump to Understanding Quantization to learn how quantization affects quality and performance.