Can I run local LLMs on an Intel Mac?

Yes, but with significant limitations. Intel Macs run models using CPU-only inference, which is slow. A 2019 MacBook Pro with 16 GB RAM can run a 7B Q4 model at about 3-5 tokens per second. Apple Silicon Macs are dramatically faster because they use the GPU cores through Metal acceleration and benefit from unified memory architecture.

How much unified memory do I need for local AI on a Mac?

8 GB runs small 1-3B models. 16 GB comfortably handles 7-8B models and is the minimum recommended configuration. 32 GB unlocks 14B-32B models. 64 GB runs 70B models. 128 GB or more lets you run 70B models at high quantization or experiment with very large models.

Should I use Ollama or MLX on Apple Silicon?

Use Ollama if you want simplicity and compatibility with the broader ecosystem (Open WebUI, Continue, API tools). Use MLX if you want the latest Apple-optimized performance, want to experiment with Apple's ML framework, or are developing ML applications. Both use Metal GPU acceleration. Ollama is built on llama.cpp; MLX is Apple's native framework.

Why is my Mac using swap memory during inference?

Your model is too large for available unified memory. macOS will swap to disk, which dramatically slows inference. Check Activity Monitor > Memory > Memory Pressure. If it's yellow or red, use a smaller model or lower quantization. Close memory-hungry applications (browsers with many tabs, Electron apps) to free memory.

Running Local LLMs on macOS: Apple Silicon Optimization Guide

Apple Silicon Macs are among the best machines for running local LLMs, thanks to their unified memory architecture. Unlike traditional PCs where CPU and GPU have separate memory pools, Apple’s M-series chips share a single pool of fast memory between CPU and GPU. This means a MacBook Pro with 64 GB of unified memory can run a 70B parameter model that would require a $1,600+ GPU on a PC. This guide covers everything you need to optimize local AI performance on macOS, from choosing the right tools to getting the most out of your specific chip.

Why Apple Silicon Excels at Local AI

Unified Memory Architecture

On a traditional PC, a discrete GPU has its own VRAM (typically 8-24 GB). Models must fit in that VRAM for full GPU acceleration. Apple Silicon’s unified memory means:

The GPU can access all system memory, not just a fixed VRAM pool
No data copying between CPU and GPU memory
A MacBook with 64 GB unified memory gives the GPU access to all 64 GB
Memory bandwidth scales with chip tier (M4 Max: 546 GB/s)

This architecture is why a $2,500 MacBook Pro can run models that require a $4,000+ PC setup.

Metal GPU Acceleration

Apple’s Metal framework provides GPU acceleration for LLM inference. Both Ollama (via llama.cpp’s Metal backend) and MLX use Metal to accelerate computation on the GPU cores.

Every M-series chip has GPU cores:

M1/M2/M3/M4: 8-10 GPU cores
M-Pro: 14-18 GPU cores
M-Max: 30-40 GPU cores
M-Ultra: 60-80 GPU cores

More GPU cores means faster parallel processing and higher tokens-per-second.

Performance by Chip Configuration

Expected Inference Speeds

These are token generation speeds (output tokens/second) for common models at Q4_K_M quantization:

Chip	Memory	Llama 3.1 8B	Qwen 2.5 14B	Qwen 2.5 32B	Llama 3.1 70B
M1 8 GB	8 GB	12 t/s	—	—	—
M1 16 GB	16 GB	18 t/s	8 t/s	—	—
M1 Pro 16 GB	16 GB	28 t/s	12 t/s	—	—
M1 Pro 32 GB	32 GB	30 t/s	18 t/s	8 t/s	—
M1 Max 32 GB	32 GB	40 t/s	24 t/s	10 t/s	—
M1 Max 64 GB	64 GB	42 t/s	28 t/s	18 t/s	8 t/s
M2 24 GB	24 GB	22 t/s	14 t/s	—	—
M2 Pro 32 GB	32 GB	32 t/s	20 t/s	10 t/s	—
M2 Max 64 GB	64 GB	45 t/s	30 t/s	20 t/s	12 t/s
M2 Max 96 GB	96 GB	45 t/s	30 t/s	22 t/s	14 t/s
M2 Ultra 128 GB	128 GB	55 t/s	40 t/s	28 t/s	22 t/s
M3 Pro 36 GB	36 GB	35 t/s	22 t/s	10 t/s	—
M3 Max 64 GB	64 GB	48 t/s	32 t/s	22 t/s	14 t/s
M3 Max 128 GB	128 GB	48 t/s	34 t/s	24 t/s	18 t/s
M4 32 GB	32 GB	28 t/s	16 t/s	—	—
M4 Pro 48 GB	48 GB	42 t/s	28 t/s	15 t/s	—
M4 Max 64 GB	64 GB	55 t/s	38 t/s	26 t/s	16 t/s
M4 Max 128 GB	128 GB	58 t/s	42 t/s	30 t/s	22 t/s

”—” means the model does not fit in available memory. Speeds are approximate and vary by context length and system load.

Recommended Models by Configuration

Your Config	Best Daily Driver	Stretch Model
M-series 8 GB	Phi-3 Mini 3.8B, Llama 3.2 3B	Llama 3.1 8B Q3
M-series 16 GB	Llama 3.1 8B, Qwen 2.5 7B	Qwen 2.5 14B Q3
M-Pro 18 GB	Llama 3.1 8B Q6, Qwen 2.5 7B Q6	Qwen 2.5 14B Q4
M-Pro/Max 32 GB	Qwen 2.5 14B Q5	Qwen 2.5 32B Q3
M-Max 48 GB	Qwen 2.5 32B Q4	Llama 3.1 70B Q2
M-Max 64 GB	Qwen 2.5 32B Q6	Llama 3.1 70B Q4
M-Max/Ultra 96+ GB	Llama 3.1 70B Q4	Llama 3.1 70B Q6
M-Ultra 128+ GB	Llama 3.1 70B Q6	Llama 3.1 70B Q8

Step 1: Install Ollama

Ollama is the recommended starting point. It handles everything: model downloads, Metal GPU acceleration, memory management, and API serving.

Installation

Option A: Direct download

Download from ollama.com/download and drag to Applications.

Option B: Homebrew

brew install ollama

Option C: Terminal install script

curl -fsSL https://ollama.com/install.sh | sh

First Run

# Start Ollama (if not running from menu bar)
ollama serve &

# Run your first model
ollama run llama3.1:8b

Ollama automatically uses Metal GPU acceleration on Apple Silicon. No additional configuration needed.

Verify GPU Acceleration

# Run with verbose output
ollama run llama3.1:8b --verbose

# In the output, look for:
# "metal" - confirms Metal GPU is being used
# "eval rate: XX tokens/s" - your generation speed

You can also check Activity Monitor:

Open Activity Monitor
Click the GPU tab (or Window > GPU History)
You should see GPU utilization when generating text

Step 2: Install MLX (Apple-Native Alternative)

MLX is Apple’s machine learning framework, built specifically for Apple Silicon. It can offer better performance than llama.cpp on some models because it’s optimized for the Metal compute architecture.

Installation

# Install MLX LM (the language model package)
pip3 install mlx-lm

Running Models

# Generate text
mlx_lm.generate \
  --model mlx-community/Llama-3.1-8B-Instruct-4bit \
  --prompt "Explain quantum computing simply" \
  --max-tokens 500

# Interactive chat
mlx_lm.chat --model mlx-community/Llama-3.1-8B-Instruct-4bit

MLX Server (OpenAI-Compatible API)

# Start an API server
mlx_lm.server --model mlx-community/Llama-3.1-8B-Instruct-4bit --port 8080

# Use with any OpenAI-compatible client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.1-8B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

MLX vs. Ollama

Feature	Ollama	MLX
Ease of use	Easier (one command)	Requires Python
Model management	Built-in library	Manual or HF download
Performance	Very good (Metal)	Excellent (native Metal)
API compatibility	OpenAI-compatible	OpenAI-compatible
Ecosystem support	Broad (Open WebUI, Continue, etc.)	Growing
Fine-tuning	No	Yes (MLX supports training)
Custom formats	GGUF via Modelfile	MLX format

Recommendation: Start with Ollama for its simplicity and ecosystem. Try MLX if you want to squeeze out extra performance or do fine-tuning natively on Apple Silicon.

Step 3: Install LM Studio (GUI Option)

LM Studio provides a graphical interface for browsing, downloading, and chatting with models.

Download from lmstudio.ai
Drag to Applications
Launch and browse the model catalog
Download a model and start chatting

LM Studio is excellent for exploring different models without the command line.

Optimization Tips for Apple Silicon

Memory Management

Apple Silicon shares memory between the OS, apps, and your model. To maximize available memory for AI:

# Check current memory usage
vm_stat | head -10

# See memory pressure
memory_pressure

Free up memory before running large models:

Close browser tabs (each Chrome tab uses 100-300 MB)
Quit unused applications
Close Electron apps (Slack, Discord, VS Code each use 500MB+)

Monitor during inference:

Open Activity Monitor > Memory tab
Watch “Memory Pressure” — green is good, yellow means pressure, red means swapping
If you see heavy swap, your model is too large

Optimal Ollama Settings

# Set maximum model memory (leave room for OS and apps)
# For a 64 GB Mac, allocate up to 55 GB for models:
export OLLAMA_MAX_VRAM=55g

# Keep models loaded longer (default 5 minutes)
export OLLAMA_KEEP_ALIVE=30m

# Set context size (larger = more memory used)
# Default is 2048. Increase if you need longer conversations:
ollama run llama3.1:8b /set parameter num_ctx 4096

Using Modelfiles for Optimization

Create optimized model configurations:

cat > ~/Modelfile-optimized << 'EOF'
FROM llama3.1:8b

# Adjust context window (lower = less memory)
PARAMETER num_ctx 4096

# Adjust batch size for prompt processing
PARAMETER num_batch 512

# Temperature for more consistent output
PARAMETER temperature 0.7

# System prompt
SYSTEM "You are a helpful assistant. Be concise and accurate."
EOF

ollama create llama3.1-optimized -f ~/Modelfile-optimized
ollama run llama3.1-optimized

SSD Performance

Models are loaded from SSD into memory. Apple’s internal NVMe SSDs are fast (up to 7.4 GB/s on M4), so model loading is quick. If you’re using an external drive:

Thunderbolt SSD: Fast enough, minimal impact
USB-C SSD: Acceptable, slightly slower model loading
USB-A or HDD: Too slow, not recommended

Store models on your internal SSD when possible:

# Default Ollama model location
ls ~/.ollama/models/

# Check size
du -sh ~/.ollama/models/

Running Specific Workloads

Code Assistant on Mac

# Install a code-optimized model
ollama pull qwen2.5-coder:7b

# Use with Continue (VS Code extension)
# Install Continue from VS Code marketplace
# Configure in ~/.continue/config.json:
{
  "models": [{
    "title": "Qwen 2.5 Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }],
  "tabAutocompleteModel": {
    "title": "Qwen Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Document Q&A (RAG) on Mac

# Install models
ollama pull llama3.1:8b
ollama pull nomic-embed-text

# Install RAG dependencies
pip3 install llama-index llama-index-llms-ollama \
  llama-index-embeddings-ollama chromadb

Image Generation on Mac

Stable Diffusion and FLUX run well on Apple Silicon via:

DiffusionBee: Native macOS app for Stable Diffusion
Draw Things: iOS/macOS app with excellent Apple Silicon support
ComfyUI: Python-based with MPS (Metal Performance Shaders) support

# ComfyUI on Mac
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip3 install -r requirements.txt
python3 main.py --force-fp16

Multimodal (Vision) Models

# Run a vision model
ollama run llama3.2-vision:11b

# Then provide an image path in the chat
>>> Describe this image: /path/to/photo.jpg

Web Interface Setup

Open WebUI via Docker

# Install Docker Desktop for Mac from docker.com
# Then run:
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000.

Open WebUI via pip (No Docker)

pip3 install open-webui
open-webui serve

Troubleshooting macOS

Model Runs Slowly

# Check if Metal is being used
ollama run llama3.1:8b --verbose
# Look for "metal" in output

# Check memory pressure
memory_pressure
# If "System-wide memory free percentage: X%"
# is below 10%, your model is too large

# Check for swap usage
sysctl vm.swapusage
# If swap is being used heavily, use a smaller model

Ollama Not Starting

# Check if already running
pgrep ollama

# Kill existing process
pkill ollama

# Start fresh
ollama serve

# Check logs
cat ~/.ollama/logs/server.log

”Killed” or Crash During Model Load

This usually means the model is too large for available memory and macOS killed the process.

# Check system log
log show --predicate 'eventMessage contains "ollama"' --last 5m

# Use a smaller model or lower quantization
ollama run llama3.1:8b-instruct-q3_K_M

Slow Prompt Processing

Prompt processing (the initial “thinking” before generating) is slower than generation because it processes all input tokens at once. This is normal. Tips:

Keep prompts concise
Reduce context length if you don’t need long conversations
Close other GPU-intensive apps

Power and Battery Considerations

Running LLMs uses significant power on laptops:

Activity	Power Draw	Battery Impact
Idle (model loaded)	5-10W	Minimal
Generating (8B)	20-35W	Moderate
Generating (32B+)	35-55W	High
Prompt processing	40-60W	High (brief)

Tips for battery life:

Use smaller models when on battery
Set OLLAMA_KEEP_ALIVE=5m to unload models quickly when idle
Plug in for extended inference sessions
Use a model with lower quantization (Q3 instead of Q5) on battery

Advanced: Building llama.cpp from Source

For maximum performance, build llama.cpp with Apple Silicon optimizations:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_METAL=ON \
  -DGGML_METAL_EMBED_LIBRARY=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(sysctl -n hw.ncpu)

# Run server
./build/bin/llama-server \
  -m /path/to/model.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8080

Next Steps

Your Mac is now a local AI machine. Where to go next:

Choose the right model: Model selection guide
Add a web interface: Open WebUI + Ollama setup
Set up code assistance: Local AI Code Assistant
Build a RAG chatbot: Local RAG Chatbot tutorial
Explore the full stack: The Local AI Stack

Why Apple Silicon Excels at Local AI

Unified Memory Architecture

Metal GPU Acceleration

Performance by Chip Configuration

Expected Inference Speeds

Recommended Models by Configuration

Step 1: Install Ollama

Installation

First Run

Verify GPU Acceleration

Step 2: Install MLX (Apple-Native Alternative)

Installation

Running Models

MLX Server (OpenAI-Compatible API)

MLX vs. Ollama

Step 3: Install LM Studio (GUI Option)

Optimization Tips for Apple Silicon

Memory Management

Optimal Ollama Settings

Using Modelfiles for Optimization

SSD Performance

Running Specific Workloads

Code Assistant on Mac

Document Q&A (RAG) on Mac

Image Generation on Mac

Multimodal (Vision) Models

Web Interface Setup

Open WebUI via Docker

Open WebUI via pip (No Docker)

Troubleshooting macOS

Model Runs Slowly

Ollama Not Starting

”Killed” or Crash During Model Load

Slow Prompt Processing

Power and Battery Considerations

Advanced: Building llama.cpp from Source

Next Steps

Frequently Asked Questions