Running Local LLMs on macOS: Apple Silicon Optimization Guide

Complete guide to running local LLMs on macOS with Apple Silicon. Covers Ollama, MLX, LM Studio, unified memory optimization, and model recommendations for M1 through M4 chips.

Apple Silicon Macs are among the best machines for running local LLMs, thanks to their unified memory architecture. Unlike traditional PCs where CPU and GPU have separate memory pools, Apple’s M-series chips share a single pool of fast memory between CPU and GPU. This means a MacBook Pro with 64 GB of unified memory can run a 70B parameter model that would require a $1,600+ GPU on a PC. This guide covers everything you need to optimize local AI performance on macOS, from choosing the right tools to getting the most out of your specific chip.

Why Apple Silicon Excels at Local AI

Unified Memory Architecture

On a traditional PC, a discrete GPU has its own VRAM (typically 8-24 GB). Models must fit in that VRAM for full GPU acceleration. Apple Silicon’s unified memory means:

  • The GPU can access all system memory, not just a fixed VRAM pool
  • No data copying between CPU and GPU memory
  • A MacBook with 64 GB unified memory gives the GPU access to all 64 GB
  • Memory bandwidth scales with chip tier (M4 Max: 546 GB/s)

This architecture is why a $2,500 MacBook Pro can run models that require a $4,000+ PC setup.

Metal GPU Acceleration

Apple’s Metal framework provides GPU acceleration for LLM inference. Both Ollama (via llama.cpp’s Metal backend) and MLX use Metal to accelerate computation on the GPU cores.

Every M-series chip has GPU cores:

  • M1/M2/M3/M4: 8-10 GPU cores
  • M-Pro: 14-18 GPU cores
  • M-Max: 30-40 GPU cores
  • M-Ultra: 60-80 GPU cores

More GPU cores means faster parallel processing and higher tokens-per-second.

Performance by Chip Configuration

Expected Inference Speeds

These are token generation speeds (output tokens/second) for common models at Q4_K_M quantization:

ChipMemoryLlama 3.1 8BQwen 2.5 14BQwen 2.5 32BLlama 3.1 70B
M1 8 GB8 GB12 t/s
M1 16 GB16 GB18 t/s8 t/s
M1 Pro 16 GB16 GB28 t/s12 t/s
M1 Pro 32 GB32 GB30 t/s18 t/s8 t/s
M1 Max 32 GB32 GB40 t/s24 t/s10 t/s
M1 Max 64 GB64 GB42 t/s28 t/s18 t/s8 t/s
M2 24 GB24 GB22 t/s14 t/s
M2 Pro 32 GB32 GB32 t/s20 t/s10 t/s
M2 Max 64 GB64 GB45 t/s30 t/s20 t/s12 t/s
M2 Max 96 GB96 GB45 t/s30 t/s22 t/s14 t/s
M2 Ultra 128 GB128 GB55 t/s40 t/s28 t/s22 t/s
M3 Pro 36 GB36 GB35 t/s22 t/s10 t/s
M3 Max 64 GB64 GB48 t/s32 t/s22 t/s14 t/s
M3 Max 128 GB128 GB48 t/s34 t/s24 t/s18 t/s
M4 32 GB32 GB28 t/s16 t/s
M4 Pro 48 GB48 GB42 t/s28 t/s15 t/s
M4 Max 64 GB64 GB55 t/s38 t/s26 t/s16 t/s
M4 Max 128 GB128 GB58 t/s42 t/s30 t/s22 t/s

”—” means the model does not fit in available memory. Speeds are approximate and vary by context length and system load.

Your ConfigBest Daily DriverStretch Model
M-series 8 GBPhi-3 Mini 3.8B, Llama 3.2 3BLlama 3.1 8B Q3
M-series 16 GBLlama 3.1 8B, Qwen 2.5 7BQwen 2.5 14B Q3
M-Pro 18 GBLlama 3.1 8B Q6, Qwen 2.5 7B Q6Qwen 2.5 14B Q4
M-Pro/Max 32 GBQwen 2.5 14B Q5Qwen 2.5 32B Q3
M-Max 48 GBQwen 2.5 32B Q4Llama 3.1 70B Q2
M-Max 64 GBQwen 2.5 32B Q6Llama 3.1 70B Q4
M-Max/Ultra 96+ GBLlama 3.1 70B Q4Llama 3.1 70B Q6
M-Ultra 128+ GBLlama 3.1 70B Q6Llama 3.1 70B Q8

Step 1: Install Ollama

Ollama is the recommended starting point. It handles everything: model downloads, Metal GPU acceleration, memory management, and API serving.

Installation

Option A: Direct download

Download from ollama.com/download and drag to Applications.

Option B: Homebrew

brew install ollama

Option C: Terminal install script

curl -fsSL https://ollama.com/install.sh | sh

First Run

# Start Ollama (if not running from menu bar)
ollama serve &

# Run your first model
ollama run llama3.1:8b

Ollama automatically uses Metal GPU acceleration on Apple Silicon. No additional configuration needed.

Verify GPU Acceleration

# Run with verbose output
ollama run llama3.1:8b --verbose

# In the output, look for:
# "metal" - confirms Metal GPU is being used
# "eval rate: XX tokens/s" - your generation speed

You can also check Activity Monitor:

  1. Open Activity Monitor
  2. Click the GPU tab (or Window > GPU History)
  3. You should see GPU utilization when generating text

Step 2: Install MLX (Apple-Native Alternative)

MLX is Apple’s machine learning framework, built specifically for Apple Silicon. It can offer better performance than llama.cpp on some models because it’s optimized for the Metal compute architecture.

Installation

# Install MLX LM (the language model package)
pip3 install mlx-lm

Running Models

# Generate text
mlx_lm.generate \
  --model mlx-community/Llama-3.1-8B-Instruct-4bit \
  --prompt "Explain quantum computing simply" \
  --max-tokens 500

# Interactive chat
mlx_lm.chat --model mlx-community/Llama-3.1-8B-Instruct-4bit

MLX Server (OpenAI-Compatible API)

# Start an API server
mlx_lm.server --model mlx-community/Llama-3.1-8B-Instruct-4bit --port 8080

# Use with any OpenAI-compatible client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.1-8B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

MLX vs. Ollama

FeatureOllamaMLX
Ease of useEasier (one command)Requires Python
Model managementBuilt-in libraryManual or HF download
PerformanceVery good (Metal)Excellent (native Metal)
API compatibilityOpenAI-compatibleOpenAI-compatible
Ecosystem supportBroad (Open WebUI, Continue, etc.)Growing
Fine-tuningNoYes (MLX supports training)
Custom formatsGGUF via ModelfileMLX format

Recommendation: Start with Ollama for its simplicity and ecosystem. Try MLX if you want to squeeze out extra performance or do fine-tuning natively on Apple Silicon.

Step 3: Install LM Studio (GUI Option)

LM Studio provides a graphical interface for browsing, downloading, and chatting with models.

  1. Download from lmstudio.ai
  2. Drag to Applications
  3. Launch and browse the model catalog
  4. Download a model and start chatting

LM Studio is excellent for exploring different models without the command line.

Optimization Tips for Apple Silicon

Memory Management

Apple Silicon shares memory between the OS, apps, and your model. To maximize available memory for AI:

# Check current memory usage
vm_stat | head -10

# See memory pressure
memory_pressure

Free up memory before running large models:

  • Close browser tabs (each Chrome tab uses 100-300 MB)
  • Quit unused applications
  • Close Electron apps (Slack, Discord, VS Code each use 500MB+)

Monitor during inference:

  • Open Activity Monitor > Memory tab
  • Watch “Memory Pressure” — green is good, yellow means pressure, red means swapping
  • If you see heavy swap, your model is too large

Optimal Ollama Settings

# Set maximum model memory (leave room for OS and apps)
# For a 64 GB Mac, allocate up to 55 GB for models:
export OLLAMA_MAX_VRAM=55g

# Keep models loaded longer (default 5 minutes)
export OLLAMA_KEEP_ALIVE=30m

# Set context size (larger = more memory used)
# Default is 2048. Increase if you need longer conversations:
ollama run llama3.1:8b /set parameter num_ctx 4096

Using Modelfiles for Optimization

Create optimized model configurations:

cat > ~/Modelfile-optimized << 'EOF'
FROM llama3.1:8b

# Adjust context window (lower = less memory)
PARAMETER num_ctx 4096

# Adjust batch size for prompt processing
PARAMETER num_batch 512

# Temperature for more consistent output
PARAMETER temperature 0.7

# System prompt
SYSTEM "You are a helpful assistant. Be concise and accurate."
EOF

ollama create llama3.1-optimized -f ~/Modelfile-optimized
ollama run llama3.1-optimized

SSD Performance

Models are loaded from SSD into memory. Apple’s internal NVMe SSDs are fast (up to 7.4 GB/s on M4), so model loading is quick. If you’re using an external drive:

  • Thunderbolt SSD: Fast enough, minimal impact
  • USB-C SSD: Acceptable, slightly slower model loading
  • USB-A or HDD: Too slow, not recommended

Store models on your internal SSD when possible:

# Default Ollama model location
ls ~/.ollama/models/

# Check size
du -sh ~/.ollama/models/

Running Specific Workloads

Code Assistant on Mac

# Install a code-optimized model
ollama pull qwen2.5-coder:7b

# Use with Continue (VS Code extension)
# Install Continue from VS Code marketplace
# Configure in ~/.continue/config.json:
{
  "models": [{
    "title": "Qwen 2.5 Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }],
  "tabAutocompleteModel": {
    "title": "Qwen Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Document Q&A (RAG) on Mac

# Install models
ollama pull llama3.1:8b
ollama pull nomic-embed-text

# Install RAG dependencies
pip3 install llama-index llama-index-llms-ollama \
  llama-index-embeddings-ollama chromadb

Image Generation on Mac

Stable Diffusion and FLUX run well on Apple Silicon via:

  • DiffusionBee: Native macOS app for Stable Diffusion
  • Draw Things: iOS/macOS app with excellent Apple Silicon support
  • ComfyUI: Python-based with MPS (Metal Performance Shaders) support
# ComfyUI on Mac
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip3 install -r requirements.txt
python3 main.py --force-fp16

Multimodal (Vision) Models

# Run a vision model
ollama run llama3.2-vision:11b

# Then provide an image path in the chat
>>> Describe this image: /path/to/photo.jpg

Web Interface Setup

Open WebUI via Docker

# Install Docker Desktop for Mac from docker.com
# Then run:
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000.

Open WebUI via pip (No Docker)

pip3 install open-webui
open-webui serve

Troubleshooting macOS

Model Runs Slowly

# Check if Metal is being used
ollama run llama3.1:8b --verbose
# Look for "metal" in output

# Check memory pressure
memory_pressure
# If "System-wide memory free percentage: X%"
# is below 10%, your model is too large

# Check for swap usage
sysctl vm.swapusage
# If swap is being used heavily, use a smaller model

Ollama Not Starting

# Check if already running
pgrep ollama

# Kill existing process
pkill ollama

# Start fresh
ollama serve

# Check logs
cat ~/.ollama/logs/server.log

”Killed” or Crash During Model Load

This usually means the model is too large for available memory and macOS killed the process.

# Check system log
log show --predicate 'eventMessage contains "ollama"' --last 5m

# Use a smaller model or lower quantization
ollama run llama3.1:8b-instruct-q3_K_M

Slow Prompt Processing

Prompt processing (the initial “thinking” before generating) is slower than generation because it processes all input tokens at once. This is normal. Tips:

  • Keep prompts concise
  • Reduce context length if you don’t need long conversations
  • Close other GPU-intensive apps

Power and Battery Considerations

Running LLMs uses significant power on laptops:

ActivityPower DrawBattery Impact
Idle (model loaded)5-10WMinimal
Generating (8B)20-35WModerate
Generating (32B+)35-55WHigh
Prompt processing40-60WHigh (brief)

Tips for battery life:

  • Use smaller models when on battery
  • Set OLLAMA_KEEP_ALIVE=5m to unload models quickly when idle
  • Plug in for extended inference sessions
  • Use a model with lower quantization (Q3 instead of Q5) on battery

Advanced: Building llama.cpp from Source

For maximum performance, build llama.cpp with Apple Silicon optimizations:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_METAL=ON \
  -DGGML_METAL_EMBED_LIBRARY=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(sysctl -n hw.ncpu)

# Run server
./build/bin/llama-server \
  -m /path/to/model.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8080

Next Steps

Your Mac is now a local AI machine. Where to go next:

Frequently Asked Questions

Can I run local LLMs on an Intel Mac?

Yes, but with significant limitations. Intel Macs run models using CPU-only inference, which is slow. A 2019 MacBook Pro with 16 GB RAM can run a 7B Q4 model at about 3-5 tokens per second. Apple Silicon Macs are dramatically faster because they use the GPU cores through Metal acceleration and benefit from unified memory architecture.

How much unified memory do I need for local AI on a Mac?

8 GB runs small 1-3B models. 16 GB comfortably handles 7-8B models and is the minimum recommended configuration. 32 GB unlocks 14B-32B models. 64 GB runs 70B models. 128 GB or more lets you run 70B models at high quantization or experiment with very large models.

Should I use Ollama or MLX on Apple Silicon?

Use Ollama if you want simplicity and compatibility with the broader ecosystem (Open WebUI, Continue, API tools). Use MLX if you want the latest Apple-optimized performance, want to experiment with Apple's ML framework, or are developing ML applications. Both use Metal GPU acceleration. Ollama is built on llama.cpp; MLX is Apple's native framework.

Why is my Mac using swap memory during inference?

Your model is too large for available unified memory. macOS will swap to disk, which dramatically slows inference. Check Activity Monitor > Memory > Memory Pressure. If it's yellow or red, use a smaller model or lower quantization. Close memory-hungry applications (browsers with many tabs, Electron apps) to free memory.