Linux is the ideal operating system for running local LLMs. It offers the best GPU driver support, the most flexible configuration options, and the full ecosystem of AI tools without the limitations of Windows or macOS. This guide covers GPU setup for both NVIDIA (CUDA) and AMD (ROCm), Ollama installation, building llama.cpp from source for maximum performance, setting up systemd services for production use, and performance tuning across Ubuntu, Fedora, and Arch Linux.
Prerequisites
Check your system:
# Kernel version (5.15+ recommended)
uname -r
# CPU features (AVX2 needed for efficient inference)
lscpu | grep -i avx
# Total RAM
free -h
# List GPUs
lspci | grep -i 'vga\|3d\|display'
GPU Setup: NVIDIA (CUDA)
NVIDIA GPUs with CUDA provide the fastest local AI inference on Linux. Setup varies by distribution.
Ubuntu 22.04 / 24.04
Option A: Ubuntu’s packaged driver (simplest)
# List available drivers
ubuntu-drivers list
# Install the recommended driver
sudo ubuntu-drivers install
# Reboot
sudo reboot
Option B: NVIDIA’s official driver
# Add NVIDIA's repository
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt update
# Install the latest driver
sudo apt install -y nvidia-driver-560
# Reboot
sudo reboot
Option C: CUDA Toolkit (includes driver)
# Download CUDA installer
wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux.run
# Install
sudo sh cuda_12.6.0_560.28.03_linux.run
# Add to PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
Fedora 39 / 40
# Install RPM Fusion repositories
sudo dnf install -y \
https://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm \
https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
# Install NVIDIA driver
sudo dnf install -y akmod-nvidia xorg-x11-drv-nvidia-cuda
# Wait for kernel module to build (important!)
sudo akmods --force
sudo dracut --force
# Reboot
sudo reboot
Arch Linux
# Install NVIDIA driver
sudo pacman -S nvidia nvidia-utils cuda
# For the latest driver (may be needed for newest GPUs)
# Use the nvidia-dkms package:
sudo pacman -S nvidia-dkms nvidia-utils cuda
# Reboot
sudo reboot
Verify NVIDIA Installation
# Check driver and CUDA version
nvidia-smi
# Expected output:
# Driver Version: 560.xx CUDA Version: 12.6
# And your GPU listed with memory info
# Check CUDA compiler (if CUDA Toolkit installed)
nvcc --version
Headless Server Setup
For servers without a display:
# Install driver without OpenGL (headless)
sudo apt install -y nvidia-headless-560 nvidia-utils-560
# Or from NVIDIA .run file:
sudo sh cuda_*.run --no-opengl-files
# Verify
nvidia-smi
GPU Setup: AMD (ROCm)
AMD GPUs use ROCm (Radeon Open Compute) for GPU-accelerated inference on Linux.
Supported AMD GPUs
Officially supported by ROCm:
- RX 7900 XTX, 7900 XT, 7800 XT, 7700 XT (RDNA 3)
- RX 6900 XT, 6800 XT, 6800, 6700 XT (RDNA 2)
- Radeon PRO W7900, W7800
- Instinct MI210, MI250, MI300X
Community supported (may need HSA_OVERRIDE_GFX_VERSION):
- RX 7600, 6600 XT, 6600
- Older RDNA/GCN GPUs
Ubuntu ROCm Installation
# Add AMD's repository
wget https://repo.radeon.com/amdgpu-install/6.0/ubuntu/jammy/amdgpu-install_6.0.60000-1_all.deb
sudo apt install -y ./amdgpu-install_6.0.60000-1_all.deb
# Install ROCm
sudo amdgpu-install -y --usecase=rocm
# Add user to required groups
sudo usermod -aG render,video $USER
# Reboot
sudo reboot
Fedora ROCm Installation
# ROCm support on Fedora requires manual setup
# Install from AMD's repository or build from source
# Check AMD's documentation for the latest Fedora instructions
# Alternative: Use Ollama which bundles ROCm support
curl -fsSL https://ollama.com/install.sh | sh
Arch Linux ROCm
# Install ROCm from AUR or official repos
sudo pacman -S rocm-hip-sdk rocm-opencl-sdk
# Add user to groups
sudo usermod -aG render,video $USER
# Reboot
sudo reboot
Verify ROCm Installation
# Check ROCm
rocm-smi
# Check HIP (ROCm's CUDA equivalent)
hipconfig --full
# For unsupported GPUs, set override
export HSA_OVERRIDE_GFX_VERSION=11.0.0 # For RDNA 3
export HSA_OVERRIDE_GFX_VERSION=10.3.0 # For RDNA 2
Installing Ollama
Ollama is the recommended starting point for all Linux distributions.
One-Line Install
curl -fsSL https://ollama.com/install.sh | sh
This script:
- Downloads the Ollama binary to
/usr/local/bin/ollama - Creates an
ollamasystem user - Creates a systemd service
- Starts Ollama automatically
- Detects and configures NVIDIA CUDA or AMD ROCm
Manual Install
If you prefer not to pipe to shell:
# Download binary
curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama
# Create user
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
sudo usermod -aG render,video ollama # For AMD GPU access
# Create systemd service
sudo tee /etc/systemd/system/ollama.service << 'EOF'
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="HOME=/usr/share/ollama"
Environment="OLLAMA_HOST=0.0.0.0:11434"
[Install]
WantedBy=default.target
EOF
# Start and enable
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
Verify Installation
# Check service status
systemctl status ollama
# Check version
ollama --version
# Run a model
ollama run llama3.1:8b
# Check GPU is being used
ollama run llama3.1:8b --verbose
Ollama Configuration
Configure via environment variables in the systemd service:
# Edit the service
sudo systemctl edit ollama
# Add environment variables in the override file:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=30m"
# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart ollama
Key environment variables:
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Listen address (use 0.0.0.0 for network access) |
OLLAMA_MODELS | ~/.ollama/models | Model storage directory |
OLLAMA_NUM_PARALLEL | 1 | Concurrent request slots |
OLLAMA_MAX_LOADED_MODELS | 1 | Models loaded simultaneously |
OLLAMA_KEEP_ALIVE | 5m | How long to keep model in memory |
OLLAMA_NUM_GPU | auto | Number of GPU layers |
CUDA_VISIBLE_DEVICES | all | Which GPUs to use |
Building llama.cpp from Source
Building from source gives you maximum performance with optimizations for your specific hardware.
NVIDIA GPU Build
# Install dependencies
sudo apt install -y build-essential cmake git # Ubuntu
sudo dnf install -y gcc gcc-c++ cmake git # Fedora
sudo pacman -S base-devel cmake git # Arch
# Clone
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with CUDA
cmake -B build \
-DGGML_CUDA=ON \
-DGGML_NATIVE=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
# Verify
./build/bin/llama-cli --version
AMD GPU Build
cd llama.cpp
# Build with ROCm/HIP
cmake -B build \
-DGGML_HIP=ON \
-DGGML_NATIVE=ON \
-DCMAKE_BUILD_TYPE=Release \
-DAMDGPU_TARGETS="gfx1100;gfx1030" # Adjust for your GPU
cmake --build build --config Release -j$(nproc)
AMD GPU target strings:
gfx1100— RDNA 3 (RX 7900 series)gfx1030— RDNA 2 (RX 6800/6900 series)gfx1010— RDNA 1 (RX 5700 series)
CPU-Only Build (Optimized)
cd llama.cpp
cmake -B build \
-DGGML_NATIVE=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
The -DGGML_NATIVE=ON flag enables CPU-specific optimizations (AVX2, AVX-512, etc.) for your exact processor.
Running llama.cpp Server
# Download a model (e.g., from Hugging Face)
wget https://huggingface.co/bartowski/Llama-3.1-8B-Instruct-GGUF/resolve/main/Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Run the server
./build/bin/llama-server \
-m Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 8192 \
--threads $(nproc)
llama.cpp as a systemd Service
sudo tee /etc/systemd/system/llamacpp.service << 'EOF'
[Unit]
Description=llama.cpp Server
After=network.target
[Service]
Type=simple
User=llamacpp
ExecStart=/opt/llama.cpp/build/bin/llama-server \
-m /opt/models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 8192
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable llamacpp
sudo systemctl start llamacpp
Installing vLLM (Production Serving)
vLLM is the go-to for serving models to multiple users with high throughput.
# Install via pip (requires Python 3.9+)
pip install vllm
# Serve a model
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
# Multi-GPU (tensor parallelism)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 8192
vLLM as a systemd Service
sudo tee /etc/systemd/system/vllm.service << 'EOF'
[Unit]
Description=vLLM Inference Server
After=network.target
[Service]
Type=simple
User=vllm
Environment="CUDA_VISIBLE_DEVICES=0,1"
ExecStart=/usr/local/bin/vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
Performance Tuning
NVIDIA GPU Tuning
# Set persistence mode (keeps GPU initialized)
sudo nvidia-smi -pm 1
# Set maximum performance mode
sudo nvidia-smi -ac 1593,1410 # Memory,Graphics clocks (check your GPU's max)
# Disable ECC for consumer GPUs (frees ~6% VRAM on some cards)
# WARNING: Only do this if you understand the trade-off
sudo nvidia-smi --ecc-config=0
# Requires reboot
# Monitor GPU during inference
watch -n 0.5 nvidia-smi
CPU Tuning
# Set CPU governor to performance
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Make it persistent (Ubuntu/Fedora)
sudo apt install -y cpufrequtils # Ubuntu
echo 'GOVERNOR="performance"' | sudo tee /etc/default/cpufrequtils
sudo systemctl restart cpufrequtils
# Verify
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Memory Tuning
# Increase huge pages (improves memory access patterns)
echo 'vm.nr_hugepages=1024' | sudo tee -a /etc/sysctl.d/99-hugepages.conf
sudo sysctl --system
# Disable transparent huge pages if causing latency spikes
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
# Check NUMA topology (for multi-socket systems)
numactl --hardware
# Run Ollama pinned to the NUMA node nearest your GPU:
numactl --cpunodebind=0 --membind=0 ollama serve
IO Tuning
# Use a fast NVMe SSD for model storage
# Check disk speed
sudo hdparm -tT /dev/nvme0n1
# Mount with noatime for slightly better IO
# In /etc/fstab, add noatime to your mount options
Network Tuning (For Multi-User Serving)
# Increase connection limits
echo 'net.core.somaxconn=65535' | sudo tee -a /etc/sysctl.d/99-network.conf
echo 'net.ipv4.tcp_max_syn_backlog=65535' | sudo tee -a /etc/sysctl.d/99-network.conf
sudo sysctl --system
Docker Setup
Install Docker
# Ubuntu
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# Fedora
sudo dnf install -y dnf-plugins-core
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install -y docker-ce docker-ce-cli containerd.io
sudo systemctl enable docker
sudo systemctl start docker
# Arch
sudo pacman -S docker
sudo systemctl enable docker
sudo systemctl start docker
NVIDIA Container Toolkit
# Add repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi
Run Ollama in Docker
# NVIDIA GPU
docker run -d \
--gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
--restart unless-stopped \
ollama/ollama
# AMD GPU
docker run -d \
--device /dev/kfd \
--device /dev/dri \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
--restart unless-stopped \
ollama/ollama:rocm
# Pull a model
docker exec -it ollama ollama pull llama3.1:8b
Firewall Configuration
If serving models over the network:
# UFW (Ubuntu)
sudo ufw allow 11434/tcp # Ollama
sudo ufw allow 8080/tcp # llama.cpp server
sudo ufw allow 3000/tcp # Open WebUI
# firewalld (Fedora)
sudo firewall-cmd --permanent --add-port=11434/tcp
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --reload
# iptables (Arch/any)
sudo iptables -A INPUT -p tcp --dport 11434 -j ACCEPT
Monitoring and Logging
GPU Monitoring
# Real-time NVIDIA monitoring
watch -n 0.5 nvidia-smi
# Detailed per-process GPU usage
nvidia-smi pmon -i 0 -s u -d 1
# AMD GPU monitoring
watch -n 1 rocm-smi
# Install nvtop for a nice TUI
sudo apt install -y nvtop # Ubuntu
sudo dnf install -y nvtop # Fedora
sudo pacman -S nvtop # Arch
nvtop
Ollama Logs
# View service logs
journalctl -u ollama -f
# View last 100 lines
journalctl -u ollama -n 100
# Filter by time
journalctl -u ollama --since "1 hour ago"
System Monitoring During Inference
# Combined monitoring
htop # CPU and RAM
nvtop # GPU
iotop # Disk IO
# All in one terminal with tmux
tmux new-session -d -s monitor 'htop' \; \
split-window -h 'nvtop' \; \
attach
Troubleshooting
NVIDIA Driver Issues
# Check if driver is loaded
lsmod | grep nvidia
# If not loaded, try:
sudo modprobe nvidia
# If modprobe fails, check secure boot
mokutil --sb-state
# If enabled, either disable secure boot or sign the NVIDIA module
# Check dmesg for errors
dmesg | grep -i nvidia
ROCm Not Detecting GPU
# Check permissions
groups $USER # Should include render and video
# Check device nodes
ls -la /dev/kfd /dev/dri/render*
# Try override for unsupported GPU
export HSA_OVERRIDE_GFX_VERSION=11.0.0
rocm-smi
Ollama CUDA Errors
# Common fix: reinstall Ollama to get matching CUDA libraries
curl -fsSL https://ollama.com/install.sh | sh
# Check CUDA library path
ldconfig -p | grep cuda
# Set library path explicitly
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Out of Memory
# Check available VRAM
nvidia-smi
# Check system RAM
free -h
# Kill other GPU processes
nvidia-smi # Note PIDs
kill <pid>
# Use a smaller model or lower quantization
ollama run llama3.1:8b-instruct-q3_K_M
Next Steps
Your Linux system is fully configured for local AI. Where to go next:
- Choose the right model: Model selection guide
- Add a web interface: Open WebUI + Ollama setup
- Containerize your setup: Docker and Kubernetes guide
- Build a RAG system: Local RAG Chatbot tutorial
- Deploy for your team: Enterprise Local AI guide