A local AI setup is not a single application. It’s a stack of components that work together, much like a web development stack (think LAMP or MERN). Understanding these layers helps you make informed choices instead of blindly following tutorials. This guide breaks down the three-layer architecture of local AI, explains when to use each major component, and provides five ready-to-deploy reference stacks for common scenarios.
The Three-Layer Architecture
Every local AI setup consists of three layers:
┌─────────────────────────────────────────┐
│ APPLICATION LAYER │
│ Frameworks, RAG, agents, pipelines │
│ LangChain, LlamaIndex, Haystack │
├─────────────────────────────────────────┤
│ INTERFACE LAYER │
│ How users interact with the model │
│ Open WebUI, LM Studio, CLI, API │
├─────────────────────────────────────────┤
│ INFERENCE LAYER │
│ Runs the model, generates tokens │
│ Ollama, llama.cpp, vLLM, ExLlamaV2 │
└─────────────────────────────────────────┘
Some tools span multiple layers (LM Studio includes both inference and interface), but conceptually, these three layers are always present.
Layer 1: The Inference Engine
The inference engine is the foundation. It loads model weights into memory, processes input tokens, and generates output tokens. Everything else depends on this layer.
Ollama
What it is: An all-in-one model manager and inference server with a built-in model library.
Best for: Personal use, development, small teams, getting started quickly.
Strengths:
- Single binary, no dependencies
- Built-in model library (
ollama pull llama3.1) - Automatic GPU detection and optimization
- OpenAI-compatible API
- Cross-platform (macOS, Linux, Windows)
- Modelfile system for custom configurations
Limitations:
- Single-user optimized (limited concurrent request handling)
- Less tuning control than raw llama.cpp
- Quantization options limited to what’s in the library (or custom Modelfiles)
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run
ollama run llama3.1:8b
# API endpoint
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Hello"}]
}'
llama.cpp
What it is: The foundational C/C++ library for local LLM inference. Most other tools (including Ollama) are built on top of it.
Best for: Maximum performance, custom builds, embedded systems, advanced users.
Strengths:
- Maximum control over every parameter
- Fastest inference for a given hardware config
- Supports every quantization level
- Can be compiled for specific CPU/GPU targets
- Server mode with OpenAI-compatible API
- Smallest resource footprint
Limitations:
- No built-in model management
- Must download GGUF files manually
- Configuration via command-line flags
- Steeper learning curve
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # or DGGML_METAL=ON for Mac
cmake --build build --config Release -j$(nproc)
# Run the server
./build/bin/llama-server \
-m /path/to/model.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \ # GPU layers
-c 4096 # context size
vLLM
What it is: A high-throughput inference engine designed for serving multiple users.
Best for: Production deployments, multi-user servers, enterprise.
Strengths:
- PagedAttention for efficient memory use
- Continuous batching for high throughput
- Tensor parallelism (multi-GPU)
- OpenAI-compatible API server
- Supports many model architectures natively
Limitations:
- GPU-only (no CPU inference)
- Higher resource overhead
- Python-based (heavier than C++ alternatives)
- Primarily Linux
pip install vllm
# Serve a model
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192
ExLlamaV2
What it is: A highly optimized CUDA inference library focused on speed.
Best for: Maximum generation speed on NVIDIA GPUs.
Strengths:
- Fastest token generation for NVIDIA GPUs
- EXL2 quantization format (flexible bits-per-weight)
- Excellent speculative decoding support
- Low VRAM overhead
Limitations:
- NVIDIA only
- Smaller community than llama.cpp
- Fewer integrations
MLC LLM
What it is: Machine Learning Compilation for LLMs. Compiles models for specific hardware targets.
Best for: Mobile deployment, WebGPU, edge devices.
Strengths:
- Runs on phones (iOS/Android), browsers, and edge devices
- Optimized compilation for each target
- WebGPU support for browser-based inference
Limitations:
- More complex setup than alternatives
- Smaller model selection
Decision Tree: Which Inference Engine?
Start here:
├── Just getting started?
│ └── Use Ollama
├── Need multi-user production serving?
│ └── Use vLLM
├── Need maximum single-user speed on NVIDIA?
│ └── Use ExLlamaV2
├── Need maximum control and customization?
│ └── Use llama.cpp
├── Deploying to mobile or browser?
│ └── Use MLC LLM
└── Building on Apple Silicon with ML focus?
└── Use MLX
Layer 2: The Interface
The interface layer is how you (or your users) interact with the model. This can be a CLI, a web UI, a desktop app, or an API that other applications consume.
Open WebUI
What it is: A self-hosted web interface that provides a ChatGPT-like experience for local models.
Best for: Personal use, small teams, anyone who wants a polished chat UI.
Key features:
- Multi-model support (switch between models mid-conversation)
- Conversation history and search
- User management and authentication
- Document upload and RAG
- Plugin/function system
- Mobile-responsive design
- Supports Ollama and any OpenAI-compatible backend
# Quick start with Docker
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
LM Studio
What it is: A desktop application with integrated model browser, inference engine, and chat interface.
Best for: Users who prefer a native GUI, model experimentation.
Key features:
- Built-in model browser (search and download from Hugging Face)
- Local inference engine (based on llama.cpp)
- Chat interface
- Local server mode with OpenAI-compatible API
- Cross-platform (macOS, Windows, Linux)
Limitations:
- Closed-source
- Cannot easily be deployed as a server for teams
- Resource usage higher than command-line tools
Text-Generation-WebUI (Oobabooga)
What it is: A comprehensive web interface with extensive backend support and parameter control.
Best for: Advanced users who want fine-grained control over generation parameters.
Key features:
- Supports multiple backends (Transformers, llama.cpp, ExLlamaV2, GPTQ, AutoGPTQ)
- Extensive parameter tuning UI
- Character/persona system
- Extensions framework
- LoRA loading and management
Jan
What it is: An open-source desktop application focused on offline, privacy-first AI.
Best for: Privacy-focused users who want a polished desktop experience.
LibreChat
What it is: A multi-provider chat interface that supports both cloud and local backends.
Best for: Organizations that use a mix of cloud and local AI.
CLI / Terminal
What it is: Direct command-line interaction with the model.
Best for: Developers, scripting, automation.
# Ollama CLI
ollama run llama3.1:8b
# llama.cpp CLI
./llama-cli -m model.gguf -p "Hello" -n 256
# Pipe input
echo "Summarize this: $(cat document.txt)" | ollama run llama3.1:8b
API Only (Headless)
Sometimes you don’t need a user-facing interface at all. You just need an API that other services can call.
# Ollama exposes an API automatically on port 11434
# vLLM serves an OpenAI-compatible API
# llama.cpp server mode provides the same
# Any OpenAI-compatible client library works:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Hello"}]
)
Decision Tree: Which Interface?
Start here:
├── Want a ChatGPT-like web experience?
│ └── Open WebUI
├── Prefer a native desktop app?
│ ├── Want a model browser built in?
│ │ └── LM Studio
│ └── Want open-source desktop?
│ └── Jan
├── Need multi-provider (cloud + local)?
│ └── LibreChat
├── Want maximum parameter control?
│ └── Text-Generation-WebUI
├── Building automations/pipelines?
│ └── API only (headless)
└── Developer who lives in terminal?
└── CLI (Ollama CLI or llama.cpp)
Layer 3: The Application Framework
The application layer is where you build things on top of the model. RAG pipelines, AI agents, custom workflows, and integrations all live here.
LangChain
What it is: The most popular framework for building LLM-powered applications.
Best for: RAG pipelines, agents, chains, tool use, complex workflows.
Strengths:
- Massive ecosystem of integrations
- Comprehensive RAG support
- Agent and tool-use frameworks
- LangSmith for debugging and tracing
- Active community
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage
llm = ChatOllama(model="llama3.1:8b", base_url="http://localhost:11434")
response = llm.invoke([HumanMessage(content="What is local AI?")])
print(response.content)
LlamaIndex
What it is: A data framework focused on connecting LLMs to your data (documents, databases, APIs).
Best for: Document Q&A, knowledge bases, structured data querying.
Strengths:
- Best-in-class document indexing and retrieval
- Many data connectors (PDF, web, databases, APIs)
- Sophisticated chunking and retrieval strategies
- Query engines for structured data
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama
llm = Ollama(model="llama3.1:8b", base_url="http://localhost:11434")
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What is in these documents?")
Haystack
What it is: An end-to-end NLP framework by deepset, focused on production-grade pipelines.
Best for: Production NLP pipelines, search systems, enterprise deployments.
Direct API Integration
For simple use cases, you don’t need a framework at all. The OpenAI-compatible API works with any HTTP client:
import requests
response = requests.post("http://localhost:11434/v1/chat/completions", json={
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Hello"}]
})
print(response.json()["choices"][0]["message"]["content"])
// Node.js
const response = await fetch("http://localhost:11434/v1/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama3.1:8b",
messages: [{ role: "user", content: "Hello" }]
})
});
const data = await response.json();
console.log(data.choices[0].message.content);
Vector Databases (For RAG)
If you’re building RAG applications, you need a vector database to store and search embeddings:
| Database | Best For | Storage |
|---|---|---|
| ChromaDB | Local development, small-medium datasets | Embedded/file |
| FAISS | High-performance search, research | In-memory |
| Qdrant | Production deployments, filtering | Client-server |
| Weaviate | Full-featured, hybrid search | Client-server |
| pgvector | Already using PostgreSQL | PostgreSQL extension |
Five Reference Stacks
Here are five complete stacks you can deploy today, each optimized for a different scenario.
Stack 1: Personal AI Assistant
Scenario: Single user who wants a private ChatGPT replacement.
Engine: Ollama
Interface: Open WebUI
Model: Llama 3.1 8B
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
# Install Open WebUI
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000, create an account, and start chatting.
Stack 2: Developer Workstation
Scenario: Software developer who wants local code assistance in their IDE plus terminal chat.
Engine: Ollama
Interface: Continue (VS Code) + CLI
Models: Qwen 2.5 Coder 7B (code) + Llama 3.1 8B (chat)
# Install models
ollama pull qwen2.5-coder:7b
ollama pull llama3.1:8b
# Install Continue extension in VS Code
# Then configure ~/.continue/config.json:
{
"models": [
{
"title": "Qwen Coder",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
],
"tabAutocompleteModel": {
"title": "Qwen Coder",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
Stack 3: Document Q&A System (RAG)
Scenario: User who wants to chat with their documents (PDFs, notes, knowledge base).
Engine: Ollama
Framework: LlamaIndex
Vector DB: ChromaDB
Interface: Custom Streamlit app or Open WebUI with RAG
Models: Llama 3.1 8B + nomic-embed-text
# Install models
ollama pull llama3.1:8b
ollama pull nomic-embed-text
# Install Python dependencies
pip install llama-index llama-index-llms-ollama \
llama-index-embeddings-ollama chromadb streamlit
# Simple RAG pipeline
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
llm = Ollama(model="llama3.1:8b")
embed = OllamaEmbedding(model_name="nomic-embed-text")
documents = SimpleDirectoryReader("./documents").load_data()
index = VectorStoreIndex.from_documents(
documents, embed_model=embed
)
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("Summarize the key findings")
print(response)
Stack 4: Team/Small Business Server
Scenario: 5-20 users who need shared access to local AI with user management.
Engine: Ollama (or vLLM for higher concurrency)
Interface: Open WebUI or LibreChat
Proxy: Nginx with HTTPS
Models: Llama 3.1 8B + Qwen 2.5 14B
# docker-compose.yml
services:
ollama:
image: ollama/ollama
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- webui_data:/app/backend/data
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
webui_data:
Stack 5: Enterprise Production Deployment
Scenario: Organization deploying AI for 100+ users with security, monitoring, and high availability.
Engine: vLLM (multi-GPU, tensor parallel)
Interface: LibreChat (multi-provider, LDAP/OIDC auth)
Orchestration: Kubernetes with NVIDIA GPU Operator
Monitoring: Prometheus + Grafana
Load Balancer: Nginx/Traefik with rate limiting
Models: Llama 3.1 70B (primary) + specialized models
This stack is covered in detail in our Enterprise Local AI Deployment guide.
Choosing Your Stack: Summary
| Scenario | Engine | Interface | Framework |
|---|---|---|---|
| Just exploring | Ollama | CLI | None |
| Personal assistant | Ollama | Open WebUI | None |
| Developer tools | Ollama | Continue/IDE | None |
| Document Q&A | Ollama | Custom/Open WebUI | LlamaIndex |
| AI applications | Ollama/vLLM | API (headless) | LangChain |
| Team server | Ollama | Open WebUI | None |
| Enterprise | vLLM | LibreChat | LangChain/Haystack |
| Maximum speed | llama.cpp/ExLlamaV2 | CLI/API | None |
| Model experimentation | LM Studio | LM Studio | None |
The beauty of the local AI ecosystem is its modularity. Start simple with Ollama and a CLI, then add components as your needs grow. Every layer can be swapped independently, so you’re never locked in.