What is the simplest local AI stack for beginners?

Ollama plus Open WebUI. Ollama handles model management and inference with a single install. Open WebUI provides a polished ChatGPT-like web interface. Together, they give you a complete local AI setup with minimal configuration.

Can I mix and match components from different stacks?

Yes. Most components communicate through standard APIs (typically OpenAI-compatible endpoints). You can use llama.cpp as your engine, Open WebUI as your interface, and LangChain as your framework. The OpenAI-compatible API layer makes components interchangeable.

When should I use vLLM instead of Ollama?

Use vLLM when you need to serve multiple concurrent users with high throughput. vLLM's PagedAttention and continuous batching are designed for production serving. Ollama is better for personal use, development, and small teams where simplicity matters more than maximum throughput.

The Local AI Stack: Choosing Your Engine, UI, and Framework

A local AI setup is not a single application. It’s a stack of components that work together, much like a web development stack (think LAMP or MERN). Understanding these layers helps you make informed choices instead of blindly following tutorials. This guide breaks down the three-layer architecture of local AI, explains when to use each major component, and provides five ready-to-deploy reference stacks for common scenarios.

The Three-Layer Architecture

Every local AI setup consists of three layers:

┌─────────────────────────────────────────┐
│           APPLICATION LAYER             │
│  Frameworks, RAG, agents, pipelines     │
│  LangChain, LlamaIndex, Haystack        │
├─────────────────────────────────────────┤
│           INTERFACE LAYER               │
│  How users interact with the model      │
│  Open WebUI, LM Studio, CLI, API        │
├─────────────────────────────────────────┤
│           INFERENCE LAYER               │
│  Runs the model, generates tokens       │
│  Ollama, llama.cpp, vLLM, ExLlamaV2     │
└─────────────────────────────────────────┘

Some tools span multiple layers (LM Studio includes both inference and interface), but conceptually, these three layers are always present.

Layer 1: The Inference Engine

The inference engine is the foundation. It loads model weights into memory, processes input tokens, and generates output tokens. Everything else depends on this layer.

Ollama

What it is: An all-in-one model manager and inference server with a built-in model library.

Best for: Personal use, development, small teams, getting started quickly.

Strengths:

Single binary, no dependencies
Built-in model library (ollama pull llama3.1)
Automatic GPU detection and optimization
OpenAI-compatible API
Cross-platform (macOS, Linux, Windows)
Modelfile system for custom configurations

Limitations:

Single-user optimized (limited concurrent request handling)
Less tuning control than raw llama.cpp
Quantization options limited to what’s in the library (or custom Modelfiles)

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run
ollama run llama3.1:8b

# API endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

llama.cpp

What it is: The foundational C/C++ library for local LLM inference. Most other tools (including Ollama) are built on top of it.

Best for: Maximum performance, custom builds, embedded systems, advanced users.

Strengths:

Maximum control over every parameter
Fastest inference for a given hardware config
Supports every quantization level
Can be compiled for specific CPU/GPU targets
Server mode with OpenAI-compatible API
Smallest resource footprint

Limitations:

No built-in model management
Must download GGUF files manually
Configuration via command-line flags
Steeper learning curve

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # or DGGML_METAL=ON for Mac
cmake --build build --config Release -j$(nproc)

# Run the server
./build/bin/llama-server \
  -m /path/to/model.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \  # GPU layers
  -c 4096    # context size

vLLM

What it is: A high-throughput inference engine designed for serving multiple users.

Best for: Production deployments, multi-user servers, enterprise.

Strengths:

PagedAttention for efficient memory use
Continuous batching for high throughput
Tensor parallelism (multi-GPU)
OpenAI-compatible API server
Supports many model architectures natively

Limitations:

GPU-only (no CPU inference)
Higher resource overhead
Python-based (heavier than C++ alternatives)
Primarily Linux

pip install vllm

# Serve a model
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192

ExLlamaV2

What it is: A highly optimized CUDA inference library focused on speed.

Best for: Maximum generation speed on NVIDIA GPUs.

Strengths:

Fastest token generation for NVIDIA GPUs
EXL2 quantization format (flexible bits-per-weight)
Excellent speculative decoding support
Low VRAM overhead

Limitations:

NVIDIA only
Smaller community than llama.cpp
Fewer integrations

MLC LLM

What it is: Machine Learning Compilation for LLMs. Compiles models for specific hardware targets.

Best for: Mobile deployment, WebGPU, edge devices.

Strengths:

Runs on phones (iOS/Android), browsers, and edge devices
Optimized compilation for each target
WebGPU support for browser-based inference

Limitations:

More complex setup than alternatives
Smaller model selection

Decision Tree: Which Inference Engine?

Start here:
├── Just getting started?
│   └── Use Ollama
├── Need multi-user production serving?
│   └── Use vLLM
├── Need maximum single-user speed on NVIDIA?
│   └── Use ExLlamaV2
├── Need maximum control and customization?
│   └── Use llama.cpp
├── Deploying to mobile or browser?
│   └── Use MLC LLM
└── Building on Apple Silicon with ML focus?
    └── Use MLX

Layer 2: The Interface

The interface layer is how you (or your users) interact with the model. This can be a CLI, a web UI, a desktop app, or an API that other applications consume.

Open WebUI

What it is: A self-hosted web interface that provides a ChatGPT-like experience for local models.

Best for: Personal use, small teams, anyone who wants a polished chat UI.

Key features:

Multi-model support (switch between models mid-conversation)
Conversation history and search
User management and authentication
Document upload and RAG
Plugin/function system
Mobile-responsive design
Supports Ollama and any OpenAI-compatible backend

# Quick start with Docker
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

LM Studio

What it is: A desktop application with integrated model browser, inference engine, and chat interface.

Best for: Users who prefer a native GUI, model experimentation.

Key features:

Built-in model browser (search and download from Hugging Face)
Local inference engine (based on llama.cpp)
Chat interface
Local server mode with OpenAI-compatible API
Cross-platform (macOS, Windows, Linux)

Limitations:

Closed-source
Cannot easily be deployed as a server for teams
Resource usage higher than command-line tools

Text-Generation-WebUI (Oobabooga)

What it is: A comprehensive web interface with extensive backend support and parameter control.

Best for: Advanced users who want fine-grained control over generation parameters.

Key features:

Supports multiple backends (Transformers, llama.cpp, ExLlamaV2, GPTQ, AutoGPTQ)
Extensive parameter tuning UI
Character/persona system
Extensions framework
LoRA loading and management

Jan

What it is: An open-source desktop application focused on offline, privacy-first AI.

Best for: Privacy-focused users who want a polished desktop experience.

LibreChat

What it is: A multi-provider chat interface that supports both cloud and local backends.

Best for: Organizations that use a mix of cloud and local AI.

CLI / Terminal

What it is: Direct command-line interaction with the model.

Best for: Developers, scripting, automation.

# Ollama CLI
ollama run llama3.1:8b

# llama.cpp CLI
./llama-cli -m model.gguf -p "Hello" -n 256

# Pipe input
echo "Summarize this: $(cat document.txt)" | ollama run llama3.1:8b

API Only (Headless)

Sometimes you don’t need a user-facing interface at all. You just need an API that other services can call.

# Ollama exposes an API automatically on port 11434
# vLLM serves an OpenAI-compatible API
# llama.cpp server mode provides the same

# Any OpenAI-compatible client library works:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Hello"}]
)

Decision Tree: Which Interface?

Start here:
├── Want a ChatGPT-like web experience?
│   └── Open WebUI
├── Prefer a native desktop app?
│   ├── Want a model browser built in?
│   │   └── LM Studio
│   └── Want open-source desktop?
│       └── Jan
├── Need multi-provider (cloud + local)?
│   └── LibreChat
├── Want maximum parameter control?
│   └── Text-Generation-WebUI
├── Building automations/pipelines?
│   └── API only (headless)
└── Developer who lives in terminal?
    └── CLI (Ollama CLI or llama.cpp)

Layer 3: The Application Framework

The application layer is where you build things on top of the model. RAG pipelines, AI agents, custom workflows, and integrations all live here.

LangChain

What it is: The most popular framework for building LLM-powered applications.

Best for: RAG pipelines, agents, chains, tool use, complex workflows.

Strengths:

Massive ecosystem of integrations
Comprehensive RAG support
Agent and tool-use frameworks
LangSmith for debugging and tracing
Active community

from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage

llm = ChatOllama(model="llama3.1:8b", base_url="http://localhost:11434")
response = llm.invoke([HumanMessage(content="What is local AI?")])
print(response.content)

LlamaIndex

What it is: A data framework focused on connecting LLMs to your data (documents, databases, APIs).

Best for: Document Q&A, knowledge bases, structured data querying.

Strengths:

Best-in-class document indexing and retrieval
Many data connectors (PDF, web, databases, APIs)
Sophisticated chunking and retrieval strategies
Query engines for structured data

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama

llm = Ollama(model="llama3.1:8b", base_url="http://localhost:11434")
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What is in these documents?")

Haystack

What it is: An end-to-end NLP framework by deepset, focused on production-grade pipelines.

Best for: Production NLP pipelines, search systems, enterprise deployments.

Direct API Integration

For simple use cases, you don’t need a framework at all. The OpenAI-compatible API works with any HTTP client:

import requests

response = requests.post("http://localhost:11434/v1/chat/completions", json={
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello"}]
})
print(response.json()["choices"][0]["message"]["content"])

// Node.js
const response = await fetch("http://localhost:11434/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "llama3.1:8b",
    messages: [{ role: "user", content: "Hello" }]
  })
});
const data = await response.json();
console.log(data.choices[0].message.content);

Vector Databases (For RAG)

If you’re building RAG applications, you need a vector database to store and search embeddings:

Database	Best For	Storage
ChromaDB	Local development, small-medium datasets	Embedded/file
FAISS	High-performance search, research	In-memory
Qdrant	Production deployments, filtering	Client-server
Weaviate	Full-featured, hybrid search	Client-server
pgvector	Already using PostgreSQL	PostgreSQL extension

Five Reference Stacks

Here are five complete stacks you can deploy today, each optimized for a different scenario.

Stack 1: Personal AI Assistant

Scenario: Single user who wants a private ChatGPT replacement.

Engine: Ollama
Interface: Open WebUI
Model: Llama 3.1 8B

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b

# Install Open WebUI
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000, create an account, and start chatting.

Stack 2: Developer Workstation

Scenario: Software developer who wants local code assistance in their IDE plus terminal chat.

Engine: Ollama
Interface: Continue (VS Code) + CLI
Models: Qwen 2.5 Coder 7B (code) + Llama 3.1 8B (chat)

# Install models
ollama pull qwen2.5-coder:7b
ollama pull llama3.1:8b

# Install Continue extension in VS Code
# Then configure ~/.continue/config.json:
{
  "models": [
    {
      "title": "Qwen Coder",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Stack 3: Document Q&A System (RAG)

Scenario: User who wants to chat with their documents (PDFs, notes, knowledge base).

Engine: Ollama
Framework: LlamaIndex
Vector DB: ChromaDB
Interface: Custom Streamlit app or Open WebUI with RAG
Models: Llama 3.1 8B + nomic-embed-text

# Install models
ollama pull llama3.1:8b
ollama pull nomic-embed-text

# Install Python dependencies
pip install llama-index llama-index-llms-ollama \
  llama-index-embeddings-ollama chromadb streamlit

# Simple RAG pipeline
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

llm = Ollama(model="llama3.1:8b")
embed = OllamaEmbedding(model_name="nomic-embed-text")

documents = SimpleDirectoryReader("./documents").load_data()
index = VectorStoreIndex.from_documents(
    documents, embed_model=embed
)
query_engine = index.as_query_engine(llm=llm)

response = query_engine.query("Summarize the key findings")
print(response)

Stack 4: Team/Small Business Server

Scenario: 5-20 users who need shared access to local AI with user management.

Engine: Ollama (or vLLM for higher concurrency)
Interface: Open WebUI or LibreChat
Proxy: Nginx with HTTPS
Models: Llama 3.1 8B + Qwen 2.5 14B

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - webui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  webui_data:

Stack 5: Enterprise Production Deployment

Scenario: Organization deploying AI for 100+ users with security, monitoring, and high availability.

Engine: vLLM (multi-GPU, tensor parallel)
Interface: LibreChat (multi-provider, LDAP/OIDC auth)
Orchestration: Kubernetes with NVIDIA GPU Operator
Monitoring: Prometheus + Grafana
Load Balancer: Nginx/Traefik with rate limiting
Models: Llama 3.1 70B (primary) + specialized models

This stack is covered in detail in our Enterprise Local AI Deployment guide.

Choosing Your Stack: Summary

Scenario	Engine	Interface	Framework
Just exploring	Ollama	CLI	None
Personal assistant	Ollama	Open WebUI	None
Developer tools	Ollama	Continue/IDE	None
Document Q&A	Ollama	Custom/Open WebUI	LlamaIndex
AI applications	Ollama/vLLM	API (headless)	LangChain
Team server	Ollama	Open WebUI	None
Enterprise	vLLM	LibreChat	LangChain/Haystack
Maximum speed	llama.cpp/ExLlamaV2	CLI/API	None
Model experimentation	LM Studio	LM Studio	None

The beauty of the local AI ecosystem is its modularity. Start simple with Ollama and a CLI, then add components as your needs grow. Every layer can be swapped independently, so you’re never locked in.

The Three-Layer Architecture

Layer 1: The Inference Engine

Ollama

llama.cpp

vLLM

ExLlamaV2

MLC LLM

Decision Tree: Which Inference Engine?

Layer 2: The Interface

Open WebUI

LM Studio

Text-Generation-WebUI (Oobabooga)

Jan

LibreChat

CLI / Terminal

API Only (Headless)

Decision Tree: Which Interface?

Layer 3: The Application Framework

LangChain

LlamaIndex

Haystack

Direct API Integration

Vector Databases (For RAG)

Five Reference Stacks

Stack 1: Personal AI Assistant

Stack 2: Developer Workstation

Stack 3: Document Q&A System (RAG)

Stack 4: Team/Small Business Server

Stack 5: Enterprise Production Deployment

Choosing Your Stack: Summary

Frequently Asked Questions