The Local AI Stack: Choosing Your Engine, UI, and Framework

Understand the three-layer local AI architecture and choose the right inference engine, user interface, and application framework for your needs. Includes 5 reference stacks for common scenarios.

A local AI setup is not a single application. It’s a stack of components that work together, much like a web development stack (think LAMP or MERN). Understanding these layers helps you make informed choices instead of blindly following tutorials. This guide breaks down the three-layer architecture of local AI, explains when to use each major component, and provides five ready-to-deploy reference stacks for common scenarios.

The Three-Layer Architecture

Every local AI setup consists of three layers:

┌─────────────────────────────────────────┐
│           APPLICATION LAYER             │
│  Frameworks, RAG, agents, pipelines     │
│  LangChain, LlamaIndex, Haystack        │
├─────────────────────────────────────────┤
│           INTERFACE LAYER               │
│  How users interact with the model      │
│  Open WebUI, LM Studio, CLI, API        │
├─────────────────────────────────────────┤
│           INFERENCE LAYER               │
│  Runs the model, generates tokens       │
│  Ollama, llama.cpp, vLLM, ExLlamaV2     │
└─────────────────────────────────────────┘

Some tools span multiple layers (LM Studio includes both inference and interface), but conceptually, these three layers are always present.

Layer 1: The Inference Engine

The inference engine is the foundation. It loads model weights into memory, processes input tokens, and generates output tokens. Everything else depends on this layer.

Ollama

What it is: An all-in-one model manager and inference server with a built-in model library.

Best for: Personal use, development, small teams, getting started quickly.

Strengths:

  • Single binary, no dependencies
  • Built-in model library (ollama pull llama3.1)
  • Automatic GPU detection and optimization
  • OpenAI-compatible API
  • Cross-platform (macOS, Linux, Windows)
  • Modelfile system for custom configurations

Limitations:

  • Single-user optimized (limited concurrent request handling)
  • Less tuning control than raw llama.cpp
  • Quantization options limited to what’s in the library (or custom Modelfiles)
# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run
ollama run llama3.1:8b

# API endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

llama.cpp

What it is: The foundational C/C++ library for local LLM inference. Most other tools (including Ollama) are built on top of it.

Best for: Maximum performance, custom builds, embedded systems, advanced users.

Strengths:

  • Maximum control over every parameter
  • Fastest inference for a given hardware config
  • Supports every quantization level
  • Can be compiled for specific CPU/GPU targets
  • Server mode with OpenAI-compatible API
  • Smallest resource footprint

Limitations:

  • No built-in model management
  • Must download GGUF files manually
  • Configuration via command-line flags
  • Steeper learning curve
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # or DGGML_METAL=ON for Mac
cmake --build build --config Release -j$(nproc)

# Run the server
./build/bin/llama-server \
  -m /path/to/model.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \  # GPU layers
  -c 4096    # context size

vLLM

What it is: A high-throughput inference engine designed for serving multiple users.

Best for: Production deployments, multi-user servers, enterprise.

Strengths:

  • PagedAttention for efficient memory use
  • Continuous batching for high throughput
  • Tensor parallelism (multi-GPU)
  • OpenAI-compatible API server
  • Supports many model architectures natively

Limitations:

  • GPU-only (no CPU inference)
  • Higher resource overhead
  • Python-based (heavier than C++ alternatives)
  • Primarily Linux
pip install vllm

# Serve a model
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192

ExLlamaV2

What it is: A highly optimized CUDA inference library focused on speed.

Best for: Maximum generation speed on NVIDIA GPUs.

Strengths:

  • Fastest token generation for NVIDIA GPUs
  • EXL2 quantization format (flexible bits-per-weight)
  • Excellent speculative decoding support
  • Low VRAM overhead

Limitations:

  • NVIDIA only
  • Smaller community than llama.cpp
  • Fewer integrations

MLC LLM

What it is: Machine Learning Compilation for LLMs. Compiles models for specific hardware targets.

Best for: Mobile deployment, WebGPU, edge devices.

Strengths:

  • Runs on phones (iOS/Android), browsers, and edge devices
  • Optimized compilation for each target
  • WebGPU support for browser-based inference

Limitations:

  • More complex setup than alternatives
  • Smaller model selection

Decision Tree: Which Inference Engine?

Start here:
├── Just getting started?
│   └── Use Ollama
├── Need multi-user production serving?
│   └── Use vLLM
├── Need maximum single-user speed on NVIDIA?
│   └── Use ExLlamaV2
├── Need maximum control and customization?
│   └── Use llama.cpp
├── Deploying to mobile or browser?
│   └── Use MLC LLM
└── Building on Apple Silicon with ML focus?
    └── Use MLX

Layer 2: The Interface

The interface layer is how you (or your users) interact with the model. This can be a CLI, a web UI, a desktop app, or an API that other applications consume.

Open WebUI

What it is: A self-hosted web interface that provides a ChatGPT-like experience for local models.

Best for: Personal use, small teams, anyone who wants a polished chat UI.

Key features:

  • Multi-model support (switch between models mid-conversation)
  • Conversation history and search
  • User management and authentication
  • Document upload and RAG
  • Plugin/function system
  • Mobile-responsive design
  • Supports Ollama and any OpenAI-compatible backend
# Quick start with Docker
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

LM Studio

What it is: A desktop application with integrated model browser, inference engine, and chat interface.

Best for: Users who prefer a native GUI, model experimentation.

Key features:

  • Built-in model browser (search and download from Hugging Face)
  • Local inference engine (based on llama.cpp)
  • Chat interface
  • Local server mode with OpenAI-compatible API
  • Cross-platform (macOS, Windows, Linux)

Limitations:

  • Closed-source
  • Cannot easily be deployed as a server for teams
  • Resource usage higher than command-line tools

Text-Generation-WebUI (Oobabooga)

What it is: A comprehensive web interface with extensive backend support and parameter control.

Best for: Advanced users who want fine-grained control over generation parameters.

Key features:

  • Supports multiple backends (Transformers, llama.cpp, ExLlamaV2, GPTQ, AutoGPTQ)
  • Extensive parameter tuning UI
  • Character/persona system
  • Extensions framework
  • LoRA loading and management

Jan

What it is: An open-source desktop application focused on offline, privacy-first AI.

Best for: Privacy-focused users who want a polished desktop experience.

LibreChat

What it is: A multi-provider chat interface that supports both cloud and local backends.

Best for: Organizations that use a mix of cloud and local AI.

CLI / Terminal

What it is: Direct command-line interaction with the model.

Best for: Developers, scripting, automation.

# Ollama CLI
ollama run llama3.1:8b

# llama.cpp CLI
./llama-cli -m model.gguf -p "Hello" -n 256

# Pipe input
echo "Summarize this: $(cat document.txt)" | ollama run llama3.1:8b

API Only (Headless)

Sometimes you don’t need a user-facing interface at all. You just need an API that other services can call.

# Ollama exposes an API automatically on port 11434
# vLLM serves an OpenAI-compatible API
# llama.cpp server mode provides the same

# Any OpenAI-compatible client library works:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Hello"}]
)

Decision Tree: Which Interface?

Start here:
├── Want a ChatGPT-like web experience?
│   └── Open WebUI
├── Prefer a native desktop app?
│   ├── Want a model browser built in?
│   │   └── LM Studio
│   └── Want open-source desktop?
│       └── Jan
├── Need multi-provider (cloud + local)?
│   └── LibreChat
├── Want maximum parameter control?
│   └── Text-Generation-WebUI
├── Building automations/pipelines?
│   └── API only (headless)
└── Developer who lives in terminal?
    └── CLI (Ollama CLI or llama.cpp)

Layer 3: The Application Framework

The application layer is where you build things on top of the model. RAG pipelines, AI agents, custom workflows, and integrations all live here.

LangChain

What it is: The most popular framework for building LLM-powered applications.

Best for: RAG pipelines, agents, chains, tool use, complex workflows.

Strengths:

  • Massive ecosystem of integrations
  • Comprehensive RAG support
  • Agent and tool-use frameworks
  • LangSmith for debugging and tracing
  • Active community
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage

llm = ChatOllama(model="llama3.1:8b", base_url="http://localhost:11434")
response = llm.invoke([HumanMessage(content="What is local AI?")])
print(response.content)

LlamaIndex

What it is: A data framework focused on connecting LLMs to your data (documents, databases, APIs).

Best for: Document Q&A, knowledge bases, structured data querying.

Strengths:

  • Best-in-class document indexing and retrieval
  • Many data connectors (PDF, web, databases, APIs)
  • Sophisticated chunking and retrieval strategies
  • Query engines for structured data
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama

llm = Ollama(model="llama3.1:8b", base_url="http://localhost:11434")
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What is in these documents?")

Haystack

What it is: An end-to-end NLP framework by deepset, focused on production-grade pipelines.

Best for: Production NLP pipelines, search systems, enterprise deployments.

Direct API Integration

For simple use cases, you don’t need a framework at all. The OpenAI-compatible API works with any HTTP client:

import requests

response = requests.post("http://localhost:11434/v1/chat/completions", json={
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello"}]
})
print(response.json()["choices"][0]["message"]["content"])
// Node.js
const response = await fetch("http://localhost:11434/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "llama3.1:8b",
    messages: [{ role: "user", content: "Hello" }]
  })
});
const data = await response.json();
console.log(data.choices[0].message.content);

Vector Databases (For RAG)

If you’re building RAG applications, you need a vector database to store and search embeddings:

DatabaseBest ForStorage
ChromaDBLocal development, small-medium datasetsEmbedded/file
FAISSHigh-performance search, researchIn-memory
QdrantProduction deployments, filteringClient-server
WeaviateFull-featured, hybrid searchClient-server
pgvectorAlready using PostgreSQLPostgreSQL extension

Five Reference Stacks

Here are five complete stacks you can deploy today, each optimized for a different scenario.

Stack 1: Personal AI Assistant

Scenario: Single user who wants a private ChatGPT replacement.

Engine: Ollama
Interface: Open WebUI
Model: Llama 3.1 8B
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b

# Install Open WebUI
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000, create an account, and start chatting.

Stack 2: Developer Workstation

Scenario: Software developer who wants local code assistance in their IDE plus terminal chat.

Engine: Ollama
Interface: Continue (VS Code) + CLI
Models: Qwen 2.5 Coder 7B (code) + Llama 3.1 8B (chat)
# Install models
ollama pull qwen2.5-coder:7b
ollama pull llama3.1:8b

# Install Continue extension in VS Code
# Then configure ~/.continue/config.json:
{
  "models": [
    {
      "title": "Qwen Coder",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Stack 3: Document Q&A System (RAG)

Scenario: User who wants to chat with their documents (PDFs, notes, knowledge base).

Engine: Ollama
Framework: LlamaIndex
Vector DB: ChromaDB
Interface: Custom Streamlit app or Open WebUI with RAG
Models: Llama 3.1 8B + nomic-embed-text
# Install models
ollama pull llama3.1:8b
ollama pull nomic-embed-text

# Install Python dependencies
pip install llama-index llama-index-llms-ollama \
  llama-index-embeddings-ollama chromadb streamlit
# Simple RAG pipeline
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

llm = Ollama(model="llama3.1:8b")
embed = OllamaEmbedding(model_name="nomic-embed-text")

documents = SimpleDirectoryReader("./documents").load_data()
index = VectorStoreIndex.from_documents(
    documents, embed_model=embed
)
query_engine = index.as_query_engine(llm=llm)

response = query_engine.query("Summarize the key findings")
print(response)

Stack 4: Team/Small Business Server

Scenario: 5-20 users who need shared access to local AI with user management.

Engine: Ollama (or vLLM for higher concurrency)
Interface: Open WebUI or LibreChat
Proxy: Nginx with HTTPS
Models: Llama 3.1 8B + Qwen 2.5 14B
# docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - webui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  webui_data:

Stack 5: Enterprise Production Deployment

Scenario: Organization deploying AI for 100+ users with security, monitoring, and high availability.

Engine: vLLM (multi-GPU, tensor parallel)
Interface: LibreChat (multi-provider, LDAP/OIDC auth)
Orchestration: Kubernetes with NVIDIA GPU Operator
Monitoring: Prometheus + Grafana
Load Balancer: Nginx/Traefik with rate limiting
Models: Llama 3.1 70B (primary) + specialized models

This stack is covered in detail in our Enterprise Local AI Deployment guide.

Choosing Your Stack: Summary

ScenarioEngineInterfaceFramework
Just exploringOllamaCLINone
Personal assistantOllamaOpen WebUINone
Developer toolsOllamaContinue/IDENone
Document Q&AOllamaCustom/Open WebUILlamaIndex
AI applicationsOllama/vLLMAPI (headless)LangChain
Team serverOllamaOpen WebUINone
EnterprisevLLMLibreChatLangChain/Haystack
Maximum speedllama.cpp/ExLlamaV2CLI/APINone
Model experimentationLM StudioLM StudioNone

The beauty of the local AI ecosystem is its modularity. Start simple with Ollama and a CLI, then add components as your needs grow. Every layer can be swapped independently, so you’re never locked in.

Frequently Asked Questions

What is the simplest local AI stack for beginners?

Ollama plus Open WebUI. Ollama handles model management and inference with a single install. Open WebUI provides a polished ChatGPT-like web interface. Together, they give you a complete local AI setup with minimal configuration.

Can I mix and match components from different stacks?

Yes. Most components communicate through standard APIs (typically OpenAI-compatible endpoints). You can use llama.cpp as your engine, Open WebUI as your interface, and LangChain as your framework. The OpenAI-compatible API layer makes components interchangeable.

When should I use vLLM instead of Ollama?

Use vLLM when you need to serve multiple concurrent users with high throughput. vLLM's PagedAttention and continuous batching are designed for production serving. Ollama is better for personal use, development, and small teams where simplicity matters more than maximum throughput.