Deploying local AI for an organization requires more than installing Ollama on a server. Enterprise deployments need multi-user access control, high availability, security hardening, compliance documentation, and cost-justified infrastructure. This guide provides battle-tested architecture patterns, step-by-step deployment instructions, and practical guidance for IT teams evaluating or implementing local AI for their organizations.
Why Enterprise Local AI
Organizations deploy local AI for three primary reasons:
Data Sovereignty: Sensitive data (code, legal documents, customer information, financial records) never leaves the organization’s infrastructure. This eliminates third-party data processing agreements, residency concerns, and breach risk from external providers.
Cost Control: Cloud AI costs scale linearly with usage and are unpredictable. A 50-person team using GPT-4 can easily generate $10,000+/month in API costs. Local infrastructure has a fixed upfront cost with minimal ongoing expense.
Customization: Local deployments can run specialized models, fine-tuned models, and custom pipelines that cloud providers don’t offer. Organizations can choose exactly which models serve which use cases.
Architecture Patterns
Pattern 1: Single Server (5-20 Users)
┌────────────────┐
Users (Browser) ──> │ Reverse Proxy │
│ (Nginx/Caddy) │
└──────┬─────────┘
│
┌──────▼─────────┐
│ Open WebUI / │
│ LibreChat │
└──────┬─────────┘
│
┌──────▼─────────┐
│ Ollama / vLLM │
│ (GPU Server) │
└────────────────┘
Hardware: Single server with 1-2 GPUs (RTX 4090 or A6000) Budget: $5,000-15,000 Best for: Small teams, departments, startups
Pattern 2: Separated Tiers (20-100 Users)
┌────────────────┐
Users (Browser) ──> │ Load Balancer │
│ (HAProxy) │
└──────┬─────────┘
│
┌────────────┴────────────┐
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ WebUI #1 │ │ WebUI #2 │
│ (App Tier) │ │ (App Tier) │
└──────┬──────┘ └──────┬──────┘
│ │
└────────────┬────────────┘
│
┌──────▼─────────┐
│ vLLM Cluster │
│ (GPU Tier) │
│ Multi-GPU │
└──────┬─────────┘
│
┌──────▼─────────┐
│ Model Storage │
│ (NFS/S3) │
└────────────────┘
Hardware: Dedicated GPU server(s) + application server(s) Budget: $20,000-50,000 Best for: Medium organizations, multiple departments
Pattern 3: Kubernetes Cluster (100+ Users)
┌─────────────────┐
Users ─────────────>│ Ingress / LB │
└────────┬────────┘
│
┌──────────────┴──────────────┐
│ Kubernetes │
│ │
│ ┌─────────┐ ┌─────────┐ │
│ │ WebUI │ │ WebUI │ │
│ │ Pod x2 │ │ Pod x2 │ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ ┌────▼───────────▼────┐ │
│ │ Service Mesh / │ │
│ │ Internal LB │ │
│ └────────┬────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ vLLM Pod (GPU) │ │
│ │ GPU Node Pool │ │
│ │ 2-8 GPU Nodes │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ Persistent │ │
│ │ Storage │ │
│ └─────────────────┘ │
└─────────────────────────────┘
Hardware: Multi-node Kubernetes cluster with GPU nodes Budget: $50,000-200,000+ Best for: Large enterprises, multi-department, high availability required
Deployment: Single Server with vLLM
This is the most common starting point for enterprise deployments.
Hardware Recommendations
| Users | GPU | RAM | Storage | Approximate Cost |
|---|---|---|---|---|
| 5-10 | 1x RTX 4090 24 GB | 64 GB | 1 TB NVMe | $5,000-7,000 |
| 10-20 | 2x RTX 4090 24 GB | 128 GB | 2 TB NVMe | $8,000-12,000 |
| 10-20 | 1x RTX A6000 48 GB | 128 GB | 2 TB NVMe | $10,000-15,000 |
| 20-50 | 2x RTX A6000 48 GB | 256 GB | 4 TB NVMe | $18,000-25,000 |
| 50-100 | 4x A100 80 GB | 512 GB | 8 TB NVMe | $80,000-120,000 |
Step 1: Server Setup
# Ubuntu 24.04 LTS recommended
# Install NVIDIA drivers
sudo apt update
sudo apt install -y nvidia-driver-560
sudo reboot
# Verify
nvidia-smi
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Step 2: Deploy vLLM
# docker-compose.yml
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port 8000
--max-model-len 8192
--gpu-memory-utilization 0.9
--max-num-seqs 32
--enable-prefix-caching
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
volumes:
- hf_cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 120s
librechat:
image: ghcr.io/danny-avila/librechat:latest
container_name: librechat
ports:
- "3080:3080"
environment:
- HOST=0.0.0.0
- MONGO_URI=mongodb://mongodb:27017/librechat
- OPENAI_API_KEY=unused
- OPENAI_REVERSE_PROXY=http://vllm:8000/v1
- ENDPOINTS=openAI
volumes:
- librechat_data:/app/data
- ./librechat.yaml:/app/librechat.yaml
depends_on:
vllm:
condition: service_healthy
mongodb:
condition: service_started
restart: unless-stopped
mongodb:
image: mongo:7
container_name: mongodb
volumes:
- mongo_data:/data/db
restart: unless-stopped
nginx:
image: nginx:alpine
container_name: nginx
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
- /etc/letsencrypt:/etc/letsencrypt:ro
depends_on:
- librechat
restart: unless-stopped
volumes:
hf_cache:
librechat_data:
mongo_data:
Step 3: Configure LibreChat
# librechat.yaml
version: 1.1.4
cache: true
endpoints:
openAI:
- name: "Local AI"
apiKey: "unused"
baseURL: "http://vllm:8000/v1"
models:
default: ["meta-llama/Llama-3.1-8B-Instruct"]
fetch: true
titleModel: "meta-llama/Llama-3.1-8B-Instruct"
summarize: true
dropParams:
- "user"
registration:
socialLogins: []
allowedDomains:
- "yourcompany.com"
Step 4: HTTPS with Nginx
# nginx.conf
server {
listen 80;
server_name ai.yourcompany.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
server_name ai.yourcompany.com;
ssl_certificate /etc/letsencrypt/live/ai.yourcompany.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ai.yourcompany.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
client_max_body_size 100M;
location / {
proxy_pass http://librechat:3080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 300s;
}
}
Step 5: Launch
# Set Hugging Face token (for model download)
echo "HF_TOKEN=hf_your_token_here" > .env
# Start everything
docker compose up -d
# Wait for vLLM to load the model (may take a few minutes)
docker compose logs -f vllm
# Once healthy, access at https://ai.yourcompany.com
Multi-Model Deployment
Serve different models for different use cases:
# docker-compose.yml additions
services:
vllm-general:
image: vllm/vllm-openai:latest
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0 --port 8000
--max-model-len 8192
--gpu-memory-utilization 0.9
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
vllm-code:
image: vllm/vllm-openai:latest
command: >
--model Qwen/Qwen2.5-Coder-14B-Instruct
--host 0.0.0.0 --port 8001
--max-model-len 8192
--gpu-memory-utilization 0.9
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: [gpu]
Security Hardening
Network Security
# Firewall: Only allow HTTPS from corporate network
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow from 10.0.0.0/8 to any port 443 # Internal network
sudo ufw allow ssh
sudo ufw enable
# No outbound from inference container (prevent model exfiltration)
# In docker-compose.yml, use internal networks:
services:
vllm:
networks:
- internal
# No ports exposed directly
librechat:
networks:
- internal
- external
ports:
- "3080:3080" # Only via Nginx
networks:
internal:
internal: true # No external access
external:
Authentication Integration
LDAP/Active Directory (LibreChat supports LDAP):
# librechat.yaml
registration:
socialLogins: []
allowedDomains:
- "yourcompany.com"
# For OIDC/SSO (Okta, Azure AD, Google Workspace):
endpoints:
# LibreChat supports OIDC configuration
Open WebUI with OIDC:
environment:
- ENABLE_OAUTH_SIGNUP=true
- OAUTH_CLIENT_ID=your-client-id
- OAUTH_CLIENT_SECRET=your-client-secret
- OPENID_PROVIDER_URL=https://login.yourcompany.com/.well-known/openid-configuration
Audit Logging
# LibreChat stores all conversations in MongoDB
# Export audit logs:
docker exec mongodb mongodump --db librechat --out /tmp/audit
# Query conversation logs:
docker exec mongodb mongosh librechat --eval '
db.messages.find({
createdAt: { $gte: new Date("2026-04-01") }
}).count()
'
Data Encryption
# Encrypt model storage at rest
sudo apt install -y cryptsetup
sudo cryptsetup luksFormat /dev/sdb
sudo cryptsetup open /dev/sdb ai-storage
sudo mkfs.ext4 /dev/mapper/ai-storage
sudo mount /dev/mapper/ai-storage /data/ai
# TLS for all internal communication
# Use Nginx as TLS termination point
Monitoring and Observability
Prometheus + Grafana
# Add to docker-compose.yml
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
networks:
- internal
restart: unless-stopped
grafana:
image: grafana/grafana
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/var/lib/grafana/dashboards
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
networks:
- internal
- external
restart: unless-stopped
# prometheus.yml
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['vllm:8000']
metrics_path: /metrics
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'nvidia-gpu'
static_configs:
- targets: ['nvidia-exporter:9835']
Key Metrics to Monitor
| Metric | Threshold | Action |
|---|---|---|
| GPU utilization | >95% sustained | Add GPUs or queue requests |
| GPU memory usage | >90% | Reduce max-model-len or batch size |
| Request latency (p99) | >30s | Scale horizontally |
| Queue depth | >50 | Add capacity |
| Error rate | >1% | Investigate logs |
| Disk usage | >80% | Clean old model versions |
Compliance Considerations
GDPR Compliance
- Data stays local: No third-party processing
- Right to deletion: Delete user conversations from MongoDB
- Data minimization: Configure log retention policies
- Processing records: Maintain audit logs of AI usage
# Delete a user's conversation history
docker exec mongodb mongosh librechat --eval '
db.messages.deleteMany({ user: "[email protected]" })
db.conversations.deleteMany({ user: "[email protected]" })
'
HIPAA Compliance (Healthcare)
- Deploy on HIPAA-compliant infrastructure
- Enable encryption at rest and in transit
- Implement access controls and audit logging
- Sign BAA with hardware/hosting provider (if applicable)
- Document data flow and retention policies
SOC 2 Compliance
- Access controls (RBAC via LibreChat/Open WebUI)
- Audit logging (all conversations logged)
- Encryption (TLS + disk encryption)
- Change management (infrastructure as code)
- Monitoring and alerting
Cost Analysis
Total Cost of Ownership (TCO)
Example: 50-user deployment
| Cost Category | Local AI | Cloud AI (GPT-4) |
|---|---|---|
| Year 1 | ||
| Hardware | $15,000 | $0 |
| Setup/labor (40 hrs) | $4,000 | $1,000 |
| Electricity (12 months) | $1,800 | $0 |
| Maintenance (12 months) | $2,400 | $0 |
| API costs (12 months) | $0 | $96,000 |
| Year 1 Total | $23,200 | $97,000 |
| Year 2 | ||
| Electricity | $1,800 | $0 |
| Maintenance | $2,400 | $0 |
| API costs | $0 | $96,000 |
| Year 2 Total | $4,200 | $96,000 |
| 3-Year Total | $31,600 | $289,000 |
Assumptions: 50 users, 100 requests/user/day, average 500 tokens/request. GPT-4 at ~$0.015/request. Local: 2x RTX A6000 server. Electricity at $0.12/kWh. Maintenance includes admin time.
Break-even point: Typically 2-4 months for moderate usage teams.
When Cloud AI Makes More Sense
- Very low usage (under 10 users, occasional use)
- Need for the absolute largest models (GPT-4+, Claude Opus class)
- No IT staff to maintain infrastructure
- Rapid prototyping before committing to local
- Burst capacity needs that would require overprovisioning hardware
Scaling Strategies
Horizontal Scaling
Add more GPU servers behind a load balancer:
# nginx.conf - upstream load balancing
upstream vllm_backends {
least_conn;
server vllm-1:8000;
server vllm-2:8000;
server vllm-3:8000;
}
server {
location /v1 {
proxy_pass http://vllm_backends;
}
}
Model Routing
Route different requests to different models:
# Simple model router
from fastapi import FastAPI, Request
import httpx
app = FastAPI()
ROUTES = {
"general": "http://vllm-general:8000",
"code": "http://vllm-code:8001",
"long-context": "http://vllm-long:8002",
}
@app.post("/v1/chat/completions")
async def route_request(request: Request):
body = await request.json()
model = body.get("model", "general")
backend = ROUTES.get(model, ROUTES["general"])
async with httpx.AsyncClient() as client:
response = await client.post(
f"{backend}/v1/chat/completions",
json=body,
timeout=120,
)
return response.json()
Migration Path
Phase 1: Pilot (Week 1-2)
- Deploy Ollama + Open WebUI on a single machine
- Give 5-10 early adopters access
- Gather feedback on model quality and speed
Phase 2: Validation (Week 3-4)
- Evaluate model quality against cloud alternatives
- Identify use cases that work well locally
- Document requirements for production deployment
Phase 3: Production (Week 5-8)
- Deploy vLLM + LibreChat/Open WebUI with proper infrastructure
- Configure authentication, monitoring, and backups
- Roll out to broader user base
- Establish support procedures
Phase 4: Optimization (Ongoing)
- Fine-tune models on organizational data if needed
- Add specialized models for specific departments
- Implement RAG for knowledge base integration
- Scale hardware based on usage patterns
Next Steps
- Docker deployment details: Docker and Kubernetes guide
- Fine-tune for your data: Fine-Tuning guide
- Build RAG pipelines: Local RAG Chatbot
- Hardware planning: Hardware guide