Local AI Tools Directory — 73+ Tools for Running AI Locally

Inference Engines

Run models on your hardware

ExLlamaV2

Fastest inference engine for consumer NVIDIA GPUs. Custom CUDA kernels and EXL2 quantization format deliver maximum tokens per second on desktop hardware.

linuxwindows

KoboldCpp

Single-binary inference server for local LLMs. Best CPU efficiency, runs on Windows/Linux/Mac/Android. Powers SillyTavern and many other frontends.

windowsmacoslinuxandroid

llama-cpp-python

Python bindings for llama.cpp providing a high-level API and OpenAI-compatible server. The easiest way to use llama.cpp from Python applications.

windowsmacoslinux

llama.cpp

The foundational C/C++ inference engine that pioneered consumer-hardware LLM deployment via quantization. Powers Ollama, LM Studio, GPT4All, and KoboldCpp.

windowsmacoslinuxandroidios

MLX

Apple's machine learning framework for Apple Silicon. Leverages unified memory architecture for efficient LLM inference on Mac with minimal data copying.

macos

Mullama

Cognisoc

Versatile local LLM inference engine with multi-language bindings for Python, Node.js, Go, Rust, PHP, and C/C++. Supports daemon server and embedded modes.

windowsmacoslinux

Ollama

Single-binary LLM runner with built-in model registry, automatic GPU detection, and OpenAI-compatible REST API. The easiest way to run AI locally.

windowsmacoslinuxdocker

TensorRT-LLM

NVIDIA's high-performance LLM inference library. Achieves the highest throughput on NVIDIA GPUs with custom CUDA kernels, quantization, and in-flight batching.

linuxdocker

UniLLM

Cognisoc

A modular LLM inference runtime written in Rust. 47 architecture families, format-agnostic weight loading (SafeTensors, GGUF, PyTorch), and a clean three-layer abstraction (TensorCore, ModelCore, WeightLoaderCore).

windowsmacoslinux

vLLM

High-throughput LLM serving engine with PagedAttention, continuous batching, and tensor parallelism. Designed for multi-user production serving at scale.

linux

Desktop Apps

GUI applications for local AI

GPT4All

Free Nomic AI desktop chatbot that runs any GGUF model on CPU. LocalDocs for private RAG with no GPU. 2026 comparison vs LM Studio, Jan, AnythingLLM.

windowsmacoslinux

Jan

Open-source ChatGPT alternative that runs 100% offline. Clean desktop app with extension ecosystem, local API server, and cross-platform support.

windowsmacoslinux

LM Studio

The most comprehensive local LLM desktop application. Discover, download, and chat with models through a polished UI with built-in OpenAI-compatible API server.

windowsmacoslinux

Msty

Clean, modern desktop chat client for local LLMs. Connects to any OpenAI-compatible backend (Ollama, Mullama, LM Studio). Smallest binary, fastest startup.

windowsmacoslinux

Text Generation WebUI

Feature-rich Gradio web interface by oobabooga supporting multiple inference backends including llama.cpp, ExLlamaV2, Transformers, and AutoGPTQ.

windowsmacoslinux

Web Interfaces

Self-hosted chat platforms

AnythingLLM

Document-first local AI workspace. RAG over your own files (PDFs, docs, code, audio), multi-workspace, multi-user via Docker, OpenAI + Ollama + Anthropic compatible.

windowsmacoslinux

LibreChat

Multi-provider AI chat platform with MCP support, AI Agents, plugins, and multi-user auth. Self-hosted alternative to ChatGPT with enterprise features.

dockerlinuxmacoswindows

Open WebUI

Self-hosted ChatGPT-like interface with 130K+ GitHub stars. Clean design, model selector, markdown rendering, plugin ecosystem, and multi-user authentication.

dockerlinuxmacoswindows

SillyTavern

Local roleplay and storytelling chat frontend with Visual Novel mode, character cards, world info, lorebooks, and full local-LLM support. Connects to Ollama, Mullama, KoboldCpp, and any OpenAI-compatible endpoint.

windowsmacoslinuxdockerandroid

Server / API

Serve models to applications

CLLM

Cognisoc

A bare-metal C unikernel for serving large language models — Multiboot-compliant, boots on x86 hardware (or in QEMU) without an operating system. Includes a custom HTTP server with llama.cpp-compatible REST API.

linux

LocalAI

OpenAI API drop-in replacement that runs LLMs, image generation, audio transcription, and embedding models locally. No GPU required, fully self-hosted.

dockerlinuxmacoswindows

Text Generation Inference (TGI)

Hugging Face's production LLM serving stack: continuous batching, tensor parallelism, Flash Attention. Status, alternatives, and when to pick TGI vs vLLM vs TensorRT-LLM in 2026.

dockerlinux

Mobile AI

On-device AI for phones

Llamafu

Cognisoc

Flutter plugin enabling on-device AI inference on Android and iOS with complete privacy. Supports text generation, chat, vision, function calling, and structured JSON output.

androidios

MLC LLM

Compile and deploy any LLM on iOS, Android, and browsers via WebGPU. 4-bit quantization on phones, no cloud required. Compared to llamafu and MLC LLMs alternatives in 2026.

androidioswindowsmacoslinuxcross-platform

Developer SDKs

Build AI applications

Guidance

Microsoft's constrained generation DSL that interleaves text templates with LLM generation. Control output structure with selects, regex, and grammar rules.

cross-platform

Haystack

Production-ready AI pipeline framework by deepset. Build composable RAG, question answering, and agent pipelines with modular components and any LLM backend.

cross-platform

LangChain

The dominant LLM application framework with 90K GitHub stars. Build chains, agents, and RAG pipelines with local models via Ollama, llama.cpp, or any OpenAI-compatible API.

cross-platform

LlamaIndex

Leading data framework for building RAG and agentic applications over private data. 30K+ GitHub stars, 300+ data connectors, and production-ready pipelines.

cross-platform

llmdot

Cognisoc

Local GGUF language model runtime for .NET — single NuGet package, CPU-first, AOT-friendly, runs Llama, Qwen, Phi, and Gemma on Windows, macOS, and Linux.

windowsmacoslinux

Outlines

Structured text generation library using finite-state machines to guarantee valid JSON, regex patterns, and grammar-conforming output from any LLM.

cross-platform

Semantic Kernel

Microsoft's open-source SDK for integrating LLMs into .NET, Python, and Java applications. Enterprise-focused with planners, plugins, and AI agent patterns.

cross-platform

Fine-Tuning

Train and customize models

Axolotl

YAML-config-driven fine-tuning framework. Supports LoRA, QLoRA, full FT, DPO, RLHF, multi-GPU via DeepSpeed and FSDP. The most flexible option for research.

windowsmacoslinux

LLaMA-Factory

Web-UI-driven fine-tuning framework. Supports the broadest set of training methods (LoRA, QLoRA, full FT, DPO, PPO, KTO, ORPO) and continued pre-training.

windowsmacoslinux

Unsloth

2-5x faster fine-tuning with 40% less VRAM. Single-GPU QLoRA champion. The fastest way to fine-tune a 7B-13B model on a 24GB GPU.

windowsmacoslinux

Vector Databases

Store and search embeddings

ChromaDB

Lightweight, local-first open-source vector database. The default embedding store for RAG applications with simple Python/JS APIs and zero-config setup.

cross-platformdocker

FAISS

Meta's C++ library for efficient similarity search and dense vector clustering. Industry standard for billion-scale nearest neighbor search with IVF and HNSW indexes.

linuxmacoswindows

pgvector

PostgreSQL extension for vector similarity search. Store embeddings alongside relational data in your existing Postgres database with HNSW and IVFFlat indexes.

linuxmacoswindowsdocker

Qdrant

High-performance vector database written in Rust. Production-grade similarity search with advanced filtering, multi-tenancy, and horizontal scaling.

dockerlinuxmacoswindows

Weaviate

Open-source vector database with hybrid vector and keyword search, GraphQL API, built-in vectorization modules, and multi-tenancy for production AI applications.

dockerlinuxmacoswindows

Voice & Audio

Speech-to-text and TTS

Kokoro TTS

Lightweight, high-quality text-to-speech model using ONNX runtime. Sub-second latency, natural prosody, and minimal resource usage for local voice synthesis.

windowsmacoslinux

Piper TTS

Fast, lightweight neural text-to-speech system that runs on CPU. Optimized for Raspberry Pi and edge devices with 30+ languages and natural-sounding voices.

linuxmacoswindowsdocker

Whisper

OpenAI's open-source speech-to-text model. Run locally via faster-whisper (CTranslate2) or Whisper.cpp for real-time transcription in 100+ languages.

windowsmacoslinux

Image & Vision

Image generation and understanding

Automatic1111 (A1111)

The original Stable Diffusion WebUI. Most users should migrate to Forge, but A1111 still has the largest extension ecosystem for legacy workflows.

windowsmacoslinux

ComfyUI

Node-graph Stable Diffusion UI for complex workflows, video generation, and reproducible image pipelines. The most extensible and powerful local image gen UI in 2026.

windowsmacoslinux

Stable Diffusion WebUI Forge

High-performance fork of Automatic1111 with aggressive low-VRAM optimizations. Best Stable Diffusion UI for 6-8GB cards running SDXL.

windowsmacoslinux

SwarmUI

Polished, multi-user Stable Diffusion UI. Runs ComfyUI under the hood with a classic interface on top. Best UX out of the box in 2026.

windowsmacoslinux

Code Assistants

AI-powered coding tools

Aider

Terminal AI pair-programming with auto git commits, multi-file edits, and any local LLM. Setup with Ollama, Mullama, or llama.cpp in 2026. Compared to Continue and Tabby.

windowsmacoslinux

Continue

Open-source AI code assistant for VS Code and JetBrains with tab autocomplete, chat, and inline edits powered by any local LLM. 2026 setup with Ollama, LM Studio, or Mullama.

windowsmacoslinux

Tabby

Self-hosted AI coding assistant with team features. GitHub Copilot alternative with code completion, chat, repository-aware context, and admin dashboard.

linuxmacosdocker

Educational

Learn AI internals

ZigLLM

Cognisoc

Educational implementation of transformer architectures in Zig. Learn LLM internals from first principles with 18 model architectures and 285+ tests.

windowsmacoslinux