Building a Local RAG Chatbot: Documents, Embeddings, and Retrieval

Build a fully local RAG (Retrieval-Augmented Generation) chatbot that answers questions about your documents. Covers architecture, chunking strategies, embedding models, vector databases, and prompt engineering.

RAG (Retrieval-Augmented Generation) is the most practical way to make a local LLM answer questions about your own documents, knowledge base, or data. Instead of fine-tuning the model (which is expensive and inflexible), RAG retrieves relevant document chunks at query time and includes them in the LLM’s prompt. This guide walks you through building a complete local RAG pipeline from scratch, with no data leaving your machine, using Ollama for inference, embedding models for vector search, and ChromaDB for storage.

How RAG Works

The RAG pipeline has two phases:

Ingestion Phase (One-time)

Documents → Chunking → Embedding → Vector Database
   │            │           │              │
   PDF/TXT    Split into   Convert to    Store for
   files      passages     vectors       fast search
  1. Load documents: Read PDFs, text files, markdown, web pages
  2. Chunk: Split documents into smaller passages (typically 500-1000 characters)
  3. Embed: Convert each chunk into a numerical vector using an embedding model
  4. Store: Save vectors in a vector database (ChromaDB, FAISS)

Query Phase (Every question)

User Query → Embed Query → Search → Top K Chunks → LLM + Context → Answer
     │            │           │          │               │             │
  "What is    Convert to   Find most  Retrieve       Prompt with   Generated
   X?"        vector       similar    relevant       context       response
                           vectors    passages
  1. Embed the question: Convert the user’s question into a vector
  2. Search: Find the most similar document chunks in the vector database
  3. Build prompt: Combine the retrieved chunks with the question in a prompt
  4. Generate: Send the prompt to the LLM, which generates an answer grounded in the retrieved context

Prerequisites

# Ollama with models
ollama pull llama3.1:8b           # Chat model
ollama pull nomic-embed-text       # Embedding model

# Python packages
pip install \
  langchain \
  langchain-ollama \
  langchain-chroma \
  langchain-community \
  chromadb \
  pypdf \
  unstructured

Step 1: Document Loading

Loading PDFs

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader

# Single PDF
loader = PyPDFLoader("report.pdf")
documents = loader.load()

# Directory of PDFs
loader = DirectoryLoader(
    "./documents",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)
documents = loader.load()

print(f"Loaded {len(documents)} pages")

Loading Multiple File Types

from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    UnstructuredMarkdownLoader,
    CSVLoader,
)
from langchain_community.document_loaders import DirectoryLoader

# Create loaders for different file types
loaders = [
    DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader),
    DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader),
    DirectoryLoader("./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader),
    DirectoryLoader("./docs", glob="**/*.csv", loader_cls=CSVLoader),
]

documents = []
for loader in loaders:
    documents.extend(loader.load())

print(f"Loaded {len(documents)} documents from all sources")

Loading Web Pages

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader([
    "https://example.com/page1",
    "https://example.com/page2",
])
web_docs = loader.load()

Step 2: Chunking Strategies

Chunking determines how documents are split into retrievable passages. This is one of the most impactful decisions in RAG quality.

Basic Recursive Splitting

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Max characters per chunk
    chunk_overlap=200,      # Overlap between chunks
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")

Semantic Chunking

Splits at natural topic boundaries rather than fixed character counts:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)

semantic_chunks = semantic_splitter.split_documents(documents)

Markdown-Aware Splitting

For markdown documents, split by headers to preserve structure:

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split)
md_chunks = md_splitter.split_text(markdown_text)

Choosing Chunk Parameters

Document TypeChunk SizeOverlapRationale
Technical docs500-800100Precise answers need focused chunks
Legal documents1000-1500200Context is critical for legal text
Narrative/prose1000-2000200Larger chunks preserve storytelling
Code documentation500-1000100Functions and explanations as units
FAQ/Q&A300-50050Each Q&A pair should be one chunk
Research papers800-1200150Balance between sections and detail

Step 3: Embedding Models

The embedding model converts text into vectors. Quality of embeddings directly impacts retrieval accuracy.

Using Ollama Embeddings

from langchain_ollama import OllamaEmbeddings

# nomic-embed-text: Good quality, fast, 768 dimensions
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

# Test embedding
test_vector = embeddings.embed_query("What is machine learning?")
print(f"Vector dimensions: {len(test_vector)}")  # 768

Embedding Model Options

ModelDimensionsSizeQualitySpeed
nomic-embed-text768274 MBGoodFast
mxbai-embed-large1024670 MBVery goodMedium
all-minilm38445 MBDecentVery fast
snowflake-arctic-embed1024670 MBVery goodMedium
# Pull your chosen embedding model
ollama pull nomic-embed-text
ollama pull mxbai-embed-large

Step 4: Vector Database Setup

ChromaDB runs embedded (no separate server) and persists to disk.

from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Create vector store and embed all chunks
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my_documents"
)

print(f"Stored {vectorstore._collection.count()} chunks in ChromaDB")

Loading an Existing Database

# Load previously created database
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="my_documents"
)

Adding New Documents

# Add more documents to existing database
new_docs = PyPDFLoader("new_report.pdf").load()
new_chunks = text_splitter.split_documents(new_docs)
vectorstore.add_documents(new_chunks)

FAISS Alternative

FAISS (by Meta) is faster for large collections but doesn’t persist to disk by default.

from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(chunks, embeddings)

# Save to disk
vectorstore.save_local("./faiss_index")

# Load from disk
vectorstore = FAISS.load_local(
    "./faiss_index",
    embeddings,
    allow_dangerous_deserialization=True
)

Step 5: Retrieval Configuration

Basic Retriever

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Return top 4 most relevant chunks
)

# Test retrieval
results = retriever.invoke("What are the key findings?")
for doc in results:
    print(f"Source: {doc.metadata.get('source', 'unknown')}")
    print(f"Content: {doc.page_content[:200]}...")
    print("---")

MMR Retrieval (Diversity)

Maximum Marginal Relevance balances relevance with diversity to avoid returning near-duplicate chunks:

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 4,
        "fetch_k": 20,      # Fetch 20 candidates, return 4 diverse ones
        "lambda_mult": 0.5   # 0 = max diversity, 1 = max relevance
    }
)

Similarity Score Threshold

Only return chunks above a relevance threshold:

retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.7,  # Only return if similarity > 0.7
        "k": 6
    }
)

Step 6: Building the RAG Chain

Basic RAG Chain with LangChain

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Initialize LLM
llm = ChatOllama(
    model="llama3.1:8b",
    temperature=0.1,  # Low temperature for factual answers
)

# RAG prompt template
template = """Answer the question based only on the following context. 
If the context doesn't contain enough information to answer the question, 
say "I don't have enough information to answer that question."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Helper to format retrieved documents
def format_docs(docs):
    return "\n\n---\n\n".join(
        f"Source: {doc.metadata.get('source', 'unknown')}\n{doc.page_content}"
        for doc in docs
    )

# Build the chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask a question
response = rag_chain.invoke("What are the main conclusions of the report?")
print(response)

RAG Chain with Sources

from langchain_core.runnables import RunnableParallel

# Chain that returns both answer and source documents
rag_chain_with_sources = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(
    answer=lambda x: (
        prompt.invoke({
            "context": format_docs(x["context"]),
            "question": x["question"]
        })
        | llm
        | StrOutputParser()
    ).invoke(x)
)

result = rag_chain_with_sources.invoke("What is the budget for Q3?")
print("Answer:", result["answer"])
print("\nSources:")
for doc in result["context"]:
    print(f"  - {doc.metadata.get('source')}: {doc.page_content[:100]}...")

Conversational RAG (Chat History)

from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage

# Prompt with chat history
contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", """Given a chat history and the latest user question, 
    formulate a standalone question that can be understood without 
    the chat history. Do NOT answer the question, just reformulate it."""),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Full conversational RAG prompt
qa_prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer the question based on the following context. 
    If unsure, say you don't know.
    
    Context: {context}"""),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Usage with chat history
chat_history = []

def ask(question):
    # Simple conversational chain
    context_docs = retriever.invoke(question)
    context = format_docs(context_docs)
    
    messages = qa_prompt.format_messages(
        context=context,
        chat_history=chat_history,
        input=question
    )
    
    response = llm.invoke(messages)
    
    chat_history.append(HumanMessage(content=question))
    chat_history.append(AIMessage(content=response.content))
    
    return response.content

print(ask("What were the Q3 results?"))
print(ask("How do they compare to Q2?"))  # Uses chat history

Step 7: Alternative with LlamaIndex

LlamaIndex provides a more streamlined API for RAG:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

# Configure global settings
Settings.llm = Ollama(model="llama3.1:8b", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.chunk_size = 1000
Settings.chunk_overlap = 200

# Load and index documents
documents = SimpleDirectoryReader("./documents").load_data()
index = VectorStoreIndex.from_documents(documents)

# Save index to disk
index.storage_context.persist(persist_dir="./llama_index_store")

# Create query engine
query_engine = index.as_query_engine(
    similarity_top_k=4,
    streaming=True,
)

# Ask questions
response = query_engine.query("Summarize the key findings")
print(response)
print("\nSources:")
for node in response.source_nodes:
    print(f"  Score: {node.score:.3f} - {node.metadata.get('file_name', 'unknown')}")

LlamaIndex Chat Engine (Conversational)

chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    similarity_top_k=4,
)

response = chat_engine.chat("What are the main points?")
print(response)

response = chat_engine.chat("Can you elaborate on the second point?")
print(response)

Step 8: Building a Complete Application

Streamlit RAG App

# app.py
import streamlit as st
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import tempfile
import os

st.title("Local RAG Chatbot")
st.caption("Upload documents and ask questions. Everything runs locally.")

# Initialize components
@st.cache_resource
def get_embeddings():
    return OllamaEmbeddings(model="nomic-embed-text")

@st.cache_resource
def get_llm():
    return ChatOllama(model="llama3.1:8b", temperature=0.1)

embeddings = get_embeddings()
llm = get_llm()

# File upload
uploaded_files = st.file_uploader(
    "Upload PDF documents",
    type=["pdf"],
    accept_multiple_files=True
)

if uploaded_files:
    with st.spinner("Processing documents..."):
        documents = []
        for file in uploaded_files:
            with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
                tmp.write(file.getvalue())
                loader = PyPDFLoader(tmp.name)
                documents.extend(loader.load())
                os.unlink(tmp.name)

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000, chunk_overlap=200
        )
        chunks = splitter.split_documents(documents)

        vectorstore = Chroma.from_documents(chunks, embeddings)
        retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

        st.success(f"Processed {len(documents)} pages into {len(chunks)} chunks")

    # Chat interface
    if "messages" not in st.session_state:
        st.session_state.messages = []

    for msg in st.session_state.messages:
        st.chat_message(msg["role"]).write(msg["content"])

    if question := st.chat_input("Ask a question about your documents"):
        st.session_state.messages.append({"role": "user", "content": question})
        st.chat_message("user").write(question)

        template = """Answer based on this context:
        {context}
        
        Question: {question}
        Answer:"""
        prompt = ChatPromptTemplate.from_template(template)

        def format_docs(docs):
            return "\n\n".join(d.page_content for d in docs)

        chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt | llm | StrOutputParser()
        )

        with st.chat_message("assistant"):
            response = chain.invoke(question)
            st.write(response)
            st.session_state.messages.append({"role": "assistant", "content": response})
# Run the app
streamlit run app.py

Step 9: Prompt Engineering for RAG

The prompt template significantly impacts answer quality.

Effective RAG Prompts

# Strict factual (no hallucination)
strict_template = """You are a precise assistant that answers questions 
using ONLY the provided context. Follow these rules:
1. Only use information from the context below
2. If the context doesn't contain the answer, say "This information is not in the provided documents"
3. Quote relevant passages when possible
4. Cite the source document

Context:
{context}

Question: {question}

Answer:"""

# Analytical (synthesize across documents)
analytical_template = """Analyze the following context to answer the question.
Synthesize information across multiple sources if needed.
Highlight any contradictions or gaps in the available information.

Context:
{context}

Question: {question}

Analysis:"""

# Conversational (natural dialogue)
conversational_template = """You are a helpful assistant with access to 
a knowledge base. Use the context below to answer naturally. 
If you're unsure, say so.

Context:
{context}

Question: {question}

Response:"""

Step 10: Evaluation and Improvement

Testing Retrieval Quality

# Test if the right chunks are being retrieved
test_questions = [
    "What was the revenue in Q3?",
    "Who is the CEO?",
    "What are the risk factors?",
]

for question in test_questions:
    results = retriever.invoke(question)
    print(f"\nQ: {question}")
    for i, doc in enumerate(results):
        print(f"  [{i+1}] Score relevance - {doc.page_content[:100]}...")

Common Issues and Fixes

ProblemSymptomFix
Chunks too largeAnswer includes irrelevant infoReduce chunk_size to 500
Chunks too smallMissing contextIncrease chunk_size to 1500
Wrong chunks retrievedIrrelevant resultsTry better embedding model (mxbai-embed-large)
HallucinationMakes up factsUse stricter prompt, lower temperature
Slow retrievalLong wait for resultsReduce k, use FAISS instead of ChromaDB
Missing answers”Not in documents” when it isIncrease k to 6-8, check chunking boundaries

Improving Quality

  1. Better chunking: Use semantic chunking or document-structure-aware splitting
  2. Better embeddings: Upgrade from nomic-embed-text to mxbai-embed-large
  3. Hybrid search: Combine vector similarity with keyword (BM25) search
  4. Reranking: After retrieval, rerank results with a cross-encoder model
  5. Larger LLM: Use a 14B or 32B model for better comprehension
  6. Query expansion: Rephrase the query to improve retrieval

No-Code RAG with Open WebUI

If you don’t want to write code, Open WebUI has built-in RAG:

# Start Open WebUI with Ollama
docker compose up -d  # Using the compose file from earlier guides

# In the Open WebUI interface:
# 1. Click the + button in a chat
# 2. Upload documents (PDF, TXT, etc.)
# 3. Ask questions about the uploaded content
# Open WebUI handles chunking, embedding, and retrieval automatically

Next Steps

Frequently Asked Questions

What is RAG and why use it instead of fine-tuning?

RAG (Retrieval-Augmented Generation) retrieves relevant documents and includes them in the LLM's prompt context. Unlike fine-tuning, RAG doesn't modify the model. It's faster to set up, works with any model, lets you update data instantly without retraining, and provides source attribution. Use RAG when your data changes frequently or when you need to cite sources. Use fine-tuning when you need to change the model's behavior or writing style.

How many documents can a local RAG system handle?

A local RAG system can handle thousands to tens of thousands of documents easily. ChromaDB and FAISS handle millions of embeddings on a single machine. The bottleneck is usually the initial embedding computation, not storage or retrieval. A 7B embedding model can process about 50-100 pages per minute on a modern GPU.

What chunk size should I use for RAG?

Start with 500-1000 characters with 100-200 character overlap. Smaller chunks (200-500) improve precision but may lose context. Larger chunks (1000-2000) preserve more context but may include irrelevant information. The optimal size depends on your content type: technical docs work well with smaller chunks, narrative content benefits from larger ones. Test with your actual data and queries.

Can I build RAG without writing code?

Yes. Open WebUI has built-in RAG functionality. Upload documents directly in the chat interface and ask questions about them. AnythingLLM is another no-code option with a GUI for document management and chat. For more control and customization, the code-based approaches in this guide offer maximum flexibility.