When should I fine-tune instead of using RAG?

Fine-tune when you need to change the model's behavior, writing style, or output format consistently across all interactions. Use RAG when you need the model to reference specific documents or knowledge that changes frequently. Example: Fine-tune if you want the model to always respond in a specific corporate tone. Use RAG if you want the model to answer questions about your documentation. Many production systems use both: a fine-tuned base model with RAG for knowledge retrieval.

How much data do I need for fine-tuning?

Quality matters far more than quantity. For task-specific fine-tuning (formatting, style), 100-500 high-quality examples can produce significant improvements. For domain adaptation (medical, legal, financial), 1,000-5,000 examples are typical. For general instruction tuning, 10,000+ examples are common. Start with 200-500 carefully curated examples and iterate. Bad data actively harms the model, so invest time in data quality.

How much VRAM do I need for fine-tuning?

QLoRA fine-tuning dramatically reduces VRAM requirements. A 7B model can be fine-tuned with QLoRA on a single GPU with 12-16 GB VRAM. A 13B model needs 16-24 GB. A 70B model needs 48-80 GB (or multiple GPUs). Full fine-tuning requires 4-8x more VRAM than QLoRA. Unsloth further reduces requirements by 30-50% compared to standard QLoRA implementations.

How long does fine-tuning take?

With QLoRA on an RTX 4090 (24 GB): A 7B model with 1,000 examples takes 15-30 minutes for 3 epochs. A 13B model takes 30-60 minutes. A 70B model takes 2-4 hours. Training time scales linearly with dataset size and number of epochs. Unsloth is approximately 2x faster than standard implementations.

Fine-Tuning Your Own Local Model: From Data to Deployment

Fine-tuning lets you customize a pre-trained LLM for your specific use case by training it on your own data. Unlike RAG (which provides context at query time), fine-tuning changes the model’s weights permanently, making it inherently better at your target task. This guide walks you through the entire process: deciding whether to fine-tune, preparing your dataset, training with QLoRA using Unsloth (the most efficient method), evaluating results, exporting to GGUF format, and deploying with Ollama.

When to Fine-Tune

Fine-tuning is powerful but not always necessary. Here’s a decision framework:

Fine-Tuning Is Right When You Need To:

Change output format: Always respond in JSON, markdown tables, or a specific structure
Adopt a consistent style: Specific tone, persona, or writing conventions
Learn domain terminology: Medical, legal, financial, or industry-specific language
Improve at a narrow task: Classification, extraction, summarization of specific content
Reduce prompt engineering: Eliminate repetitive system prompts by baking behavior into weights
Reduce model size: A fine-tuned 7B model can outperform a general 70B model on its specific task

RAG Is Better When:

Your knowledge base changes frequently
You need source attribution
The data is too large to fit in training
You need the model to reference specific documents
Setup time and simplicity matter more than maximum quality

Don’t Fine-Tune When:

A good prompt with the base model already works
You have fewer than 50 examples
Your task requires broad general knowledge
You can’t evaluate quality systematically

Prerequisites

Hardware Requirements

Model Size	QLoRA VRAM	Full FT VRAM	Unsloth QLoRA	Training Time (1K samples)
3B	6-8 GB	24 GB	4-6 GB	5-10 min
7-8B	12-16 GB	60 GB	8-12 GB	15-30 min
13-14B	16-24 GB	100 GB	12-16 GB	30-60 min
32-34B	32-48 GB	200+ GB	24-32 GB	1-2 hours
70B	48-80 GB	400+ GB	40-48 GB	2-4 hours

Software Setup

# Create a virtual environment
python3 -m venv finetune-env
source finetune-env/bin/activate

# Install Unsloth (fastest QLoRA implementation)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Or install the stable version
pip install unsloth

# Additional tools
pip install datasets wandb  # Data loading and experiment tracking

Verify GPU

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")

Step 1: Dataset Preparation

Dataset quality is the single most important factor in fine-tuning success. Spend 80% of your effort here.

Dataset Formats

Chat/Instruction format (most common):

[
  {
    "messages": [
      {"role": "system", "content": "You are a helpful medical assistant."},
      {"role": "user", "content": "What are the symptoms of type 2 diabetes?"},
      {"role": "assistant", "content": "The main symptoms of type 2 diabetes include increased thirst, frequent urination, fatigue, blurred vision, slow-healing wounds, and unexplained weight loss. Many people with type 2 diabetes may not notice symptoms initially, as they develop gradually."}
    ]
  },
  {
    "messages": [
      {"role": "system", "content": "You are a helpful medical assistant."},
      {"role": "user", "content": "What is the first-line treatment for hypertension?"},
      {"role": "assistant", "content": "The first-line pharmacological treatments for hypertension typically include ACE inhibitors (e.g., lisinopril), ARBs (e.g., losartan), calcium channel blockers (e.g., amlodipine), or thiazide diuretics (e.g., hydrochlorothiazide). The choice depends on patient factors including age, ethnicity, comorbidities, and medication tolerability."}
    ]
  }
]

Alpaca format (simpler):

[
  {
    "instruction": "Summarize the following legal clause",
    "input": "The Licensee shall indemnify and hold harmless...",
    "output": "This clause requires the licensee to protect the licensor from any legal claims..."
  }
]

Creating Quality Training Data

Method 1: Manual curation

Write examples by hand. This produces the highest quality data but is time-intensive.

# Template for structured data creation
import json

examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a customer support agent for Acme Inc. Be helpful, concise, and professional."},
            {"role": "user", "content": "How do I reset my password?"},
            {"role": "assistant", "content": "To reset your Acme password:\n\n1. Go to acme.com/reset\n2. Enter your email address\n3. Click the reset link in the email you receive\n4. Choose a new password (minimum 12 characters)\n\nIf you don't receive the email within 5 minutes, check your spam folder or contact [email protected]."}
        ]
    },
    # Add more examples...
]

with open("training_data.json", "w") as f:
    json.dump(examples, f, indent=2)

Method 2: Convert existing data

Transform existing Q&A databases, help desk tickets, or documentation into training format:

import json
import csv

# Convert FAQ CSV to training format
training_data = []
with open("faq.csv") as f:
    reader = csv.DictReader(f)
    for row in reader:
        training_data.append({
            "messages": [
                {"role": "system", "content": "You are a product support specialist."},
                {"role": "user", "content": row["question"]},
                {"role": "assistant", "content": row["answer"]},
            ]
        })

with open("training_data.json", "w") as f:
    json.dump(training_data, f, indent=2)

print(f"Created {len(training_data)} training examples")

Method 3: Synthetic data generation

Use a powerful model to generate training examples (then manually verify):

import requests
import json

def generate_example(topic, model="llama3.1:8b"):
    """Generate a training example using Ollama."""
    prompt = f"""Create a realistic customer support conversation about {topic}.
    Format as JSON with messages array containing role and content fields.
    The assistant should be helpful, professional, and concise.
    Only output the JSON, nothing else."""
    
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False},
    )
    return response.json()["response"]

topics = [
    "billing dispute",
    "product return",
    "account upgrade",
    "technical troubleshooting",
]

# Generate and manually review each example
for topic in topics:
    example = generate_example(topic)
    print(f"--- {topic} ---")
    print(example)
    print()

Data Quality Checklist

Before training, verify your dataset:

At least 100 examples (500+ recommended)
Consistent formatting across all examples
No contradictory information
Responses match desired tone and style
No personally identifiable information (unless intentional)
Variety of question types and complexity levels
Grammar and spelling checked
No duplicate or near-duplicate examples
Train/validation split (90/10 or 80/20)

# Validate dataset
import json

with open("training_data.json") as f:
    data = json.load(f)

print(f"Total examples: {len(data)}")

# Check format
for i, example in enumerate(data):
    assert "messages" in example, f"Example {i} missing 'messages'"
    assert len(example["messages"]) >= 2, f"Example {i} needs at least 2 messages"
    for msg in example["messages"]:
        assert "role" in msg, f"Example {i} message missing 'role'"
        assert "content" in msg, f"Example {i} message missing 'content'"
        assert msg["role"] in ("system", "user", "assistant"), f"Invalid role in example {i}"

# Check for duplicates
contents = [json.dumps(ex) for ex in data]
unique = len(set(contents))
print(f"Unique examples: {unique} (duplicates: {len(data) - unique})")

# Analyze lengths
user_lens = []
assistant_lens = []
for ex in data:
    for msg in ex["messages"]:
        if msg["role"] == "user":
            user_lens.append(len(msg["content"]))
        elif msg["role"] == "assistant":
            assistant_lens.append(len(msg["content"]))

print(f"User message length: avg={sum(user_lens)/len(user_lens):.0f}, "
      f"min={min(user_lens)}, max={max(user_lens)}")
print(f"Assistant message length: avg={sum(assistant_lens)/len(assistant_lens):.0f}, "
      f"min={min(assistant_lens)}, max={max(assistant_lens)}")

Step 2: Fine-Tuning with Unsloth

Unsloth is the fastest and most memory-efficient QLoRA implementation. It’s 2x faster than standard Hugging Face training and uses 30-50% less VRAM.

Basic Training Script

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

# 1. Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True,  # QLoRA
)

# 2. Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank (8-64, higher = more capacity)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,     # Scaling factor
    lora_dropout=0,    # Dropout (0 for QLoRA)
    bias="none",
    use_gradient_checkpointing="unsloth",  # Memory optimization
)

# 3. Load dataset
dataset = load_dataset("json", data_files="training_data.json", split="train")

# 4. Format dataset for training
def format_chat(example):
    """Format chat messages using the model's chat template."""
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

dataset = dataset.map(format_chat)

# 5. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        output_dir="./output",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        logging_steps=10,
        save_strategy="epoch",
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        optim="adamw_8bit",
        seed=42,
    ),
)

# 6. Start training
trainer_stats = trainer.train()
print(f"Training completed in {trainer_stats.metrics['train_runtime']:.1f} seconds")

Training with Validation

# Split dataset
dataset = load_dataset("json", data_files="training_data.json", split="train")
dataset = dataset.map(format_chat)
split = dataset.train_test_split(test_size=0.1, seed=42)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        output_dir="./output",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        logging_steps=10,
        eval_strategy="steps",
        eval_steps=50,
        save_strategy="epoch",
        load_best_model_at_end=True,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        optim="adamw_8bit",
        seed=42,
    ),
)

Key Hyperparameters

Parameter	Recommended	Range	Effect
`r` (LoRA rank)	16	8-64	Higher = more capacity, more memory
`lora_alpha`	16	Equal to r	Scaling factor
`learning_rate`	2e-4	1e-5 to 5e-4	Lower = safer but slower
`num_train_epochs`	3	1-5	More = better fit, risk of overfit
`batch_size`	2	1-8	Higher = faster, more memory
`gradient_accumulation`	4	2-16	Effective batch = batch * accumulation
`max_seq_length`	2048	512-8192	Must fit your longest example
`warmup_steps`	10	5-100	Stabilizes early training

Monitoring Training

# Use Weights & Biases for tracking
import wandb
wandb.init(project="local-llm-finetune")

# Add to TrainingArguments:
args = TrainingArguments(
    # ... other args ...
    report_to="wandb",
    run_name="llama3.1-8b-customer-support-v1",
)

Watch for:

Training loss: Should decrease steadily
Validation loss: Should decrease and then plateau. If it increases while training loss decreases, you’re overfitting
Learning rate: Should warm up then decay

Step 3: Evaluation

Quick Test

# Test the fine-tuned model
FastLanguageModel.for_inference(model)

messages = [
    {"role": "system", "content": "You are a customer support agent."},
    {"role": "user", "content": "I can't log into my account."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=256,
    temperature=0.1,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Systematic Evaluation

import json

# Load test questions
test_cases = [
    {
        "input": "How do I upgrade my plan?",
        "expected_topics": ["pricing", "upgrade", "steps"],
    },
    {
        "input": "I was charged twice this month.",
        "expected_topics": ["billing", "refund", "investigate"],
    },
    # Add more test cases...
]

results = []
for test in test_cases:
    messages = [
        {"role": "system", "content": "You are a customer support agent."},
        {"role": "user", "content": test["input"]},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to("cuda")
    
    outputs = model.generate(input_ids=inputs, max_new_tokens=256, temperature=0.1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract just the assistant's response
    assistant_response = response.split("assistant")[-1].strip()
    
    # Check if expected topics are covered
    topics_found = [
        topic for topic in test["expected_topics"]
        if topic.lower() in assistant_response.lower()
    ]
    
    results.append({
        "input": test["input"],
        "response": assistant_response,
        "topics_covered": len(topics_found) / len(test["expected_topics"]),
    })

# Summary
avg_coverage = sum(r["topics_covered"] for r in results) / len(results)
print(f"Average topic coverage: {avg_coverage:.1%}")

for r in results:
    print(f"\nQ: {r['input']}")
    print(f"Coverage: {r['topics_covered']:.0%}")
    print(f"A: {r['response'][:200]}...")

A/B Comparison with Base Model

# Compare fine-tuned vs base model responses
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(base_model)

test_prompt = "I was charged twice this month."

for name, m, t in [("Base", base_model, base_tokenizer), ("Fine-tuned", model, tokenizer)]:
    messages = [
        {"role": "system", "content": "You are a customer support agent."},
        {"role": "user", "content": test_prompt},
    ]
    inputs = t.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
    outputs = m.generate(input_ids=inputs, max_new_tokens=256, temperature=0.1)
    response = t.decode(outputs[0], skip_special_tokens=True)
    print(f"\n=== {name} Model ===")
    print(response.split("assistant")[-1].strip()[:300])

Step 4: Export to GGUF

GGUF is the format used by llama.cpp and Ollama. Export your fine-tuned model to GGUF for efficient local inference.

Save the LoRA Adapter

# Save LoRA adapter (small, ~100-500 MB)
model.save_pretrained("./output/lora-adapter")
tokenizer.save_pretrained("./output/lora-adapter")

Merge and Export to GGUF

# Merge LoRA into base model and export as GGUF
model.save_pretrained_gguf(
    "./output/gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Most common quantization
)

# Available quantization methods:
# "q4_k_m"  - Recommended default (4-bit, good quality)
# "q5_k_m"  - Higher quality (5-bit)
# "q8_0"    - Near-lossless (8-bit)
# "f16"     - Full precision (largest)

This creates a single .gguf file ready for Ollama.

Manual GGUF Conversion (Alternative)

If automatic export fails:

# First, save merged model in HF format
model.save_pretrained_merged(
    "./output/merged",
    tokenizer,
    save_method="merged_16bit",
)

# Then convert with llama.cpp
cd ~/llama.cpp
python convert_hf_to_gguf.py ./output/merged --outtype q4_k_m --outfile model-finetuned-q4_k_m.gguf

Step 5: Deploy with Ollama

Create Modelfile

cat > Modelfile << 'EOF'
FROM ./output/gguf/unsloth.Q4_K_M.gguf

# Set chat template (match what was used in training)
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

# Default system prompt
SYSTEM "You are a customer support agent for Acme Inc. Be helpful, concise, and professional."

# Generation parameters
PARAMETER temperature 0.3
PARAMETER num_ctx 2048
PARAMETER stop "<|eot_id|>"
EOF

Import to Ollama

# Create the model in Ollama
ollama create acme-support -f Modelfile

# Test it
ollama run acme-support

# List your models
ollama list

Verify the Fine-Tuned Model

# Interactive test
ollama run acme-support
>>> How do I reset my password?
>>> I was charged twice.
>>> Can I upgrade to the enterprise plan?

# API test
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "acme-support",
    "messages": [
      {"role": "user", "content": "How do I cancel my subscription?"}
    ]
  }'

Common Issues and Solutions

Overfitting

Symptoms: Training loss is very low but validation loss increases. Model repeats training examples verbatim.

Solutions:

Reduce epochs (try 1-2 instead of 3)
Reduce LoRA rank (try 8 instead of 16)
Add more diverse training data
Increase dropout (try 0.05-0.1)

Underfitting

Symptoms: Model behavior barely changed from base model.

Solutions:

Increase epochs (try 5)
Increase LoRA rank (try 32-64)
Increase learning rate (try 5e-4)
Add more training data
Check that data formatting matches the model’s chat template

Catastrophic Forgetting

Symptoms: Model does your task well but becomes terrible at everything else.

Solutions:

Use lower LoRA rank (8-16)
Mix in general instruction data (10-20% of training set)
Use lower learning rate (1e-4)
Fewer epochs

Out of Memory

Solutions:

Reduce batch size to 1
Increase gradient accumulation
Reduce max_seq_length
Use Unsloth (30-50% less memory)
Try a smaller base model

Iterating on Your Model

Fine-tuning is iterative. Here’s a recommended workflow:

1. Start with 200 high-quality examples
2. Train for 1 epoch with default settings
3. Evaluate with test cases
4. Identify failure modes
5. Add examples that address failures
6. Retrain with updated dataset
7. Repeat until quality meets your bar

Keep a version log:

v1: 200 examples, 1 epoch  → 60% topic coverage
v2: 350 examples, 2 epochs → 75% topic coverage
v3: 500 examples, 3 epochs → 88% topic coverage
v4: 500 examples, 3 epochs, cleaned data → 92% topic coverage ✓

Next Steps

Deploy for your team: Enterprise Local AI guide
Add document knowledge: Local RAG Chatbot
Compare models: Model selection guide
Optimize hardware: Hardware guide

When to Fine-Tune

Fine-Tuning Is Right When You Need To:

RAG Is Better When:

Don’t Fine-Tune When:

Prerequisites

Hardware Requirements

Software Setup

Verify GPU

Step 1: Dataset Preparation

Dataset Formats

Creating Quality Training Data

Data Quality Checklist

Step 2: Fine-Tuning with Unsloth

Basic Training Script

Training with Validation

Key Hyperparameters

Monitoring Training

Step 3: Evaluation

Quick Test

Systematic Evaluation

A/B Comparison with Base Model

Step 4: Export to GGUF

Save the LoRA Adapter

Merge and Export to GGUF

Manual GGUF Conversion (Alternative)

Step 5: Deploy with Ollama

Create Modelfile

Import to Ollama

Verify the Fine-Tuned Model

Common Issues and Solutions

Overfitting

Underfitting

Catastrophic Forgetting

Out of Memory

Iterating on Your Model

Next Steps

Frequently Asked Questions