Fine-tuning used to be something only well-funded teams could do. You needed A100s, distributed training frameworks, and deep expertise in gradient accumulation and learning rate scheduling. Unsloth changed that. With QLoRA and Unsloth’s optimized training kernels, you can fine-tune a 7-14B parameter model on a single consumer GPU in a few hours and produce a model that genuinely outperforms the base model on your specific task.
This walkthrough takes you from raw customer support data to a deployed, fine-tuned model running in Ollama. No hand-waving. No “left as an exercise.” Every step, every config, every decision explained.
What We Are Building
We are fine-tuning Qwen 3 7B into a customer support assistant for a fictional SaaS company called “CloudDash” — a cloud monitoring platform. The fine-tuned model will:
- Answer product questions accurately using our documentation
- Follow a consistent, professional tone
- Handle common support scenarios (billing, outages, feature requests, troubleshooting)
- Escalate appropriately when it cannot help
By the end, we will have a GGUF file running in Ollama that handles CloudDash support queries better than the base Qwen 3 7B.
Hardware Requirements
- Minimum: GPU with 12GB VRAM (RTX 3060 12GB, RTX 4070)
- Recommended: GPU with 24GB VRAM (RTX 3090, RTX 4090)
- RAM: 32GB system RAM
- Storage: 50GB free disk space
- Time: 2-4 hours for training (depending on GPU and dataset size)
This tutorial was tested on an RTX 3090 (24GB). On 12GB VRAM, you may need to reduce batch size.
Step 1: Environment Setup
# Create a fresh environment
conda create -n finetune python=3.11 -y
conda activate finetune
# Install Unsloth (follow their latest install instructions for your CUDA version)
pip install unsloth
pip install datasets transformers trl
# Verify GPU access
python -c "import torch; print(torch.cuda.get_device_name(0))"
Step 2: Prepare the Dataset
This is the most important step. The quality of your fine-tuning data determines the quality of your model. Garbage in, garbage out — and with fine-tuning, this is not a metaphor.
Dataset Format
We use the ChatML format, which is the standard for instruction-tuned models:
{
"conversations": [
{"role": "system", "content": "You are CloudDash Support, a helpful customer support assistant for CloudDash cloud monitoring platform."},
{"role": "user", "content": "How do I set up alerts for CPU usage?"},
{"role": "assistant", "content": "To set up CPU usage alerts in CloudDash:\n\n1. Navigate to Monitoring > Alert Rules\n2. Click 'Create New Rule'\n3. Select 'CPU Utilization' from the metric dropdown\n4. Set your threshold (e.g., 90% for 5 minutes)\n5. Choose notification channels (email, Slack, PagerDuty)\n6. Click 'Save Rule'\n\nThe alert will trigger when CPU usage exceeds your threshold for the specified duration. You can also set warning thresholds at a lower level (e.g., 75%) for early notification.\n\nWould you like help configuring a specific notification channel?"}
]
}
Building the Dataset
For a customer support model, your training data should include:
- Real support conversations (anonymized) — the gold standard
- FAQ pairs generated from your documentation
- Edge cases — billing disputes, angry customers, out-of-scope requests
- Escalation examples — knowing when to hand off to a human
Here is a Python script to structure your data:
# prepare_dataset.py
import json
def create_training_example(system_prompt, user_msg, assistant_msg):
return {
"conversations": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_msg},
{"role": "assistant", "content": assistant_msg},
]
}
SYSTEM_PROMPT = (
"You are CloudDash Support, a helpful and professional customer support "
"assistant for CloudDash, a cloud infrastructure monitoring platform. "
"You are knowledgeable about CloudDash features, billing, and "
"troubleshooting. You are friendly but concise. When you cannot help "
"with something, you escalate to a human agent."
)
# Example: build from structured data sources
training_data = []
# From FAQ documents
faqs = [
{
"q": "What pricing plans do you offer?",
"a": "CloudDash offers three plans:\n\n- **Starter** ($29/month): Up to 10 hosts, 1-day data retention, email alerts\n- **Pro** ($99/month): Up to 100 hosts, 30-day retention, all notification channels, custom dashboards\n- **Enterprise** (custom pricing): Unlimited hosts, 1-year retention, SSO, dedicated support\n\nAll plans include a 14-day free trial. Would you like to start a trial or do you have questions about a specific plan?"
},
# ... hundreds more FAQ pairs
]
for faq in faqs:
training_data.append(
create_training_example(SYSTEM_PROMPT, faq["q"], faq["a"])
)
# From real support conversations (anonymized)
# Load from your support ticket system export
# ...
# Save in JSONL format
with open("training_data.jsonl", "w") as f:
for example in training_data:
f.write(json.dumps(example) + "\n")
print(f"Created {len(training_data)} training examples")
Dataset Size Guidelines
- Minimum viable: 200-500 examples (you will see some improvement)
- Good results: 1,000-3,000 examples (noticeable behavior change)
- Excellent results: 5,000-10,000 examples (strong domain specialization)
- Diminishing returns: Beyond 10,000 examples for a single task
For this tutorial, we use 2,000 examples. Quality matters more than quantity — 500 excellent examples beat 5,000 sloppy ones.
Data Quality Checklist
Before training, audit your dataset:
- Consistent tone and style across all assistant responses
- No contradictory information between examples
- Responses are the length you want the model to produce (it learns length patterns)
- Edge cases are represented (refusals, escalations, multi-turn conversations)
- No personally identifiable information (PII) in the training data
- System prompt is consistent across all examples
Step 3: Fine-Tuning with Unsloth
# train.py
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# ============================================================
# Model Configuration
# ============================================================
model_name = "unsloth/Qwen3-7B" # Unsloth-optimized version
max_seq_length = 4096
load_in_4bit = True # QLoRA: load base model in 4-bit
# Load model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
load_in_4bit=load_in_4bit,
)
# ============================================================
# LoRA Configuration
# ============================================================
model = FastLanguageModel.get_peft_model(
model,
r=32, # LoRA rank — higher = more capacity, more VRAM
lora_alpha=64, # Scaling factor, typically 2x rank
lora_dropout=0.05, # Slight dropout for regularization
target_modules=[ # Which layers to fine-tune
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
use_gradient_checkpointing="unsloth", # Unsloth's memory-efficient checkpointing
)
print(f"Trainable parameters: {model.print_trainable_parameters()}")
# ============================================================
# Load Dataset
# ============================================================
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Format conversations into the chat template
def format_conversation(example):
"""Convert our conversation format to the model's chat template."""
text = tokenizer.apply_chat_template(
example["conversations"],
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
dataset = dataset.map(format_conversation)
# ============================================================
# Training Configuration
# ============================================================
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
args=TrainingArguments(
output_dir="./output",
num_train_epochs=3, # 3 epochs is usually sufficient
per_device_train_batch_size=4, # Reduce to 2 if OOM on 12GB
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
learning_rate=2e-4, # Standard for QLoRA
lr_scheduler_type="cosine",
warmup_ratio=0.05,
weight_decay=0.01,
logging_steps=10,
save_steps=200,
save_total_limit=3,
fp16=True, # Use bf16=True if your GPU supports it
optim="adamw_8bit", # 8-bit Adam saves VRAM
seed=42,
),
)
# ============================================================
# Train
# ============================================================
print("Starting training...")
trainer.train()
# Save the LoRA adapter
model.save_pretrained("./output/final")
tokenizer.save_pretrained("./output/final")
print("Training complete! Model saved to ./output/final")
Training Time Expectations
| GPU | 2,000 examples, 3 epochs | 5,000 examples, 3 epochs |
|---|---|---|
| RTX 3060 12GB | ~3.5 hours | ~8 hours |
| RTX 3090 24GB | ~1.5 hours | ~4 hours |
| RTX 4090 24GB | ~45 minutes | ~2 hours |
Monitoring Training
Watch the training loss. A healthy training run looks like:
- Step 0-50: Loss drops rapidly from ~2.5 to ~1.5
- Step 50-200: Loss decreases steadily to ~0.8-1.0
- Step 200+: Loss plateaus around 0.6-0.9
If the loss drops below 0.3, you are likely overfitting. Reduce epochs or increase dropout.
If the loss does not drop below 1.5, check your data formatting — the model is probably not learning from malformatted examples.
Step 4: Evaluation
Before exporting, test the fine-tuned model:
# evaluate.py
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./output/final",
max_seq_length=4096,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model) # Enable fast inference
test_questions = [
"How do I reset my password?",
"Our dashboards are loading slowly, what should we check?",
"I want to cancel my subscription and get a refund.",
"Can CloudDash monitor my Kubernetes clusters?",
"I'm being charged for hosts I already removed.",
]
system = (
"You are CloudDash Support, a helpful and professional customer "
"support assistant for CloudDash, a cloud infrastructure monitoring "
"platform."
)
for question in test_questions:
messages = [
{"role": "system", "content": system},
{"role": "user", "content": question},
]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
outputs = model.generate(
inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
)
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(f"\nQ: {question}")
print(f"A: {response}")
print("-" * 60)
What to Look For
- Tone consistency: Does the model maintain the CloudDash support voice?
- Accuracy: Does it reference correct product features and processes?
- Appropriate escalation: Does it know when to hand off to a human?
- Length calibration: Are responses similar in length to your training examples?
Compare responses with the base Qwen 3 7B (without fine-tuning) to confirm improvement. The fine-tuned model should produce answers that are clearly more relevant, more consistent in tone, and more knowledgeable about the specific product.
Step 5: Export to GGUF
Ollama uses the GGUF format. Unsloth can export directly:
# export.py
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./output/final",
max_seq_length=4096,
load_in_4bit=True,
)
# Export merged model in GGUF format
# Q5_K_M is a good balance of quality and size for a 7B model
model.save_pretrained_gguf(
"clouddash-support",
tokenizer,
quantization_method="q5_k_m",
)
print("GGUF export complete: clouddash-support-Q5_K_M.gguf")
Available quantization options and their trade-offs:
| Quantization | File Size (7B) | Quality Loss | Recommended For |
|---|---|---|---|
| Q8_0 | ~7.5 GB | Minimal | When VRAM is not a concern |
| Q6_K | ~5.8 GB | Very small | Best quality/size balance |
| Q5_K_M | ~5.0 GB | Small | Default recommendation |
| Q4_K_M | ~4.2 GB | Moderate | When VRAM is tight |
| Q3_K_M | ~3.3 GB | Noticeable | Minimum viable quality |
For a customer support model where accuracy matters, use Q5_K_M or Q6_K. Do not go below Q4_K_M.
Step 6: Deploy to Ollama
Create a Modelfile for Ollama:
# Modelfile
FROM ./clouddash-support-Q5_K_M.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"
SYSTEM """You are CloudDash Support, a helpful and professional customer support assistant for CloudDash, a cloud infrastructure monitoring platform. You are knowledgeable about CloudDash features, billing, and troubleshooting. You are friendly but concise. When you cannot help with something, you escalate to a human agent."""
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
Build and run:
# Create the Ollama model
ollama create clouddash-support -f Modelfile
# Test it
ollama run clouddash-support "How do I set up Kubernetes monitoring?"
# Serve via API
curl http://localhost:11434/api/generate -d '{
"model": "clouddash-support",
"prompt": "I need help with my billing"
}'
Step 7: Integration
The fine-tuned model is now accessible through Ollama’s API, which is OpenAI-compatible. You can integrate it into your existing support infrastructure:
# Example: integrate with a support ticket system
import requests
def get_ai_response(customer_message: str, ticket_history: list[str] = None) -> str:
"""Get an AI-drafted response for a support ticket."""
context = ""
if ticket_history:
context = "Previous messages in this ticket:\n"
context += "\n".join(ticket_history)
context += "\n\nLatest customer message:\n"
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "clouddash-support",
"messages": [
{
"role": "user",
"content": f"{context}{customer_message}"
}
],
"stream": False,
},
)
return response.json()["message"]["content"]
# Use in your support workflow
draft = get_ai_response(
"My alerts stopped working after I upgraded to the Pro plan yesterday"
)
print(draft)
# Agent reviews the draft, edits if needed, sends to customer
Common Pitfalls and Solutions
Pitfall: Model forgets general knowledge. Fine-tuning on a narrow domain can cause “catastrophic forgetting” of general capabilities. Mitigate by mixing 10-20% general instruction-following examples into your training data.
Pitfall: Overfitting to training examples. If the model starts reciting training examples verbatim, you have overfit. Reduce epochs, increase dropout, or add more diverse training data.
Pitfall: Wrong chat template. If the model produces garbled output, the chat template in the Modelfile likely does not match the base model’s expected format. Check the base model’s documentation.
Pitfall: GGUF export fails. Ensure you have enough RAM (the merging and export process temporarily needs the full model in memory). 32GB system RAM should suffice for a 7B model.
Pitfall: “The model is worse than the base model.” This usually means data quality issues. Review your training examples. Remove any that are contradictory, poorly written, or off-topic. A smaller, cleaner dataset beats a larger messy one every time.
Total Time Breakdown
| Phase | Time |
|---|---|
| Dataset preparation | 1-2 hours (assuming you have source data) |
| Environment setup | 15 minutes |
| Training | 1.5-3.5 hours |
| Evaluation | 30 minutes |
| GGUF export | 10 minutes |
| Ollama deployment | 5 minutes |
| Total | 3.5-6.5 hours |
The “4 hours” in the title assumes you have your dataset ready and are using an RTX 3090 or better. The investment pays off quickly — a well-tuned support model can draft responses for 80% of common tickets, with human agents reviewing and sending.
Questions about fine-tuning? Join our community — several members have shared their fine-tuning recipes for various use cases.