Fine-Tuning a Customer Support Model with Unsloth in 4 Hours

Fine-tuning used to be something only well-funded teams could do. You needed A100s, distributed training frameworks, and deep expertise in gradient accumulation and learning rate scheduling. Unsloth changed that. With QLoRA and Unsloth’s optimized training kernels, you can fine-tune a 7-14B parameter model on a single consumer GPU in a few hours and produce a model that genuinely outperforms the base model on your specific task.

This walkthrough takes you from raw customer support data to a deployed, fine-tuned model running in Ollama. No hand-waving. No “left as an exercise.” Every step, every config, every decision explained.

What We Are Building

We are fine-tuning Qwen 3 7B into a customer support assistant for a fictional SaaS company called “CloudDash” — a cloud monitoring platform. The fine-tuned model will:

Answer product questions accurately using our documentation
Follow a consistent, professional tone
Handle common support scenarios (billing, outages, feature requests, troubleshooting)
Escalate appropriately when it cannot help

By the end, we will have a GGUF file running in Ollama that handles CloudDash support queries better than the base Qwen 3 7B.

Hardware Requirements

Minimum: GPU with 12GB VRAM (RTX 3060 12GB, RTX 4070)
Recommended: GPU with 24GB VRAM (RTX 3090, RTX 4090)
RAM: 32GB system RAM
Storage: 50GB free disk space
Time: 2-4 hours for training (depending on GPU and dataset size)

This tutorial was tested on an RTX 3090 (24GB). On 12GB VRAM, you may need to reduce batch size.

Step 1: Environment Setup

# Create a fresh environment
conda create -n finetune python=3.11 -y
conda activate finetune

# Install Unsloth (follow their latest install instructions for your CUDA version)
pip install unsloth
pip install datasets transformers trl

# Verify GPU access
python -c "import torch; print(torch.cuda.get_device_name(0))"

Step 2: Prepare the Dataset

This is the most important step. The quality of your fine-tuning data determines the quality of your model. Garbage in, garbage out — and with fine-tuning, this is not a metaphor.

Dataset Format

We use the ChatML format, which is the standard for instruction-tuned models:

{
  "conversations": [
    {"role": "system", "content": "You are CloudDash Support, a helpful customer support assistant for CloudDash cloud monitoring platform."},
    {"role": "user", "content": "How do I set up alerts for CPU usage?"},
    {"role": "assistant", "content": "To set up CPU usage alerts in CloudDash:\n\n1. Navigate to Monitoring > Alert Rules\n2. Click 'Create New Rule'\n3. Select 'CPU Utilization' from the metric dropdown\n4. Set your threshold (e.g., 90% for 5 minutes)\n5. Choose notification channels (email, Slack, PagerDuty)\n6. Click 'Save Rule'\n\nThe alert will trigger when CPU usage exceeds your threshold for the specified duration. You can also set warning thresholds at a lower level (e.g., 75%) for early notification.\n\nWould you like help configuring a specific notification channel?"}
  ]
}

Building the Dataset

For a customer support model, your training data should include:

Real support conversations (anonymized) — the gold standard
FAQ pairs generated from your documentation
Edge cases — billing disputes, angry customers, out-of-scope requests
Escalation examples — knowing when to hand off to a human

Here is a Python script to structure your data:

# prepare_dataset.py
import json

def create_training_example(system_prompt, user_msg, assistant_msg):
    return {
        "conversations": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_msg},
            {"role": "assistant", "content": assistant_msg},
        ]
    }

SYSTEM_PROMPT = (
    "You are CloudDash Support, a helpful and professional customer support "
    "assistant for CloudDash, a cloud infrastructure monitoring platform. "
    "You are knowledgeable about CloudDash features, billing, and "
    "troubleshooting. You are friendly but concise. When you cannot help "
    "with something, you escalate to a human agent."
)

# Example: build from structured data sources
training_data = []

# From FAQ documents
faqs = [
    {
        "q": "What pricing plans do you offer?",
        "a": "CloudDash offers three plans:\n\n- **Starter** ($29/month): Up to 10 hosts, 1-day data retention, email alerts\n- **Pro** ($99/month): Up to 100 hosts, 30-day retention, all notification channels, custom dashboards\n- **Enterprise** (custom pricing): Unlimited hosts, 1-year retention, SSO, dedicated support\n\nAll plans include a 14-day free trial. Would you like to start a trial or do you have questions about a specific plan?"
    },
    # ... hundreds more FAQ pairs
]

for faq in faqs:
    training_data.append(
        create_training_example(SYSTEM_PROMPT, faq["q"], faq["a"])
    )

# From real support conversations (anonymized)
# Load from your support ticket system export
# ...

# Save in JSONL format
with open("training_data.jsonl", "w") as f:
    for example in training_data:
        f.write(json.dumps(example) + "\n")

print(f"Created {len(training_data)} training examples")

Dataset Size Guidelines

Minimum viable: 200-500 examples (you will see some improvement)
Good results: 1,000-3,000 examples (noticeable behavior change)
Excellent results: 5,000-10,000 examples (strong domain specialization)
Diminishing returns: Beyond 10,000 examples for a single task

For this tutorial, we use 2,000 examples. Quality matters more than quantity — 500 excellent examples beat 5,000 sloppy ones.

Data Quality Checklist

Before training, audit your dataset:

Consistent tone and style across all assistant responses
No contradictory information between examples
Responses are the length you want the model to produce (it learns length patterns)
Edge cases are represented (refusals, escalations, multi-turn conversations)
No personally identifiable information (PII) in the training data
System prompt is consistent across all examples

Step 3: Fine-Tuning with Unsloth

# train.py
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# ============================================================
# Model Configuration
# ============================================================
model_name = "unsloth/Qwen3-7B"  # Unsloth-optimized version
max_seq_length = 4096
load_in_4bit = True  # QLoRA: load base model in 4-bit

# Load model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=load_in_4bit,
)

# ============================================================
# LoRA Configuration
# ============================================================
model = FastLanguageModel.get_peft_model(
    model,
    r=32,                    # LoRA rank — higher = more capacity, more VRAM
    lora_alpha=64,           # Scaling factor, typically 2x rank
    lora_dropout=0.05,       # Slight dropout for regularization
    target_modules=[         # Which layers to fine-tune
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's memory-efficient checkpointing
)

print(f"Trainable parameters: {model.print_trainable_parameters()}")

# ============================================================
# Load Dataset
# ============================================================
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Format conversations into the chat template
def format_conversation(example):
    """Convert our conversation format to the model's chat template."""
    text = tokenizer.apply_chat_template(
        example["conversations"],
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

dataset = dataset.map(format_conversation)

# ============================================================
# Training Configuration
# ============================================================
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    args=TrainingArguments(
        output_dir="./output",
        num_train_epochs=3,              # 3 epochs is usually sufficient
        per_device_train_batch_size=4,   # Reduce to 2 if OOM on 12GB
        gradient_accumulation_steps=4,   # Effective batch size = 4 * 4 = 16
        learning_rate=2e-4,              # Standard for QLoRA
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        weight_decay=0.01,
        logging_steps=10,
        save_steps=200,
        save_total_limit=3,
        fp16=True,                       # Use bf16=True if your GPU supports it
        optim="adamw_8bit",              # 8-bit Adam saves VRAM
        seed=42,
    ),
)

# ============================================================
# Train
# ============================================================
print("Starting training...")
trainer.train()

# Save the LoRA adapter
model.save_pretrained("./output/final")
tokenizer.save_pretrained("./output/final")
print("Training complete! Model saved to ./output/final")

Training Time Expectations

GPU	2,000 examples, 3 epochs	5,000 examples, 3 epochs
RTX 3060 12GB	~3.5 hours	~8 hours
RTX 3090 24GB	~1.5 hours	~4 hours
RTX 4090 24GB	~45 minutes	~2 hours

Monitoring Training

Watch the training loss. A healthy training run looks like:

Step 0-50: Loss drops rapidly from ~2.5 to ~1.5
Step 50-200: Loss decreases steadily to ~0.8-1.0
Step 200+: Loss plateaus around 0.6-0.9

If the loss drops below 0.3, you are likely overfitting. Reduce epochs or increase dropout.

If the loss does not drop below 1.5, check your data formatting — the model is probably not learning from malformatted examples.

Step 4: Evaluation

Before exporting, test the fine-tuned model:

# evaluate.py
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./output/final",
    max_seq_length=4096,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)  # Enable fast inference

test_questions = [
    "How do I reset my password?",
    "Our dashboards are loading slowly, what should we check?",
    "I want to cancel my subscription and get a refund.",
    "Can CloudDash monitor my Kubernetes clusters?",
    "I'm being charged for hosts I already removed.",
]

system = (
    "You are CloudDash Support, a helpful and professional customer "
    "support assistant for CloudDash, a cloud infrastructure monitoring "
    "platform."
)

for question in test_questions:
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": question},
    ]
    inputs = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")

    outputs = model.generate(
        inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
    )
    response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)

    print(f"\nQ: {question}")
    print(f"A: {response}")
    print("-" * 60)

What to Look For

Tone consistency: Does the model maintain the CloudDash support voice?
Accuracy: Does it reference correct product features and processes?
Appropriate escalation: Does it know when to hand off to a human?
Length calibration: Are responses similar in length to your training examples?

Compare responses with the base Qwen 3 7B (without fine-tuning) to confirm improvement. The fine-tuned model should produce answers that are clearly more relevant, more consistent in tone, and more knowledgeable about the specific product.

Step 5: Export to GGUF

Ollama uses the GGUF format. Unsloth can export directly:

# export.py
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./output/final",
    max_seq_length=4096,
    load_in_4bit=True,
)

# Export merged model in GGUF format
# Q5_K_M is a good balance of quality and size for a 7B model
model.save_pretrained_gguf(
    "clouddash-support",
    tokenizer,
    quantization_method="q5_k_m",
)

print("GGUF export complete: clouddash-support-Q5_K_M.gguf")

Available quantization options and their trade-offs:

Quantization	File Size (7B)	Quality Loss	Recommended For
Q8_0	~7.5 GB	Minimal	When VRAM is not a concern
Q6_K	~5.8 GB	Very small	Best quality/size balance
Q5_K_M	~5.0 GB	Small	Default recommendation
Q4_K_M	~4.2 GB	Moderate	When VRAM is tight
Q3_K_M	~3.3 GB	Noticeable	Minimum viable quality

For a customer support model where accuracy matters, use Q5_K_M or Q6_K. Do not go below Q4_K_M.

Step 6: Deploy to Ollama

Create a Modelfile for Ollama:

# Modelfile
FROM ./clouddash-support-Q5_K_M.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"

SYSTEM """You are CloudDash Support, a helpful and professional customer support assistant for CloudDash, a cloud infrastructure monitoring platform. You are knowledgeable about CloudDash features, billing, and troubleshooting. You are friendly but concise. When you cannot help with something, you escalate to a human agent."""

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

Build and run:

# Create the Ollama model
ollama create clouddash-support -f Modelfile

# Test it
ollama run clouddash-support "How do I set up Kubernetes monitoring?"

# Serve via API
curl http://localhost:11434/api/generate -d '{
  "model": "clouddash-support",
  "prompt": "I need help with my billing"
}'

Step 7: Integration

The fine-tuned model is now accessible through Ollama’s API, which is OpenAI-compatible. You can integrate it into your existing support infrastructure:

# Example: integrate with a support ticket system
import requests

def get_ai_response(customer_message: str, ticket_history: list[str] = None) -> str:
    """Get an AI-drafted response for a support ticket."""
    context = ""
    if ticket_history:
        context = "Previous messages in this ticket:\n"
        context += "\n".join(ticket_history)
        context += "\n\nLatest customer message:\n"

    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": "clouddash-support",
            "messages": [
                {
                    "role": "user",
                    "content": f"{context}{customer_message}"
                }
            ],
            "stream": False,
        },
    )

    return response.json()["message"]["content"]

# Use in your support workflow
draft = get_ai_response(
    "My alerts stopped working after I upgraded to the Pro plan yesterday"
)
print(draft)
# Agent reviews the draft, edits if needed, sends to customer

Common Pitfalls and Solutions

Pitfall: Model forgets general knowledge. Fine-tuning on a narrow domain can cause “catastrophic forgetting” of general capabilities. Mitigate by mixing 10-20% general instruction-following examples into your training data.

Pitfall: Overfitting to training examples. If the model starts reciting training examples verbatim, you have overfit. Reduce epochs, increase dropout, or add more diverse training data.

Pitfall: Wrong chat template. If the model produces garbled output, the chat template in the Modelfile likely does not match the base model’s expected format. Check the base model’s documentation.

Pitfall: GGUF export fails. Ensure you have enough RAM (the merging and export process temporarily needs the full model in memory). 32GB system RAM should suffice for a 7B model.

Pitfall: “The model is worse than the base model.” This usually means data quality issues. Review your training examples. Remove any that are contradictory, poorly written, or off-topic. A smaller, cleaner dataset beats a larger messy one every time.

Total Time Breakdown

Phase	Time
Dataset preparation	1-2 hours (assuming you have source data)
Environment setup	15 minutes
Training	1.5-3.5 hours
Evaluation	30 minutes
GGUF export	10 minutes
Ollama deployment	5 minutes
Total	3.5-6.5 hours

The “4 hours” in the title assumes you have your dataset ready and are using an RTX 3090 or better. The investment pays off quickly — a well-tuned support model can draft responses for 80% of common tickets, with human agents reviewing and sending.

Questions about fine-tuning? Join our community — several members have shared their fine-tuning recipes for various use cases.