llm-finetuning
LLM Fine-Tuning with Unsloth (Done Right)
Fine-tune small open-weight LLMs locally for text tasks. Unsloth +
QLoRA makes this practical on consumer GPUs (8 GB VRAM). The workflow:
pick a model, format data as instruction-tuning pairs, fine-tune with
TRL's SFTTrainer, evaluate against zero-shot baseline, log to MLflow,
export to GGUF for llama.cpp deployment.
When to use this skill
- Input is text (articles, tickets, emails, reviews) — not tabular features
- You have at least a few hundred labeled examples
- You want model ownership — no API dependency, no per-token cost, data stays on-device
- You want a single model that can handle related text tasks later (classification today, extraction tomorrow)
When NOT to use this skill
- Input is tabular (numbers, categories) → use XGBoost (see binary/multiclass/multilabel classification skills)
- You have < 50 labeled examples → use zero-shot or few-shot prompting with a larger model via API
- You need state-of-the-art quality and cost doesn't matter → use Claude/GPT API with prompt engineering
- You're doing unsupervised text analysis → use embeddings + clustering, not fine-tuning
Model selection
Pick the largest model that fits in your VRAM with QLoRA (4-bit). Bigger models learn faster and generalize better, but the returns diminish. For most text classification tasks, 1-4B is plenty.
| Model | Params | VRAM (QLoRA) | Unsloth ID | Notes |
|---|---|---|---|---|
| Qwen3-0.6B | 0.6B | ~4 GB | unsloth/Qwen3-0.6B |
Smallest, fastest iteration |
| Llama-3.2 1B | 1B | ~6 GB | unsloth/Llama-3.2-1B-Instruct |
Good quality/size ratio |
| Gemma-4 E2B | 2B | 8-10 GB | unsloth/gemma-4-E2B-it |
Default. Strong for its size |
| Llama-3.2 3B | 3B | ~10 GB | unsloth/Llama-3.2-3B-Instruct |
Benchmark vs Gemma-4 E2B |
| Phi-4 mini | 3.8B | ~12 GB | unsloth/Phi-4-mini-instruct |
Needs 16+ GB VRAM |
| Gemma-4 E4B | 4B | ~17 GB | unsloth/gemma-4-E4B-it |
Needs 24+ GB VRAM |
Chat templates and mask tokens
Each model family has its own chat format. When adding a new model, you need three things:
- Unsloth chat template name — passed to
get_chat_template() - Instruction mask — the token sequence that starts a user turn
- Response mask — the token sequence that starts a model turn
These are used by train_on_responses_only() to ensure the model only
learns to predict responses, not to parrot the prompt.
| Model family | Template | Instruction part | Response part |
|---|---|---|---|
| Gemma 4 | gemma-4-thinking |
<|turn>user\n |
<|turn>model\n |
| Llama 3.x | llama-3.1 |
<|start_header_id|>user<|end_header_id|>\n\n |
<|start_header_id|>assistant<|end_header_id|>\n\n |
| Qwen 2.5/3 | qwen-2.5 |
<|im_start|>user\n |
<|im_start|>assistant\n |
Data formatting
Frame every text task as instruction-tuning: the user message is the input text with a task prompt, the assistant message is the label or output.
Text classification example
messages = [
{"role": "user", "content": (
"Classify this news article into exactly one category: "
"World, Sports, Business, Sci/Tech.\n\n"
f"{article_text}"
)},
{"role": "assistant", "content": "Sports"},
]
Formatting for SFTTrainer
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(tokenizer, chat_template="gemma-4-thinking")
def format_example(row):
messages = [
{"role": "user", "content": f"<your prompt>\n\n{row['text']}"},
{"role": "assistant", "content": row["label_name"]},
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False,
)
# Remove BOS — SFTTrainer adds its own
bos = tokenizer.bos_token or ""
if bos and text.startswith(bos):
text = text[len(bos):]
return {"formatted_text": text}
formatted = dataset.map(format_example)
Training pipeline
1. Load model with QLoRA
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name="unsloth/gemma-4-E2B-it",
dtype=None, # auto-detect
max_seq_length=2048, # keep short for classification
load_in_4bit=True, # QLoRA
full_finetuning=False,
)
2. Add LoRA adapters
model = FastModel.get_peft_model(
model,
finetune_vision_layers=False,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=8, # LoRA rank — 8 is a good default
lora_alpha=8, # usually equal to r
lora_dropout=0,
bias="none",
random_state=3407,
)
3. Train with SFTTrainer
from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import train_on_responses_only
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=formatted,
args=SFTConfig(
output_dir="/tmp/llm-ft",
dataset_text_field="formatted_text",
per_device_train_batch_size=1, # 1 for 8 GB VRAM
gradient_accumulation_steps=4, # effective batch = 4
warmup_steps=5,
max_steps=60, # tune based on dataset size
learning_rate=2e-4,
optim="adamw_8bit",
weight_decay=0.001,
lr_scheduler_type="linear",
seed=3407,
save_strategy="no",
report_to="none", # we log to MLflow manually
),
)
# Only compute loss on assistant responses
trainer = train_on_responses_only(
trainer,
instruction_part="<|turn>user\n", # model-specific
response_part="<|turn>model\n", # model-specific
)
stats = trainer.train()
Key training params to tune
| Param | Default | When to change |
|---|---|---|
max_steps |
60 | More data → more steps. 1-3 epochs over the dataset. |
learning_rate |
2e-4 | Lower (1e-4) if loss is unstable. |
r (LoRA rank) |
8 | Increase to 16-32 for harder tasks. More capacity but slower. |
max_seq_length |
2048 | Increase for long documents. Costs more VRAM. |
gradient_accumulation_steps |
4 | Increase for more stable gradients at the cost of speed. |
Evaluation
Always compare zero-shot vs fine-tuned
The most important evaluation is: did fine-tuning actually help? Run the base model (before training) on the eval set, then run the fine-tuned model. Compare accuracy and F1 macro.
def classify(model, tokenizer, text, label_names):
messages = [{"role": "user", "content": f"Classify: {', '.join(label_names)}.\n\n{text}"}]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True,
).to(model.device)
with torch.no_grad():
output = model.generate(input_ids=inputs, max_new_tokens=20, do_sample=False)
response = tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True).strip()
for idx, name in enumerate(label_names):
if name.lower() in response.lower():
return idx
return -1 # parse failure
Metrics
Same metrics as multiclass classification:
- F1 macro — primary metric, surfaces rare-class failures
- Per-class F1 — each class has its own difficulty
- Confusion matrix — which classes get confused
- Parse rate — fraction of responses that contain a valid label. Low parse rate means the model isn't following the format.
Parse failures
If the model outputs "This article is about sports and entertainment"
instead of just "Sports", parsing fails. Fine-tuning should fix this —
the model learns the exact output format from training data. If parse
rate stays low after fine-tuning, increase max_steps or check the
training data format.
MLflow conventions
Log every training run to MLflow for comparison across models and hyperparameters.
import mlflow
mlflow.set_experiment("llm-finetuning")
with mlflow.start_run(run_name="gemma-4-E2B-it"):
mlflow.log_params({
"model_name": "unsloth/gemma-4-E2B-it",
"lora_r": 8,
"max_steps": 60,
"learning_rate": 2e-4,
"max_seq_length": 2048,
})
# ... train ...
mlflow.log_metric("train_loss", stats.training_loss)
mlflow.log_metrics({
"zs_accuracy": zs_acc,
"zs_f1_macro": zs_f1,
"ft_accuracy": ft_acc,
"ft_f1_macro": ft_f1,
})
Compare runs: mlflow ui --port 5000
What to compare across runs
- Model size vs F1: does the 2B model beat the 0.6B?
- Steps vs F1: diminishing returns? overfitting?
- LoRA rank vs F1: does r=16 beat r=8?
- Zero-shot delta: how much did fine-tuning actually buy?
GGUF export
After fine-tuning, merge LoRA adapters and quantize to GGUF:
model.save_pretrained_gguf(
"finetuned-model",
tokenizer,
quantization_method="q4_k_m",
)
Quantization options:
q4_k_m— 4-bit, recommended default (~3 GB for E2B)q8_0— 8-bit, higher quality (~5 GB for E2B)f16— full precision, no loss (~9 GB for E2B)
Inference with llama.cpp
llama-server -m finetuned-model/unsloth.Q4_K_M.gguf \
--port 8080 --ctx-size 2048
Then query via the OpenAI-compatible API:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role":"user","content":"Classify: World, Sports, Business, Sci/Tech.\n\nNASA launches new Mars rover..."}],
"max_tokens": 20,
"temperature": 0
}'
Pitfalls
- Wrong chat template — if you use template X for training and template Y for inference, the model will produce garbage. Always use the same template for both.
- Forgetting
train_on_responses_only— without it, the model also learns to generate the user prompt, wasting capacity and often degrading classification quality. - Too many steps → overfitting — with small datasets (< 1000 examples), 1-2 epochs is enough. Watch training loss: if it flatlines near zero, you're overfitting.
- BOS token duplication —
SFTTraineradds BOS automatically. If your formatted text already starts with BOS, strip it. - Ignoring parse failures — if 30% of responses don't parse, your accuracy numbers are misleading (computed only on the 70% that parsed). Report parse rate alongside accuracy.
- Not comparing zero-shot — if zero-shot gets 85% accuracy, fine-tuning to 87% may not be worth the effort. Always measure the baseline.
- VRAM OOM — reduce
max_seq_length, useload_in_4bit=True, setper_device_train_batch_size=1. If still OOM, use a smaller model.
Demos
This bundle includes two runnable marimo notebooks:
demo.py— AG News text classification (4 classes, single label). Quick to run, good for validating the pipeline.demo_nhtsa.py— NHTSA vehicle safety complaints (multi-target: fire, crash, component via structured JSON output). Downloads NHTSA's 2.2M-record complaint database automatically. Demonstrates programmatic labeling: using structured fields from an existing database as free supervision to train a model that works from raw text alone.
Dependencies
This bundle uses PEP 723 inline script metadata. Run with:
marimo edit --sandbox demo.py
marimo edit --sandbox demo_nhtsa.py
If unsloth fails to install via sandbox (CUDA compatibility), install manually in a venv:
python -m venv .venv && source .venv/bin/activate
pip install unsloth torch trl datasets mlflow scikit-learn numpy matplotlib requests marimo
marimo edit demo.py
More from brojonat/llmsrules
ibis-data
Use Ibis for database-agnostic data access in Python. Use when writing data queries, connecting to databases (DuckDB, PostgreSQL, SQLite), or building portable data pipelines that should work across backends.
13go-service
Build Go microservices with stdlib HTTP handlers, sqlc, urfave/cli, and slog. Use when creating or modifying a Go HTTP server, adding routes, middleware, database queries, or CLI commands.
13temporal-go
Build Temporal workflow applications in Go. Use when creating or modifying Temporal workflows, activities, workers, clients, signals, queries, updates, retry policies, saga patterns, or writing Temporal tests.
13parquet-analysis
Analyze parquet files using Python and Ibis. Use when the user wants to explore, transform, or analyze parquet data files, perform aggregations, joins, or export results. Works with local parquet files and provides database-agnostic data operations.
12temporal-python
Build Temporal applications in Python using the temporalio SDK. Use when creating workflows, activities, workers, clients, signals, queries, updates, child workflows, timers, retry policies, saga/compensation patterns, testing, or any durable execution pattern in Python.
12redfin-api
Interact with Redfin's unofficial stingray API for property search, listing data, AVM estimates, and property history. Use when writing code to search listings, fetch property details, parse Redfin responses, or build GISCSV queries.
6