llm-finetuning

Installation

SKILL.md

LLM Fine-Tuning with Unsloth (Done Right)

Fine-tune small open-weight LLMs locally for text tasks. Unsloth + QLoRA makes this practical on consumer GPUs (8 GB VRAM). The workflow: pick a model, format data as instruction-tuning pairs, fine-tune with TRL's SFTTrainer, evaluate against zero-shot baseline, log to MLflow, export to GGUF for llama.cpp deployment.

When to use this skill

Input is text (articles, tickets, emails, reviews) — not tabular features
You have at least a few hundred labeled examples
You want model ownership — no API dependency, no per-token cost, data stays on-device
You want a single model that can handle related text tasks later (classification today, extraction tomorrow)

When NOT to use this skill

Input is tabular (numbers, categories) → use XGBoost (see binary/multiclass/multilabel classification skills)
You have < 50 labeled examples → use zero-shot or few-shot prompting with a larger model via API
You need state-of-the-art quality and cost doesn't matter → use Claude/GPT API with prompt engineering
You're doing unsupervised text analysis → use embeddings + clustering, not fine-tuning

Model selection

Pick the largest model that fits in your VRAM with QLoRA (4-bit). Bigger models learn faster and generalize better, but the returns diminish. For most text classification tasks, 1-4B is plenty.

Model	Params	VRAM (QLoRA)	Unsloth ID	Notes
Qwen3-0.6B	0.6B	~4 GB	`unsloth/Qwen3-0.6B`	Smallest, fastest iteration
Llama-3.2 1B	1B	~6 GB	`unsloth/Llama-3.2-1B-Instruct`	Good quality/size ratio
Gemma-4 E2B	2B	8-10 GB	`unsloth/gemma-4-E2B-it`	Default. Strong for its size
Llama-3.2 3B	3B	~10 GB	`unsloth/Llama-3.2-3B-Instruct`	Benchmark vs Gemma-4 E2B
Phi-4 mini	3.8B	~12 GB	`unsloth/Phi-4-mini-instruct`	Needs 16+ GB VRAM
Gemma-4 E4B	4B	~17 GB	`unsloth/gemma-4-E4B-it`	Needs 24+ GB VRAM

Chat templates and mask tokens

Each model family has its own chat format. When adding a new model, you need three things:

Unsloth chat template name — passed to get_chat_template()
Instruction mask — the token sequence that starts a user turn
Response mask — the token sequence that starts a model turn

These are used by train_on_responses_only() to ensure the model only learns to predict responses, not to parrot the prompt.

Model family	Template	Instruction part	Response part
Gemma 4	`gemma-4-thinking`	`<\|turn>user\n`	`<\|turn>model\n`
Llama 3.x	`llama-3.1`	`<\|start_header_id\|>user<\|end_header_id\|>\n\n`	`<\|start_header_id\|>assistant<\|end_header_id\|>\n\n`
Qwen 2.5/3	`qwen-2.5`	`<\|im_start\|>user\n`	`<\|im_start\|>assistant\n`

Data formatting

Frame every text task as instruction-tuning: the user message is the input text with a task prompt, the assistant message is the label or output.

Text classification example

messages = [
    {"role": "user", "content": (
        "Classify this news article into exactly one category: "
        "World, Sports, Business, Sci/Tech.\n\n"
        f"{article_text}"
    )},
    {"role": "assistant", "content": "Sports"},
]

Formatting for SFTTrainer

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(tokenizer, chat_template="gemma-4-thinking")

def format_example(row):
    messages = [
        {"role": "user", "content": f"<your prompt>\n\n{row['text']}"},
        {"role": "assistant", "content": row["label_name"]},
    ]
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False,
    )
    # Remove BOS — SFTTrainer adds its own
    bos = tokenizer.bos_token or ""
    if bos and text.startswith(bos):
        text = text[len(bos):]
    return {"formatted_text": text}

formatted = dataset.map(format_example)

Training pipeline

1. Load model with QLoRA

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-4-E2B-it",
    dtype=None,              # auto-detect
    max_seq_length=2048,     # keep short for classification
    load_in_4bit=True,       # QLoRA
    full_finetuning=False,
)

2. Add LoRA adapters

model = FastModel.get_peft_model(
    model,
    finetune_vision_layers=False,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=8,                    # LoRA rank — 8 is a good default
    lora_alpha=8,           # usually equal to r
    lora_dropout=0,
    bias="none",
    random_state=3407,
)

3. Train with SFTTrainer

from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import train_on_responses_only

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted,
    args=SFTConfig(
        output_dir="/tmp/llm-ft",
        dataset_text_field="formatted_text",
        per_device_train_batch_size=1,     # 1 for 8 GB VRAM
        gradient_accumulation_steps=4,      # effective batch = 4
        warmup_steps=5,
        max_steps=60,                       # tune based on dataset size
        learning_rate=2e-4,
        optim="adamw_8bit",
        weight_decay=0.001,
        lr_scheduler_type="linear",
        seed=3407,
        save_strategy="no",
        report_to="none",                   # we log to MLflow manually
    ),
)

# Only compute loss on assistant responses
trainer = train_on_responses_only(
    trainer,
    instruction_part="<|turn>user\n",      # model-specific
    response_part="<|turn>model\n",         # model-specific
)

stats = trainer.train()

Key training params to tune

Param	Default	When to change
`max_steps`	60	More data → more steps. 1-3 epochs over the dataset.
`learning_rate`	2e-4	Lower (1e-4) if loss is unstable.
`r` (LoRA rank)	8	Increase to 16-32 for harder tasks. More capacity but slower.
`max_seq_length`	2048	Increase for long documents. Costs more VRAM.
`gradient_accumulation_steps`	4	Increase for more stable gradients at the cost of speed.

Evaluation

Always compare zero-shot vs fine-tuned

The most important evaluation is: did fine-tuning actually help? Run the base model (before training) on the eval set, then run the fine-tuned model. Compare accuracy and F1 macro.

def classify(model, tokenizer, text, label_names):
    messages = [{"role": "user", "content": f"Classify: {', '.join(label_names)}.\n\n{text}"}]
    inputs = tokenizer.apply_chat_template(
        messages, return_tensors="pt", add_generation_prompt=True,
    ).to(model.device)
    with torch.no_grad():
        output = model.generate(input_ids=inputs, max_new_tokens=20, do_sample=False)
    response = tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True).strip()
    for idx, name in enumerate(label_names):
        if name.lower() in response.lower():
            return idx
    return -1  # parse failure

Metrics

Same metrics as multiclass classification:

F1 macro — primary metric, surfaces rare-class failures
Per-class F1 — each class has its own difficulty
Confusion matrix — which classes get confused
Parse rate — fraction of responses that contain a valid label. Low parse rate means the model isn't following the format.

Parse failures

If the model outputs "This article is about sports and entertainment" instead of just "Sports", parsing fails. Fine-tuning should fix this — the model learns the exact output format from training data. If parse rate stays low after fine-tuning, increase max_steps or check the training data format.

MLflow conventions

Log every training run to MLflow for comparison across models and hyperparameters.

import mlflow

mlflow.set_experiment("llm-finetuning")
with mlflow.start_run(run_name="gemma-4-E2B-it"):
    mlflow.log_params({
        "model_name": "unsloth/gemma-4-E2B-it",
        "lora_r": 8,
        "max_steps": 60,
        "learning_rate": 2e-4,
        "max_seq_length": 2048,
    })
    # ... train ...
    mlflow.log_metric("train_loss", stats.training_loss)
    mlflow.log_metrics({
        "zs_accuracy": zs_acc,
        "zs_f1_macro": zs_f1,
        "ft_accuracy": ft_acc,
        "ft_f1_macro": ft_f1,
    })

Compare runs: mlflow ui --port 5000

What to compare across runs

Model size vs F1: does the 2B model beat the 0.6B?
Steps vs F1: diminishing returns? overfitting?
LoRA rank vs F1: does r=16 beat r=8?
Zero-shot delta: how much did fine-tuning actually buy?

GGUF export

After fine-tuning, merge LoRA adapters and quantize to GGUF:

model.save_pretrained_gguf(
    "finetuned-model",
    tokenizer,
    quantization_method="q4_k_m",
)

Quantization options:

q4_k_m — 4-bit, recommended default (~3 GB for E2B)
q8_0 — 8-bit, higher quality (~5 GB for E2B)
f16 — full precision, no loss (~9 GB for E2B)

Inference with llama.cpp

llama-server -m finetuned-model/unsloth.Q4_K_M.gguf \
    --port 8080 --ctx-size 2048

Then query via the OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role":"user","content":"Classify: World, Sports, Business, Sci/Tech.\n\nNASA launches new Mars rover..."}],
    "max_tokens": 20,
    "temperature": 0
  }'

Pitfalls

Wrong chat template — if you use template X for training and template Y for inference, the model will produce garbage. Always use the same template for both.
Forgetting train_on_responses_only — without it, the model also learns to generate the user prompt, wasting capacity and often degrading classification quality.
Too many steps → overfitting — with small datasets (< 1000 examples), 1-2 epochs is enough. Watch training loss: if it flatlines near zero, you're overfitting.
BOS token duplication — SFTTrainer adds BOS automatically. If your formatted text already starts with BOS, strip it.
Ignoring parse failures — if 30% of responses don't parse, your accuracy numbers are misleading (computed only on the 70% that parsed). Report parse rate alongside accuracy.
Not comparing zero-shot — if zero-shot gets 85% accuracy, fine-tuning to 87% may not be worth the effort. Always measure the baseline.
VRAM OOM — reduce max_seq_length, use load_in_4bit=True, set per_device_train_batch_size=1. If still OOM, use a smaller model.

Demos

This bundle includes two runnable marimo notebooks:

demo.py — AG News text classification (4 classes, single label). Quick to run, good for validating the pipeline.
demo_nhtsa.py — NHTSA vehicle safety complaints (multi-target: fire, crash, component via structured JSON output). Downloads NHTSA's 2.2M-record complaint database automatically. Demonstrates programmatic labeling: using structured fields from an existing database as free supervision to train a model that works from raw text alone.

Dependencies

This bundle uses PEP 723 inline script metadata. Run with:

marimo edit --sandbox demo.py
marimo edit --sandbox demo_nhtsa.py

If unsloth fails to install via sandbox (CUDA compatibility), install manually in a venv:

python -m venv .venv && source .venv/bin/activate
pip install unsloth torch trl datasets mlflow scikit-learn numpy matplotlib requests marimo
marimo edit demo.py

Related skills

More from brojonat/llmsrules

Installs

Repository

brojonat/llmsrules

First Seen

Apr 18, 2026