skills/scientiacapital/skills/unsloth-training

unsloth-training

SKILL.md
  1. GRPO - RL with reward functions (no labeled outputs needed)
  2. SFT - Supervised fine-tuning with input/output pairs
  3. Vision - VLM fine-tuning (Qwen3-VL, Gemma3, Llama 3.2 Vision)

Key capabilities:

  • FP8 Training - 60% less VRAM, 1.4x faster (RTX 40+, H100)
  • 3x Packing - Automatic 2-5x speedup for mixed-length data
  • Docker - Official unsloth/unsloth image
  • Mobile - QAT → ExecuTorch → iOS/Android (~40 tok/s)
  • Export - GGUF, Ollama, vLLM, LM Studio, SGLang

<quick_start> GRPO with FP8 (60% less VRAM):

import os
os.environ['UNSLOTH_VLLM_STANDBY'] = "1"  # Shared memory
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=2048, load_in_fp8=True, fast_inference=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
)

def correctness_reward(completions, answer, **kwargs):
    return [2.0 if extract_answer(c) == a else 0.0
            for c, a in zip(completions, answer)]

trainer = GRPOTrainer(
    model=model,
    args=GRPOConfig(num_generations=4, beta=0.04, learning_rate=5e-6),
    train_dataset=dataset, reward_funcs=[correctness_reward],
)
trainer.train()

SFT with Packing (2-5x faster):

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model, train_dataset=dataset, processing_class=tokenizer,
    args=SFTConfig(
        per_device_train_batch_size=2, num_train_epochs=3,
        learning_rate=2e-4, packing=True,  # 2-5x speedup
    ),
)
trainer.train()

</quick_start>

<success_criteria> A training run is successful when:

  • Model loads without OOM errors
  • Reward (GRPO) or loss (SFT) shows improvement trend
  • Generated outputs match expected format
  • Model exported to desired format (LoRA, merged, GGUF)
  • Test inference produces reasonable outputs </success_criteria>

<activation_triggers> Explicit triggers:

  • /unsloth grpo - GRPO (RL) training
  • /unsloth sft - SFT training
  • /unsloth fp8 - FP8 training setup
  • /unsloth vision - VLM fine-tuning
  • /unsloth mobile - Phone deployment (QAT)
  • /unsloth docker - Docker container setup
  • /unsloth troubleshoot - Debug issues

Natural language:

  • "train with GRPO", "fine-tune", "reward functions"
  • "FP8 training", "fp8", "less VRAM"
  • "vision fine-tuning", "VLM", "image training"
  • "phone deployment", "mobile LLM", "ExecuTorch"
  • "docker training", "container", "unsloth docker"
  • "packing", "faster training", "500k context"
  • "export GGUF", "Ollama", "vLLM", "SGLang" </activation_triggers>

<file_locations> Core references:

  • reference/reward-design.md - Reward function patterns
  • reference/domain-examples.md - Voice AI, Sales Agent examples
  • reference/hyperparameters.md - GRPOConfig reference
  • reference/troubleshooting.md - Common fixes

New feature references:

  • reference/fp8-training.md - FP8 setup, VRAM savings
  • reference/deployment.md - Docker, vLLM, LoRA hot-swap, SGLang
  • reference/export-formats.md - GGUF, Ollama, LM Studio, Dynamic 2.0
  • reference/advanced-training.md - 500K context, packing, checkpoints
  • reference/vision-training.md - VLM fine-tuning
  • reference/mobile-deployment.md - QAT, ExecuTorch, iOS/Android

Code examples: reference/grpo/, reference/sft/ </file_locations>

<core_concepts>

When to Use GRPO vs SFT

Method Use When Data Needed
GRPO Improving reasoning quality Prompts + verifiable answers
GRPO Aligning behavior with preferences Reward functions
GRPO When you can verify correctness Verifiable outputs
SFT Teaching specific output format Input/output pairs
SFT Following new instructions Conversation examples
SFT Learning domain knowledge Labeled examples

Model Selection

Model Size VRAM Use Case
unsloth/Qwen2.5-0.5B-Instruct 0.5B 5GB Mobile deployment (~200MB GGUF)
unsloth/Qwen2.5-1.5B-Instruct 1.5B 5GB Learning/prototyping
Qwen/Qwen2.5-3B-Instruct 3B 8GB Good balance (recommended start)
unsloth/Qwen2.5-7B-Instruct 7B 16GB Production quality
unsloth/Phi-4 14B 20GB Strong reasoning

Core Hyperparameters

GRPO (RL):

GRPOConfig(
    num_generations=4,        # Completions per prompt (2-8)
    beta=0.04,                # KL penalty (0.01-0.1)
    learning_rate=5e-6,       # 10x smaller than SFT!
    max_completion_length=512,
    max_steps=300,            # Minimum for results
)

SFT:

TrainingArguments(
    learning_rate=2e-4,       # Standard SFT rate
    num_train_epochs=3,       # 2-4 typical
    per_device_train_batch_size=2,
)

</core_concepts>

<reward_functions>

Reward Function Design

Reward functions are the core of GRPO. They return a list of floats for each completion.

Pattern 1: Correctness (Primary Signal)

def correctness_reward(completions, answer, **kwargs):
    """
    +2.0 for correct answer, 0.0 otherwise.
    This should be your highest-weighted reward.
    """
    rewards = []
    for completion, true_answer in zip(completions, answer):
        extracted = extract_answer(completion)
        try:
            pred = float(extracted.replace(",", "").strip())
            true = float(true_answer.replace(",", "").strip())
            reward = 2.0 if abs(pred - true) < 0.01 else 0.0
        except ValueError:
            reward = 2.0 if extracted.strip() == str(true_answer).strip() else 0.0
        rewards.append(reward)
    return rewards

Pattern 2: Format Compliance

def format_reward(completions, **kwargs):
    """
    +0.5 for proper XML structure with reasoning and answer tags.
    """
    rewards = []
    for completion in completions:
        has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", completion, re.DOTALL))
        has_answer = bool(re.search(r"<answer>.*?</answer>", completion, re.DOTALL))
        if has_reasoning and has_answer:
            rewards.append(0.5)
        elif has_answer:
            rewards.append(0.2)
        else:
            rewards.append(0.0)
    return rewards

Pattern 3: Reasoning Quality

def reasoning_length_reward(completions, **kwargs):
    """
    +0.3 for substantive reasoning (30-200 words).
    """
    rewards = []
    for completion in completions:
        reasoning = extract_reasoning(completion)
        word_count = len(reasoning.split()) if reasoning else 0
        if 30 <= word_count <= 200:
            rewards.append(0.3)
        elif 15 <= word_count < 30:
            rewards.append(0.1)
        else:
            rewards.append(0.0)
    return rewards

Pattern 4: Negative Constraints

def no_hedging_reward(completions, **kwargs):
    """
    -0.3 penalty for uncertainty language.
    """
    hedging = ["i think", "maybe", "perhaps", "possibly", "i'm not sure"]
    rewards = []
    for completion in completions:
        has_hedging = any(phrase in completion.lower() for phrase in hedging)
        rewards.append(-0.3 if has_hedging else 0.0)
    return rewards

Typical Reward Stack

reward_funcs = [
    correctness_reward,      # +2.0 max (primary signal)
    format_reward,           # +0.5 max (structure)
    reasoning_length_reward, # +0.3 max (quality)
    no_hedging_reward,       # -0.3 max (constraint)
]
# Total range: -0.3 to +2.8

For domain-specific rewards: See reference/domain-examples.md for Voice AI, Sales Agent, and Support patterns. </reward_functions>

<prompt_format>

Prompt Structure

System Prompt with XML Tags

SYSTEM_PROMPT = """You are a helpful assistant that thinks step-by-step.

Always respond in this exact format:
<reasoning>
[Your step-by-step thinking process]
</reasoning>
<answer>
[Your final answer - just the number or short response]
</answer>
"""

Extraction Helpers

import re

def extract_answer(text: str) -> str:
    """Extract answer from XML tags"""
    match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    return match.group(1).strip() if match else ""

def extract_reasoning(text: str) -> str:
    """Extract reasoning from XML tags"""
    match = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
    return match.group(1).strip() if match else ""

Dataset Format

GRPO (prompt-only):

dataset = dataset.map(lambda ex: {
    "prompt": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["question"]}
    ],
    "answer": ex["answer"]  # Ground truth for verification
})

SFT (full conversations):

dataset = dataset.map(lambda ex: {
    "conversations": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["input"]},
        {"role": "assistant", "content": ex["output"]}
    ]
})

</prompt_format>

<model_export>

Save and Deploy

Save LoRA Only (~100MB)

model.save_lora("grpo_lora")

Merge and Save Full Model

model.save_pretrained_merged(
    "grpo_merged", tokenizer,
    save_method="merged_16bit",
)

Export to GGUF for Ollama

model.save_pretrained_gguf(
    "grpo_gguf", tokenizer,
    quantization_method="q4_k_m",  # Options: q4_k_m, q8_0, q5_k_m
)

Test with Ollama

# Create Modelfile
cat > Modelfile << EOF
FROM ./grpo_gguf/unsloth.Q4_K_M.gguf
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
PARAMETER temperature 0.7
EOF

ollama create my-model -f Modelfile
ollama run my-model "Solve: 15 + 27 = ?"

</model_export>

GRPO training: → GRPOConfig, reward functions, dataset prep → Reference: reference/grpo/basic_grpo.py

SFT training: → SFTTrainer, dataset formatting → Reference: reference/sft/sales_extractor_training.py

Reward function design: → 4 patterns (correctness, format, quality, constraints) → Reference: reference/reward-design.md, reference/domain-examples.md

FP8 training: → 60% VRAM savings, env vars, pre-quantized models → Reference: reference/fp8-training.md

Docker setup: → Official image, volumes, Jupyter/SSH → Reference: reference/deployment.md

Vision fine-tuning: → FastVisionModel, VLM data format → Reference: reference/vision-training.md

Mobile deployment: → QAT, ExecuTorch, iOS/Android → Reference: reference/mobile-deployment.md

Long context / packing: → 500K context, 2-5x speedup → Reference: reference/advanced-training.md

Export formats: → GGUF methods, Ollama, vLLM, SGLang → Reference: reference/export-formats.md

Training issues:reference/troubleshooting.md

<troubleshooting_quick>

Quick Troubleshooting

Symptom Fix
Reward not increasing Wait 300+ steps, then increase learning_rate 2x
Reward spiky/unstable Decrease learning_rate 0.5x, increase beta
Model outputs garbage Increase beta 2-4x, check prompt format
Out of memory Reduce max_completion_length, num_generations=2
No reasoning appearing Train 500+ steps, use model >= 1.5B

For detailed troubleshooting: See reference/troubleshooting.md </troubleshooting_quick>

<training_checklist>

Pre-Training Checklist

GRPO:

  • Model loads without OOM
  • LoRA configured with use_gradient_checkpointing="unsloth"
  • Dataset has prompt and answer fields
  • At least one reward function defined and tested
  • num_generations >= 2
  • beta set (0.01-0.1, start at 0.04)
  • learning_rate set (1e-6 to 1e-5)
  • At least 300 steps planned

SFT:

  • Model loads without OOM
  • Dataset has conversations or text field
  • Chat template applied correctly
  • learning_rate ~2e-4
  • 2-4 epochs planned </training_checklist>

<cost_estimates>

Cost Estimates

Training Data Generation:

  • 50 seeds x 10 variations x ~1000 tokens = ~500K tokens
  • Claude Sonnet: ~$1.50
  • GPT-4o-mini: ~$0.30

Training Compute:

  • Google Colab T4: Free (30-60 min for 1000 examples)
  • RunPod A40 (48GB): ~$0.79/hr

Total per experiment: ~$0.50-$1.00 on RunPod </cost_estimates>

<example_session>

Example Session

User: "I want to train a model that gives concise voice responses"

Claude:

  1. First, let's set up the model:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=1024, load_in_4bit=True,
)
  1. For voice AI, you need brevity rewards. Here's a custom stack:
def brevity_reward(completions, **kwargs):
    """Voice responses under 50 words get +0.5"""
    return [0.5 if len(c.split()) <= 50 else -0.3 for c in completions]

def speakable_reward(completions, **kwargs):
    """Penalize markdown that sounds bad spoken"""
    bad = ["**", "- ", "```", "http"]
    return [-0.2 * sum(1 for b in bad if b in c) for c in completions]

reward_funcs = [correctness_reward, brevity_reward, speakable_reward]
  1. See reference/domain-examples.md for complete Voice AI reward patterns. </example_session>
Weekly Installs
35
GitHub Stars
5
First Seen
Jan 23, 2026
Installed on
codex33
gemini-cli33
opencode32
github-copilot30
amp27
cursor26