NYC
skills/markpitt/claude-skills/fine-tuning-data-generator

fine-tuning-data-generator

SKILL.md

Fine-Tuning Data Generator

This skill generates high-quality synthetic training data in ChatML format for fine-tuning language models using frameworks like Unsloth, Axolotl, or similar tools.

What Do I Need?

Need Resource
Planning my dataset - requirements, strategy, quality checklist resources/dataset-strategy.md
How to create diverse examples - variation techniques, multi-turn patterns, format-specific guidance resources/generation-techniques.md
ChatML format details - structure, specification, common issues, framework compatibility resources/chatml-format.md
Example datasets - inspiration across domains, multi-turn samples, edge cases resources/examples.md
Validating quality - validation workflow, analyzing datasets, troubleshooting resources/quality-validation.md
Training & deployment - framework setup, hyperparameters, optimization, deployment resources/framework-integration.md

Workflow

Phase 1: Gather Requirements

Start with these essential clarifying questions:

Task Definition:

  • What is the model being trained to do? (e.g., customer support, code generation, creative writing)
  • What specific domain or subject matter? (e.g., legal, medical, e-commerce, software development)
  • How many training examples are needed? (Recommend: 100+ for simple tasks, 500-1000+ for complex)

Quality & Diversity:

  • Complexity range: simple to complex mix, or focus on specific difficulty level?
  • Diversity: edge cases, error handling, unusual scenarios?
  • Tone/style: professional, friendly, technical, concise, detailed?
  • Response length preferences?
  • Any specific formats: code blocks, lists, tables, JSON?

Dataset Composition:

  • Distribution across subtopics: evenly distributed or weighted?
  • Include negative examples (what NOT to do)?
  • Need validation split? (Recommend 10-20% of total)

See resources/dataset-strategy.md for detailed question templates.

Phase 2: Create Generation Plan

Present a plan covering:

  • Number and distribution of examples across categories
  • Key topics/scenarios to cover
  • Diversity strategies (phrasing variations, complexity levels, edge cases)
  • System prompt approach (consistent vs. varied)
  • Quality assurance approach

Get user approval before generating.

Phase 3: Generate Synthetic Data

Create examples following these quality standards:

Key Principles:

  • Realistic scenarios reflecting real-world use cases
  • Natural language with varied phrasing and formality levels
  • Accurate, helpful responses aligned with desired behavior
  • Consistent ChatML formatting throughout
  • Balanced difficulty (unless specified)
  • Meaningful variety (no repetition)
  • Include edge cases and error scenarios

Diversity Techniques:

  • Vary query phrasing (questions, commands, statements)
  • Include different expertise levels (beginner, intermediate, expert)
  • Cover both positive and negative examples
  • Mix short and long-form responses
  • Include multi-step reasoning when appropriate
  • Add context variations

See resources/generation-techniques.md for detailed techniques, domain-specific guidance, and batch generation workflow.

Phase 4: Validate & Document

Run validation tools and checks:

# Validate JSON formatting and structure
python scripts/validate_chatml.py training_data.jsonl

# Analyze dataset statistics and diversity
python scripts/analyze_dataset.py training_data.jsonl

# Export statistics
python scripts/analyze_dataset.py training_data.jsonl --export stats.json

Quality Checklist:

  • JSON validation passed (no errors)
  • Analysis shows good diversity metrics
  • Manual sample review passed
  • No duplicate or near-duplicate examples
  • All required fields present
  • Realistic user queries
  • Accurate, helpful responses
  • Balanced category distribution
  • Dataset metadata documented

See resources/quality-validation.md for validation details, troubleshooting, and documentation templates.

Phase 5: Integration & Training

Prepare for training with your framework of choice:

Output Files:

  • training_data.jsonl - Main training set
  • validation_data.jsonl - Optional validation set
  • dataset_info.txt - Metadata and statistics

Framework Setup:

  • Unsloth: Automatic ChatML detection, efficient 4-bit training
  • Axolotl: Specify type: chat_template and chat_template: chatml
  • Hugging Face: Use tokenizer's apply_chat_template() method
  • Custom: Load from JSONL, handle ChatML formatting

See resources/framework-integration.md for setup code, hyperparameters, deployment options, and best practices.

ChatML Format Overview

Each training example is a JSON object with a messages array:

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}

Roles:

  • system: Sets assistant behavior (optional but recommended)
  • user: User's input/query
  • assistant: Model's expected response

Multi-turn: Add additional user/assistant message pairs for conversations.

See resources/chatml-format.md for detailed specification, validation, common issues, and framework-specific notes.

Tool Reference

Scripts in scripts/

validate_chatml.py

Validates ChatML format JSONL files:

python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbose

Checks:

  • Valid JSON formatting
  • Required fields (messages, role, content)
  • Valid role values (system, user, assistant)
  • Proper message order
  • Duplicate detection
  • Diversity metrics

analyze_dataset.py

Provides comprehensive statistics and analysis:

python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.json

Provides:

  • Dataset overview (total examples, message counts)
  • Message length statistics
  • System prompt variations
  • User query patterns (questions, commands, code-related, length categories)
  • Assistant response patterns (code blocks, lists, headers, length categories)
  • Quality indicators (diversity score, balance ratio)
  • Token estimates and cost projection

Common Workflows

Small Dataset (100-200 examples)

  1. Gather requirements
  2. Create generation plan for 1-2 categories
  3. Generate in single batch, review quality
  4. Validate and document
  5. Ready for training

Medium Dataset (500-1000 examples)

  1. Gather requirements
  2. Create detailed plan with multiple categories
  3. Generate in 2-3 batches, reviewing after each
  4. Analyze diversity and adjust approach
  5. Fill any gaps
  6. Final validation and documentation

Large Dataset (2000+ examples)

  1. Gather comprehensive requirements
  2. Create multi-batch generation plan
  3. Batch 1 (50-100): Foundation examples
  4. Batch 2 (100-200): Complexity expansion
  5. Batch 3 (100-200): Coverage filling
  6. Batch 4 (50-100): Polish and validation
  7. Run full validation suite
  8. Generate comprehensive documentation

Best Practices

Start Small, Iterate

  1. Generate 10-20 examples first
  2. Review and get feedback
  3. Refine approach based on feedback
  4. Scale up to full dataset

Quality Over Quantity

  • Better to have 500 diverse, high-quality examples than 5,000 repetitive ones
  • Each example should teach something new
  • Maintain consistent response quality throughout

Diversify Systematically

  • Vary query phrasing (questions, commands, statements)
  • Cover different expertise levels
  • Mix response complexities
  • Include edge cases (typically 20-30% of dataset)
  • Use batch generation workflow for large datasets

Test Before Deployment

  • Test dataset with actual training framework
  • Monitor training metrics for issues
  • Test fine-tuned model outputs before deployment
  • Compare results to base model

Document Everything

  • Keep notes on generation parameters
  • Save different dataset versions
  • Document any modifications made
  • Record generation strategies used
  • Track model performance metrics

Advanced Features

Batch Generation Strategy

For datasets 500+ examples:

  • Generate 50-100 examples at a time
  • Review distribution and diversity after each batch
  • Adjust generation strategy based on identified gaps
  • Prevents repetition and maintains creativity

Common Pitfalls to Avoid

  • Over-templating: Creates repetitive patterns (vary naturally)
  • Unrealistic Queries: Overly formal/robotic user inputs (use varied phrasing)
  • Narrow Coverage: Limited scenarios and phrasing (ensure diversity)
  • Inconsistent Quality: Quality degradation over time (use quality checklist)
  • JSON Errors: Invalid formatting breaking training (always validate)
  • Missing Context: System prompts without detail (provide clear instructions)
  • Response Mismatch: Responses don't address queries (verify relevance)

Dataset Size Recommendations

Task Complexity Recommended Size Notes
Simple tasks 100-500 Well-defined, limited variation
Medium tasks 500-2,000 Multiple scenarios, moderate complexity
Complex tasks 2,000-10,000+ Many edge cases, high variability
Domain adaptation 1,000-5,000 Specialized knowledge required

Resources


Version: 2.0 | Updated: 2024 | Pattern: Modular Orchestration

Weekly Installs
17
First Seen
Jan 24, 2026
Installed on
claude-code14
opencode13
codex12
gemini-cli11
trae11
windsurf11