skills/sundial-org/skills/training-data-curation

training-data-curation

SKILL.md

Training Data Curation Guidelines

Best practices for gathering and preparing training data for LLM fine-tuning.

Data Quality Principles

Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.

Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.

Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.

Format Requirements

Supervised Fine-Tuning (SFT)

Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
  • Each sample is a complete conversation
  • Multi-turn: alternate user/assistant messages
  • System prompts optional: {"role": "system", "content": "..."}
  • JSONL format, one sample per line

Preference Learning (DPO/ORPO/KTO)

Requires paired comparisons [2]:

{"prompt": "...", "chosen": "...", "rejected": "..."}
  • chosen and rejected must respond to the same prompt
  • Quality difference should be clear and consistent
  • Annotator agreement >70% indicates usable samples [1]

For KTO, pairs aren't required—just binary labels on completions [7]:

{"prompt": "...", "completion": "...", "label": true/false}

Reward Modeling (RLHF)

Needs ranked responses [1]:

{"prompt": "...", "responses": ["best", "second", "worst"]}

Quality Checklist

Before training, verify:

  • No duplicates — exact and near-duplicate removal [3]
  • No empty fields — all required fields populated
  • Consistent format — schema matches throughout
  • Appropriate length — not too short (noise) or too long (truncation)
  • Clean text — proper encoding, no HTML/boilerplate artifacts [8]
  • Manual inspection — reviewed random sample of 50-100 examples
  • No PII/sensitive data — unless intentionally included
  • License verified — legal to use for training

Common Quality Issues

Issue Detection Fix Source
Duplicates Hash-based dedup Remove exact matches, MinHash for near-dupes [3]
Boilerplate Keyword filter Remove "subscribe", "cookie policy", etc. [8]
Repetitive text N-gram analysis Flag if <30% unique trigrams [4]
Low-quality text Alpha ratio Remove if <50% alphabetic characters [8]
Wrong language Language detection fastText classifier, filter to target [3]
Too short Length check Minimum 3-5 sentences, 100+ words for documents [8]

Data Sources

High quality:

  • Curated human annotations [1]
  • Expert-written examples
  • Filtered high-quality web data [3]

Medium quality:

  • Synthetic data from stronger models (distillation)
  • Community Q&A with voting signals
  • Filtered user-generated content

Use with caution:

  • Raw web scrapes
  • Unfiltered synthetic data
  • Data without clear provenance [6]

Sizing Guidelines

Dataset Size Use Case Source
100-1K Quick experiments, specific behaviors
1K-10K Production SFT, domain adaptation
10K-100K Comprehensive instruction tuning [1]
1M+ preference pairs Large-scale RLHF [1]

Llama 2 used ~27K SFT examples and 1M+ preference comparisons [1].

File Format

  • JSONL — one JSON object per line, human-readable
  • Parquet — efficient for large datasets, built-in compression [3]
  • Sharding — split files >500MB into chunks

References

  1. Llama 2 Paper — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
  2. TRL Library — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
  3. FineWeb Paper — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
  4. Data-Juicer — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
  5. Tinker API — Training API using messages format for SFT, DPO/RLHF support
  6. Data Provenance Initiative — Longpre et al. (2023). Dataset licensing and attribution audit
  7. KTO Paper — Ethayarajh et al. (2024). Binary preference learning without pairs
  8. C4/T5 Paper — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal
Weekly Installs
25
GitHub Stars
145
First Seen
Jan 26, 2026
Installed on
claude-code23
codex12
gemini-cli12
opencode12
github-copilot11
kimi-cli11