Training Data Curation Guidelines

Best practices for gathering and preparing training data for LLM fine-tuning.

Data Quality Principles

Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.

Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.

Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.

Format Requirements

Supervised Fine-Tuning (SFT)

Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Each sample is a complete conversation
Multi-turn: alternate user/assistant messages
System prompts optional: {"role": "system", "content": "..."}
JSONL format, one sample per line

Preference Learning (DPO/ORPO/KTO)

Requires paired comparisons [2]:

{"prompt": "...", "chosen": "...", "rejected": "..."}

chosen and rejected must respond to the same prompt
Quality difference should be clear and consistent
Annotator agreement >70% indicates usable samples [1]

For KTO, pairs aren't required—just binary labels on completions [7]:

{"prompt": "...", "completion": "...", "label": true/false}

Reward Modeling (RLHF)

Needs ranked responses [1]:

{"prompt": "...", "responses": ["best", "second", "worst"]}

Quality Checklist

Before training, verify:

No duplicates — exact and near-duplicate removal [3]
No empty fields — all required fields populated
Consistent format — schema matches throughout
Appropriate length — not too short (noise) or too long (truncation)
Clean text — proper encoding, no HTML/boilerplate artifacts [8]
Manual inspection — reviewed random sample of 50-100 examples
No PII/sensitive data — unless intentionally included
License verified — legal to use for training

Common Quality Issues

Issue	Detection	Fix	Source
Duplicates	Hash-based dedup	Remove exact matches, MinHash for near-dupes	[3]
Boilerplate	Keyword filter	Remove "subscribe", "cookie policy", etc.	[8]
Repetitive text	N-gram analysis	Flag if <30% unique trigrams	[4]
Low-quality text	Alpha ratio	Remove if <50% alphabetic characters	[8]
Wrong language	Language detection	fastText classifier, filter to target	[3]
Too short	Length check	Minimum 3-5 sentences, 100+ words for documents	[8]

Data Sources

High quality:

Curated human annotations [1]
Expert-written examples
Filtered high-quality web data [3]

Medium quality:

Synthetic data from stronger models (distillation)
Community Q&A with voting signals
Filtered user-generated content

Use with caution:

Raw web scrapes
Unfiltered synthetic data
Data without clear provenance [6]

Sizing Guidelines

Dataset Size	Use Case	Source
100-1K	Quick experiments, specific behaviors	—
1K-10K	Production SFT, domain adaptation	—
10K-100K	Comprehensive instruction tuning	[1]
1M+ preference pairs	Large-scale RLHF	[1]

Llama 2 used ~27K SFT examples and 1M+ preference comparisons [1].

File Format

JSONL — one JSON object per line, human-readable
Parquet — efficient for large datasets, built-in compression [3]
Sharding — split files >500MB into chunks

References

Llama 2 Paper — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
TRL Library — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
FineWeb Paper — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
Data-Juicer — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
Tinker API — Training API using messages format for SFT, DPO/RLHF support
Data Provenance Initiative — Longpre et al. (2023). Dataset licensing and attribution audit
KTO Paper — Ethayarajh et al. (2024). Binary preference learning without pairs
C4/T5 Paper — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal

training-data-curation