training-data-curation
Training Data Curation Guidelines
Best practices for gathering and preparing training data for LLM fine-tuning.
Data Quality Principles
Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.
Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.
Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.
Format Requirements
Supervised Fine-Tuning (SFT)
Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
- Each sample is a complete conversation
- Multi-turn: alternate user/assistant messages
- System prompts optional:
{"role": "system", "content": "..."} - JSONL format, one sample per line
Preference Learning (DPO/ORPO/KTO)
Requires paired comparisons [2]:
{"prompt": "...", "chosen": "...", "rejected": "..."}
chosenandrejectedmust respond to the same prompt- Quality difference should be clear and consistent
- Annotator agreement >70% indicates usable samples [1]
For KTO, pairs aren't required—just binary labels on completions [7]:
{"prompt": "...", "completion": "...", "label": true/false}
Reward Modeling (RLHF)
Needs ranked responses [1]:
{"prompt": "...", "responses": ["best", "second", "worst"]}
Quality Checklist
Before training, verify:
- No duplicates — exact and near-duplicate removal [3]
- No empty fields — all required fields populated
- Consistent format — schema matches throughout
- Appropriate length — not too short (noise) or too long (truncation)
- Clean text — proper encoding, no HTML/boilerplate artifacts [8]
- Manual inspection — reviewed random sample of 50-100 examples
- No PII/sensitive data — unless intentionally included
- License verified — legal to use for training
Common Quality Issues
| Issue | Detection | Fix | Source |
|---|---|---|---|
| Duplicates | Hash-based dedup | Remove exact matches, MinHash for near-dupes | [3] |
| Boilerplate | Keyword filter | Remove "subscribe", "cookie policy", etc. | [8] |
| Repetitive text | N-gram analysis | Flag if <30% unique trigrams | [4] |
| Low-quality text | Alpha ratio | Remove if <50% alphabetic characters | [8] |
| Wrong language | Language detection | fastText classifier, filter to target | [3] |
| Too short | Length check | Minimum 3-5 sentences, 100+ words for documents | [8] |
Data Sources
High quality:
Medium quality:
- Synthetic data from stronger models (distillation)
- Community Q&A with voting signals
- Filtered user-generated content
Use with caution:
- Raw web scrapes
- Unfiltered synthetic data
- Data without clear provenance [6]
Sizing Guidelines
| Dataset Size | Use Case | Source |
|---|---|---|
| 100-1K | Quick experiments, specific behaviors | — |
| 1K-10K | Production SFT, domain adaptation | — |
| 10K-100K | Comprehensive instruction tuning | [1] |
| 1M+ preference pairs | Large-scale RLHF | [1] |
Llama 2 used ~27K SFT examples and 1M+ preference comparisons [1].
File Format
- JSONL — one JSON object per line, human-readable
- Parquet — efficient for large datasets, built-in compression [3]
- Sharding — split files >500MB into chunks
References
- Llama 2 Paper — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
- TRL Library — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
- FineWeb Paper — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
- Data-Juicer — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
- Tinker API — Training API using messages format for SFT, DPO/RLHF support
- Data Provenance Initiative — Longpre et al. (2023). Dataset licensing and attribution audit
- KTO Paper — Ethayarajh et al. (2024). Binary preference learning without pairs
- C4/T5 Paper — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal
More from m4n5ter/skills
ipynb-notebooks
面向 .ipynb Notebook(Jupyter / JupyterLab / Google Colab / VS Code)的创建、审阅、重构与展示。涵盖工程化目录结构、token 高效处理、演示/分享模式、以及 uv/venv 可复现工作流。
16jj-vcs
面向 Jujutsu(jj) 版本控制的使用、工作流、revset/fileset 语法、Git 互操作与配置排错指导。用于解答 jj 命令与概念差异、迁移 Git 流程到 jj、处理冲突/回滚、配置与远程书签相关问题。
1docx
全面的文档创建、编辑和分析,支持修订(tracked changes)、批注、格式保留和文本提取。当需要处理专业文档(.docx 文件)用于:(1)创建新文档,(2)修改或编辑内容,(3)处理修订,(4)添加批注,或任何其他文档任务时使用。
1xlsx
全面的电子表格创建、编辑和分析,支持公式、格式设置、数据分析和可视化。当需要处理电子表格(.xlsx, .xlsm, .csv, .tsv 等)以进行以下操作时使用:(1) 创建带有公式和格式的新电子表格,(2) 读取或分析数据,(3) 在保留公式的同时修改现有电子表格,(4) 电子表格中的数据分析和可视化,或 (5) 重新计算公式
1agent-browser
Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, test web applications, or extract information from web pages.
1tinker-training-cost
Calculate training costs for Tinker fine-tuning jobs. Use when estimating costs for Tinker LLM training, counting tokens in datasets, or comparing Tinker model training prices. Tokenizes datasets using the correct model tokenizer and provides accurate cost estimates.
1