golden-dataset
Golden Dataset
Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in rules/ loaded on-demand.
Quick Reference
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Curation | 3 | HIGH | Content collection, annotation pipelines, diversity analysis |
| Management | 3 | HIGH | Versioning, backup/restore, CI/CD automation |
| Validation | 3 | CRITICAL | Quality scoring, drift detection, regression testing |
| Add Workflow | 1 | HIGH | 9-phase curation, quality scoring, bias detection, silver-to-gold |
Total: 10 rules across 4 categories
Curation
Content collection, multi-agent annotation, and diversity analysis for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Collection | rules/curation-collection.md |
Content type classification, quality thresholds, duplicate prevention |
| Annotation | rules/curation-annotation.md |
Multi-agent pipeline, consensus aggregation, Langfuse tracing |
| Diversity | rules/curation-diversity.md |
Difficulty stratification, domain coverage, balance guidelines |
Management
Versioning, storage, and CI/CD automation for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Versioning | rules/management-versioning.md |
JSON backup format, embedding regeneration, disaster recovery |
| Storage | rules/management-storage.md |
Backup strategies, URL contract, data integrity checks |
| CI Integration | rules/management-ci.md |
GitHub Actions automation, pre-deployment validation, weekly backups |
Validation
Quality scoring, drift detection, and regression testing for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Quality | rules/validation-quality.md |
Schema validation, content quality, referential integrity |
| Drift | rules/validation-drift.md |
Duplicate detection, semantic similarity, coverage gap analysis |
| Regression | rules/validation-regression.md |
Difficulty distribution, pre-commit hooks, full dataset validation |
Add Workflow
Structured workflow for adding new documents to the golden dataset.
| Rule | File | Key Pattern |
|---|---|---|
| Add Document | rules/curation-add-workflow.md |
9-phase curation, parallel quality analysis, bias detection |
Quick Start Example
from app.shared.services.embeddings import embed_text
async def validate_before_add(document: dict, source_url_map: dict) -> dict:
"""Pre-addition validation for golden dataset entries."""
errors = []
# 1. URL contract check
if "placeholder" in document.get("source_url", ""):
errors.append("URL must be canonical, not a placeholder")
# 2. Content quality
if len(document.get("title", "")) < 10:
errors.append("Title too short (min 10 chars)")
# 3. Tag requirements
if len(document.get("tags", [])) < 2:
errors.append("At least 2 domain tags required")
return {"valid": len(errors) == 0, "errors": errors}
Key Decisions
| Decision | Recommendation |
|---|---|
| Backup format | JSON (version controlled, portable) |
| Embedding storage | Exclude from backup (regenerate on restore) |
| Quality threshold | >= 0.70 quality score for inclusion |
| Confidence threshold | >= 0.65 for auto-include |
| Duplicate threshold | >= 0.90 similarity blocks, >= 0.85 warns |
| Min tags per entry | 2 domain tags |
| Min test queries | 3 per document |
| Difficulty balance | Trivial 3, Easy 3, Medium 5, Hard 3 minimum |
| CI frequency | Weekly automated backup (Sunday 2am UTC) |
Common Mistakes
- Using placeholder URLs instead of canonical source URLs
- Skipping embedding regeneration after restore
- Not validating referential integrity between documents and queries
- Over-indexing on articles (neglecting tutorials, research papers)
- Missing difficulty distribution balance in test queries
- Not running verification after backup/restore operations
- Testing restore procedures in production instead of staging
- Committing SQL dumps instead of JSON (not version-control friendly)
Evaluations
See test-cases.json for 9 test cases across all categories.
Related Skills
ork:rag-retrieval- Retrieval evaluation using golden datasetlangfuse-observability- Tracing patterns for curation workflowsork:testing-patterns- General testing patterns and strategiesai-native-development- Embedding generation for restore
Capability Details
curation
Keywords: golden dataset, curation, content collection, annotation, quality criteria
Solves:
- Classify document content types for golden dataset
- Run multi-agent quality analysis pipelines
- Generate test queries for new documents
management
Keywords: golden dataset, backup, restore, versioning, disaster recovery
Solves:
- Backup and restore golden datasets with JSON
- Regenerate embeddings after restore
- Automate backups with CI/CD
validation
Keywords: golden dataset, validation, schema, duplicate detection, quality metrics
Solves:
- Validate entries against document schema
- Detect duplicate or near-duplicate entries
- Analyze dataset coverage and distribution gaps