skills/yonatangross/orchestkit/golden-dataset-curation

golden-dataset-curation

SKILL.md

Golden Dataset Curation

Curate high-quality documents for the golden dataset with multi-agent validation

Overview

This skill provides patterns and workflows for adding new documents to the golden dataset with thorough quality analysis. It complements golden-dataset-management which handles backup/restore.

When to use this skill:

  • Adding new documents to the golden dataset
  • Classifying content types and difficulty levels
  • Generating test queries for new documents
  • Running multi-agent quality analysis

Content Types

Type Description Quality Focus
article Technical articles, blog posts Depth, accuracy, actionability
tutorial Step-by-step guides Completeness, clarity, code quality
research_paper Academic papers, whitepapers Rigor, citations, methodology
documentation API docs, reference materials Accuracy, completeness, examples
video_transcript Transcribed video content Structure, coherence, key points
code_repository README, code analysis Code quality, documentation

Difficulty Levels

Level Semantic Complexity Expected Score Characteristics
trivial Direct keyword match >0.85 Technical terms, exact phrases
easy Common synonyms >0.70 Well-known concepts, slight variations
medium Paraphrased intent >0.55 Conceptual queries, multi-topic
hard Multi-hop reasoning >0.40 Cross-domain, comparative analysis
adversarial Edge cases Graceful degradation Robustness tests, off-domain

Quality Dimensions

Dimension Weight Perfect Acceptable Failing
Accuracy 0.25 0.95-1.0 0.70-0.94 <0.70
Coherence 0.20 0.90-1.0 0.60-0.89 <0.60
Depth 0.25 0.90-1.0 0.55-0.89 <0.55
Relevance 0.30 0.95-1.0 0.70-0.94 <0.70

Evaluation focuses:

  • Accuracy: Technical correctness, code validity, up-to-date info
  • Coherence: Logical structure, clear flow, consistent terminology
  • Depth: Comprehensive coverage, edge cases, appropriate detail
  • Relevance: Alignment with AI/ML, backend, frontend, DevOps domains

Multi-Agent Pipeline

INPUT: URL/Content
        |
        v
+------------------+
|   FETCH AGENT    |  Extract structure, detect type
+--------+---------+
         |
         v
+-----------------------------------------------+
|  PARALLEL ANALYSIS AGENTS                      |
|  Quality | Difficulty | Domain  | Query Gen   |
+-----------------------------------------------+
         |
         v
+------------------+
| CONSENSUS        |  Weighted score + confidence
| AGGREGATOR       |  -> include/review/exclude
+--------+---------+
         |
         v
+------------------+
|  USER APPROVAL   |  Show scores, confirm
+--------+---------+
         |
         v
OUTPUT: Curated document entry

Decision Thresholds

Quality Score Confidence Decision
>= 0.75 >= 0.70 include
>= 0.55 any review
< 0.55 any exclude

Quality Thresholds

# Recommended thresholds for golden dataset inclusion
minimum_quality_score: 0.70
minimum_confidence: 0.65
required_tags: 2          # At least 2 domain tags
required_queries: 3       # At least 3 test queries

Coverage Balance Guidelines

Maintain balanced coverage across:

  • Content types: Don't over-index on articles
  • Difficulty levels: Need trivial AND hard queries
  • Domains: Spread across AI/ML, backend, frontend, etc.

Duplicate Prevention Checklist

Before adding:

  1. Check URL against existing source_url_map.json
  2. Run semantic similarity against existing document embeddings
  3. Warn if >80% similar to existing document

Provenance Tracking

Always record:

  • Source URL (canonical)
  • Curation date
  • Agent scores (for audit trail)
  • Langfuse trace ID

Langfuse Integration

Trace Structure

trace = langfuse.trace(
    name="golden-dataset-curation",
    metadata={"source_url": url, "document_id": doc_id}
)

# Log individual dimension scores
trace.score(name="accuracy", value=0.85)
trace.score(name="coherence", value=0.90)
trace.score(name="depth", value=0.78)
trace.score(name="relevance", value=0.92)

# Final aggregated score
trace.score(name="quality_total", value=0.87)
trace.event(name="curation_decision", metadata={"decision": "include"})

Managed Prompts

Prompt Name Purpose
golden-content-classifier Classify content_type
golden-difficulty-classifier Assign difficulty
golden-domain-tagger Extract tags
golden-query-generator Generate test queries

References

For detailed implementation patterns, see:

  • references/selection-criteria.md - Content type classification, difficulty stratification, quality evaluation dimensions, and best practices
  • references/annotation-patterns.md - Multi-agent pipeline architecture, agent specifications, consensus aggregation logic, and Langfuse integration

Related Skills

  • golden-dataset-management - Backup/restore operations
  • golden-dataset-validation - Validation rules and checks
  • langfuse-observability - Tracing patterns
  • pgvector-search - Duplicate detection

Version: 1.0.0 (December 2025) Issue: #599

Capability Details

content-classification

Keywords: content type, classification, document type, golden dataset Solves:

  • Classify document content types for golden dataset
  • Categorize entries by domain and purpose
  • Identify content requiring special handling

difficulty-stratification

Keywords: difficulty, stratification, complexity level, challenge rating Solves:

  • Assign difficulty levels to golden dataset entries
  • Ensure balanced difficulty distribution
  • Identify edge cases and challenging examples

quality-evaluation

Keywords: quality, evaluation, quality dimensions, quality criteria Solves:

  • Evaluate entry quality against defined criteria
  • Score entries on multiple quality dimensions
  • Identify entries needing improvement

multi-agent-analysis

Keywords: multi-agent, parallel analysis, consensus, agent evaluation Solves:

  • Run parallel agent evaluations on entries
  • Aggregate consensus from multiple analysts
  • Resolve disagreements in classifications
Weekly Installs
11
GitHub Stars
95
First Seen
Jan 22, 2026
Installed on
claude-code8
antigravity6
gemini-cli6
opencode6
windsurf5
github-copilot5