genie-benchmark-generator

SKILL.md

Genie Benchmark Generator

Creates, validates, and syncs benchmark question suites for Genie Space evaluation. Handles three intake paths, ground truth SQL validation, and MLflow Evaluation Dataset synchronization.

When to Use This Skill

  • Creating benchmark questions for a new Genie Space optimization
  • Refreshing benchmarks after arbiter corrections (ground truth was wrong)
  • Generating synthetic benchmarks from Genie Space trusted assets
  • Validating ground truth SQL against the live warehouse

Inputs (from Orchestrator)

Input Type Description
space_id str Genie Space ID
uc_schema str Unity Catalog schema for MLflow datasets
domain str Domain key (e.g., "cost")
user_questions list Optional user-provided benchmark questions

Outputs (to Orchestrator)

Output Type Description
eval_dataset_name str MLflow Evaluation Dataset name in UC (always produced when uc_schema is set)
gt_validation_report dict Per-question validation results
yaml_path str Path to saved golden-queries.yaml
benchmark_count int Number of benchmarks created

Optional Outputs

Output Type Description
example_question_sqls list[dict] Optional: up to 10 example Q&A pairs for the Genie Space config. Each entry has id (32-hex UUID via uuid.uuid4().hex), question, and sql.

Warning: The evaluator REQUIRES eval_dataset_name to use mlflow.genai.evaluate(). Skipping dataset sync means the Datasets and Evaluation tabs will be empty. Always run sync_yaml_to_mlflow_dataset().

Template variable handoff: Ground truth SQL in expected_sql may use ${catalog} and ${gold_schema} template variables. The evaluator must receive catalog and gold_schema as job parameters and call resolve_sql() before any spark.sql() execution.

Workflow

User provides questions?
├── YES (10+) → Validate → Report → Save
├── YES (1-9) → Validate → Augment → Review → Save
└── NO (0)    → Inspect assets → Generate → Review → Save
                        Validate GT SQL (spark.sql)
                        Sync to MLflow Evaluation Dataset
  1. Prompt the user for benchmark questions (always ask before generating synthetic)
  2. Validate each question against the live Genie Space trusted assets
  3. Generate / Augment synthetic benchmarks to fill coverage gaps
  4. Show to user for review before proceeding
  5. Validate ground truth SQL — execute via spark.sql(), store result hash + sample
  6. Save to golden-queries.yaml
  7. Sync to MLflow Evaluation Dataset in Unity Catalog

Load: Read benchmark-intake-workflow.md for the full three-path intake implementation.

Load: Read benchmark-patterns.md for question writing rules, category coverage, and domain-specific patterns.

Load: Read gt-validation.md for ground truth SQL validation, retry logic, and MLflow dataset sync code.

Benchmark Question Schema

Every benchmark question requires these fields:

- id: "domain_NNN"
  question: "Natural language question"
  expected_sql: "Full SQL that returns correct result"
  expected_asset: "MV|TVF|TABLE"
  category: "aggregation|ranking|time-series|comparison|list"
  source: "user|synthetic|augmented"
  priority: "P0|P1|P2"   # P0 = hard gate, P1 = default, P2 = nice-to-have

Optional fields (auto-populated by GT validation):

  • required_tables, required_columns, required_joins
  • expected_result_hash, expected_result_sample, expected_row_count, expected_columns

ID Format Requirements

Entity ID Format Example Notes
Benchmark questions domain_NNN cost_001 Human-readable, sequential within domain
example_question_sqls 32-hex UUID a1b2c3d4e5f6... Via uuid.uuid4().hex — required by Genie API
sql_functions (TVFs) 32-hex UUID f6e5d4c3b2a1... Via uuid.uuid4().hex — required by Genie API

See golden-queries.yaml for the full template.

Benchmark Splits (P12)

  • assign_splits() assigns train / val / held_out with 60/20/20 ratios.
  • The held_out split is never used during optimization; it is reserved only for post-deploy overfitting checks.
  • The split field is optional and backward-compatible.

SQL Dependency Map

  • parse_sql_dependencies() auto-extracts required_tables, required_columns, and required_joins from GT SQL.
  • These fields are auto-populated during validate_benchmarks() if not already set.
  • Used by the evaluator's eval_scope="slice" to filter benchmarks by patched objects.

Enhanced GT Validation

  • validate_benchmarks() now also populates expected_columns from the GT SQL result.
  • All validation output fields: expected_result_hash, expected_result_sample, expected_row_count, expected_columns.

Coverage Requirements (MUST)

Benchmarks MUST meet minimum coverage thresholds. Insufficient coverage produces statistically unreliable evaluation results and blocks the optimization loop (evaluator's min_benchmarks guard).

Domain Size Min Questions Min Categories
Small (1-3 tables) 10 4
Medium (4-8 tables) 15 6
Large (9+ tables) 20 8

Common Mistakes

Mistake Consequence Fix
Generating benchmarks without asking user first Misses real user intent Always prompt before synthetic generation
Skipping GT SQL validation Chasing wrong ground truth Execute every GT SQL via spark.sql() before acceptance
Not validating against live space Questions reference non-trusted assets Check all SQL references against space config
Insufficient category coverage Missing failure patterns in evaluation Ensure 4+ categories, fill gaps with synthetic
Template variables in expected_sql without documenting handoff Evaluator gets PARSE_SYNTAX_ERROR Document that evaluator must receive catalog and gold_schema as parameters and call resolve_sql()
Skipping dataset sync to UC Evaluator can't use mlflow.genai.evaluate(data=...), Evaluation tab empty Always run sync_yaml_to_mlflow_dataset() — it's required for the MLflow GenAI evaluation flow

Ground Truth Validation Handoff

The generator's GT validation functions (validate_ground_truth_sql() and validate_with_retry() from references/gt-validation.md) are consumed by the evaluator in Cell 3a as a structural pre-check before evaluation begins. The evaluator:

  1. Executes each benchmark's expected_sql with LIMIT 1
  2. Auto-remediates failures via LLM with full schema context (information_schema.columns + INFORMATION_SCHEMA.ROUTINES)
  3. Emits gt_remediation_queue.yaml for unrepairable benchmarks, which the orchestrator routes back to the generator

The generator should expect to receive remediation requests containing the original SQL and Spark error messages, and produce replacement benchmarks.

HARD CONSTRAINTS

  • sync_yaml_to_mlflow_dataset() MUST run when uc_schema is available. Skipping leaves the Datasets tab empty.
  • Generated benchmarks for large domains (9+ tables) MUST meet minimum question count (20) and minimum category count (8). These are hard requirements, not recommendations.

Benchmark Correction

When the orchestrator's arbiter detects that Genie produced correct SQL but the benchmark expected_sql was wrong (genie_correct verdict), it passes corrected_questions to the Generator. This triggers a benchmark correction workflow:

  1. Receive corrected_questions from the orchestrator. Each entry contains question_id, verdict, expected_sql (old), generated_sql (Genie's correct SQL), and rationale.
  2. Update the benchmark YAML -- replace the expected_sql for each corrected question with Genie's SQL.
  3. Re-validate the updated benchmarks via sync_yaml_to_mlflow_dataset() to ensure the new expected SQL is valid.
  4. Return the updated eval_dataset_name to the orchestrator so the next evaluation uses corrected benchmarks.

This ensures benchmark quality improves over iterations rather than penalizing Genie for correct answers.

Scripts

benchmark_generator.py

Standalone CLI for benchmark generation, validation, and MLflow sync:

python scripts/benchmark_generator.py --space-id <ID> --domain cost --uc-schema catalog.schema
python scripts/benchmark_generator.py --space-id <ID> --domain cost --validate-only

Reference Index

Reference What to Find
benchmark-intake-workflow.md Three-path intake, validation pipeline, augmentation strategy
benchmark-patterns.md Question writing rules, SQL expectations, domain patterns, validation checklist
gt-validation.md GT SQL execution, retry logic, hash storage, MLflow dataset sync
golden-queries.yaml Template with example cost domain benchmarks
Weekly Installs
1
GitHub Stars
2
First Seen
9 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1