Genie Benchmark Generator

Creates, validates, and syncs benchmark question suites for Genie Space evaluation. Handles three intake paths, ground truth SQL validation, and MLflow Evaluation Dataset synchronization.

When to Use This Skill

Creating benchmark questions for a new Genie Space optimization
Refreshing benchmarks after arbiter corrections (ground truth was wrong)
Generating synthetic benchmarks from Genie Space trusted assets
Validating ground truth SQL against the live warehouse

Inputs (from Orchestrator)

Input	Type	Description
`space_id`	str	Genie Space ID
`uc_schema`	str	Unity Catalog schema for MLflow datasets
`domain`	str	Domain key (e.g., "cost")
`user_questions`	list	Optional user-provided benchmark questions

Outputs (to Orchestrator)

Output	Type	Description
`eval_dataset_name`	str	MLflow Evaluation Dataset name in UC (always produced when `uc_schema` is set)
`gt_validation_report`	dict	Per-question validation results
`yaml_path`	str	Path to saved golden-queries.yaml
`benchmark_count`	int	Number of benchmarks created

Optional Outputs

Output	Type	Description
`example_question_sqls`	list[dict]	Optional: up to 10 example Q&A pairs for the Genie Space config. Each entry has `id` (32-hex UUID via `uuid.uuid4().hex`), `question`, and `sql`.

Warning: The evaluator REQUIRES eval_dataset_name to use mlflow.genai.evaluate(). Skipping dataset sync means the Datasets and Evaluation tabs will be empty. Always run sync_yaml_to_mlflow_dataset().

Template variable handoff: Ground truth SQL in expected_sql may use ${catalog} and ${gold_schema} template variables. The evaluator must receive catalog and gold_schema as job parameters and call resolve_sql() before any spark.sql() execution.

Workflow

User provides questions?
├── YES (10+) → Validate → Report → Save
├── YES (1-9) → Validate → Augment → Review → Save
└── NO (0)    → Inspect assets → Generate → Review → Save
                                    ↓
                        Validate GT SQL (spark.sql)
                                    ↓
                        Sync to MLflow Evaluation Dataset

Prompt the user for benchmark questions (always ask before generating synthetic)
Validate each question against the live Genie Space trusted assets
Generate / Augment synthetic benchmarks to fill coverage gaps
Show to user for review before proceeding
Validate ground truth SQL — execute via spark.sql(), store result hash + sample
Save to golden-queries.yaml
Sync to MLflow Evaluation Dataset in Unity Catalog

Load: Read benchmark-intake-workflow.md for the full three-path intake implementation.

Load: Read benchmark-patterns.md for question writing rules, category coverage, and domain-specific patterns.

Load: Read gt-validation.md for ground truth SQL validation, retry logic, and MLflow dataset sync code.

Benchmark Question Schema

Every benchmark question requires these fields:

- id: "domain_NNN"
  question: "Natural language question"
  expected_sql: "Full SQL that returns correct result"
  expected_asset: "MV|TVF|TABLE"
  category: "aggregation|ranking|time-series|comparison|list"
  source: "user|synthetic|augmented"
  priority: "P0|P1|P2"   # P0 = hard gate, P1 = default, P2 = nice-to-have

Optional fields (auto-populated by GT validation):

required_tables, required_columns, required_joins
expected_result_hash, expected_result_sample, expected_row_count, expected_columns

ID Format Requirements

Entity	ID Format	Example	Notes
Benchmark questions	`domain_NNN`	`cost_001`	Human-readable, sequential within domain
`example_question_sqls`	32-hex UUID	`a1b2c3d4e5f6...`	Via `uuid.uuid4().hex` — required by Genie API
`sql_functions` (TVFs)	32-hex UUID	`f6e5d4c3b2a1...`	Via `uuid.uuid4().hex` — required by Genie API

See golden-queries.yaml for the full template.

Benchmark Splits (P12)

assign_splits() assigns train / val / held_out with 60/20/20 ratios.
The held_out split is never used during optimization; it is reserved only for post-deploy overfitting checks.
The split field is optional and backward-compatible.

SQL Dependency Map

parse_sql_dependencies() auto-extracts required_tables, required_columns, and required_joins from GT SQL.
These fields are auto-populated during validate_benchmarks() if not already set.
Used by the evaluator's eval_scope="slice" to filter benchmarks by patched objects.

Enhanced GT Validation

validate_benchmarks() now also populates expected_columns from the GT SQL result.
All validation output fields: expected_result_hash, expected_result_sample, expected_row_count, expected_columns.

Coverage Requirements (MUST)

Benchmarks MUST meet minimum coverage thresholds. Insufficient coverage produces statistically unreliable evaluation results and blocks the optimization loop (evaluator's min_benchmarks guard).

Domain Size	Min Questions	Min Categories
Small (1-3 tables)	10	4
Medium (4-8 tables)	15	6
Large (9+ tables)	20	8

Common Mistakes

Mistake	Consequence	Fix
Generating benchmarks without asking user first	Misses real user intent	Always prompt before synthetic generation
Skipping GT SQL validation	Chasing wrong ground truth	Execute every GT SQL via `spark.sql()` before acceptance
Not validating against live space	Questions reference non-trusted assets	Check all SQL references against space config
Insufficient category coverage	Missing failure patterns in evaluation	Ensure 4+ categories, fill gaps with synthetic
Template variables in `expected_sql` without documenting handoff	Evaluator gets `PARSE_SYNTAX_ERROR`	Document that evaluator must receive `catalog` and `gold_schema` as parameters and call `resolve_sql()`
Skipping dataset sync to UC	Evaluator can't use `mlflow.genai.evaluate(data=...)`, Evaluation tab empty	Always run `sync_yaml_to_mlflow_dataset()` — it's required for the MLflow GenAI evaluation flow

Ground Truth Validation Handoff

The generator's GT validation functions (validate_ground_truth_sql() and validate_with_retry() from references/gt-validation.md) are consumed by the evaluator in Cell 3a as a structural pre-check before evaluation begins. The evaluator:

Executes each benchmark's expected_sql with LIMIT 1
Auto-remediates failures via LLM with full schema context (information_schema.columns + INFORMATION_SCHEMA.ROUTINES)
Emits gt_remediation_queue.yaml for unrepairable benchmarks, which the orchestrator routes back to the generator

The generator should expect to receive remediation requests containing the original SQL and Spark error messages, and produce replacement benchmarks.

HARD CONSTRAINTS

sync_yaml_to_mlflow_dataset() MUST run when uc_schema is available. Skipping leaves the Datasets tab empty.
Generated benchmarks for large domains (9+ tables) MUST meet minimum question count (20) and minimum category count (8). These are hard requirements, not recommendations.

Benchmark Correction

When the orchestrator's arbiter detects that Genie produced correct SQL but the benchmark expected_sql was wrong (genie_correct verdict), it passes corrected_questions to the Generator. This triggers a benchmark correction workflow:

Receive corrected_questions from the orchestrator. Each entry contains question_id, verdict, expected_sql (old), generated_sql (Genie's correct SQL), and rationale.
Update the benchmark YAML -- replace the expected_sql for each corrected question with Genie's SQL.
Re-validate the updated benchmarks via sync_yaml_to_mlflow_dataset() to ensure the new expected SQL is valid.
Return the updated eval_dataset_name to the orchestrator so the next evaluation uses corrected benchmarks.

This ensures benchmark quality improves over iterations rather than penalizing Genie for correct answers.

Scripts

benchmark_generator.py

Standalone CLI for benchmark generation, validation, and MLflow sync:

python scripts/benchmark_generator.py --space-id <ID> --domain cost --uc-schema catalog.schema
python scripts/benchmark_generator.py --space-id <ID> --domain cost --validate-only

Reference Index

Reference	What to Find
benchmark-intake-workflow.md	Three-path intake, validation pipeline, augmentation strategy
benchmark-patterns.md	Question writing rules, SQL expectations, domain patterns, validation checklist
gt-validation.md	GT SQL execution, retry logic, hash storage, MLflow dataset sync
golden-queries.yaml	Template with example cost domain benchmarks

genie-benchmark-generator