genie-benchmark-generator
Genie Benchmark Generator
Creates, validates, and syncs benchmark question suites for Genie Space evaluation. Handles three intake paths, ground truth SQL validation, and MLflow Evaluation Dataset synchronization.
When to Use This Skill
- Creating benchmark questions for a new Genie Space optimization
- Refreshing benchmarks after arbiter corrections (ground truth was wrong)
- Generating synthetic benchmarks from Genie Space trusted assets
- Validating ground truth SQL against the live warehouse
Inputs (from Orchestrator)
| Input | Type | Description |
|---|---|---|
space_id |
str | Genie Space ID |
uc_schema |
str | Unity Catalog schema for MLflow datasets |
domain |
str | Domain key (e.g., "cost") |
user_questions |
list | Optional user-provided benchmark questions |
Outputs (to Orchestrator)
| Output | Type | Description |
|---|---|---|
eval_dataset_name |
str | MLflow Evaluation Dataset name in UC (always produced when uc_schema is set) |
gt_validation_report |
dict | Per-question validation results |
yaml_path |
str | Path to saved golden-queries.yaml |
benchmark_count |
int | Number of benchmarks created |
Optional Outputs
| Output | Type | Description |
|---|---|---|
example_question_sqls |
list[dict] | Optional: up to 10 example Q&A pairs for the Genie Space config. Each entry has id (32-hex UUID via uuid.uuid4().hex), question, and sql. |
Warning: The evaluator REQUIRES
eval_dataset_nameto usemlflow.genai.evaluate(). Skipping dataset sync means the Datasets and Evaluation tabs will be empty. Always runsync_yaml_to_mlflow_dataset().
Template variable handoff: Ground truth SQL in
expected_sqlmay use${catalog}and${gold_schema}template variables. The evaluator must receivecatalogandgold_schemaas job parameters and callresolve_sql()before anyspark.sql()execution.
Workflow
User provides questions?
├── YES (10+) → Validate → Report → Save
├── YES (1-9) → Validate → Augment → Review → Save
└── NO (0) → Inspect assets → Generate → Review → Save
↓
Validate GT SQL (spark.sql)
↓
Sync to MLflow Evaluation Dataset
- Prompt the user for benchmark questions (always ask before generating synthetic)
- Validate each question against the live Genie Space trusted assets
- Generate / Augment synthetic benchmarks to fill coverage gaps
- Show to user for review before proceeding
- Validate ground truth SQL — execute via
spark.sql(), store result hash + sample - Save to
golden-queries.yaml - Sync to MLflow Evaluation Dataset in Unity Catalog
Load: Read benchmark-intake-workflow.md for the full three-path intake implementation.
Load: Read benchmark-patterns.md for question writing rules, category coverage, and domain-specific patterns.
Load: Read gt-validation.md for ground truth SQL validation, retry logic, and MLflow dataset sync code.
Benchmark Question Schema
Every benchmark question requires these fields:
- id: "domain_NNN"
question: "Natural language question"
expected_sql: "Full SQL that returns correct result"
expected_asset: "MV|TVF|TABLE"
category: "aggregation|ranking|time-series|comparison|list"
source: "user|synthetic|augmented"
priority: "P0|P1|P2" # P0 = hard gate, P1 = default, P2 = nice-to-have
Optional fields (auto-populated by GT validation):
required_tables,required_columns,required_joinsexpected_result_hash,expected_result_sample,expected_row_count,expected_columns
ID Format Requirements
| Entity | ID Format | Example | Notes |
|---|---|---|---|
| Benchmark questions | domain_NNN |
cost_001 |
Human-readable, sequential within domain |
example_question_sqls |
32-hex UUID | a1b2c3d4e5f6... |
Via uuid.uuid4().hex — required by Genie API |
sql_functions (TVFs) |
32-hex UUID | f6e5d4c3b2a1... |
Via uuid.uuid4().hex — required by Genie API |
See golden-queries.yaml for the full template.
Benchmark Splits (P12)
assign_splits()assignstrain/val/held_outwith 60/20/20 ratios.- The
held_outsplit is never used during optimization; it is reserved only for post-deploy overfitting checks. - The
splitfield is optional and backward-compatible.
SQL Dependency Map
parse_sql_dependencies()auto-extractsrequired_tables,required_columns, andrequired_joinsfrom GT SQL.- These fields are auto-populated during
validate_benchmarks()if not already set. - Used by the evaluator's
eval_scope="slice"to filter benchmarks by patched objects.
Enhanced GT Validation
validate_benchmarks()now also populatesexpected_columnsfrom the GT SQL result.- All validation output fields:
expected_result_hash,expected_result_sample,expected_row_count,expected_columns.
Coverage Requirements (MUST)
Benchmarks MUST meet minimum coverage thresholds. Insufficient coverage produces statistically unreliable evaluation results and blocks the optimization loop (evaluator's min_benchmarks guard).
| Domain Size | Min Questions | Min Categories |
|---|---|---|
| Small (1-3 tables) | 10 | 4 |
| Medium (4-8 tables) | 15 | 6 |
| Large (9+ tables) | 20 | 8 |
Common Mistakes
| Mistake | Consequence | Fix |
|---|---|---|
| Generating benchmarks without asking user first | Misses real user intent | Always prompt before synthetic generation |
| Skipping GT SQL validation | Chasing wrong ground truth | Execute every GT SQL via spark.sql() before acceptance |
| Not validating against live space | Questions reference non-trusted assets | Check all SQL references against space config |
| Insufficient category coverage | Missing failure patterns in evaluation | Ensure 4+ categories, fill gaps with synthetic |
Template variables in expected_sql without documenting handoff |
Evaluator gets PARSE_SYNTAX_ERROR |
Document that evaluator must receive catalog and gold_schema as parameters and call resolve_sql() |
| Skipping dataset sync to UC | Evaluator can't use mlflow.genai.evaluate(data=...), Evaluation tab empty |
Always run sync_yaml_to_mlflow_dataset() — it's required for the MLflow GenAI evaluation flow |
Ground Truth Validation Handoff
The generator's GT validation functions (validate_ground_truth_sql() and validate_with_retry() from references/gt-validation.md) are consumed by the evaluator in Cell 3a as a structural pre-check before evaluation begins. The evaluator:
- Executes each benchmark's
expected_sqlwithLIMIT 1 - Auto-remediates failures via LLM with full schema context (
information_schema.columns+INFORMATION_SCHEMA.ROUTINES) - Emits
gt_remediation_queue.yamlfor unrepairable benchmarks, which the orchestrator routes back to the generator
The generator should expect to receive remediation requests containing the original SQL and Spark error messages, and produce replacement benchmarks.
HARD CONSTRAINTS
sync_yaml_to_mlflow_dataset()MUST run whenuc_schemais available. Skipping leaves the Datasets tab empty.- Generated benchmarks for large domains (9+ tables) MUST meet minimum question count (20) and minimum category count (8). These are hard requirements, not recommendations.
Benchmark Correction
When the orchestrator's arbiter detects that Genie produced correct SQL but the benchmark expected_sql was wrong (genie_correct verdict), it passes corrected_questions to the Generator. This triggers a benchmark correction workflow:
- Receive
corrected_questionsfrom the orchestrator. Each entry containsquestion_id,verdict,expected_sql(old),generated_sql(Genie's correct SQL), andrationale. - Update the benchmark YAML -- replace the
expected_sqlfor each corrected question with Genie's SQL. - Re-validate the updated benchmarks via
sync_yaml_to_mlflow_dataset()to ensure the new expected SQL is valid. - Return the updated
eval_dataset_nameto the orchestrator so the next evaluation uses corrected benchmarks.
This ensures benchmark quality improves over iterations rather than penalizing Genie for correct answers.
Scripts
benchmark_generator.py
Standalone CLI for benchmark generation, validation, and MLflow sync:
python scripts/benchmark_generator.py --space-id <ID> --domain cost --uc-schema catalog.schema
python scripts/benchmark_generator.py --space-id <ID> --domain cost --validate-only
Reference Index
| Reference | What to Find |
|---|---|
| benchmark-intake-workflow.md | Three-path intake, validation pipeline, augmentation strategy |
| benchmark-patterns.md | Question writing rules, SQL expectations, domain patterns, validation checklist |
| gt-validation.md | GT SQL execution, retry logic, hash storage, MLflow dataset sync |
| golden-queries.yaml | Template with example cost domain benchmarks |