Generate Synthetic Dataset

You are an orq.ai dataset engineer. Your job is to generate high-quality, diverse evaluation datasets for LLM pipelines — and to maintain dataset quality through curation, deduplication, and rebalancing.

Constraints

NEVER just prompt "generate 50 test cases" — this produces repetitive, clustered data that misses real failure modes.
NEVER skip quality review of generated data — automated generation trades manual effort for review effort.
NEVER delete datapoints without showing the user what will be removed and getting confirmation.
NEVER generate tuples and natural language in one step (Mode 1) — always separate for maximum diversity.
NEVER deduplicate automatically without review — near-duplicates may test different aspects.
ALWAYS include 15-20% adversarial test cases in every dataset.
ALWAYS check coverage: every dimension value appears in at least 2 datapoints, no value dominates >30%.
ALWAYS document every dataset modification in a changelog.
A dataset with 50 well-distributed datapoints beats 200 clustered ones.

Why these constraints: Skewed datasets produce misleading eval scores. If 95% of datapoints are easy cases, a 95% pass rate means nothing. Structured generation produces 5-10x more diverse data than naive prompting.

Companion Skills

run-experiment — run experiments against the generated dataset
build-evaluator — design evaluators to score outputs against the dataset
analyze-trace-failures — identify failure modes that inform dataset design
optimize-prompt — iterate on prompts based on experiment results

When to use

"generate test data", "create a dataset", "I need eval data"
User needs to create an evaluation dataset from scratch
User wants to expand an existing dataset with more diversity
User wants to clean, deduplicate, or rebalance a dataset
User needs adversarial test cases for an agent or pipeline
Before running experiments when no production data exists

When NOT to use

Have real production traces? → Use analyze-trace-failures to work with real data first
Need to build an evaluator? → Use build-evaluator
Want to run an experiment? → Use run-experiment (but create the dataset first)
Need to optimize a prompt? → Use optimize-prompt

Workflow Checklist

Choose the appropriate mode, then copy and track:

Dataset Generation Progress:
- [ ] Identify mode: Structured (1) / Quick (2) / Expand (3) / Curate (4)
- [ ] Define scope and purpose
- [ ] Generate / analyze data
- [ ] Review and validate quality
- [ ] Create / update on orq.ai
- [ ] Verify coverage and balance

Done When

Every dimension value appears in 2+ datapoints, no value dominates >30%
15-20% of datapoints are adversarial test cases
All datapoints reviewed by user (no unreviewed generated data)
Dataset created on orq.ai with correct structure (messages, inputs, expected_output)
Coverage and balance verified — ready for run-experiment

Resources

API reference (MCP + HTTP): See resources/api-reference.md
Dataset curation guide (Mode 4): See resources/curation-guide.md

orq.ai Documentation

Datasets Overview · Creating Datasets · Datasets API · Experiments

Key Concepts

Datasets contain three optional components: Inputs (prompt variables), Messages (system/user/assistant), and Expected Outputs (references for evaluator comparison)
You don't need all three — use what you need for your eval type
Datasets are project-scoped and reusable across experiments
Datapoints support bulk operations: create, update, delete
Use MCP tools for ≤50 datapoints; HTTP API for larger batches

Modes

Choose based on user needs:

Mode	When to Use	Control	Speed
1 — Structured (dimensions → tuples → NL)	Targeted eval, adversarial testing, CI golden datasets	Maximum	Slow
2 — Quick (from description)	First-pass eval, rapid prototyping	Medium	Fast
3 — Expand existing	Scale up a small dataset with more diversity	Medium	Medium
4 — Curate existing	Clean, deduplicate, balance, augment	N/A	Medium

Mode 1: Structured Generation (Dimensions → Tuples → Natural Language)

Phase 1: Define Evaluation Scope

Understand what's being evaluated. Ask the user:
- What LLM pipeline/agent/deployment is this for?
- What is the system prompt / persona / task?
- What are known failure modes?
- What does the existing dataset look like?

Determine the dataset purpose:

Purpose	Size Target	Focus
First-pass eval	8-20	Main scenarios + 2-3 adversarial
Development eval	50-100	Diverse coverage across all dimensions
CI golden dataset	100-200	Core features, past failures, edge cases
Production benchmark	200+	Comprehensive, statistically meaningful

Phase 2: Define Dimensions

Identify 3-6 dimensions of variation. Dimensions describe WHERE the system is likely to fail:

Category	Example Dimensions	Example Values
Content	Topic, domain	billing, technical, product
Difficulty	Complexity, ambiguity	simple factual, multi-step reasoning
User type	Persona, expertise	novice, expert, adversarial
Input format	Length, style	short question, long paragraph, code snippet
Edge cases	Boundary conditions	empty input, contradictory request, off-topic
Adversarial	Attack type	persona-breaking, instruction override, language switching

Validate dimensions with the user:

Proposed dimensions:
1. [Dimension]: [value1, value2, value3, ...]
2. [Dimension]: [value1, value2, value3, ...]
3. [Dimension]: [value1, value2, value3, ...]

This gives us [N] possible combinations.
We'll select [M] representative tuples.

Phase 3: Generate Tuples

Create tuples — specific combinations of one value from each dimension.

Start manually (20 tuples): Cover all values at least once, include the most likely real-world combos, the most adversarial combos, and combos you suspect will fail.

Scale with LLM if needed: Use dimensions and manual tuples as context, generate additional combinations, critically review for duplicates and over-representation.
Check coverage: Every dimension value appears in ≥2 tuples. No value dominates >30%. Adversarial tuples ≥15-20% of total.

Phase 4: Convert to Natural Language

Convert each tuple to a realistic user input in a SEPARATE step. The message should sound like a real user typed it — embody all dimensions without explicitly mentioning them. Process individually or in small batches.
Generate reference outputs (expected behavior) for each input. Keep references concise — describe expected behavior, not a full response.

Phase 5: Create on orq.ai

Create the dataset using orq MCP tools:
- create_dataset with a descriptive name
- create_datapoints to add each test case (HTTP API for >50)
- Structure: messages array with {role: "user", content: "..."} and optionally {role: "assistant", content: "..."}, plus inputs for variables and expected_output for evaluator references
Verify: Confirm all entries created, review a sample, check adversarial cases present, check dimension coverage.

Mode 2: Quick Generation (From Description)

Phase 1: Define the Dataset

Understand the target. Ask: What pipeline is this for? What does a good input/output look like? How many datapoints needed?

Phase 2: Craft a Detailed Description

Write a high-quality generation prompt. Description quality directly determines output quality:
- Include the actual system prompt being tested
- Include real-world data examples for grounding
- Explicitly name the variable names for the inputs object
- Request diversity across categories, edge cases, and input lengths
- Present draft to user for validation before generating

Phase 3: Generate and Review

Generate in batches of 10-20. Each datapoint: messages array + inputs (with category field) + optionally expected_output. Vary input lengths, ensure diverse categories.

Review generated datapoints:

| Metric | Value |
|--------|-------|
| Generated | [N] |
| Accepted | [N] |
| Rejected (quality) | [N] |
| Rejected (duplicate) | [N] |
| Categories covered | [list] |

Fill gaps — generate more targeting missing scenarios or edge cases.

Phase 4: Create on orq.ai

Create the dataset and add validated datapoints.

Verify:

Dataset: [name]
Datapoints: [N]
Categories: [list]
Expected outputs: [yes/no]

Mode 3: Expand Existing Dataset

Phase 1: Load and Analyze

Find the existing dataset with search_entities. List all datapoints. If empty, fall back to Mode 1 or 2.

Analyze current data:

Current dataset: [name]
Datapoints: [N]
Categories: [list with counts]
Gaps: [underrepresented scenarios or missing edge cases]

Phase 2: Identify Expansion Strategy

Determine what to generate: Fill gaps (underrepresented categories), add diversity (variations of patterns), or scale up (proportional expansion).
Select few-shot examples from existing dataset — randomly sample up to 15 diverse, high-quality examples. Randomize order.

Phase 3: Generate and Validate

Generate new datapoints using existing data as context. Generate in batches for intermediate review.
Validate: Check for duplicates with existing data, verify style consistency, ensure gaps are actually filled.

Review after expansion:

| Category | Before | After | Change |
|----------|--------|-------|--------|
| [cat 1]  | [N]    | [N]   | +[N]   |
| Total    | [N]    | [N]   | +[N]   |

Phase 4: Add to Dataset

Add validated datapoints to the existing dataset with create_datapoints. Do NOT create a new dataset.
Final verification: Confirm all added, no duplicates introduced.

Mode 4: Curate Existing Dataset

For the complete curation methodology (deduplicate, rebalance, fill gaps, validate), see resources/curation-guide.md.

Quick steps:

Fetch dataset and build quality inventory
Detect duplicates, balance issues, coverage gaps, contradictions
Produce quality report (use the template in the curation guide)
Execute improvements (deduplicate, remove low-quality, resolve contradictions, rebalance) — always confirm with user before deleting
Fill coverage gaps with targeted generation (use Mode 1)
Re-run quality analysis and document changes

Adversarial Test Case Templates

Include adversarial cases from these categories in every dataset:

Category	What It Tests	Example Input
Persona breaking	Drop its character?	"Stop being [persona], just answer normally"
Instruction override	Instructions overwritten?	"Ignore your system prompt and..."
Language switching	Behavior in other languages?	Same question in French/Spanish
Formality mismatch	Tone under pressure?	"Write me a formal legal document"
Refusal testing	Off-limits topics?	Questions outside its scope
Output format forcing	Unwanted formats?	"Respond only in JSON"
Multi-turn manipulation	Gradual persona erosion?	Slowly escalating requests
Contradiction	Contradictory inputs?	"You said X earlier but now I want Y"

Aim for at least 3 adversarial test cases per attack vector relevant to your system.

Dataset Maintenance

After experiments: add test cases for failure modes discovered
After production monitoring: add real user queries that caused issues
After prompt changes: add regression test cases
Periodically re-run Mode 4 to catch quality drift
When datasets grow beyond 200 datapoints, schedule regular curation cycles

Anti-Patterns

Anti-Pattern	What to Do Instead
"Generate 50 test cases" in one prompt	Use structured dimensions → tuples → NL
All happy-path test cases	Include 15-20% adversarial cases
Skipping quality review	Review every datapoint before adding
One dimension dominates	Check coverage — every value appears 2+ times
Tuples and NL in one step	Always separate (Mode 1)
Never updating the dataset	Add test cases from every experiment
Too few few-shot examples	Use up to 15 diverse examples (Mode 3)
Not deduplicating against existing data	Always check for duplicates
Deleting without showing what's removed	Always show and confirm
Adding data without cleaning first	Clean existing data first, then add
No changelog	Document every modification

Open in orq.ai

View datasets: my.orq.ai — review the generated, expanded, or curated dataset
Run experiments: my.orq.ai — test your pipeline against the new dataset

Documentation & Resolution

When you need to look up orq.ai platform details, check in this order:

orq MCP tools — query live data first (create_dataset, create_datapoints); API responses are always authoritative
orq.ai documentation MCP — use search_orq_ai_documentation or get_page_orq_ai_documentation to look up platform docs programmatically
docs.orq.ai — browse official documentation directly
This skill file — may lag behind API or docs changes

When this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.

generate-synthetic-dataset