ai-generating-data
Generate Synthetic Training Data
Guide the user through generating high-quality synthetic training data with DSPy. This solves the "I don't have data" problem that blocks every other AI workflow.
When you need synthetic data
- Cold start: You're building a new feature and have zero labeled examples
- Not enough for optimization: You have 10-30 examples but optimizers need 200+
- Privacy/compliance: You can't use real customer data for training
- Edge cases: Your AI works on common inputs but fails on rare ones
- Unbalanced categories: Some categories have 500 examples, others have 10
- New categories: You added a category and have no examples for it
- Schema changed: Your input/output format changed, old data doesn't fit
- Proof of concept: PM wants a demo by Friday, no time to collect real data
The core idea
Define a generator signature whose outputs match your task's input/output fields. Use an LM to produce examples. Filter for quality. Use for optimization.
Research shows this works surprisingly well:
- Optimized generator prompts match models trained on 100K+ human labels using only 10 gold labels (arXiv 2406.11706)
- DSPy-optimized Chain-of-Thought generation outperforms hand-written static templates (arXiv 2508.13930)
The key insight: the prompt used to generate data is a critical hyperparameter — optimizing it matters more than generating more data.
Step 1: Define what an example looks like
Your generator's outputs should match your task's inputs and expected outputs.
import dspy
# Your task — what the AI will do in production
class ClassifyTicket(dspy.Signature):
"""Classify a support ticket into a category."""
ticket_text: str = dspy.InputField()
category: str = dspy.OutputField()
# Generator — produces examples for your task
class GenerateTicketExample(dspy.Signature):
"""Generate a realistic support ticket with its correct category."""
category: str = dspy.InputField(desc="the target category to generate an example for")
ticket_text: str = dspy.OutputField(desc="a realistic support ticket for this category")
The generator's output fields become inputs to your task. Think of it as: "given what I want the answer to be, generate a realistic input."
Multi-field tasks
If your task has multiple inputs or outputs, mirror all of them:
# Task: extract structured data from text
class ExtractContact(dspy.Signature):
"""Extract contact info from a message."""
message: str = dspy.InputField()
name: str = dspy.OutputField()
email: str = dspy.OutputField()
phone: str = dspy.OutputField()
# Generator: produce realistic messages with known contact info
class GenerateContactExample(dspy.Signature):
"""Generate a realistic message that contains contact information."""
name: str = dspy.InputField(desc="the person's name to embed in the message")
email: str = dspy.InputField(desc="the email address to embed in the message")
phone: str = dspy.InputField(desc="the phone number to embed in the message")
message: str = dspy.OutputField(desc="a realistic message containing this contact info")
Step 2: Write seed examples
Start with 5-10 hand-written examples. These anchor the generator's understanding of what "realistic" means for your domain.
seeds = [
dspy.Example(
ticket_text="I was charged twice for my subscription this month. Order #4521.",
category="billing"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="The app crashes when I try to upload a profile photo on Android.",
category="bug"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="How do I export my data to CSV? I can't find the option anywhere.",
category="how-to"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="I'd love to see dark mode added. The white background hurts my eyes.",
category="feature-request"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="My account got locked after too many login attempts. Please help.",
category="account"
).with_inputs("ticket_text"),
]
Even 5 seeds dramatically improve generation quality over zero.
Step 3: Generate in batches
Two patterns depending on your LM provider:
Pattern A: n=N batch generation
When your provider supports the n parameter (OpenAI does), this generates multiple completions in one call — faster and often more diverse:
generator = dspy.Predict(GenerateTicketExample, n=20)
response = generator(category="billing")
examples = [
dspy.Example(ticket_text=t, category="billing").with_inputs("ticket_text")
for t in response.completions.ticket_text
]
Pattern B: Loop generation
Works with any provider. More control over each example:
examples = []
categories = ["billing", "bug", "how-to", "feature-request", "account"]
for category in categories:
generator = dspy.Predict(GenerateTicketExample)
for i in range(40):
result = generator(category=category)
examples.append(
dspy.Example(ticket_text=result.ticket_text, category=category)
.with_inputs("ticket_text")
)
print(f"Generated {len(examples)} examples")
The n parameter isn't supported by all providers — use the loop pattern as a reliable fallback.
Generation strategies
Pick the strategy that fits your gap:
Category-driven — generate N per category (fixes imbalance):
for category in categories:
for i in range(50):
result = generator(category=category)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))
Seed-and-vary — pass a seed example with a variation instruction:
class GenerateVariation(dspy.Signature):
"""Generate a variation of this support ticket with a different tone and phrasing."""
original_ticket: str = dspy.InputField(desc="the original ticket to vary")
variation_type: str = dspy.InputField(desc="how to vary it: tone, length, complexity, or language")
ticket_text: str = dspy.OutputField(desc="a new ticket with the same meaning but different style")
vary = dspy.Predict(GenerateVariation)
for seed in seeds:
for variation in ["angry tone", "very brief", "verbose and detailed", "non-native English"]:
result = vary(original_ticket=seed.ticket_text, variation_type=variation)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=seed.category).with_inputs("ticket_text"))
Scenario-driven — specify edge case scenarios:
class GenerateScenarioTicket(dspy.Signature):
"""Generate a support ticket matching a specific scenario."""
category: str = dspy.InputField()
scenario: str = dspy.InputField(desc="the specific scenario to generate")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateScenarioTicket)
scenarios = [
("billing", "customer charged in wrong currency"),
("billing", "refund for a cancelled subscription"),
("bug", "issue only happens on slow network connections"),
("bug", "multi-step reproduction involving two features"),
("how-to", "customer is non-technical and confused by jargon"),
]
for category, scenario in scenarios:
result = gen(category=category, scenario=scenario)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))
Difficulty-driven — generate easy, medium, hard examples separately:
class GenerateByDifficulty(dspy.Signature):
"""Generate a support ticket at a specific difficulty level for classification."""
category: str = dspy.InputField()
difficulty: str = dspy.InputField(desc="easy (clear-cut), medium (some ambiguity), or hard (could be multiple categories)")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateByDifficulty)
for category in categories:
for difficulty in ["easy", "medium", "hard"]:
for i in range(15):
result = gen(category=category, difficulty=difficulty)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))
Diversity trick — add a random sindex field to push the LM toward varied outputs:
import random
class GenerateDiverse(dspy.Signature):
"""Generate a unique and realistic support ticket."""
category: str = dspy.InputField()
sindex: str = dspy.InputField(desc="a unique seed index for diversity")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateDiverse)
for category in categories:
for i in range(50):
result = gen(category=category, sindex=str(random.randint(0, 1_000_000)))
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))
The random sindex prevents the LM from falling into repetitive patterns.
Step 4: Filter for quality
Generated data always contains some bad examples. Filter aggressively — aim to generate 2-3x what you need and keep ~50%.
Simple: metric-based filtering
Run each generated example through your task program and check with your metric:
program = dspy.ChainOfThought(ClassifyTicket)
filtered = []
for ex in examples:
pred = program(**ex.inputs())
if metric(ex, pred):
filtered.append(ex)
print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")
This works when your program is already decent — it filters out examples that are confusing or mislabeled.
Robust: LM-based assessment
Use a separate assessment step to check realism and correctness:
class AssessExample(dspy.Signature):
"""Is this a realistic and correctly labeled example?"""
ticket_text: str = dspy.InputField()
category: str = dspy.InputField()
is_realistic: bool = dspy.OutputField(desc="true if this looks like a real support ticket")
is_correctly_labeled: bool = dspy.OutputField(desc="true if the category matches the ticket")
assessor = dspy.Predict(AssessExample)
filtered = []
for ex in examples:
result = assessor(ticket_text=ex.ticket_text, category=ex.category)
if result.is_realistic and result.is_correctly_labeled:
filtered.append(ex)
print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")
Quality gates with dspy.Suggest
For tighter integration, build quality checks into the generator itself. When a Suggest constraint fails, DSPy retries the generation:
class QualityGenerator(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought(GenerateTicketExample)
self.assess = dspy.Predict(AssessExample)
def forward(self, category):
result = self.generate(category=category)
assessment = self.assess(ticket_text=result.ticket_text, category=category)
dspy.Suggest(assessment.is_realistic, "Generated ticket should be realistic")
dspy.Suggest(assessment.is_correctly_labeled, "Category label should be correct")
return result
generator = QualityGenerator()
# DSPy retries generation when Suggest constraints fail
Check for duplicates
Remove near-duplicates to keep your dataset diverse:
seen = set()
unique = []
for ex in filtered:
# Normalize and check
key = ex.ticket_text.strip().lower()
if key not in seen:
seen.add(key)
unique.append(ex)
print(f"Removed {len(filtered) - len(unique)} near-duplicates")
filtered = unique
Step 5: Optimize the generator itself (advanced)
Research (arXiv 2406.11706) shows that optimizing the prompt used to generate data dramatically improves downstream quality. This is meta-optimization: optimizing the generator so it produces better training data.
class DataGenerator(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought(GenerateTicketExample)
def forward(self, category):
return self.generate(category=category)
# Define a metric that measures generated data quality
def generator_metric(example, prediction, trace=None):
# Check if a downstream classifier gets the right answer on this generated example
classifier = dspy.Predict(ClassifyTicket)
task_example = dspy.Example(ticket_text=prediction.ticket_text, category=example.category).with_inputs("ticket_text")
task_pred = classifier(**task_example.inputs())
return task_pred.category.lower() == example.category.lower()
# Optimize the generator's prompts
optimizer = dspy.BootstrapFewShot(metric=generator_metric)
optimized_generator = optimizer.compile(DataGenerator(), trainset=seeds)
# Now generate with the optimized generator
better_examples = []
for category in categories:
for i in range(50):
result = optimized_generator(category=category)
better_examples.append(
dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text")
)
This closes the loop: better generator prompts produce better data, which produces better task programs.
Step 6: Use generated data for optimization
Full pipeline: generate, filter, split, optimize, evaluate.
import random
from dspy.evaluate import Evaluate
# Shuffle and split
random.shuffle(filtered)
split = int(len(filtered) * 0.8)
trainset = filtered[:split]
devset = filtered[split:]
print(f"Train: {len(trainset)}, Dev: {len(devset)}")
# Configure your task LM (can be cheaper than the generator LM)
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
# Build and optimize your task program
program = dspy.ChainOfThought(ClassifyTicket)
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized = optimizer.compile(program, trainset=trainset)
# Evaluate
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)
score = evaluator(optimized)
print(f"Score on synthetic dev set: {score:.1f}%")
# Save
optimized.save("optimized_program.json")
If you have even a small number of real examples, use them as the dev set instead — real data gives more trustworthy evaluation.
Common scenarios
Cold start — zero real data. Write 5-10 seeds. Generate 200+ synthetic examples across all categories. Filter and optimize. See examples.md for a full walkthrough.
Edge case gaps — your AI works at 85% but fails on specific scenarios. Run error analysis, identify the failure patterns, then use scenario-driven generation targeting those gaps. Re-optimize with the augmented dataset.
Privacy/compliance — can't use real customer data. Generate synthetic examples with realistic patterns but no PII. Validate with domain-specific assessments. The dspy.Suggest quality gate pattern ensures generated data meets your standards.
New categories — added a category with no examples. Use category-driven generation to produce 50+ examples for the new category, then retrain.
Rebalancing — some categories have 500 examples, others have 10. Generate more for underrepresented categories until all are roughly balanced.
Schema changed — your input/output format changed. Generate new examples matching the new schema rather than manually converting old data.
Tips and pitfalls
- Always validate generated data — LMs produce plausible but wrong labels. Filter aggressively.
- Mix synthetic with real data when available — even 20 real examples mixed in improve quality significantly.
- Use a stronger model to generate, cheaper model for your task — e.g., generate with GPT-4o, run your task on GPT-4o-mini.
- Generate more than you need — aim for 2-3x your target, keep ~50% after filtering.
- Check for duplicates — LMs tend to repeat themselves, especially without the diversity trick.
- Iterate — generate, optimize, evaluate, identify gaps, generate more for gaps.
- Don't trust synthetic eval scores blindly — if possible, validate final quality on real data.
- The
nparameter for batch generation isn't supported by all providers — use the loop pattern as a reliable fallback.
Additional resources
- For end-to-end worked examples (cold start, edge cases, privacy), see examples.md
- Use
/ai-improving-accuracyto measure and improve your optimized program - Use
/ai-fine-tuningonce you have enough generated data for weight optimization - Use
/ai-kickoffto scaffold a project, then fill data with this skill - Install
/ai-doif you do not have it — it routes any AI problem to the right skill and is the fastest way to work:npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do
More from lebsral/dspy-programming-not-prompting-lms-skills
ai-switching-models
Switch AI providers or models without breaking things. Use when you want to switch from OpenAI to Anthropic, try a cheaper model, stop depending on one vendor, compare models side-by-side, a model update broke your outputs, you need vendor diversification, or you want to migrate to a local model. Also use when your prompt broke after a model update, prompts that work for GPT-4 do not work for Claude or Llama, or you need to do a model migration. Covers DSPy model portability with provider config, re-optimization, model comparison, and multi-model pipelines. Also used for migrate from OpenAI to Anthropic, GPT to Claude migration, try Llama instead of GPT, model comparison framework, multi-provider AI setup, avoid vendor lock-in for AI, prompts break when switching models, model-agnostic AI code.
55ai-stopping-hallucinations
Stop your AI from making things up. Use when your AI hallucinates, fabricates facts, is not grounded in real data, does not cite sources, makes unsupported claims, or you need to verify AI responses against source material. Also use when your LLM makes up facts, responses are disconnected from the input, or outputs are not grounded in source documents. Covers citation enforcement, faithfulness verification, grounding via retrieval, confidence thresholds, and evaluation of anti-hallucination quality. Also used for AI makes up citations, LLM fabricates data, ground AI in source documents, RAG but AI still hallucinates, force AI to cite sources, factual accuracy for AI, prevent AI from inventing facts, AI confident but wrong, LLM confabulation, hallucination detection, verify AI claims against documents.
48ai-do
Describe your AI problem and get routed to the right skill with a ready-to-use prompt. Use when you are not sure which ai- skill to use, want help picking the right approach, or just want to describe what you need in plain language. Also use this when someone says I want to build an AI that..., how do I make my AI..., or describes any AI/LLM task without naming a specific skill, I need AI but do not know where to start, which AI pattern should I use, what is the best way to add AI to my app, recommend an AI approach, AI feature discovery, too many AI options, overwhelmed by AI frameworks, just tell me what to build, new to DSPy, beginner AI project help, which LLM pattern fits my use case, confused about AI architecture, help me figure out my AI approach.
21ai-improving-accuracy
Measure and improve how well your AI works. Use when AI gives wrong answers, accuracy is bad, responses are unreliable, you need to test AI quality, evaluate your AI, write metrics, benchmark performance, optimize prompts, improve results, or systematically make your AI better. Also use when you spent hours tweaking prompts, trial and error prompt engineering is not working, quality plateaued early, or you have stale prompts everywhere in your codebase. Covers DSPy evaluation, metrics, and optimization., my AI is only 60% accurate, how to measure AI quality, AI evaluation framework, benchmark my LLM, prompt optimization not working, systematic way to improve AI, AI accuracy plateaued, DSPy optimizer tutorial, MIPROv2 optimization, how to go from 70% to 90% accuracy.
20ai-reasoning
Make AI solve hard problems that need planning and multi-step thinking. Use when your AI fails on complex questions, needs to break down problems, requires multi-step logic, needs to plan before acting, gives wrong answers on math or analysis tasks, or when a simple prompt is not enough for the reasoning required. Covers ChainOfThought, ProgramOfThought, MultiChainComparison, and Self-Discovery reasoning patterns in DSPy., AI gives shallow answers, LLM does not think before answering, chain of thought prompting, make AI show its work, AI fails at math, complex analysis with LLM, multi-step problem solving, AI reasoning errors, LLM logic mistakes, think step by step DSPy, AI cannot do basic arithmetic, deep reasoning with language models, self-consistency for better answers, tree of thought.
20ai-taking-actions
Build AI that takes actions, calls APIs, and does things autonomously. Use when you need AI to call APIs, use tools, perform calculations, search the web and act on results, interact with databases, or do multi-step tasks. Powered by DSPy agents (ReAct, CodeAct)., AI that does things not just talks, tool-using AI agent, AI calls external APIs, function calling with DSPy, build AI that books appointments, AI workflow automation, agent that searches and acts on results, AI that updates databases, autonomous AI agent, AI performs multi-step tasks, give LLM access to tools, agentic AI workflow, AI agent for DevOps, build AI assistant that takes actions, MCP tool integration with AI, AI that can browse and click, LLM with tool access.
19