ai-improving-accuracy
Measure and Improve Your AI
Guide the user through measuring how well their AI works, then systematically improving it. This is a loop: define "good" -> measure -> improve -> verify.
The Workflow
- Define what "good" means — write a metric
- Measure current quality — run an evaluation
- Improve — choose an optimizer, run it
- Verify — re-evaluate to confirm improvement
- Iterate or ship
Step 1: Define what "good" means (write a metric)
A metric takes an expected answer and the AI's answer, and returns a score.
Exact match (simplest)
def metric(example, prediction, trace=None):
return prediction.answer == example.answer
Normalized match (handles capitalization/whitespace)
def metric(example, prediction, trace=None):
return prediction.answer.strip().lower() == example.answer.strip().lower()
Partial credit (for multi-field outputs)
def metric(example, prediction, trace=None):
fields = ["name", "email", "phone"]
correct = sum(
1 for f in fields
if getattr(prediction, f, "").lower() == getattr(example, f, "").lower()
)
return correct / len(fields)
F1 score (for text overlap)
def metric(example, prediction, trace=None):
gold_tokens = set(example.answer.lower().split())
pred_tokens = set(prediction.answer.lower().split())
if not gold_tokens or not pred_tokens:
return float(gold_tokens == pred_tokens)
precision = len(gold_tokens & pred_tokens) / len(pred_tokens)
recall = len(gold_tokens & pred_tokens) / len(gold_tokens)
if precision + recall == 0:
return 0.0
return 2 * (precision * recall) / (precision + recall)
AI-as-judge (for open-ended tasks)
When exact match is too strict (summaries, creative tasks, open-ended Q&A):
class AssessQuality(dspy.Signature):
"""Assess if the predicted answer is correct and complete."""
question: str = dspy.InputField()
gold_answer: str = dspy.InputField()
predicted_answer: str = dspy.InputField()
is_correct: bool = dspy.OutputField()
def metric(example, prediction, trace=None):
judge = dspy.Predict(AssessQuality)
result = judge(
question=example.question,
gold_answer=example.answer,
predicted_answer=prediction.answer,
)
return result.is_correct
Composite metric (multiple criteria)
def metric(example, prediction, trace=None):
correct = float(prediction.answer.lower() == example.answer.lower())
concise = float(len(prediction.answer.split()) < 50)
has_reasoning = float(len(getattr(prediction, 'reasoning', '')) > 20)
return 0.7 * correct + 0.2 * concise + 0.1 * has_reasoning
Training-aware metric
The trace parameter is not None during optimization. Use it for stricter requirements during training:
def metric(example, prediction, trace=None):
correct = prediction.answer == example.answer
if trace is not None:
# During optimization, also require good reasoning
has_reasoning = len(prediction.reasoning) > 50
return correct and has_reasoning
return correct
Step 2: Measure current quality (run evaluation)
Prepare test data
If you don't have enough examples, use /ai-generating-data to generate synthetic training data.
import dspy
# Manual creation
devset = [
dspy.Example(question="What is DSPy?", answer="A framework for LM programs").with_inputs("question"),
# 20-100+ examples for reliable evaluation
]
# From CSV/JSON
import json
with open("test_data.json") as f:
data = json.load(f)
devset = [dspy.Example(**x).with_inputs("question") for x in data]
# From HuggingFace
from datasets import load_dataset
dataset = load_dataset("squad", split="validation[:100]")
devset = [
dspy.Example(question=x["question"], answer=x["answers"]["text"][0]).with_inputs("question")
for x in dataset
]
Run evaluation
from dspy.evaluate import Evaluate
evaluator = Evaluate(
devset=devset,
metric=metric,
num_threads=4,
display_progress=True,
display_table=5, # show 5 example results
)
baseline_score = evaluator(my_program)
print(f"Baseline: {baseline_score}")
Step 3: Improve (choose an optimizer)
Quick guide: which optimizer?
| Training examples | Recommended optimizer | Expected improvement | Typical cost |
|---|---|---|---|
| <20 | GEPA (instruction tuning) | 5-15% | ~$0.50 |
| 20-50 | BootstrapFewShot | 5-20% | ~$0.50-2 |
| 50-200 | BootstrapFewShot, then MIPROv2 | 15-35% | ~$2-10 |
| 200-500 | MIPROv2 (auto="medium") | 20-40% | ~$5-15 |
| 50+ | VizPy ContraPrompt / PromptGrad | 5-18% | ~$0 (free tier) |
| 500+ | MIPROv2 (auto="heavy") or BootstrapFinetune | 25-50% | ~$15-50+ |
Start here
|
+- Just getting started (<50 examples)? -> BootstrapFewShot
| Quick, cheap, usually gives a solid boost.
|
+- Want better prompts (50+ examples)? -> MIPROv2
| Optimizes both instructions and examples.
| Best general-purpose prompt optimizer.
|
+- Want to tune instructions only (<50 examples)? -> GEPA
| Good when you have few examples.
|
+- Need maximum quality (500+ examples)? -> BootstrapFinetune
| Fine-tunes the model weights.
| Best for production with smaller/cheaper models.
|
+- Want to combine approaches? -> BetterTogether
Jointly optimizes prompts and weights.
Stacking tip: Run BootstrapFewShot first, then MIPROv2 on the result. This often beats either alone — bootstrap finds good examples, then MIPRO refines the instructions.
Optimized prompts are model-specific. If you change models, re-run your optimizer. See /ai-switching-models.
BootstrapFewShot (start here)
Fast, cheap. Finds good examples by running your program and keeping successful traces.
optimizer = dspy.BootstrapFewShot(
metric=metric,
max_bootstrapped_demos=4,
max_labeled_demos=4,
)
optimized = optimizer.compile(my_program, trainset=trainset)
Cost: Minimal (one pass through trainset). Expected improvement: 5-20%.
MIPROv2 (recommended for most cases)
Optimizes both instructions and examples. Best general-purpose optimizer.
optimizer = dspy.MIPROv2(
metric=metric,
auto="medium", # "light", "medium", "heavy"
)
optimized = optimizer.compile(my_program, trainset=trainset)
"light": Quick, ~$1-2"medium": Balanced, ~$5-10"heavy": Thorough, ~$15-30
Expected improvement: 15-35%.
GEPA (instruction tuning)
Good with few examples or when you want to focus on instruction quality:
optimizer = dspy.GEPA()
optimized = optimizer.compile(my_program, trainset=trainset, metric=metric)
VizPy (third-party alternative)
VizPy is a commercial prompt optimizer that offers ContraPromptOptimizer (classification) and PromptGradOptimizer (generation). Like GEPA, it optimizes instructions only — not few-shot demos or Pydantic field descriptions. Free tier includes 10 optimization runs/month.
For setup, usage, and a comparison with GEPA/MIPROv2, see /dspy-vizpy.
BootstrapFinetune (maximum quality)
Fine-tunes model weights for the biggest accuracy gains. Requires 500+ examples and a fine-tunable model:
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
optimized = optimizer.compile(my_program, trainset=trainset)
For the full fine-tuning workflow (decision framework, prerequisites, model distillation, BetterTogether), see /ai-fine-tuning.
When optimization plateaus
If your score stops improving, check these common causes:
| Symptom | Likely cause | Fix |
|---|---|---|
| Score stuck at 60-70% despite optimization | Task too complex for single step | /ai-decomposing-tasks — break into subtasks |
| Optimizer overfits (train score high, dev score flat) | Too little training data | /ai-generating-data — generate more examples |
| Score varies wildly between runs | Non-deterministic metric or small devset | Increase devset to 100+, set temperature=0 |
| Diminishing returns from more demos | Prompt is maxed out; model is the limit | /ai-switching-models — try a more capable model |
| Score high but real users complain | Metric doesn't match real quality | Rewrite metric based on actual failure patterns |
Step 4: Verify improvement
optimized_score = evaluator(optimized)
print(f"Baseline: {baseline_score:.1f}%")
print(f"Optimized: {optimized_score:.1f}%")
print(f"Improvement: {optimized_score - baseline_score:.1f}%")
Step 5: Save and ship
optimized.save("optimized_program.json")
# Load later
my_program = MyProgram()
my_program.load("optimized_program.json")
Key patterns
- Start simple: exact match metric + BootstrapFewShot, then upgrade if needed
- Validate your metric: manually check 10-20 examples to make sure the metric scores correctly
- More data helps: optimizers work better with more training examples
- Never evaluate on trainset: always use a held-out devset
- Use
display_table: looking at actual predictions reveals metric bugs - Iterate: run optimization, check results, improve metric, re-optimize
Additional resources
- For optimizer comparison table and metric patterns, see reference.md
- Once quality is good, use
/ai-cutting-coststo reduce your AI bill - Use
/ai-monitoringto track quality in production after deployment - Use
/ai-tracking-experimentsto log, compare, and manage multiple optimization runs - Accuracy plateaued despite optimization? Try
/ai-decomposing-tasksto restructure your task - If things are broken, use
/ai-fixing-errorsto diagnose - Install
/ai-doif you do not have it — it routes any AI problem to the right skill and is the fastest way to work:npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do
More from lebsral/dspy-programming-not-prompting-lms-skills
ai-switching-models
Switch AI providers or models without breaking things. Use when you want to switch from OpenAI to Anthropic, try a cheaper model, stop depending on one vendor, compare models side-by-side, a model update broke your outputs, you need vendor diversification, or you want to migrate to a local model. Also use when your prompt broke after a model update, prompts that work for GPT-4 do not work for Claude or Llama, or you need to do a model migration. Covers DSPy model portability with provider config, re-optimization, model comparison, and multi-model pipelines. Also used for migrate from OpenAI to Anthropic, GPT to Claude migration, try Llama instead of GPT, model comparison framework, multi-provider AI setup, avoid vendor lock-in for AI, prompts break when switching models, model-agnostic AI code.
56ai-stopping-hallucinations
Stop your AI from making things up. Use when your AI hallucinates, fabricates facts, is not grounded in real data, does not cite sources, makes unsupported claims, or you need to verify AI responses against source material. Also use when your LLM makes up facts, responses are disconnected from the input, or outputs are not grounded in source documents. Covers citation enforcement, faithfulness verification, grounding via retrieval, confidence thresholds, and evaluation of anti-hallucination quality. Also used for AI makes up citations, LLM fabricates data, ground AI in source documents, RAG but AI still hallucinates, force AI to cite sources, factual accuracy for AI, prevent AI from inventing facts, AI confident but wrong, LLM confabulation, hallucination detection, verify AI claims against documents.
49ai-do
Describe your AI problem and get routed to the right skill with a ready-to-use prompt. Use when you are not sure which ai- skill to use, want help picking the right approach, or just want to describe what you need in plain language. Also use this when someone says I want to build an AI that..., how do I make my AI..., or describes any AI/LLM task without naming a specific skill, I need AI but do not know where to start, which AI pattern should I use, what is the best way to add AI to my app, recommend an AI approach, AI feature discovery, too many AI options, overwhelmed by AI frameworks, just tell me what to build, new to DSPy, beginner AI project help, which LLM pattern fits my use case, confused about AI architecture, help me figure out my AI approach.
23ai-parsing-data
Pull structured data from messy text using AI. Use when parsing invoices, extracting fields from emails, scraping entities from articles, converting unstructured text to JSON, extracting contact info, parsing resumes, reading forms, pulling data from transcripts (VTT, LiveKit, Recall), extracting fields from Langfuse traces, or any task where messy text goes in and clean structured data comes out. Also use when emails are messy and lack structure, or structured data extraction from unstructured content is unreliable., extract entities from text, parse PDF with AI, structured extraction from unstructured text, OCR plus AI extraction, convert email to structured data, pull fields from documents automatically, AI data entry automation, invoice parsing, resume parsing with AI, medical record extraction.
21ai-reasoning
Make AI solve hard problems that need planning and multi-step thinking. Use when your AI fails on complex questions, needs to break down problems, requires multi-step logic, needs to plan before acting, gives wrong answers on math or analysis tasks, or when a simple prompt is not enough for the reasoning required. Covers ChainOfThought, ProgramOfThought, MultiChainComparison, and Self-Discovery reasoning patterns in DSPy., AI gives shallow answers, LLM does not think before answering, chain of thought prompting, make AI show its work, AI fails at math, complex analysis with LLM, multi-step problem solving, AI reasoning errors, LLM logic mistakes, think step by step DSPy, AI cannot do basic arithmetic, deep reasoning with language models, self-consistency for better answers, tree of thought.
21ai-building-chatbots
Build a conversational AI assistant with memory and state. Use when you need a customer support chatbot, helpdesk bot, onboarding assistant, sales qualification bot, FAQ assistant, or any multi-turn conversational AI. Also used for chatbot remember previous messages, conversational AI keeps forgetting context, build a helpdesk bot that actually works, chatbot drops context after a few turns, Intercom bot alternative, Zendesk AI alternative, build WhatsApp bot, Slack bot with AI, chatbot escalation to human agent, LangChain chatbot but simpler, chatbot for SaaS onboarding flow.
21