dspy-vizpy
VizPy — Commercial Prompt Optimizer for DSPy
Guide the user through integrating VizPy as a drop-in prompt optimizer alongside or instead of DSPy's native optimizers (GEPA, MIPROv2).
Step 1: Understand the optimization need
Before recommending VizPy, clarify:
- Classification or generation? — ContraPromptOptimizer is for classification (fixed categories), PromptGradOptimizer is for generation (open-ended text). This determines which optimizer to use.
- Already tried DSPy native optimizers? — If not, start with GEPA or MIPROv2 first. VizPy is best as a comparison or when native optimizers plateau.
- Data privacy constraints? — VizPy is SaaS — training data is sent to their servers. If data cannot leave the network, use GEPA instead.
- How many optimization runs do they need? — Free tier allows 10 runs/month. Beyond that requires Pro ($20/mo).
What is VizPy
VizPy is a commercial SaaS prompt optimization service (vizpy.vizops.ai) that provides two optimizers for DSPy programs:
- ContraPromptOptimizer — for classification tasks (sentiment, routing, tagging)
- PromptGradOptimizer — for generation tasks (summarization, content creation, Q&A)
Both optimize the instruction string only — the same limitation as dspy.GEPA. They do NOT optimize few-shot demos, Pydantic field descriptions, or model weights.
Pricing
| Tier | Optimization runs/month | Cost |
|---|---|---|
| Free | 10 | $0 |
| Pro | Unlimited | $20/mo |
When to use VizPy
Use VizPy when:
- You want to compare a commercial optimizer against DSPy's native ones
- You've tried GEPA and want a different instruction-tuning approach
- You want ContraPrompt's contrastive approach for classification tasks
- You want PromptGrad's gradient-inspired approach for generation tasks
Do NOT use VizPy when:
- You need few-shot demo optimization — use
dspy.BootstrapFewShotordspy.MIPROv2 - You need to optimize Pydantic field descriptions — VizPy only tunes instructions (same as GEPA). See the workaround in
/dspy-gepa - You need to tune model weights — use
dspy.BootstrapFinetune - You want a fully open-source solution — use
dspy.GEPAordspy.MIPROv2
Setup
pip install vizpy
Set your API key:
import vizpy
vizpy.api_key = "your-vizpy-api-key" # from vizpy.vizops.ai/dashboard
Or via environment variable:
export VIZPY_API_KEY="your-vizpy-api-key"
ContraPromptOptimizer (classification)
Best for tasks with a fixed set of output categories.
import dspy
import vizpy
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # or "anthropic/claude-sonnet-4-5-20250929", etc.
# 1. Define your classifier
classify = dspy.ChainOfThought("text -> label")
# 2. Prepare training data
trainset = [
dspy.Example(text="Great product!", label="positive").with_inputs("text"),
dspy.Example(text="Terrible service.", label="negative").with_inputs("text"),
# ... 50+ examples recommended
]
# 3. Define a metric
def metric(example, prediction, trace=None):
return prediction.label.lower() == example.label.lower()
# 4. Optimize with VizPy
optimizer = vizpy.ContraPromptOptimizer(metric=metric)
optimized = optimizer.compile(classify, trainset=trainset)
# 5. Use the optimized program
result = optimized(text="This exceeded my expectations!")
print(result.label)
# 6. Save
optimized.save("vizpy_optimized_classifier.json")
How ContraPrompt works
ContraPromptOptimizer uses contrastive examples — it identifies cases where the current instruction fails and generates instruction candidates that distinguish between confusing categories. This is particularly effective when categories are semantically close (e.g., "billing" vs "account" tickets).
PromptGradOptimizer (generation)
Best for open-ended generation tasks where output quality is on a spectrum.
import dspy
import vizpy
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # or "anthropic/claude-sonnet-4-5-20250929", etc.
# 1. Define your generator
summarize = dspy.ChainOfThought("article -> summary")
# 2. Prepare training data
trainset = [
dspy.Example(
article="Long article text here...",
summary="Expected summary."
).with_inputs("article"),
# ... 50+ examples
]
# 3. Define a metric (can be AI-as-judge)
class AssessQuality(dspy.Signature):
"""Assess if the summary captures key points accurately."""
article: str = dspy.InputField()
gold_summary: str = dspy.InputField()
predicted_summary: str = dspy.InputField()
score: float = dspy.OutputField(desc="0.0 to 1.0")
def metric(example, prediction, trace=None):
judge = dspy.Predict(AssessQuality)
result = judge(
article=example.article,
gold_summary=example.summary,
predicted_summary=prediction.summary,
)
return result.score
# 4. Optimize with VizPy
optimizer = vizpy.PromptGradOptimizer(metric=metric)
optimized = optimizer.compile(summarize, trainset=trainset)
# 5. Use
result = optimized(article="New article text...")
print(result.summary)
How PromptGrad works
PromptGradOptimizer uses gradient-inspired optimization — it estimates how instruction changes affect output quality scores and iteratively adjusts the instruction in the direction that improves the metric. This works well for generation tasks where quality is continuous rather than binary.
VizPy vs DSPy native optimizers
| Aspect | VizPy ContraPrompt | VizPy PromptGrad | dspy.GEPA | dspy.MIPROv2 |
|---|---|---|---|---|
| Best for | Classification | Generation | Both | Both |
| What it tunes | Instructions only | Instructions only | Instructions only | Instructions + demos |
| Data needed | ~50 examples | ~50 examples | ~50 examples | ~200 examples |
| Expected improvement | 5-18% | 5-18% | 5-15% | 15-35% |
| Cost | Free tier (10 runs) | Free tier (10 runs) | ~$0.50 (LM calls) | ~$5-15 (LM calls) |
| Open source | No (SaaS) | No (SaaS) | Yes | Yes |
| Feedback-driven | Contrastive examples | Gradient-inspired | Textual feedback | Scalar scores |
| Pydantic field desc | No | No | No | No |
Decision guide
Want instruction-only optimization?
|
+- Classification task?
| +- Want open-source? -> dspy.GEPA
| +- Want to try commercial? -> vizpy.ContraPromptOptimizer
|
+- Generation task?
| +- Want open-source? -> dspy.GEPA
| +- Want to try commercial? -> vizpy.PromptGradOptimizer
|
+- Want instructions AND demos? -> dspy.MIPROv2
Switching between VizPy and GEPA
VizPy optimizers are drop-in replacements for GEPA — same .compile() interface:
# With GEPA
optimizer = dspy.GEPA(metric=metric, auto="light")
optimized = optimizer.compile(program, trainset=trainset)
# With VizPy ContraPrompt (swap one line)
optimizer = vizpy.ContraPromptOptimizer(metric=metric)
optimized = optimizer.compile(program, trainset=trainset)
The optimized program is a standard DSPy program either way — save(), load(), and Evaluate all work identically.
Important limitations
-
Instruction-only optimization — VizPy does NOT optimize Pydantic
Field(description=...),InputField(desc=...), orOutputField(desc=...). Same limitation as GEPA. See/dspy-gepafor a workaround (flatten field descriptions into the instruction). -
SaaS dependency — your training data is sent to VizPy's servers for optimization. Check your data privacy requirements.
-
No offline mode — requires internet access and a valid API key.
-
Free tier limits — 10 optimization runs per month. Each
.compile()call counts as one run.
Verifying the optimization
After running .compile(), compare baseline vs optimized:
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
# Baseline
baseline_score = evaluator(program)
print(f"Baseline: {baseline_score}")
# After VizPy optimization
optimized_score = evaluator(optimized)
print(f"Optimized: {optimized_score}")
print(f"Improvement: {optimized_score - baseline_score:.1f}%")
If the optimized score is not higher, the instruction change may not help this task. Try a different optimizer (GEPA, MIPROv2) or add few-shot demos with MIPROv2.
Gotchas
- Claude uses VizPy for few-shot demo optimization. VizPy only tunes the instruction string, not demos. If the user needs demos, use
dspy.BootstrapFewShotordspy.MIPROv2first, then layer VizPy on top for instruction tuning. - Claude picks ContraPromptOptimizer for generation tasks. ContraPrompt is designed for classification (fixed categories). For open-ended generation (summaries, articles, Q&A), use PromptGradOptimizer instead.
- Claude skips the evaluation step after VizPy optimization. Without comparing baseline vs optimized scores on a held-out devset, there is no way to know if VizPy helped. Always run
dspy.Evaluatebefore and after. - Claude forgets
vizpy.api_keyorVIZPY_API_KEY. VizPy is SaaS and requires authentication. Without the API key set,.compile()fails with a confusing auth error. Set it before any optimizer calls. - Claude recommends VizPy without mentioning the data privacy implication. Training data is sent to VizPy servers during optimization. Always ask about data sensitivity before recommending VizPy over the fully local GEPA alternative.
Additional resources
- VizPy docs
- VizPy dashboard
- For API details, see reference.md
- For worked examples comparing VizPy and GEPA side-by-side, see examples.md
Cross-references
Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
- GEPA (open-source instruction optimizer) —
/dspy-gepa - MIPROv2 (instructions + demos, best overall) —
/dspy-miprov2 - Improving accuracy (full optimizer comparison) —
/ai-improving-accuracy - Evaluating results before and after —
/dspy-evaluate - Install
/ai-doif you do not have it — it routes any AI problem to the right skill and is the fastest way to work:npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do
More from lebsral/dspy-programming-not-prompting-lms-skills
ai-switching-models
Switch AI providers or models without breaking things. Use when you want to switch from OpenAI to Anthropic, try a cheaper model, stop depending on one vendor, compare models side-by-side, a model update broke your outputs, you need vendor diversification, or you want to migrate to a local model. Also use when your prompt broke after a model update, prompts that work for GPT-4 do not work for Claude or Llama, or you need to do a model migration. Covers DSPy model portability with provider config, re-optimization, model comparison, and multi-model pipelines. Also used for migrate from OpenAI to Anthropic, GPT to Claude migration, try Llama instead of GPT, model comparison framework, multi-provider AI setup, avoid vendor lock-in for AI, prompts break when switching models, model-agnostic AI code.
56ai-stopping-hallucinations
Stop your AI from making things up. Use when your AI hallucinates, fabricates facts, is not grounded in real data, does not cite sources, makes unsupported claims, or you need to verify AI responses against source material. Also use when your LLM makes up facts, responses are disconnected from the input, or outputs are not grounded in source documents. Covers citation enforcement, faithfulness verification, grounding via retrieval, confidence thresholds, and evaluation of anti-hallucination quality. Also used for AI makes up citations, LLM fabricates data, ground AI in source documents, RAG but AI still hallucinates, force AI to cite sources, factual accuracy for AI, prevent AI from inventing facts, AI confident but wrong, LLM confabulation, hallucination detection, verify AI claims against documents.
49ai-do
Describe your AI problem and get routed to the right skill with a ready-to-use prompt. Use when you are not sure which ai- skill to use, want help picking the right approach, or just want to describe what you need in plain language. Also use this when someone says I want to build an AI that..., how do I make my AI..., or describes any AI/LLM task without naming a specific skill, I need AI but do not know where to start, which AI pattern should I use, what is the best way to add AI to my app, recommend an AI approach, AI feature discovery, too many AI options, overwhelmed by AI frameworks, just tell me what to build, new to DSPy, beginner AI project help, which LLM pattern fits my use case, confused about AI architecture, help me figure out my AI approach.
23ai-improving-accuracy
Measure and improve how well your AI works. Use when AI gives wrong answers, accuracy is bad, responses are unreliable, you need to test AI quality, evaluate your AI, write metrics, benchmark performance, optimize prompts, improve results, or systematically make your AI better. Also use when you spent hours tweaking prompts, trial and error prompt engineering is not working, quality plateaued early, or you have stale prompts everywhere in your codebase. Covers DSPy evaluation, metrics, and optimization., my AI is only 60% accurate, how to measure AI quality, AI evaluation framework, benchmark my LLM, prompt optimization not working, systematic way to improve AI, AI accuracy plateaued, DSPy optimizer tutorial, MIPROv2 optimization, how to go from 70% to 90% accuracy.
21ai-parsing-data
Pull structured data from messy text using AI. Use when parsing invoices, extracting fields from emails, scraping entities from articles, converting unstructured text to JSON, extracting contact info, parsing resumes, reading forms, pulling data from transcripts (VTT, LiveKit, Recall), extracting fields from Langfuse traces, or any task where messy text goes in and clean structured data comes out. Also use when emails are messy and lack structure, or structured data extraction from unstructured content is unreliable., extract entities from text, parse PDF with AI, structured extraction from unstructured text, OCR plus AI extraction, convert email to structured data, pull fields from documents automatically, AI data entry automation, invoice parsing, resume parsing with AI, medical record extraction.
21ai-reasoning
Make AI solve hard problems that need planning and multi-step thinking. Use when your AI fails on complex questions, needs to break down problems, requires multi-step logic, needs to plan before acting, gives wrong answers on math or analysis tasks, or when a simple prompt is not enough for the reasoning required. Covers ChainOfThought, ProgramOfThought, MultiChainComparison, and Self-Discovery reasoning patterns in DSPy., AI gives shallow answers, LLM does not think before answering, chain of thought prompting, make AI show its work, AI fails at math, complex analysis with LLM, multi-step problem solving, AI reasoning errors, LLM logic mistakes, think step by step DSPy, AI cannot do basic arithmetic, deep reasoning with language models, self-consistency for better answers, tree of thought.
21