ai-sorting
Build an AI Content Sorter
Build an AI sorter with DSPy: define categories, load data, evaluate, optimize, and deploy.
Step 1: Define the sorting task
Ask the user:
- What are you sorting? (tickets, emails, reviews, messages, comments, etc.)
- What are the categories? (list all labels/buckets)
- One category per item, or multiple? (e.g., "priority" vs "all applicable tags")
- Do you have labeled examples already? (a CSV, database, spreadsheet with items + their correct category)
The answers determine which pattern to use below.
Step 2: Build the sorter
Single category (most common)
import dspy
from typing import Literal
# Configure your LM — works with any provider
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)
# Define your categories
CATEGORIES = ["billing", "technical", "account", "feature_request", "general"]
class SortContent(dspy.Signature):
"""Sort the customer message into the correct support category."""
message: str = dspy.InputField(desc="The content to sort")
category: Literal[tuple(CATEGORIES)] = dspy.OutputField(desc="The assigned category")
sorter = dspy.ChainOfThought(SortContent)
Literal locks the output to valid categories — the model cannot invent labels.
| Module | When to use | Tradeoff |
|---|---|---|
ChainOfThought |
Default — most classification tasks | ~5-15% accuracy gain over Predict, but 2x tokens |
Predict |
Binary/obvious categories (spam vs not-spam) | Faster and cheaper, skip if reasoning is not helping |
Multiple tags
When items can belong to several categories at once (e.g., a news article that's both "technology" and "business"):
class TagContent(dspy.Signature):
"""Assign all applicable tags to the content."""
message: str = dspy.InputField(desc="The content to tag")
tags: list[Literal[tuple(CATEGORIES)]] = dspy.OutputField(desc="All applicable tags")
tagger = dspy.ChainOfThought(TagContent)
Handling "none of the above"
If real-world content might not fit any category, add an explicit catch-all rather than hoping the model picks the least-bad option:
CATEGORIES = ["billing", "technical", "account", "feature_request", "other"]
This gives the model a safe escape hatch and makes it easy to filter out uncategorized items for human review.
Sorting with context
Sometimes classification depends on extra context — a customer's plan tier, previous interactions, or business rules. Add those as input fields:
class SortWithContext(dspy.Signature):
"""Sort the ticket considering the customer's context."""
message: str = dspy.InputField(desc="The support message")
customer_tier: str = dspy.InputField(desc="Customer plan: free, pro, or enterprise")
category: Literal[tuple(CATEGORIES)] = dspy.OutputField()
priority: Literal["low", "medium", "high", "urgent"] = dspy.OutputField()
Step 3: Load your data
If the user has labeled data, help them load it. The key step is converting their data into dspy.Example objects and marking which fields are inputs (what the model sees) vs outputs (what it should predict).
From a CSV or DataFrame
import pandas as pd
df = pd.read_csv("labeled_tickets.csv") # columns: message, category
dataset = [
dspy.Example(message=row["message"], category=row["category"]).with_inputs("message")
for _, row in df.iterrows()
]
# Split into train/dev sets
trainset, devset = dataset[:len(dataset)*4//5], dataset[len(dataset)*4//5:]
From a list of dicts
data = [
{"message": "I was charged twice", "category": "billing"},
{"message": "Can't log in", "category": "technical"},
# ...
]
dataset = [dspy.Example(**d).with_inputs("message") for d in data]
From transcripts (VTT, LiveKit, Recall)
Transcripts are a common source for sorting — classifying call topics, tagging meeting segments, routing conversations. The key is extracting the text content from whatever format you have.
WebVTT (.vtt) files:
import re
def load_vtt(path):
"""Extract text lines from a VTT transcript, stripping timestamps."""
text = open(path).read()
# Remove VTT header and timestamp lines
lines = [line.strip() for line in text.split("\n")
if line.strip() and not line.startswith("WEBVTT")
and not re.match(r"\d{2}:\d{2}", line)
and not line.strip().isdigit()]
return " ".join(lines)
# Sort entire transcripts by topic
transcript = load_vtt("meeting.vtt")
dataset = [dspy.Example(message=transcript, category="standup").with_inputs("message")]
LiveKit transcripts (from LiveKit Agents egress or webhook data):
import json
def load_livekit_transcript(path):
"""Extract text from a LiveKit transcript JSON export."""
data = json.load(open(path))
# LiveKit transcription segments have text + timestamps
segments = data.get("segments", data.get("results", []))
return " ".join(seg.get("text", "") for seg in segments)
transcript = load_livekit_transcript("call_transcript.json")
Recall.ai transcripts:
def load_recall_transcript(transcript_data):
"""Extract text from a Recall.ai transcript response.
transcript_data is the JSON from Recall's /transcript endpoint."""
return " ".join(
entry["words"]
for entry in transcript_data
if entry.get("words")
)
Sorting transcript segments — often you want to classify individual segments rather than whole transcripts (e.g., tag each speaker turn by topic):
def vtt_to_segments(path):
"""Parse VTT into individual segments for per-segment sorting."""
import webvtt # pip install webvtt-py
return [
dspy.Example(message=caption.text, category="").with_inputs("message")
for caption in webvtt.read(path)
if caption.text.strip()
]
From Langfuse traces
If you're sorting AI interactions logged in Langfuse — classifying traces by quality, topic, failure mode, etc.:
from langfuse import Langfuse
langfuse = Langfuse()
# Fetch traces to classify
traces = langfuse.fetch_traces(limit=200).data
dataset = [
dspy.Example(
message=trace.input.get("message", str(trace.input)),
# If traces are already scored/tagged in Langfuse, use that as the label
category=trace.tags[0] if trace.tags else ""
).with_inputs("message")
for trace in traces
if trace.input
]
# Filter out unlabeled ones for training, keep them for batch classification
labeled = [ex for ex in dataset if ex.category]
unlabeled = [ex for ex in dataset if not ex.category]
No labeled data yet
If the user doesn't have labeled examples, they have two options:
- Label a small set by hand — even 20-30 examples helps. Suggest they pick representative examples from each category.
- Use
/ai-generating-data— generate synthetic training data from category descriptions.
Step 4: Evaluate quality
Before optimizing, measure how the baseline performs:
from dspy.evaluate import Evaluate
def sorting_metric(example, prediction, trace=None):
return prediction.category == example.category
evaluator = Evaluate(
devset=devset,
metric=sorting_metric,
num_threads=4,
display_progress=True,
display_table=5, # show 5 example results
)
score = evaluator(sorter)
print(f"Baseline accuracy: {score}%")
Multi-label metric
For multi-tag classification, exact match is too strict. Use Jaccard similarity (intersection over union):
def multilabel_metric(example, pred, trace=None):
gold = set(example.tags)
predicted = set(pred.tags)
if not gold and not predicted:
return 1.0
return len(gold & predicted) / len(gold | predicted)
Step 5: Optimize accuracy
| Optimizer | When to use | What it optimizes |
|---|---|---|
BootstrapFewShot |
Start here — fast, typically gives 10-20% accuracy bump | Selects few-shot demos from training data |
MIPROv2 |
Accuracy plateaus after BootstrapFewShot | Demos + instructions jointly |
optimizer = dspy.BootstrapFewShot(
metric=sorting_metric,
max_bootstrapped_demos=4,
)
optimized_sorter = optimizer.compile(sorter, trainset=trainset)
# Re-evaluate
score = evaluator(optimized_sorter)
print(f"Optimized accuracy: {score}%")
If accuracy plateaus, upgrade to MIPROv2:
optimizer = dspy.MIPROv2(
metric=sorting_metric,
auto="medium", # "light", "medium", or "heavy"
)
optimized_sorter = optimizer.compile(sorter, trainset=trainset)
Training hints for tricky examples
If certain examples are ambiguous ("I want to cancel" — is that billing or account?), add a hint field that's only present during training:
class SortWithHint(dspy.Signature):
"""Sort the message into the correct category."""
message: str = dspy.InputField()
hint: str = dspy.InputField(desc="Clarifying context for ambiguous cases")
category: Literal[tuple(CATEGORIES)] = dspy.OutputField()
# In training data, provide hints
trainset = [
dspy.Example(
message="I want to cancel",
hint="Customer is asking about canceling their subscription billing",
category="billing"
).with_inputs("message", "hint"),
]
# At inference time, pass hint="" or omit it
Step 6: Use it
Single item
result = optimized_sorter(message="I was charged twice on my credit card last month")
print(f"Category: {result.category}")
print(f"Reasoning: {result.reasoning}")
Batch processing
For sorting many items at once, use dspy.Evaluate with your data or a simple loop. The evaluator handles threading automatically:
# Quick batch with a loop
results = []
for item in items:
result = optimized_sorter(message=item["text"])
results.append({"text": item["text"], "category": result.category})
# Or use pandas
df["category"] = df["message"].apply(
lambda msg: optimized_sorter(message=msg).category
)
Confidence-based routing
When you need to know how sure the model is — for example, to escalate low-confidence items to a human:
class SortWithConfidence(dspy.Signature):
"""Sort the content and rate your confidence."""
message: str = dspy.InputField()
category: Literal[tuple(CATEGORIES)] = dspy.OutputField()
confidence: float = dspy.OutputField(desc="Confidence between 0.0 and 1.0")
sorter = dspy.ChainOfThought(SortWithConfidence)
result = sorter(message="I think there might be an issue")
if result.confidence < 0.7:
# Flag for human review
print(f"Low confidence ({result.confidence}) — needs human review")
else:
print(f"Category: {result.category} (confidence: {result.confidence})")
Save and load
Persist your optimized sorter so you don't have to re-optimize every time:
# Save
optimized_sorter.save("ticket_sorter.json")
# Load later
sorter = dspy.ChainOfThought(SortContent)
sorter.load("ticket_sorter.json")
Gotchas
- Using
Literal[list]instead ofLiteral[tuple(list)]. Claude writesLiteral[["a", "b"]]which raises a TypeError. Must beLiteral[tuple(["a", "b"])]— Python requires a tuple of values insideLiteral. - Categories > 15 degrade accuracy. With many flat categories, the LM confuses semantically close labels. Use hierarchical classification (coarse category first, then sub-category) instead of a flat list.
- Omitting a catch-all category. Without "other" or "unknown", the model is forced to misclassify edge cases into the closest wrong bucket. Always include an explicit escape hatch for content that does not fit.
- Using verbose category names like "Issues related to billing". Short, unambiguous names ("billing_issue") give the LM a clearer signal. Add a
descfield on the signature only if the name alone is ambiguous. - Skipping adversarial inputs in the dev set. Inputs that span two categories or contain no relevant content expose classification weaknesses early. Add these before optimizing, not after.
Cross-references
Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
- Need scores instead of categories? See
/ai-scoring - Measure and improve sorting accuracy — see
/ai-improving-accuracy - Generate training data when you have none — see
/ai-generating-data - Define input/output contracts for signatures — see
/dspy-signatures - Add reasoning before classification — see
/dspy-chain-of-thought - Simple classification without reasoning — see
/dspy-predict - Constrain output quality with assertions — see
/dspy-assertions - Install
/ai-doif you do not have it — it routes any AI problem to the right skill and is the fastest way to work:npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do
Additional resources
- For worked examples (sentiment, intent routing, topics, hierarchical), see examples.md
More from lebsral/dspy-programming-not-prompting-lms-skills
ai-switching-models
Switch AI providers or models without breaking things. Use when you want to switch from OpenAI to Anthropic, try a cheaper model, stop depending on one vendor, compare models side-by-side, a model update broke your outputs, you need vendor diversification, or you want to migrate to a local model. Also use when your prompt broke after a model update, prompts that work for GPT-4 do not work for Claude or Llama, or you need to do a model migration. Covers DSPy model portability with provider config, re-optimization, model comparison, and multi-model pipelines. Also used for migrate from OpenAI to Anthropic, GPT to Claude migration, try Llama instead of GPT, model comparison framework, multi-provider AI setup, avoid vendor lock-in for AI, prompts break when switching models, model-agnostic AI code.
56ai-stopping-hallucinations
Stop your AI from making things up. Use when your AI hallucinates, fabricates facts, is not grounded in real data, does not cite sources, makes unsupported claims, or you need to verify AI responses against source material. Also use when your LLM makes up facts, responses are disconnected from the input, or outputs are not grounded in source documents. Covers citation enforcement, faithfulness verification, grounding via retrieval, confidence thresholds, and evaluation of anti-hallucination quality. Also used for AI makes up citations, LLM fabricates data, ground AI in source documents, RAG but AI still hallucinates, force AI to cite sources, factual accuracy for AI, prevent AI from inventing facts, AI confident but wrong, LLM confabulation, hallucination detection, verify AI claims against documents.
49ai-do
Describe your AI problem and get routed to the right skill with a ready-to-use prompt. Use when you are not sure which ai- skill to use, want help picking the right approach, or just want to describe what you need in plain language. Also use this when someone says I want to build an AI that..., how do I make my AI..., or describes any AI/LLM task without naming a specific skill, I need AI but do not know where to start, which AI pattern should I use, what is the best way to add AI to my app, recommend an AI approach, AI feature discovery, too many AI options, overwhelmed by AI frameworks, just tell me what to build, new to DSPy, beginner AI project help, which LLM pattern fits my use case, confused about AI architecture, help me figure out my AI approach.
23ai-improving-accuracy
Measure and improve how well your AI works. Use when AI gives wrong answers, accuracy is bad, responses are unreliable, you need to test AI quality, evaluate your AI, write metrics, benchmark performance, optimize prompts, improve results, or systematically make your AI better. Also use when you spent hours tweaking prompts, trial and error prompt engineering is not working, quality plateaued early, or you have stale prompts everywhere in your codebase. Covers DSPy evaluation, metrics, and optimization., my AI is only 60% accurate, how to measure AI quality, AI evaluation framework, benchmark my LLM, prompt optimization not working, systematic way to improve AI, AI accuracy plateaued, DSPy optimizer tutorial, MIPROv2 optimization, how to go from 70% to 90% accuracy.
21ai-parsing-data
Pull structured data from messy text using AI. Use when parsing invoices, extracting fields from emails, scraping entities from articles, converting unstructured text to JSON, extracting contact info, parsing resumes, reading forms, pulling data from transcripts (VTT, LiveKit, Recall), extracting fields from Langfuse traces, or any task where messy text goes in and clean structured data comes out. Also use when emails are messy and lack structure, or structured data extraction from unstructured content is unreliable., extract entities from text, parse PDF with AI, structured extraction from unstructured text, OCR plus AI extraction, convert email to structured data, pull fields from documents automatically, AI data entry automation, invoice parsing, resume parsing with AI, medical record extraction.
21ai-reasoning
Make AI solve hard problems that need planning and multi-step thinking. Use when your AI fails on complex questions, needs to break down problems, requires multi-step logic, needs to plan before acting, gives wrong answers on math or analysis tasks, or when a simple prompt is not enough for the reasoning required. Covers ChainOfThought, ProgramOfThought, MultiChainComparison, and Self-Discovery reasoning patterns in DSPy., AI gives shallow answers, LLM does not think before answering, chain of thought prompting, make AI show its work, AI fails at math, complex analysis with LLM, multi-step problem solving, AI reasoning errors, LLM logic mistakes, think step by step DSPy, AI cannot do basic arithmetic, deep reasoning with language models, self-consistency for better answers, tree of thought.
21