evaluate-multimodal
Evaluate Your Multimodal Agent
This recipe helps you evaluate agents that process images, audio, PDFs, or other non-text inputs.
Step 1: Identify Modalities
Read the codebase to understand what your agent processes:
- Images: classification, analysis, generation, OCR
- Audio: transcription, voice agents, audio Q&A
- PDFs/Documents: parsing, extraction, summarization
- Mixed: multiple input types in one pipeline
Step 2: Read the Relevant Docs
Use the LangWatch MCP:
fetch_scenario_docs→ search for multimodal pages (image analysis, audio testing, file analysis)fetch_langwatch_docs→ search for evaluation SDK docs
For PDF evaluation specifically, reference the pattern from python-sdk/examples/pdf_parsing_evaluation.ipynb:
- Download/load documents
- Define extraction pipeline
- Use LangWatch experiment SDK to evaluate extraction accuracy
Step 3: Set Up Evaluation by Modality
Image Evaluation
LangWatch's LLM-as-judge evaluators can accept images. Create an evaluation that:
- Loads test images
- Runs the agent on each image
- Uses an LLM-as-judge evaluator to assess output quality
import langwatch
experiment = langwatch.experiment.init("image-eval")
for idx, entry in experiment.loop(enumerate(image_dataset)):
result = my_agent(image=entry["image_path"])
experiment.evaluate(
"llm_boolean",
index=idx,
data={
"input": entry["image_path"], # LLM-as-judge can view images
"output": result,
},
settings={
"model": "openai/gpt-5-mini",
"prompt": "Does the agent correctly describe/classify this image?",
},
)
Audio Evaluation
Use Scenario's audio testing patterns:
- Audio-to-text: verify transcription accuracy
- Audio-to-audio: verify voice agent responses
- Use
fetch_scenario_docswith url formultimodal/audio-to-text.md
PDF/Document Evaluation
Follow the pattern from the PDF parsing evaluation example:
- Load documents (PDFs, CSVs, etc.)
- Define extraction/parsing pipeline
- Evaluate extraction accuracy against expected fields
- Use structured evaluation (exact match for fields, LLM judge for summaries)
File Analysis
For agents that process arbitrary files:
- Use Scenario's file analysis patterns
fetch_scenario_docswith url formultimodal/multimodal-files.md
Step 4: Generate Domain-Specific Test Data
For each modality, generate or collect test data that matches the agent's actual use case:
- If it's a medical imaging agent → use relevant medical image samples
- If it's a document parser → use real document types the agent encounters
- If it's a voice assistant → record realistic voice prompts
Step 5: Run and Iterate
Run the evaluation, review results, fix issues, re-run until quality is acceptable.
Common Mistakes
- Do NOT evaluate multimodal agents with text-only metrics — use image-aware judges
- Do NOT skip testing with real file formats — synthetic descriptions aren't enough
- Do NOT forget to handle file loading errors in evaluations
- Do NOT use generic test images — use domain-specific ones matching the agent's purpose
More from langwatch/skills
prompts
Version and manage your agent's prompts with LangWatch Prompts CLI. Use for both onboarding (set up prompt versioning for an entire codebase) and targeted operations (version a specific prompt, create a new prompt version). Supports Python and TypeScript.
36analytics
Analyze your AI agent's performance using LangWatch analytics. Use when the user wants to understand costs, latency, error rates, usage trends, or debug specific traces. Works with any LangWatch-instrumented agent.
31datasets
Generate realistic synthetic evaluation datasets by analyzing the user's codebase, prompts, production traces, and reference materials. Interactive, consultant-style — asks clarifying questions, proposes a plan, generates a preview for approval, then delivers a complete dataset uploaded to LangWatch. Use when user asks to generate, create, or build a dataset for evaluation, testing, or benchmarking.
12improve-setup
Expert AI engineering consultant for your LangWatch setup. Audits your codebase, traces, evaluations, and scenarios, then guides you to improve — starting from low-hanging fruit and going deeper. Use when you want to level up your agent's engineering quality.
8debug-instrumentation
Debug and improve your LangWatch traces. Inspects production traces for missing input/output, disconnected spans, unlabeled traces, and missing metadata. Use when traces look broken or incomplete.
7