evaluate-multimodal

Installation

SKILL.md

Evaluate Your Multimodal Agent

This recipe helps you evaluate agents that process images, audio, PDFs, or other non-text inputs.

Step 1: Identify Modalities

Read the codebase to understand what your agent processes:

Images: classification, analysis, generation, OCR
Audio: transcription, voice agents, audio Q&A
PDFs/Documents: parsing, extraction, summarization
Mixed: multiple input types in one pipeline

Step 2: Read the Relevant Docs

Use the LangWatch MCP:

fetch_scenario_docs → search for multimodal pages (image analysis, audio testing, file analysis)
fetch_langwatch_docs → search for evaluation SDK docs

For PDF evaluation specifically, reference the pattern from python-sdk/examples/pdf_parsing_evaluation.ipynb:

Download/load documents
Define extraction pipeline
Use LangWatch experiment SDK to evaluate extraction accuracy

Step 3: Set Up Evaluation by Modality

Image Evaluation

LangWatch's LLM-as-judge evaluators can accept images. Create an evaluation that:

Loads test images
Runs the agent on each image
Uses an LLM-as-judge evaluator to assess output quality

import langwatch

experiment = langwatch.experiment.init("image-eval")

for idx, entry in experiment.loop(enumerate(image_dataset)):
    result = my_agent(image=entry["image_path"])
    experiment.evaluate(
        "llm_boolean",
        index=idx,
        data={
            "input": entry["image_path"],  # LLM-as-judge can view images
            "output": result,
        },
        settings={
            "model": "openai/gpt-5-mini",
            "prompt": "Does the agent correctly describe/classify this image?",
        },
    )

Audio Evaluation

Use Scenario's audio testing patterns:

Audio-to-text: verify transcription accuracy
Audio-to-audio: verify voice agent responses
Use fetch_scenario_docs with url for multimodal/audio-to-text.md

PDF/Document Evaluation

Follow the pattern from the PDF parsing evaluation example:

Load documents (PDFs, CSVs, etc.)
Define extraction/parsing pipeline
Evaluate extraction accuracy against expected fields
Use structured evaluation (exact match for fields, LLM judge for summaries)

File Analysis

For agents that process arbitrary files:

Use Scenario's file analysis patterns
fetch_scenario_docs with url for multimodal/multimodal-files.md

Step 4: Generate Domain-Specific Test Data

For each modality, generate or collect test data that matches the agent's actual use case:

If it's a medical imaging agent → use relevant medical image samples
If it's a document parser → use real document types the agent encounters
If it's a voice assistant → record realistic voice prompts

Step 5: Run and Iterate

Run the evaluation, review results, fix issues, re-run until quality is acceptable.

Common Mistakes

Do NOT evaluate multimodal agents with text-only metrics — use image-aware judges
Do NOT skip testing with real file formats — synthetic descriptions aren't enough
Do NOT forget to handle file loading errors in evaluations
Do NOT use generic test images — use domain-specific ones matching the agent's purpose

Related skills

More from langwatch/skills

Installs

Repository

langwatch/skills

GitHub Stars

First Seen

Mar 23, 2026

Security Audits

Gen Agent Trust HubPass