evaluate-multimodal

Installation
SKILL.md

Evaluate Your Multimodal Agent

This recipe helps you evaluate agents that process images, audio, PDFs, or other non-text inputs.

Step 1: Identify Modalities

Read the codebase to understand what your agent processes:

  • Images: classification, analysis, generation, OCR
  • Audio: transcription, voice agents, audio Q&A
  • PDFs/Documents: parsing, extraction, summarization
  • Mixed: multiple input types in one pipeline

Step 2: Read the Relevant Docs

Use the LangWatch MCP:

  • fetch_scenario_docs → search for multimodal pages (image analysis, audio testing, file analysis)
  • fetch_langwatch_docs → search for evaluation SDK docs

For PDF evaluation specifically, reference the pattern from python-sdk/examples/pdf_parsing_evaluation.ipynb:

  • Download/load documents
  • Define extraction pipeline
  • Use LangWatch experiment SDK to evaluate extraction accuracy

Step 3: Set Up Evaluation by Modality

Image Evaluation

LangWatch's LLM-as-judge evaluators can accept images. Create an evaluation that:

  1. Loads test images
  2. Runs the agent on each image
  3. Uses an LLM-as-judge evaluator to assess output quality
import langwatch

experiment = langwatch.experiment.init("image-eval")

for idx, entry in experiment.loop(enumerate(image_dataset)):
    result = my_agent(image=entry["image_path"])
    experiment.evaluate(
        "llm_boolean",
        index=idx,
        data={
            "input": entry["image_path"],  # LLM-as-judge can view images
            "output": result,
        },
        settings={
            "model": "openai/gpt-5-mini",
            "prompt": "Does the agent correctly describe/classify this image?",
        },
    )

Audio Evaluation

Use Scenario's audio testing patterns:

  • Audio-to-text: verify transcription accuracy
  • Audio-to-audio: verify voice agent responses
  • Use fetch_scenario_docs with url for multimodal/audio-to-text.md

PDF/Document Evaluation

Follow the pattern from the PDF parsing evaluation example:

  1. Load documents (PDFs, CSVs, etc.)
  2. Define extraction/parsing pipeline
  3. Evaluate extraction accuracy against expected fields
  4. Use structured evaluation (exact match for fields, LLM judge for summaries)

File Analysis

For agents that process arbitrary files:

  • Use Scenario's file analysis patterns
  • fetch_scenario_docs with url for multimodal/multimodal-files.md

Step 4: Generate Domain-Specific Test Data

For each modality, generate or collect test data that matches the agent's actual use case:

  • If it's a medical imaging agent → use relevant medical image samples
  • If it's a document parser → use real document types the agent encounters
  • If it's a voice assistant → record realistic voice prompts

Step 5: Run and Iterate

Run the evaluation, review results, fix issues, re-run until quality is acceptable.

Common Mistakes

  • Do NOT evaluate multimodal agents with text-only metrics — use image-aware judges
  • Do NOT skip testing with real file formats — synthetic descriptions aren't enough
  • Do NOT forget to handle file loading errors in evaluations
  • Do NOT use generic test images — use domain-specific ones matching the agent's purpose
Related skills
Installs
8
GitHub Stars
2
First Seen
Mar 23, 2026