evaluations
Set Up Evaluations for Your Agent
LangWatch Evaluations is a comprehensive QA system. Map the user's request to one branch:
| User says... | They need... | Go to... |
|---|---|---|
| "test my agent", "benchmark", "compare models" | Experiments | Step A |
| "monitor production", "track quality", "block harmful content", "safety" | Online Evaluation (includes guardrails) | Step B |
| "create an evaluator", "scoring function" | Evaluators | Step C |
| "create a dataset", "test data" | Datasets | Step D |
| "evaluate" (ambiguous) | Ask: "batch test or production monitoring?" | - |
Where Evaluations Fit
Evaluations sit at the component level of the testing pyramid — they test specific aspects of an agent with many input/output examples. Different from scenarios (end-to-end multi-turn).
Use evaluations when you have many examples with clear correct answers, or for CI quality gates. Use scenarios for multi-turn behavior and tool-calling sequences.
Determine Scope
If the user's request is general ("set up evaluations"):
- Read the codebase to understand the agent
- Study git history to understand what changed and why — focus on agent behavior changes, prompt tweaks, bug fixes. Read commit messages for context.
- Set up an experiment + evaluator + dataset
- After the experiment is working, summarize results and suggest improvements (consultant mode — see end of skill).
If the user's request is specific ("add a faithfulness evaluator"):
- Focus on the specific need
- Create the targeted evaluator, dataset, or experiment
- Verify it works
Detect Context
If you're in a codebase (package.json, pyproject.toml, etc.) — use the SDK for experiments and guardrails; use the CLI for evaluators, datasets, monitors. If there is no codebase, drive everything via the CLI. If ambiguous, ask the user.
Some features are code-only (experiments, guardrails) and some are platform-only (monitors). Evaluators work on both surfaces.
Plan Limits
LangWatch's free plan has limits on prompts, scenarios, evaluators, experiments, and datasets. When you hit a limit, the API returns "Free plan limit of N reached..." with an upgrade link.
How to handle:
- Work within the limits — if 3 scenarios are allowed, create 3 meaningful ones, not 10.
- Make every creation count: each one should demonstrate clear value.
- Show what works FIRST. If you hit a limit, summarize what was accomplished and direct the user to upgrade at https://app.langwatch.ai/settings/subscription.
- Do NOT delete existing resources to make room, and do NOT reuse a scenario set to cram in more tests.
If LANGWATCH_ENDPOINT is set in .env, the user is self-hosted — direct them to {LANGWATCH_ENDPOINT}/settings/license instead
Prerequisites
Use langwatch docs <path> to read documentation as Markdown. Some useful entry points:
langwatch docs # Docs index
langwatch docs integration/python/guide # Python integration
langwatch docs integration/typescript/guide # TypeScript integration
langwatch docs prompt-management/cli # Prompts CLI
langwatch scenario-docs # Scenario docs index
Discover commands with langwatch --help and langwatch <subcommand> --help. List and get commands accept --format json for machine-readable output. Read the docs first instead of guessing SDK APIs or CLI flags.
If no shell is available, fetch the same Markdown over plain HTTP — append .md to any docs path (e.g. https://langwatch.ai/docs/integration/python/guide.md). Index: https://langwatch.ai/docs/llms.txt. Scenario index: https://langwatch.ai/scenario/llms.txt
Then read the evaluations overview:
langwatch docs evaluations/overview
Step A: Experiments (Batch Testing) — Code Approach
Create a script or notebook that runs the agent against a dataset and measures quality.
- Read the SDK docs:
langwatch docs evaluations/experiments/sdk - Analyze the agent code to understand its inputs/outputs.
- Create a dataset with examples that look like real production data — domain-realistic, not generic.
- Create the experiment file:
Python (Jupyter):
import langwatch
import pandas as pd
data = {
"input": ["domain-specific question 1", "domain-specific question 2"],
"expected_output": ["expected answer 1", "expected answer 2"],
}
df = pd.DataFrame(data)
evaluation = langwatch.experiment.init("agent-evaluation")
for index, row in evaluation.loop(df.iterrows()):
response = my_agent(row["input"])
evaluation.evaluate(
"ragas/answer_relevancy",
index=index,
data={"input": row["input"], "output": response},
settings={"model": "openai/gpt-5-mini", "max_tokens": 2048},
)
TypeScript:
import { LangWatch } from "langwatch";
const langwatch = new LangWatch();
const dataset = [
{ input: "domain-specific question", expectedOutput: "expected answer" },
];
const evaluation = await langwatch.experiments.init("agent-evaluation");
await evaluation.run(dataset, async ({ item, index }) => {
const response = await myAgent(item.input);
await evaluation.evaluate("ragas/answer_relevancy", {
index,
data: { input: item.input, output: response },
settings: { model: "openai/gpt-5-mini", max_tokens: 2048 },
});
});
- Run it. ALWAYS execute the experiment after creating it — an unrun experiment is useless. For Python notebooks: run the cells, or
jupyter nbconvert --to notebook --execute. For TypeScript:npx tsx experiment.ts.
Step B: Online Evaluation (Production Monitoring & Guardrails)
Platform mode: Monitors (continuous async scoring)
langwatch docs evaluations/online-evaluation/overview
Create monitors via the CLI (langwatch monitor --help for the flag set). Optionally configure further at https://app.langwatch.ai → Evaluations → Monitors.
Code mode: Guardrails (synchronous blocking)
langwatch docs evaluations/guardrails/code-integration
Add guardrail checks in agent code:
import langwatch
@langwatch.trace()
def my_agent(user_input):
guardrail = langwatch.evaluation.evaluate(
"azure/jailbreak",
name="Jailbreak Detection",
as_guardrail=True,
data={"input": user_input},
)
if not guardrail.passed:
return "I can't help with that request."
...
Key distinction: Monitors measure (async). Guardrails act (sync via as_guardrail=True).
Step C: Evaluators (Scoring Functions)
Read the docs first:
langwatch docs evaluations/evaluators/overview
langwatch docs evaluations/evaluators/list # Browse available evaluators
In code, call evaluators via the SDK as shown in Step A. To create or manage evaluators on the platform, use langwatch evaluator --help. If unsure which --type values are valid, run langwatch evaluator create --help first.
If you need an LLM-as-judge evaluator, verify a model provider is configured (langwatch model-provider list).
Step D: Datasets
Read the docs first:
langwatch docs datasets/overview
langwatch docs datasets/programmatic-access
langwatch docs datasets/ai-dataset-generation
Use langwatch dataset --help for create/upload/download. Generate data tailored to the agent:
| Agent type | Dataset examples |
|---|---|
| Chatbot | Realistic user questions matching the bot's persona |
| RAG pipeline | Questions with expected answers testing retrieval quality |
| Classifier | Inputs with expected category labels |
| Code assistant | Coding tasks with expected outputs |
| Customer support | Support tickets and customer questions |
| Summarizer | Documents with expected summaries |
CRITICAL: The dataset MUST be specific to what the agent ACTUALLY does. Before generating any data:
- Read the agent's system prompt word by word
- Read the agent's function signatures and tool definitions
- Understand the agent's domain, persona, and constraints
Then generate data reflecting EXACTLY this agent's real-world usage. NEVER use generic examples like "What is 2+2?", "What is the capital of France?", or "Explain quantum computing" — every example must be something a real user of THIS specific agent would say.
Consultant Mode
Once the experiment is working, summarize results and suggest 2-3 domain-specific improvements based on what you learned from the codebase.
After delivering initial results, transition to consultant mode to help the user get maximum value.
Phase 1 — read first. Before generating ANY content: read the codebase end-to-end (every system prompt, function, tool definition), study git history for agent-related changes (git log --oneline -30, then drill into prompt/agent/eval-related commits — the WHY in commit messages matters more than the WHAT), and read READMEs and comments for domain context.
Phase 2 — quick wins. Generate best-effort content based on what you learned. Run everything, iterate until green. Show the user what works — the a-ha moment.
Phase 3 — go deeper. Once Phase 2 lands, summarize what you delivered, then suggest 2-3 specific improvements grounded in the codebase: domain edge cases, areas that need expert terminology or real data, integration points (APIs, databases, file uploads), or regression patterns from git history that deserve test coverage. Ask light questions with options, not open-ended ("Want scenarios for X or Y?", "I noticed Z was a recurring issue — add a regression test?", "Do you have real customer queries I could use?"). Respect "that's enough" and wrap up cleanly.
Do NOT ask permission before Phase 1 and 2 — deliver value first. Do NOT ask generic questions or overwhelm with too many suggestions. Do NOT generate generic datasets — everything must reflect the actual domain.
Common Mistakes
- Do NOT say "run an evaluation" — be specific: experiment, monitor, or guardrail
- Do NOT use generic/placeholder datasets — generate domain-specific examples
- Do NOT skip running the experiment to verify it works
- Monitors measure (async), guardrails act (sync, via code with
as_guardrail=True)
More from langwatch/skills
scenarios
Test your AI agent with simulation-based scenarios. Covers writing scenario test code (Scenario SDK), creating platform scenarios via the `langwatch` CLI, and red teaming for security vulnerabilities. Auto-detects whether to use code or platform approach based on context.
48tracing
Add LangWatch tracing and observability to your code. Use for both onboarding (instrument an entire codebase) and targeted operations (add tracing to a specific function or module). Supports Python and TypeScript with all major frameworks.
45level-up
Take your AI agent to the next level with full LangWatch integration. Adds tracing, prompt versioning, evaluation experiments, and simulation tests in one go. Use when the user wants comprehensive observability, testing, and prompt management for their agent.
37prompts
Version and manage your agent's prompts with LangWatch Prompts CLI. Use for both onboarding (set up prompt versioning for an entire codebase) and targeted operations (version a specific prompt, create a new prompt version). Supports Python and TypeScript.
36analytics
Analyze your AI agent's performance using LangWatch analytics. Use when the user wants to understand costs, latency, error rates, usage trends, or debug specific traces. Works with any LangWatch-instrumented agent.
31datasets
Generate realistic synthetic evaluation datasets by analyzing the user's codebase, prompts, production traces, and reference materials. Interactive, consultant-style — asks clarifying questions, proposes a plan, generates a preview for approval, then delivers a complete dataset uploaded to LangWatch. Use when user asks to generate, create, or build a dataset for evaluation, testing, or benchmarking.
12