langgraph-testing-evaluation
LangGraph Testing & Evaluation
Practical workflows for validating agent quality with:
- Unit/integration tests
- Trajectory evaluation
- LangSmith dataset evaluations
- A/B-style comparisons between versions
Use this file for high-level flow. Load references/* for detailed implementation.
Start Here
Choose the smallest approach that answers your question:
| Goal | Primary method | Load first |
|---|---|---|
| Validate node logic quickly | Unit tests with mocks | references/unit-testing-patterns.md |
| Validate multi-step agent behavior | Trajectory evaluation | references/trajectory-evaluation.md |
| Track quality over datasets over time | LangSmith evaluation | references/langsmith-evaluation.md |
| Compare old vs new agent versions | A/B comparison | references/ab-testing.md |
Recommended order:
- Unit tests
- Integration/trajectory checks
- Dataset evaluation in LangSmith
- A/B comparison before deployment
Quick Commands
Run from repo root.
Generate test scaffolding
# Python (preferred)
uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest
Run trajectory evaluation
# Python: LLM-as-judge
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini
# Python: trajectory match
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4
Run LangSmith dataset evaluation
# Python
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4
# Python (do not upload experiment results)
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4
Compare two agent versions
# Python
uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json
# JavaScript/TypeScript (force local dataset file only)
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith
Create mock response configs
# Python
uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json
Core Workflow
- Define test scope.
- Unit: deterministic logic in one node/function.
- Integration: node interactions and routing.
- End-to-end: complete response quality on realistic inputs.
- Start from deterministic checks.
- Mock LLM/tool IO for speed and repeatability.
- Keep real-model tests as a smaller, explicit suite.
- Build/curate dataset examples.
- Use stable inputs and expected outputs.
- Keep schema simple:
inputsandoutputsobjects (optionalmetadata). - Compatibility note: scripts also accept singular keys (
input,output) for legacy datasets.
- Run evaluation with explicit gates.
- Use evaluator keys that map to deployment decisions.
- Set thresholds in CI for regression prevention.
- Compare versions before rollout.
- Run same dataset on both versions.
- Check both quality and latency.
- Diagnose failures from traces/experiments.
- Inspect low-scoring examples.
- Split failures by pattern (routing, tool usage, hallucination, latency spikes).
Current References (Load On Demand)
references/unit-testing-patterns.md
Load when:
- You need node-level and routing test patterns.
- You need pytest/vitest/Jest integration patterns.
- You need robust mocking and flaky-test reduction.
references/trajectory-evaluation.md
Load when:
- You need trajectory match evaluation (
strict,unordered,subset,superset). - You need LLM-as-judge trajectory scoring.
- You need LangSmith experiment comparison for trajectory results.
references/langsmith-evaluation.md
Load when:
- You need dataset creation/management in LangSmith.
- You need evaluator signatures and experiment runs in Python/TS.
- You need CI-friendly workflows with quality thresholds.
references/ab-testing.md
Load when:
- You need offline A/B comparison methodology.
- You need significance testing and interpretation.
- You need production traffic split strategy and guardrails.
Assets
assets/templates/test_template.py
- Runnable Python pytest template aligned with current LangGraph testing patterns.
- Includes:
- Compiled-graph invocation with
thread_id - Single-node testing via
compiled_graph.nodes[...] - Integration-test placeholder
- Compiled-graph invocation with
assets/datasets/sample_dataset.json
- Deterministic seed dataset for LangSmith ingestion.
- Uses
examples: [{ inputs, outputs, metadata }]format.
assets/examples/README.md
- Documentation-only index for current asset usage.
- Notes where runnable assets live today.
Script Interface Summary
scripts/generate_test_cases.py / .js
Use for fast test scaffolding.
Inputs:
- Graph module path
- Python:
my_module:graphormy_module.graph - JS/TS:
./file.ts:graph
- Python:
Outputs:
- Framework-specific starter tests in target directory.
scripts/run_trajectory_eval.py / .js
Use for trajectory scoring with either:
--method match--method llm-judge
Supports:
- Local dataset files (
.json) - LangSmith dataset names
- Optional reference trajectory file with
--reference-trajectory - Match modes:
strict,unordered,subset,superset
Local-only mode:
--no-langsmithin both Python and JavaScript scripts (requires local JSON dataset file)
scripts/evaluate_with_langsmith.py / .js
Use for dataset-based evaluation runs and experiment tracking.
Supports:
- Existing dataset by name
- Dataset creation from JSON examples file
- Multiple evaluators (
--evaluators accuracy,latency,...) - Concurrency control (
--max-concurrency)
Python-only:
--no-uploadto run without uploading experiment results
scripts/compare_agents.py / .js
Use for offline version comparisons:
- Shared dataset input
- Success/latency summaries
- JSON report output for CI artifacts
- Local JSON datasets or LangSmith datasets (JS supports
--no-langsmithto disable remote loading)
scripts/mock_llm_responses.py / .js
Use for deterministic test doubles:
- single
- sequence
- conditional
Decision Rules
If behavior is deterministic and local:
- Use unit tests first.
If behavior depends on tool sequence/routing:
- Add trajectory evaluation.
If behavior depends on realistic distribution quality:
- Run LangSmith dataset evaluation.
If approving a replacement model/prompt/graph:
- Run A/B comparison and check both quality and latency.
Common Failure Patterns
Flaky tests
- Cause: real-model nondeterminism in unit scope.
- Fix: mock LLM/tool calls for unit tests; reserve real-model tests for separate integration marks.
High trajectory variance
- Cause: overly strict matching for workflows with equivalent paths.
- Fix: switch match mode (
unordered,subset, orsuperset) where appropriate.
Regressions hidden by averages
- Cause: only aggregate score monitored.
- Fix: inspect per-example failures and segment by category metadata.
Latency regressions with same quality
- Cause: no explicit latency gate.
- Fix: include latency evaluator and CI threshold.
Minimal Best Practices
- Keep fast deterministic tests as the largest share.
- Version datasets and keep them stable.
- Track both correctness and latency.
- Add explicit go/no-go thresholds in CI.
- Compare candidate vs baseline before production rollout.
- Investigate failures with trace-level evidence, not only aggregate scores.
More from lubu-labs/langchain-agent-skills
langgraph-agent-patterns
Implement multi-agent coordination patterns (supervisor-subagent, router, orchestrator-worker, handoffs) for LangGraph applications. Use when users want to (1) implement multi-agent systems, (2) coordinate multiple specialized agents, (3) choose between coordination patterns, (4) set up supervisor-subagent workflows, (5) implement router-based agent selection, (6) create parallel orchestrator-worker patterns, (7) implement agent handoffs, (8) design state schemas for multi-agent systems, or (9) debug multi-agent coordination issues.
43langgraph-state-management
Design state schemas, implement reducers, configure persistence, and debug state issues for LangGraph applications. Use when users want to (1) design or define state schemas for LangGraph graphs, (2) implement reducer functions for state accumulation, (3) configure persistence with checkpointers (InMemorySaver/MemorySaver, SqliteSaver, PostgresSaver), (4) debug state update issues or unexpected state behavior, (5) migrate state schemas between versions, (6) validate state schema structure, (7) choose between TypedDict and MessagesState patterns, (8) implement custom reducers for lists, dicts, or sets, (9) use the Overwrite type to bypass reducers, (10) set up thread-based persistence for multi-turn conversations, or (11) inspect checkpoints for debugging.
26langgraph-error-handling
Implement LangGraph error handling with current v1 patterns. Use when users need to classify failures, add RetryPolicy for transient issues, build LLM recovery loops with Command routing, add human-in-the-loop with interrupt()/resume, handle ToolNode errors, or choose a safe strategy between retry, recovery, and escalation.
25langsmith-trace-analyzer
Fetch, organize, and analyze LangSmith traces for debugging and evaluation. Use when you need to: query traces/runs by project, metadata, status, or time window; download traces to JSON; organize outcomes into passed/failed/error buckets; analyze token/message/tool-call patterns; compare passed vs failed behavior; or investigate benchmark and production failures.
19skill-creator
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
15deepagents-planning-todos
Use the write_todos tool effectively for task planning and decomposition in Deep Agents. Use when users want to (1) implement task planning with write_todos, (2) break down complex tasks into subtasks, (3) track agent progress through todos, (4) debug why todos aren't completing, (5) design todo structures for different task types (research, coding, analysis), (6) understand todo status lifecycle and best practices, or (7) visualize todo progression from LangSmith traces.
15