testing-llm
LLM & AI Testing Patterns
Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).
Quick Reference
| Area | File | Purpose |
|---|---|---|
| Rules | rules/llm-evaluation.md |
DeepEval quality metrics, Pydantic schema validation, timeout testing |
| Rules | rules/llm-mocking.md |
Mock LLM responses, VCR.py recording, custom request matchers |
| Reference | references/deepeval-ragas-api.md |
Full API reference for DeepEval and RAGAS metrics |
| Reference | references/generator-agent.md |
Transforms Markdown specs into Playwright tests |
| Reference | references/healer-agent.md |
Auto-fixes failing tests (selectors, waits, dynamic content) |
| Reference | references/planner-agent.md |
Explores app and produces Markdown test plans |
| Checklist | checklists/llm-test-checklist.md |
Complete LLM testing checklist (setup, coverage, CI/CD) |
| Example | examples/llm-test-patterns.md |
Full examples: mocking, structured output, DeepEval, VCR, golden datasets |
When to Use This Skill
- Testing code that calls LLM APIs (OpenAI, Anthropic, etc.)
- Validating RAG pipeline output quality
- Setting up deterministic LLM tests in CI
- Building evaluation pipelines with quality gates
- Applying agentic test patterns (plan -> generate -> heal)
LLM Mock Quick Start
Mock LLM responses for fast, deterministic unit tests:
from unittest.mock import AsyncMock, patch
import pytest
@pytest.fixture
def mock_llm():
mock = AsyncMock()
mock.return_value = {"content": "Mocked response", "confidence": 0.85}
return mock
@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
with patch("app.core.model_factory.get_model", return_value=mock_llm):
result = await synthesize_findings(sample_findings)
assert result["summary"] is not None
Key rule: NEVER call live LLM APIs in CI. Use mocks for unit tests, VCR.py for integration tests.
DeepEval Quality Quick Start
Validate LLM output quality with multi-dimensional metrics:
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["Paris is the capital of France."],
)
assert_test(test_case, [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
])
2026 library updates (DeepEval 2.3, RAGAS 1.2)
DeepEval 2.3 introduces self-explaining scores — every metric now emits a reason field alongside the numeric score, so a failing CI build gets a human-readable explanation without a second LLM call:
metric = AnswerRelevancyMetric(threshold=0.7, include_reason=True)
metric.measure(test_case)
print(metric.score, metric.reason)
# 0.62 "Response addresses the topic but omits the date asked for."
RAGAS 1.2 ships dynamic recalibration — when the grader model drifts (e.g. GPT-5.2 → future Gemini 3.1), RAGAS records the shift and adjusts the threshold so historical scores stay comparable across evals:
from ragas.evaluation import evaluate
from ragas.metrics import faithfulness, context_recall
result = evaluate(
dataset,
metrics=[faithfulness, context_recall],
recalibrate=True, # 1.2+ — normalizes against the grader baseline
)
Bump floors:
deepeval >= 2.3,ragas >= 1.2. Older releases silently drop thereason/recalibratekwargs.
Quality Metrics Thresholds
| Metric | Threshold | Purpose |
|---|---|---|
| Answer Relevancy | >= 0.7 | Response addresses question |
| Faithfulness | >= 0.8 | Output matches context |
| Hallucination | <= 0.3 | No fabricated facts |
| Context Precision | >= 0.7 | Retrieved contexts relevant |
| Context Recall | >= 0.7 | All relevant contexts retrieved |
Structured Output Validation
Always validate LLM output with Pydantic schemas:
from pydantic import BaseModel, Field
class LLMResponse(BaseModel):
answer: str = Field(min_length=1)
confidence: float = Field(ge=0.0, le=1.0)
sources: list[str] = Field(default_factory=list)
async def test_structured_output():
result = await get_llm_response("test query")
parsed = LLMResponse.model_validate(result)
assert 0 <= parsed.confidence <= 1.0
VCR.py for Integration Tests
Record and replay LLM API calls for deterministic integration tests:
@pytest.fixture(scope="module")
def vcr_config():
import os
return {
"record_mode": "none" if os.environ.get("CI") else "new_episodes",
"filter_headers": ["authorization", "x-api-key"],
}
@pytest.mark.vcr()
async def test_llm_integration():
response = await llm_client.complete("Say hello")
assert "hello" in response.content.lower()
Agentic Test Workflow
The three-agent pattern for end-to-end test automation:
Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)
-
Planner (
references/planner-agent.md): Explores your app, produces Markdown test plans from PRDs or natural language requests. Requiresseed.spec.tsfor app context. -
Generator (
references/generator-agent.md): Converts Markdown specs into Playwright tests. Actively validates selectors against the running app. Uses semantic locators (getByRole, getByLabel, getByText). -
Healer (
references/healer-agent.md): Automatically fixes failing tests by replaying failures, inspecting the DOM, and patching locators/waits. Max 3 healing attempts per test.
Edge Cases to Always Test
For every LLM integration, cover these paths:
- Empty/null inputs -- empty strings, None values
- Long inputs -- truncation behavior near token limits
- Timeouts -- fail-open vs fail-closed behavior
- Schema violations -- invalid structured output
- Prompt injection -- adversarial input resistance
- Unicode -- non-ASCII characters in prompts and responses
See checklists/llm-test-checklist.md for the complete checklist.
Anti-Patterns
| Anti-Pattern | Correct Approach |
|---|---|
| Live LLM calls in CI | Mock for unit, VCR for integration |
| Random seeds | Fixed seeds or mocked responses |
| Single metric evaluation | 3-5 quality dimensions |
| No timeout handling | Always set < 1s timeout in tests |
| Hardcoded API keys | Environment variables, filtered in VCR |
Asserting only is not None |
Schema validation + quality metrics |
Related Skills
ork:testing-unit— Unit testing fundamentals, AAA patternork:testing-integration— Integration testing for AI pipelinesork:golden-dataset— Evaluation dataset management
More from yonatangross/orchestkit
responsive-patterns
Responsive design with Container Queries, fluid typography, cqi/cqb units, subgrid, intrinsic layouts, foldable devices, and mobile-first patterns for React applications. Use when building responsive layouts or container queries.
464ui-components
UI component library patterns for shadcn/ui and Radix Primitives. Use when building accessible component libraries, customizing shadcn components, using Radix unstyled primitives, or creating design system foundations.
443devops-deployment
Use when setting up CI/CD pipelines, containerizing applications, deploying to Kubernetes, or writing infrastructure as code. DevOps & Deployment covers GitHub Actions, Docker, Helm, and Terraform patterns.
400rag-retrieval
Retrieval-Augmented Generation patterns for grounded LLM responses. Use when building RAG pipelines, embedding documents, implementing hybrid search, contextual retrieval, HyDE, agentic RAG, multimodal RAG, query decomposition, reranking, or pgvector search.
340architecture-decision-record
Use this skill when documenting significant architectural decisions. Provides ADR templates following the Nygard format with sections for context, decision, consequences, and alternatives. Use when writing ADRs, recording decisions, or evaluating options.
326domain-driven-design
DDD tactical patterns for complex business modeling including entities, value objects, aggregates, domain services, repositories, specifications, and bounded contexts. Python dataclass implementations with TypeScript alternatives. Use when building rich domain models, enforcing invariants, or separating domain logic from infrastructure.
326