ai-eval-ci
AI Eval in CI
Overview
Test AI agents and LLM outputs the same way you test code — automated evaluations that run in CI, compare against baselines, and fail the build when quality drops. No dashboards to check manually. Just npx eval run --ci and a red or green build.
When to Use
- Adding quality gates before deploying AI features to production
- Catching prompt regressions when system prompts or models change
- Comparing model performance (GPT-4o vs Claude Sonnet vs local Llama)
- Validating RAG pipeline accuracy against a test dataset
- Benchmarking agent tool-calling accuracy and latency
Instructions
Strategy 1: Promptfoo (Config-Driven Evals)
Promptfoo is the most popular open-source eval framework. Define test cases in YAML, run against multiple providers, get a comparison matrix.
# promptfooconfig.yaml — Eval configuration
# Tests a customer support agent across 3 models with quality assertions
description: "Customer support agent eval"
providers:
- id: openai:gpt-4o
- id: anthropic:messages:claude-sonnet-4-20250514
- id: ollama:llama3.1:8b
prompts:
- |
You are a customer support agent for a SaaS product.
Respond helpfully and accurately. If you don't know, say so.
Customer message: {{message}}
tests:
- vars:
message: "How do I reset my password?"
assert:
- type: llm-rubric
value: "Response explains the password reset process clearly"
- type: not-contains
value: "I don't know"
- type: latency
threshold: 3000 # Must respond within 3 seconds
- vars:
message: "Can I get a refund for my annual plan?"
assert:
- type: llm-rubric
value: "Response acknowledges the refund request and explains the policy"
- type: not-contains
value: "I'm an AI" # Don't break character
- vars:
message: "Your product deleted all my data!"
assert:
- type: llm-rubric
value: "Response shows empathy, takes the issue seriously, and offers next steps"
- type: sentiment
threshold: 0.3 # Must not be dismissive
- vars:
message: "What's the weather in Tokyo?"
assert:
- type: llm-rubric
value: "Response politely redirects to product-related topics"
- type: not-contains
value: "Tokyo" # Should not answer off-topic questions
# Run evals locally
npx promptfoo@latest eval
# Run in CI with threshold — exits non-zero if any test fails
npx promptfoo@latest eval --ci --output results.json
# Compare two prompt versions
npx promptfoo@latest eval --prompts prompt-v1.txt prompt-v2.txt --share
Strategy 2: Custom Eval Framework (TypeScript)
When you need full control — custom scoring logic, database-backed test sets, domain-specific metrics.
// eval.ts — Custom AI eval framework with CI integration
/**
* Runs evaluation suites against AI agents/LLMs.
* Each eval defines inputs, expected behavior, and scoring criteria.
* Exits with code 1 if any score drops below threshold.
*/
import OpenAI from "openai";
interface EvalCase {
name: string;
input: string;
rubric: string; // What "good" looks like
threshold: number; // Minimum score 0-1
metadata?: Record<string, unknown>;
}
interface EvalResult {
name: string;
score: number;
pass: boolean;
output: string;
reasoning: string;
latencyMs: number;
}
const openai = new OpenAI();
/**
* Score an AI output using LLM-as-judge.
* Returns a score 0-1 with reasoning.
*/
async function judge(output: string, rubric: string): Promise<{ score: number; reasoning: string }> {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini", // Cheap model for judging
messages: [
{
role: "system",
content: `You are an eval judge. Score the AI output against the rubric.
Return JSON: {"score": 0.0-1.0, "reasoning": "brief explanation"}
Score 1.0 = perfect match. Score 0.0 = complete failure.`,
},
{
role: "user",
content: `Rubric: ${rubric}\n\nAI Output:\n${output}`,
},
],
response_format: { type: "json_object" },
temperature: 0, // Deterministic judging
});
return JSON.parse(response.choices[0].message.content!);
}
/**
* Run a single eval case against your AI agent.
*/
async function runEval(
agentFn: (input: string) => Promise<string>,
evalCase: EvalCase
): Promise<EvalResult> {
const start = Date.now();
const output = await agentFn(evalCase.input);
const latencyMs = Date.now() - start;
const { score, reasoning } = await judge(output, evalCase.rubric);
return {
name: evalCase.name,
score,
pass: score >= evalCase.threshold,
output: output.slice(0, 200),
reasoning,
latencyMs,
};
}
/**
* Run all evals and exit with appropriate code for CI.
*/
async function runSuite(
agentFn: (input: string) => Promise<string>,
cases: EvalCase[]
): Promise<void> {
console.log(`Running ${cases.length} evals...\n`);
const results: EvalResult[] = [];
for (const evalCase of cases) {
const result = await runEval(agentFn, evalCase);
results.push(result);
const icon = result.pass ? "✅" : "❌";
console.log(`${icon} ${result.name}: ${result.score.toFixed(2)} (threshold: ${evalCase.threshold}) [${result.latencyMs}ms]`);
if (!result.pass) {
console.log(` Reasoning: ${result.reasoning}`);
}
}
// Summary
const passed = results.filter((r) => r.pass).length;
const failed = results.filter((r) => !r.pass).length;
const avgScore = results.reduce((s, r) => s + r.score, 0) / results.length;
console.log(`\n📊 Results: ${passed} passed, ${failed} failed (avg score: ${avgScore.toFixed(2)})`);
// CI exit code
if (failed > 0) {
console.log("\n❌ Eval suite FAILED — quality below threshold");
process.exit(1);
} else {
console.log("\n✅ Eval suite PASSED");
}
}
export { runSuite, EvalCase };
Strategy 3: GitHub Actions Integration
# .github/workflows/ai-eval.yml
name: AI Eval
on:
pull_request:
paths:
- "prompts/**"
- "src/agents/**"
- "eval/**"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20 }
- run: npm ci
- name: Run AI evals
run: npx tsx eval/run.ts --ci
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Post results to PR
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('eval/results.json'));
const body = results.map(r =>
`${r.pass ? '✅' : '❌'} **${r.name}**: ${r.score.toFixed(2)} (${r.latencyMs}ms)`
).join('\n');
github.rest.issues.createComment({
...context.repo,
issue_number: context.issue.number,
body: `## AI Eval Results\n\n${body}`
});
Examples
Example 1: Add quality gates to a RAG chatbot
User prompt: "Set up automated evals for our RAG customer support bot. It should test accuracy on 50 known Q&A pairs and fail the deploy if accuracy drops below 85%."
The agent will:
- Create a test dataset from the 50 known Q&A pairs
- Write promptfoo config with llm-rubric assertions for each
- Set pass threshold at 0.85
- Add GitHub Actions workflow that runs on PR to
prompts/orsrc/agents/ - Post eval results as PR comment
Example 2: Compare models before switching
User prompt: "We're considering switching from GPT-4o to Claude Sonnet. Run our eval suite against both and show me which performs better."
The agent will:
- Configure promptfoo with both providers
- Run the existing eval suite against both models
- Generate comparison table with per-test scores, latency, and cost
- Recommend based on score-to-cost ratio
Guidelines
- Eval every prompt change — treat prompts like code; test before deploying
- LLM-as-judge is good enough — GPT-4o-mini costs pennies and correlates well with human judgment
- Use temperature 0 for judges — deterministic scoring reduces noise
- Keep test sets diverse — happy path, edge cases, adversarial inputs, off-topic
- Set realistic thresholds — start at 0.7, tighten as the agent improves
- Track scores over time — log results to detect gradual quality drift
- Separate eval cost from production cost — eval uses cheap judge models, production uses the best
- Cache eval results — don't re-run unchanged tests; hash input+prompt for cache keys
- Run evals on PRs, not just main — catch regressions before merge