aws-bedrock-evals
Installation
SKILL.md
AWS Bedrock Evaluation Jobs
Overview
Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.
Pre-computed Inference vs Live Inference
| Mode | How it works | Use when |
|---|---|---|
| Live Inference | Bedrock generates responses during the eval job | Simple prompt-in/text-out, no tool calling |
| Pre-computed Inference | You pre-collect responses and supply them in a JSONL dataset | Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock |
Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.
Pipeline