AWS Bedrock Evaluation Jobs

Overview

Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.

Pre-computed Inference vs Live Inference

Mode	How it works	Use when
Live Inference	Bedrock generates responses during the eval job	Simple prompt-in/text-out, no tool calling
Pre-computed Inference	You pre-collect responses and supply them in a JSONL dataset	Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock

Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.

Pipeline

aws-bedrock-evals

AWS Bedrock Evaluation Jobs

Overview