Netra Evaluation Setup

Use this skill to build reliable evaluation pipelines in Netra that catch regressions and measure quality over time.

When To Use

You need repeatable quality checks for prompts, models, or agent logic.
You want both subjective and deterministic scoring.
You need a baseline before deploying AI changes.

Evaluation Design Framework

Define quality dimensions.
Build or import a dataset.
Select evaluator types per dimension.
Map variables carefully.
Run test suites and inspect failures.
Iterate prompt, policy, or tool logic.

Choosing Evaluator Types

Use LLM-as-Judge for subjective criteria:
- correctness, relevance, helpfulness, hallucination checks, safety.
Use Code Evaluators for deterministic criteria:
- JSON schema validity, regex formats, strict business rules.

Procedure

Create a dataset from production traces when possible.
Add edge cases and negative tests manually.
Select evaluators from Library or My Evaluators.
Configure pass thresholds and scoring output.
Map evaluator variables to:
- dataset fields (input, expected_output),
- agent response,
- execution metadata (latency/tokens/model).
Test each evaluator in Playground before saving.
Run evaluation test suite via SDK and review Test Runs.
Track pass rate, latency, and cost across versions.

Python Evaluation Run Pattern

from netra import Netra

Netra.init(app_name="my-app")

def task_fn(input_data):
    # Implement your app logic and return the generated output string.
    return my_agent_response(input_data)


dataset = Netra.evaluation.get_dataset(dataset_id="YOUR_DATASET_ID")
result = Netra.evaluation.run_test_suite(
    name="baseline-eval",
    data=dataset,
    task=task_fn,
)

print(result)

Quality Checklist

Dataset includes realistic production examples.
Dataset includes edge cases and refusal scenarios.
Evaluator prompt scales are explicit (for example 0-10 with clear rubric).
Thresholds were calibrated on representative samples.
Variable mappings are validated on at least 5 sample rows.
Failures are triaged as model issue, prompt issue, tool issue, or evaluator issue.

Common Pitfalls

Over-relying on a single evaluator.
Vague LLM-as-Judge prompts with ambiguous scoring.
Incorrect variable mapping causing false failures.
Treating pass/fail as enough without reviewing traces.

netra-evaluation-setup

Netra Evaluation Setup

When To Use

Evaluation Design Framework

Choosing Evaluator Types

Procedure

Python Evaluation Run Pattern

Quality Checklist

Common Pitfalls

References