netra-evaluation-setup
SKILL.md
Netra Evaluation Setup
Use this skill to build reliable evaluation pipelines in Netra that catch regressions and measure quality over time.
When To Use
- You need repeatable quality checks for prompts, models, or agent logic.
- You want both subjective and deterministic scoring.
- You need a baseline before deploying AI changes.
Evaluation Design Framework
- Define quality dimensions.
- Build or import a dataset.
- Select evaluator types per dimension.
- Map variables carefully.
- Run test suites and inspect failures.
- Iterate prompt, policy, or tool logic.
Choosing Evaluator Types
- Use LLM-as-Judge for subjective criteria:
- correctness, relevance, helpfulness, hallucination checks, safety.
- Use Code Evaluators for deterministic criteria:
- JSON schema validity, regex formats, strict business rules.
Procedure
- Create a dataset from production traces when possible.
- Add edge cases and negative tests manually.
- Select evaluators from Library or My Evaluators.
- Configure pass thresholds and scoring output.
- Map evaluator variables to:
- dataset fields (
input,expected_output), - agent response,
- execution metadata (latency/tokens/model).
- dataset fields (
- Test each evaluator in Playground before saving.
- Run evaluation test suite via SDK and review Test Runs.
- Track pass rate, latency, and cost across versions.
Python Evaluation Run Pattern
from netra import Netra
Netra.init(app_name="my-app")
def task_fn(input_data):
# Implement your app logic and return the generated output string.
return my_agent_response(input_data)
dataset = Netra.evaluation.get_dataset(dataset_id="YOUR_DATASET_ID")
result = Netra.evaluation.run_test_suite(
name="baseline-eval",
data=dataset,
task=task_fn,
)
print(result)
Quality Checklist
- Dataset includes realistic production examples.
- Dataset includes edge cases and refusal scenarios.
- Evaluator prompt scales are explicit (for example 0-10 with clear rubric).
- Thresholds were calibrated on representative samples.
- Variable mappings are validated on at least 5 sample rows.
- Failures are triaged as model issue, prompt issue, tool issue, or evaluator issue.
Common Pitfalls
- Over-relying on a single evaluator.
- Vague LLM-as-Judge prompts with ambiguous scoring.
- Incorrect variable mapping causing false failures.
- Treating pass/fail as enough without reviewing traces.
References
Weekly Installs
3
Repository
keyvaluesoftwar…a-skillsFirst Seen
Today
Security Audits
Installed on
amp3
cline3
opencode3
cursor3
kimi-cli3
warp3