skills/ajrlewis/ai-skills/addon-llm-judge-evals

addon-llm-judge-evals

SKILL.md

Add-on: LLM Judge Evals

Use this skill when you need qualitative evaluation (clarity, domain fit, UX coherence, docs quality) in addition to deterministic checks.

Compatibility

  • Works with all stacks.
  • Best paired with addon-deterministic-eval-suite.

Inputs

Collect:

  • JUDGE_BACKEND: auto | langchain | google-adk (default auto).
  • JUDGE_MODEL: model id to run scoring.
  • JUDGE_TIMEOUT_SECONDS: default 60.
  • JUDGE_MAX_RETRIES: default 2.
  • JUDGE_TEMPERATURE: default 0.
  • JUDGE_FAIL_ON_BACKEND_MISMATCH: yes | no (default yes).
  • JUDGE_RUBRIC_MODE: product | security | developer-experience | custom.
  • PASS_THRESHOLD: default 0.75.
  • BLOCK_ON_JUDGE_FAIL: yes | no (default no).

Integration Workflow

  1. Add judge artifacts:
config/skill_manifest.json
evals/judge/rubric.md
evals/judge/cases/
scripts/evals/run_llm_judge.py
.github/workflows/evals-judge.yml
REVIEW_BUNDLE/JUDGE_REPORT.md
  • Copy and adapt this skill's bundled starter script:
  • scripts/run_llm_judge.py
  • Place the adapted result in the target project at scripts/evals/run_llm_judge.py.
  1. Define rubric:
  • scoring categories and weights
  • failure reasons template
  • required evidence links (files/lines/commands)
  1. Execute judge run:
  • evaluate generated files against rubric per scenario
  • resolve backend from config/skill_manifest.json plus judge inputs
  • use a single adapter boundary for backend-specific scoring
  • store structured JSON + markdown summary
  • replace the bundled starter template's placeholder reporting with a real project-local backend adapter before treating judge scores as authoritative
  1. Merge policy:
  • default advisory (BLOCK_ON_JUDGE_FAIL=no)
  • blocking only when explicitly configured

Backend Resolution Contract

  • scripts/evals/run_llm_judge.py must read config/skill_manifest.json as the source of truth for selected skills and declared judge capabilities.
  • The manifest should include:
{
  "base_skill": "architect-python-uv-fastapi-sqlalchemy",
  "addons": [
    "addon-deterministic-eval-suite",
    "addon-llm-judge-evals",
    "addon-langchain-llm"
  ],
  "capabilities": {
    "judge_backends": ["langchain"]
  }
}
  • Resolution order:
    • If JUDGE_BACKEND != auto, use the requested backend only if the matching addon is present in the manifest.
    • If JUDGE_BACKEND=auto and only addon-langchain-llm is present, use langchain.
    • If JUDGE_BACKEND=auto and only addon-google-agent-dev-kit is present, use google-adk.
    • If both addons are present, fail and require explicit JUDGE_BACKEND.
    • If neither addon is present, fail with an explicit unsupported configuration error.
  • Model resolution:
    • JUDGE_MODEL wins when set.
    • For langchain, fall back to DEFAULT_MODEL.
    • For google-adk, fall back to ADK_DEFAULT_MODEL.
  • The judge runner should expose a stable adapter interface (for example JudgeBackend.score(prompt)) so rubric logic, thresholding, and report generation stay backend-agnostic.

Required Template

evals/judge/rubric.md

# Judge Rubric

- Technical coherence (0-1)
- Requirement coverage (0-1)
- Domain language alignment (0-1)
- UX quality and states (0-1)
- Documentation clarity (0-1)

Pass threshold: 0.75

Guardrails

  • Documentation contract for generated code:

    • Python: write module docstrings and docstrings for public classes, methods, and functions.
    • Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
    • Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
    • Apply this contract even when using template snippets below; expand templates as needed.
  • Never replace deterministic gates with judge scores.

  • Keep prompts/rubrics versioned in repo for auditability.

  • Record model/version and timestamp for each run.

  • Surface uncertainty as explicit notes, not silent pass.

  • Do not infer judge backend from incidental files or imports; use the manifest and explicit inputs.

  • If multiple LLM-capable addons are installed, do not guess. Require an explicit JUDGE_BACKEND.

Validation Checklist

  • Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.
test -f evals/judge/rubric.md
test -f scripts/evals/run_llm_judge.py
test -f .github/workflows/evals-judge.yml
test -f REVIEW_BUNDLE/JUDGE_REPORT.md || true

Decision Justification Rule

  • Every non-trivial decision must include a concrete justification.
  • Capture the alternatives considered and why they were rejected.
  • State tradeoffs and residual risks for the chosen option.
  • If justification is missing, treat the task as incomplete and surface it as a blocker.
Weekly Installs
7
First Seen
Mar 2, 2026
Installed on
openclaw7
gemini-cli7
github-copilot7
codex7
kimi-cli7
cursor7