data-refresh-eval
SKILL.md
Data Refresh & Eval Skill
Workflow for keeping the eval dataset fresh and running quality checks on agent responses.
Quick Start
cd ~/Code/skillrecordings/support/packages/cli
# Refresh dataset from Front (last 30 days, 200 responses max)
bun src/index.ts dataset build --since $(date -d "30 days ago" +%Y-%m-%d) --limit 200 --output data/eval-dataset.json
# Run routing eval
bun src/index.ts eval routing data/eval-dataset.json
Dataset Commands
Build fresh dataset
# Recent data (recommended for ongoing work)
bun src/index.ts dataset build --since 2025-01-01 --limit 200 --output data/eval-dataset.json
# App-specific
bun src/index.ts dataset build --app total-typescript --limit 100 --output data/tt-dataset.json
# Include conversation history for context
bun src/index.ts dataset build --since 2025-01-01 --include-history --output data/dataset-with-history.json
# Only labeled responses (good/bad)
bun src/index.ts dataset build --labeled-only --output data/labeled-only.json
Convert to evalite format
bun src/index.ts dataset to-evalite -i data/eval-dataset.json -o data/evalite-format.json
Running Evals
Routing eval (default thresholds)
bun src/index.ts eval routing data/eval-dataset.json
Custom thresholds
bun src/index.ts eval routing data/eval-dataset.json \
--min-precision 0.95 \
--min-recall 0.98 \
--max-fp-rate 0.02 \
--max-fn-rate 0.01
JSON output for CI/automation
bun src/index.ts eval routing data/eval-dataset.json --json
Response Analysis
Find bad responses for debugging
# List responses rated "bad"
bun src/index.ts responses list --rating bad
# Get details with conversation context
bun src/index.ts responses get <actionId> --context
# Export bad responses for analysis
bun src/index.ts responses export --rating bad -o bad-responses.json
Analyze unrated responses
bun src/index.ts responses list --rating unrated --limit 50
Recommended Workflow
Daily data refresh
cd ~/Code/skillrecordings/support/packages/cli
# 1. Pull fresh data
bun src/index.ts dataset build --since $(date -d "7 days ago" +%Y-%m-%d) --limit 100 --output data/eval-dataset.json
# 2. Check dataset stats
cat data/eval-dataset.json | jq 'length'
# 3. Run eval
bun src/index.ts eval routing data/eval-dataset.json
# 4. Check for failures
bun src/index.ts responses list --rating bad --limit 10
Pre-deploy validation
# 1. Build comprehensive dataset
bun src/index.ts dataset build --since 2025-01-01 --limit 500 --output data/full-dataset.json
# 2. Run eval with strict thresholds
bun src/index.ts eval routing data/full-dataset.json --min-precision 0.95 --min-recall 0.98 --json
# 3. Check exit code
echo "Exit code: $?"
Dataset Schema
Each eval point includes:
id- Action IDapp- App slug (total-typescript, aihero, etc.)conversationId- Front conversation IDcustomerEmail- Customer email (if available)triggerMessage- The inbound message that triggered the responsesubject,body,timestamp
agentResponse- The agent's drafted responsetext,category,timestamp
label- "good" | "bad" | undefinedlabeledBy- Who approved/rejectedconversationHistory- (optional) Full message history
Environment
Required in .env.local:
FRONT_API_TOKEN= # Front API access
DATABASE_URL= # Database connection
Troubleshooting
"FRONT_API_TOKEN environment variable required"
source apps/front/.env.local
# or set in .env.local at repo root
Dataset building slowly
Front API rate limits. Use --limit to control batch size.
No labeled data
Labels come from HITL approvals/rejections. New responses start unlabeled.
Weekly Installs
2
Repository
skillrecordings/supportFirst Seen
Feb 28, 2026
Security Audits
Installed on
opencode2
gemini-cli2
codebuddy2
github-copilot2
codex2
kimi-cli2