skill-tester

SKILL.md

Skill Tester & Analyzer

A meta-skill for deeply testing and auditing other Claude skills. It instruments test runs to capture raw API call traces, records all script stdin/stdout/stderr with timing, and runs deterministic security scans followed by dedicated security and code review subagents against any scripts embedded in the skill.


Session Directory Layout

<report_root>/<skill_name>_<YYYYMMDD_HHMMSS>/
├── manifest.json          # Validation results and session metadata (created by setup_test_env.py)
├── sandbox/               # Isolated workspace for script execution
├── inventory.json         # Skill structure scan
├── scan_results.json      # Deterministic security findings (B9 — runs first)
├── prompt_lint.json       # Deterministic prompt quality findings (B11 — runs first)
├── prompt_review.json     # AI prompt quality analysis (receives prompt_lint as input)
├── api_log.jsonl          # All Claude API calls (one JSON object per line)
├── script_runs.jsonl      # All script executions with I/O
├── security_report.json   # AI security analysis (receives scan_results as input)
├── code_review.json       # Code quality review
├── session_report.html    # Claude Code session trace (API calls, tool use, conversation)
└── report.html            # Unified interactive HTML report

Modes

Mode Description Phases Run Command
Full (default) Complete analysis: scan → prompt-lint → test → security → review → report All (2-9) /st:run
Audit Static analysis only, no test execution 2-4, 6-7, 9 /st:audit
Trace Runtime capture only, no security/code review 2, 5, 8, 9 /st:trace
Report Re-generate HTML from existing session data 9 only /st:report

Commands

Command Mode Phases Purpose
/st:init All 1 Set up session: target, mode, prompts, report location
/st:run Full 2-9 Execute all analysis phases
/st:audit Audit 2-4, 6-7, 9 Static analysis only
/st:trace Trace 2, 5, 8, 9 Runtime capture only
/st:report Report 9 Regenerate HTML from session data
/st:status N/A Show session state
/st:resume Any Variable Resume interrupted session




Interpreting Results

Security Severity Levels

Level Meaning Action
CRITICAL Active exploit risk (e.g., shell injection, RCE, hardcoded production key) Block — do not use skill; fix immediately
HIGH Likely data exposure or privilege escalation Fix before production
MEDIUM Defense-in-depth gap; not immediately exploitable Fix in next iteration
LOW Style/practice issue with minor security implications Note in report
INFO Observation, no risk Informational only

Code Quality Score (0–10)

Range Interpretation
9–10 Production-ready
7–8 Minor improvements needed
5–6 Significant gaps — refactoring advised
< 5 Major issues — rework required
Weekly Installs
7
GitHub Stars
5
First Seen
9 days ago
Installed on
opencode7
antigravity7
qwen-code7
claude-code7
github-copilot7
codex7