model-evaluation-benchmark
Pass
Audited by Gen Agent Trust Hub on Mar 6, 2026
Risk Level: SAFECOMMAND_EXECUTIONPROMPT_INJECTIONDATA_EXFILTRATION
Full Analysis
- [COMMAND_EXECUTION]: The skill executes local Python scripts (run_benchmarks.py), GitHub CLI commands (gh pr close, gh issue close), and Git commands (git worktree remove) to manage benchmark workflows and cleanup.- [PROMPT_INJECTION]: The skill possesses an indirect prompt injection surface by ingesting and processing untrusted data.
- Ingestion points: Benchmark task definitions in BENCHMARK_TASKS.md and execution results in result.json.
- Boundary markers: No delimiters or instructions to ignore embedded commands are present in the prompt templates.
- Capability inventory: The agent can execute shell commands, interact with GitHub repositories, and spawn subagents.
- Sanitization: There is no evidence of validation or sanitization of ingested content before it is passed to the reviewer subagent.- [DATA_EXFILTRATION]: The skill accesses a hidden directory within the user's home folder (~/.amplihack/.claude/runtime/benchmarks/suite_v3/) to retrieve benchmark results. While this appears to be the tool's intended runtime path, accessing locations outside the project workspace increases potential data exposure risks.
Audit Metadata