ln-840-benchmark-compare

Warn

Audited by Gen Agent Trust Hub on Apr 12, 2026

Risk Level: MEDIUMCOMMAND_EXECUTIONPROMPT_INJECTION
Full Analysis
  • [COMMAND_EXECUTION]: The skill executes a complex bash orchestration script (run-benchmark.sh) and several Node.js scripts to manage isolated benchmarking environments, including filesystem operations and git worktree management.
  • [COMMAND_EXECUTION]: The benchmark runner invokes the claude CLI with the --dangerously-skip-permissions flag. This configuration bypasses the platform's security controls that normally require human authorization for sensitive operations like file writes or command execution, enabling the benchmarked agent to act autonomously.
  • [PROMPT_INJECTION]: The skill presents an indirect prompt injection surface by reading scenario data from references/goals.md and interpolating it into automated agent prompts without sanitization. Evidence: (1) Ingestion point: references/goals.md (2) Boundary markers: Absent (3) Capability inventory: Full shell and filesystem access via the claude CLI (4) Sanitization: Absent.
Audit Metadata
Risk Level
MEDIUM
Analyzed
Apr 12, 2026, 04:40 PM