The Agent Skills Directory

[COMMAND_EXECUTION]: The skill uses a verification script (run_checks.py) that executes shell commands defined in task files using subprocess.run with shell=True, which can be exploited if task definitions are malicious.
[COMMAND_EXECUTION]: The skill mandates the use of the --dangerously-skip-permissions flag when invoking the claude CLI, which bypasses the standard requirement for human approval of agent tool calls.
[COMMAND_EXECUTION]: The execution logic unsets environment variables CLAUDECODE and CLAUDE_CODE_ENTRYPOINT to bypass safety-critical recursion and nesting limits for agent sessions.
[REMOTE_CODE_EXECUTION]: The benchmark runner executes prompts and verification steps that can contain arbitrary code; if the tasks or the skills being evaluated are untrusted, this creates a vector for malicious code execution on the host system.
[PROMPT_INJECTION]: The skill utilizes the --append-system-prompt feature to force specific instructions on nested agent sessions, overriding their default behavior and safety boundaries.

skill-benchmark