skill-eval

Fail

Audited by Gen Agent Trust Hub on Mar 9, 2026

Risk Level: HIGHCOMMAND_EXECUTIONREMOTE_CODE_EXECUTIONDATA_EXFILTRATIONPROMPT_INJECTION
Full Analysis
  • [COMMAND_EXECUTION]: Multiple scripts including eval_grader.py, run_eval.py, and improve_description.py use the subprocess module to execute external commands. They primarily interact with the claude CLI to spawn sub-agents for evaluation tasks and lsof to manage network ports. While central to the skill's purpose, this involves executing arbitrary tasks through the CLI.
  • [REMOTE_CODE_EXECUTION]: Automated scans identified a potential remote code execution pattern in generate_review.py. The script generates an HTML report containing fetch calls to /api/feedback. This is designed to send user feedback from a browser back to a local HTTP server started by the script. Although intended for a local feedback loop, the pattern of sending data from a browser to a local execution environment is a high-risk vector.
  • [DYNAMIC_CODE_GENERATION]: The extract_scripts.py script parses execution transcripts for code blocks and automatically generates candidate Python or Bash scripts. Since transcripts are generated from potentially untrusted data or skill behaviors, this poses a risk where malicious code could be automatically packaged into a new script for the user to run.
  • [DATA_EXFILTRATION]: generate_review.py implements a local HTTP server using HTTPServer bound to 127.0.0.1. This server is used to receive data from the interactive HTML reports. While restricted to the local loopback interface, it introduces a network-listening surface on the host machine.
  • [PROMPT_INJECTION]: The skill is susceptible to indirect prompt injection.
  • Ingestion points: scripts/eval_grader.py (reads execution transcripts), scripts/extract_scripts.py (reads transcripts), scripts/run_eval.py (processes arbitrary user queries).
  • Boundary markers: The system prompts for the grader and analyzer agents use some tagging (e.g., <skill_content>), but do not provide robust instructions to ignore potentially malicious commands embedded within the transcripts being analyzed.
  • Capability inventory: The skill has the ability to read/write files and execute shell commands via claude -p sub-agents.
  • Sanitization: There is no evidence of sanitization or escaping of the transcript content before it is passed to the LLM-based grader or script extractor.
Recommendations
  • HIGH: Downloads and executes remote code from: unknown (check file) - DO NOT USE without thorough review
Audit Metadata
Risk Level
HIGH
Analyzed
Mar 9, 2026, 08:15 PM