The Agent Skills Directory

[COMMAND_EXECUTION]: Multiple scripts including eval_grader.py, run_eval.py, and improve_description.py use the subprocess module to execute external commands. They primarily interact with the claude CLI to spawn sub-agents for evaluation tasks and lsof to manage network ports. While central to the skill's purpose, this involves executing arbitrary tasks through the CLI.
[REMOTE_CODE_EXECUTION]: Automated scans identified a potential remote code execution pattern in generate_review.py. The script generates an HTML report containing fetch calls to /api/feedback. This is designed to send user feedback from a browser back to a local HTTP server started by the script. Although intended for a local feedback loop, the pattern of sending data from a browser to a local execution environment is a high-risk vector.
[DYNAMIC_CODE_GENERATION]: The extract_scripts.py script parses execution transcripts for code blocks and automatically generates candidate Python or Bash scripts. Since transcripts are generated from potentially untrusted data or skill behaviors, this poses a risk where malicious code could be automatically packaged into a new script for the user to run.
[DATA_EXFILTRATION]: generate_review.py implements a local HTTP server using HTTPServer bound to 127.0.0.1. This server is used to receive data from the interactive HTML reports. While restricted to the local loopback interface, it introduces a network-listening surface on the host machine.
[PROMPT_INJECTION]: The skill is susceptible to indirect prompt injection.
Ingestion points: scripts/eval_grader.py (reads execution transcripts), scripts/extract_scripts.py (reads transcripts), scripts/run_eval.py (processes arbitrary user queries).
Boundary markers: The system prompts for the grader and analyzer agents use some tagging (e.g., <skill_content>), but do not provide robust instructions to ignore potentially malicious commands embedded within the transcripts being analyzed.
Capability inventory: The skill has the ability to read/write files and execute shell commands via claude -p sub-agents.
Sanitization: There is no evidence of sanitization or escaping of the transcript content before it is passed to the LLM-based grader or script extractor.

skill-eval