The Agent Skills Directory

[REMOTE_CODE_EXECUTION]: The skill provides code templates that utilize the exec() function to perform code-based grading of agent outputs.
Evidence: Found in SKILL.md under Example 1 (Simple Coding Agent Eval).
Risk: This pattern allows for the execution of arbitrary Python code. If the agent evaluates untrusted output without a strictly enforced sandbox, it can lead to code execution on the host environment.
[COMMAND_EXECUTION]: The skill recommends using subprocess.run() to execute terminal commands such as pytest for grading coding tasks.
Evidence: Found in SKILL.md in the grade_swe_bench and grade_coding_agent function templates.
Risk: If inputs such as repo_path or test_file are derived from untrusted sources without validation, it could lead to command injection or unauthorized filesystem access.
[PROMPT_INJECTION]: The skill is susceptible to indirect prompt injection due to its core functionality of ingesting and analyzing untrusted data from other agents.
Ingestion points: The skill processes external data via outcome["code"], qa_case["input"], and agent transcripts in SKILL.md.
Boundary markers: The provided grader templates do not include delimiters or instructions to ignore embedded commands in the data being processed.
Capability inventory: The skill utilizes Shell, Write, and Read tools, and includes instructions for command and code execution.
Sanitization: No explicit validation or sanitization of the ingested content is shown in the provided grading examples.

agent-evaluation