agent-evaluation
Pass
Audited by Gen Agent Trust Hub on Mar 11, 2026
Risk Level: SAFECOMMAND_EXECUTIONREMOTE_CODE_EXECUTIONPROMPT_INJECTION
Full Analysis
- [DYNAMIC_EXECUTION]: The skill provides Python templates that demonstrate the use of
subprocess.runto execute tests and theexec()function to evaluate code generated during agent trials. These techniques are standard for coding agent evaluation harnesses. \n - Evidence:
SKILL.mdcontains code snippets such assubprocess.run([\"pytest\", ...])andexec(code) # In sandbox.\n- [INDIRECT_PROMPT_INJECTION]: The skill is designed to analyze and grade data produced by other agents, such as transcripts and multi-turn conversation histories, creating an attack surface where instructions embedded in the analyzed data could influence the evaluator. \n - Ingestion points: Functions like
analyze_transcript,grade_coding_agent, andgrade_research_agentinSKILL.mdprocess trial outcomes and conversation records.\n - Boundary markers: The provided templates do not implement delimiters or specific instructions to isolate analyzed data from processing logic.\n
- Capability inventory: The skill is configured with
Read,Write,Shell,Grep, andGlobpermissions.\n - Sanitization: No explicit validation logic is provided in the templates, though the documentation lists sanitization as a best practice in its summary section.
Audit Metadata