agent-evaluation

Pass

Audited by Gen Agent Trust Hub on Mar 11, 2026

Risk Level: SAFECOMMAND_EXECUTIONREMOTE_CODE_EXECUTIONPROMPT_INJECTION
Full Analysis
  • [DYNAMIC_EXECUTION]: The skill provides Python templates that demonstrate the use of subprocess.run to execute tests and the exec() function to evaluate code generated during agent trials. These techniques are standard for coding agent evaluation harnesses. \n
  • Evidence: SKILL.md contains code snippets such as subprocess.run([\"pytest\", ...]) and exec(code) # In sandbox.\n- [INDIRECT_PROMPT_INJECTION]: The skill is designed to analyze and grade data produced by other agents, such as transcripts and multi-turn conversation histories, creating an attack surface where instructions embedded in the analyzed data could influence the evaluator. \n
  • Ingestion points: Functions like analyze_transcript, grade_coding_agent, and grade_research_agent in SKILL.md process trial outcomes and conversation records.\n
  • Boundary markers: The provided templates do not implement delimiters or specific instructions to isolate analyzed data from processing logic.\n
  • Capability inventory: The skill is configured with Read, Write, Shell, Grep, and Glob permissions.\n
  • Sanitization: No explicit validation logic is provided in the templates, though the documentation lists sanitization as a best practice in its summary section.
Audit Metadata
Risk Level
SAFE
Analyzed
Mar 11, 2026, 01:50 PM