agent-evaluation

Warn

Audited by Gen Agent Trust Hub on Mar 11, 2026

Risk Level: MEDIUMREMOTE_CODE_EXECUTIONCOMMAND_EXECUTIONEXTERNAL_DOWNLOADS
Full Analysis
  • [REMOTE_CODE_EXECUTION]: The skill contains Python examples that use the exec() function to evaluate code outcomes. Evidence: exec(code) in the 'Simple Coding Agent Eval' section of SKILL.md.
  • [COMMAND_EXECUTION]: The skill suggests using subprocesses to run test suites like pytest on code generated by agents. Evidence: subprocess.run(['pytest', test_spec['test_file']]) in SKILL.md.
  • [EXTERNAL_DOWNLOADS]: The skill documentation references several external research benchmarks and tools. Evidence: Mentions of SWE-bench, WebArena, and tau2-Bench in SKILL.md.
Audit Metadata
Risk Level
MEDIUM
Analyzed
Mar 11, 2026, 09:08 AM