agent-evaluation
Warn
Audited by Gen Agent Trust Hub on Mar 11, 2026
Risk Level: MEDIUMREMOTE_CODE_EXECUTIONCOMMAND_EXECUTIONEXTERNAL_DOWNLOADS
Full Analysis
- [REMOTE_CODE_EXECUTION]: The skill contains Python examples that use the exec() function to evaluate code outcomes. Evidence: exec(code) in the 'Simple Coding Agent Eval' section of SKILL.md.
- [COMMAND_EXECUTION]: The skill suggests using subprocesses to run test suites like pytest on code generated by agents. Evidence: subprocess.run(['pytest', test_spec['test_file']]) in SKILL.md.
- [EXTERNAL_DOWNLOADS]: The skill documentation references several external research benchmarks and tools. Evidence: Mentions of SWE-bench, WebArena, and tau2-Bench in SKILL.md.
Audit Metadata