The Agent Skills Directory

[REMOTE_CODE_EXECUTION]: The skill contains Python examples that use the exec() function to evaluate code outcomes. Evidence: exec(code) in the 'Simple Coding Agent Eval' section of SKILL.md.
[COMMAND_EXECUTION]: The skill suggests using subprocesses to run test suites like pytest on code generated by agents. Evidence: subprocess.run(['pytest', test_spec['test_file']]) in SKILL.md.
[EXTERNAL_DOWNLOADS]: The skill documentation references several external research benchmarks and tools. Evidence: Mentions of SWE-bench, WebArena, and tau2-Bench in SKILL.md.

agent-evaluation