evaluating-code-models
Pass
Audited by Gen Agent Trust Hub on Mar 28, 2026
Risk Level: SAFECOMMAND_EXECUTIONEXTERNAL_DOWNLOADSREMOTE_CODE_EXECUTIONPROMPT_INJECTION
Full Analysis
- [COMMAND_EXECUTION]: The skill provides instructions to execute shell commands for installing the evaluation harness ('pip install -e .') and launching benchmarks through 'accelerate launch'.\n- [EXTERNAL_DOWNLOADS]: It fetches source code from the BigCode Project's GitHub repository and pulls Docker images from the GitHub Container Registry to facilitate multi-language evaluation.\n- [REMOTE_CODE_EXECUTION]: The harness supports flags such as '--allow_code_execution' to run code generated by models and '--trust_remote_code' to run code hosted on Hugging Face repositories. While essential for benchmarking, these operations involve executing untrusted code.\n- [PROMPT_INJECTION]: The skill is subject to indirect prompt injection risks by processing external data from various benchmarks. Ingestion points: Data from benchmarks like HumanEval and MBPP loaded via 'datasets' (SKILL.md). Boundary markers: None identified. Capability inventory: Execution of code snippets via the 'code_eval' metric and subprocess execution during harness launch. Sanitization: The documentation recommends using Docker containers to isolate execution from the host environment.
Audit Metadata