evaluating-code-models
Warn
Audited by Gen Agent Trust Hub on Feb 17, 2026
Risk Level: MEDIUMEXTERNAL_DOWNLOADSREMOTE_CODE_EXECUTIONCOMMAND_EXECUTION
Full Analysis
- [Unverifiable Dependencies & Remote Code Execution] (MEDIUM): The documentation instructs users to use the
--trust_remote_codeflag. This allows the execution of arbitrary Python code defined in remote Hugging Face model repositories. This is a high-risk feature that can lead to remote code execution if the model being evaluated contains malicious logic. - [Privilege Escalation] (MEDIUM): The troubleshooting guide (references/issues.md) suggests building Docker images using
sudo make. This encourages users to execute build scripts with elevated privileges, which could be exploited if the local environment or the Makefile is compromised. - [Unverifiable Dependencies & Remote Code Execution] (LOW): The skill directs users to download and install packages and containers from non-whitelisted sources, including
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.gitanddocker pull ghcr.io/bigcode-project/evaluation-harness-multiple. These sources are outside the trusted list for this analysis. - [Indirect Prompt Injection] (LOW): The tool is designed to ingest and execute model-generated code (untrusted data) via the
--allow_code_executionflag. - Ingestion points: Model outputs are loaded from
generations.jsonfor evaluation. - Boundary markers: None identified in the provided documentation to delimit code from instructions.
- Capability inventory: The harness executes code in the local environment or Docker containers using Python and various language runtimes.
- Sanitization: None mentioned; the tool relies on the assumption that the user will use isolated environments (like Docker) as suggested in the documentation.
- [Dynamic Execution] (MEDIUM): The core functionality of the tool involves dynamic execution of arbitrary strings as code, which is a significant risk surface despite being the intended purpose of the benchmarking harness.
Audit Metadata