evaluating-code-models

Warn

Audited by Gen Agent Trust Hub on Feb 17, 2026

Risk Level: MEDIUMEXTERNAL_DOWNLOADSREMOTE_CODE_EXECUTIONCOMMAND_EXECUTION
Full Analysis
  • [Unverifiable Dependencies & Remote Code Execution] (MEDIUM): The documentation instructs users to use the --trust_remote_code flag. This allows the execution of arbitrary Python code defined in remote Hugging Face model repositories. This is a high-risk feature that can lead to remote code execution if the model being evaluated contains malicious logic.
  • [Privilege Escalation] (MEDIUM): The troubleshooting guide (references/issues.md) suggests building Docker images using sudo make. This encourages users to execute build scripts with elevated privileges, which could be exploited if the local environment or the Makefile is compromised.
  • [Unverifiable Dependencies & Remote Code Execution] (LOW): The skill directs users to download and install packages and containers from non-whitelisted sources, including git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git and docker pull ghcr.io/bigcode-project/evaluation-harness-multiple. These sources are outside the trusted list for this analysis.
  • [Indirect Prompt Injection] (LOW): The tool is designed to ingest and execute model-generated code (untrusted data) via the --allow_code_execution flag.
  • Ingestion points: Model outputs are loaded from generations.json for evaluation.
  • Boundary markers: None identified in the provided documentation to delimit code from instructions.
  • Capability inventory: The harness executes code in the local environment or Docker containers using Python and various language runtimes.
  • Sanitization: None mentioned; the tool relies on the assumption that the user will use isolated environments (like Docker) as suggested in the documentation.
  • [Dynamic Execution] (MEDIUM): The core functionality of the tool involves dynamic execution of arbitrary strings as code, which is a significant risk surface despite being the intended purpose of the benchmarking harness.
Audit Metadata
Risk Level
MEDIUM
Analyzed
Feb 17, 2026, 06:06 PM