evaluating-code-models
Fail
Audited by Gen Agent Trust Hub on Feb 17, 2026
Risk Level: HIGHREMOTE_CODE_EXECUTIONCOMMAND_EXECUTIONEXTERNAL_DOWNLOADS
Full Analysis
- [REMOTE_CODE_EXECUTION] (HIGH): The
references/issues.mdfile encourages the use of the--trust_remote_codeflag. This flag allows the execution of arbitrary Python code defined in a HuggingFace model's repository, which is a critical security risk when interacting with untrusted model sources. - [COMMAND_EXECUTION] (HIGH): The
references/issues.mdfile instructs users to runsudo makefor building Docker images. Granting administrative privileges to a build process for external code is a high-risk operation. - [COMMAND_EXECUTION] (HIGH): The skill repeatedly promotes the use of the
--allow_code_executionflag (e.g., inreferences/benchmarks.md). While necessary for benchmarking, this capability allows the execution of arbitrary, unvetted code generated by an AI model directly on the user's system or within a container. - [EXTERNAL_DOWNLOADS] (MEDIUM): The documentation guides users to download Docker images from
ghcr.io/bigcode-projectand install packages from various indices. These are non-whitelisted external sources that represent a supply chain risk. - [INDIRECT_PROMPT_INJECTION] (LOW): The harness is designed to ingest external datasets from HuggingFace and execute code based on that data. This creates a surface where a malicious dataset could compromise the system through the execution capability.
- Ingestion points: Data loaded from HuggingFace via the
--tasksargument inreferences/benchmarks.md. - Boundary markers: None identified; code is executed directly as part of the benchmark.
- Capability inventory: Execution of arbitrary code on the host or in Docker via
accelerate launch main.py --allow_code_execution. - Sanitization: None; the tool's purpose is to execute the generated output for testing.
Recommendations
- AI detected serious security threats
Audit Metadata