evaluating-code-models

Fail

Audited by Gen Agent Trust Hub on Feb 17, 2026

Risk Level: HIGHREMOTE_CODE_EXECUTIONCOMMAND_EXECUTIONEXTERNAL_DOWNLOADS
Full Analysis
  • [REMOTE_CODE_EXECUTION] (HIGH): The references/issues.md file encourages the use of the --trust_remote_code flag. This flag allows the execution of arbitrary Python code defined in a HuggingFace model's repository, which is a critical security risk when interacting with untrusted model sources.
  • [COMMAND_EXECUTION] (HIGH): The references/issues.md file instructs users to run sudo make for building Docker images. Granting administrative privileges to a build process for external code is a high-risk operation.
  • [COMMAND_EXECUTION] (HIGH): The skill repeatedly promotes the use of the --allow_code_execution flag (e.g., in references/benchmarks.md). While necessary for benchmarking, this capability allows the execution of arbitrary, unvetted code generated by an AI model directly on the user's system or within a container.
  • [EXTERNAL_DOWNLOADS] (MEDIUM): The documentation guides users to download Docker images from ghcr.io/bigcode-project and install packages from various indices. These are non-whitelisted external sources that represent a supply chain risk.
  • [INDIRECT_PROMPT_INJECTION] (LOW): The harness is designed to ingest external datasets from HuggingFace and execute code based on that data. This creates a surface where a malicious dataset could compromise the system through the execution capability.
  • Ingestion points: Data loaded from HuggingFace via the --tasks argument in references/benchmarks.md.
  • Boundary markers: None identified; code is executed directly as part of the benchmark.
  • Capability inventory: Execution of arbitrary code on the host or in Docker via accelerate launch main.py --allow_code_execution.
  • Sanitization: None; the tool's purpose is to execute the generated output for testing.
Recommendations
  • AI detected serious security threats
Audit Metadata
Risk Level
HIGH
Analyzed
Feb 17, 2026, 04:54 PM