evaluating-code-models

Fail

Audited by Gen Agent Trust Hub on Feb 16, 2026

Risk Level: HIGHREMOTE_CODE_EXECUTIONCOMMAND_EXECUTIONEXTERNAL_DOWNLOADS
Full Analysis
  • [Indirect Prompt Injection] (HIGH): The documentation extensively promotes the use of the --allow_code_execution flag across multiple benchmarks (HumanEval, MBPP, DS-1000, etc.) in references/benchmarks.md. This flag enables the execution of code generated by LLMs, which is a significant security risk if a model is poisoned or produces malicious output.
  • Ingestion points: External code generated by LLMs during the benchmarking process.
  • Boundary markers: None specified in the instructions; the harness executes the generated code directly.
  • Capability inventory: Full Python/shell execution on the host system depending on the benchmark and language.
  • Sanitization: None provided; the harness relies on the user to provide a safe environment.
  • [Dynamic Execution] (HIGH): In references/issues.md, the guide suggests using the --trust_remote_code flag to resolve model loading issues. This flag allows the HuggingFace transformers library to execute arbitrary Python code contained within the model's repository, potentially leading to Remote Code Execution (RCE) if the model source is untrusted.
  • [Remote Code Execution] (MEDIUM): The skill instructs users to install various Python packages and run Docker containers from external sources.
  • Evidence: pip install lctk sortedcontainers and docker pull ghcr.io/bigcode-project/evaluation-harness-multiple in references/benchmarks.md.
  • [Privilege Escalation] (LOW): The troubleshooting guide in references/issues.md includes commands utilizing sudo for building Docker images (sudo make DOCKERFILE=Dockerfile all), which grants elevated privileges to the build process.
Recommendations
  • AI detected serious security threats
Audit Metadata
Risk Level
HIGH
Analyzed
Feb 16, 2026, 04:04 AM