evaluating-code-models
Fail
Audited by Gen Agent Trust Hub on Feb 16, 2026
Risk Level: HIGHREMOTE_CODE_EXECUTIONCOMMAND_EXECUTIONEXTERNAL_DOWNLOADS
Full Analysis
- [Indirect Prompt Injection] (HIGH): The documentation extensively promotes the use of the
--allow_code_executionflag across multiple benchmarks (HumanEval, MBPP, DS-1000, etc.) inreferences/benchmarks.md. This flag enables the execution of code generated by LLMs, which is a significant security risk if a model is poisoned or produces malicious output. - Ingestion points: External code generated by LLMs during the benchmarking process.
- Boundary markers: None specified in the instructions; the harness executes the generated code directly.
- Capability inventory: Full Python/shell execution on the host system depending on the benchmark and language.
- Sanitization: None provided; the harness relies on the user to provide a safe environment.
- [Dynamic Execution] (HIGH): In
references/issues.md, the guide suggests using the--trust_remote_codeflag to resolve model loading issues. This flag allows the HuggingFace transformers library to execute arbitrary Python code contained within the model's repository, potentially leading to Remote Code Execution (RCE) if the model source is untrusted. - [Remote Code Execution] (MEDIUM): The skill instructs users to install various Python packages and run Docker containers from external sources.
- Evidence:
pip install lctk sortedcontainersanddocker pull ghcr.io/bigcode-project/evaluation-harness-multipleinreferences/benchmarks.md. - [Privilege Escalation] (LOW): The troubleshooting guide in
references/issues.mdincludes commands utilizingsudofor building Docker images (sudo make DOCKERFILE=Dockerfile all), which grants elevated privileges to the build process.
Recommendations
- AI detected serious security threats
Audit Metadata