The Agent Skills Directory

[Unverifiable Dependencies & Remote Code Execution] (MEDIUM): The documentation instructs users to use the --trust_remote_code flag. This allows the execution of arbitrary Python code defined in remote Hugging Face model repositories. This is a high-risk feature that can lead to remote code execution if the model being evaluated contains malicious logic.
[Privilege Escalation] (MEDIUM): The troubleshooting guide (references/issues.md) suggests building Docker images using sudo make. This encourages users to execute build scripts with elevated privileges, which could be exploited if the local environment or the Makefile is compromised.
[Unverifiable Dependencies & Remote Code Execution] (LOW): The skill directs users to download and install packages and containers from non-whitelisted sources, including git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git and docker pull ghcr.io/bigcode-project/evaluation-harness-multiple. These sources are outside the trusted list for this analysis.
[Indirect Prompt Injection] (LOW): The tool is designed to ingest and execute model-generated code (untrusted data) via the --allow_code_execution flag.
Ingestion points: Model outputs are loaded from generations.json for evaluation.
Boundary markers: None identified in the provided documentation to delimit code from instructions.
Capability inventory: The harness executes code in the local environment or Docker containers using Python and various language runtimes.
Sanitization: None mentioned; the tool relies on the assumption that the user will use isolated environments (like Docker) as suggested in the documentation.
[Dynamic Execution] (MEDIUM): The core functionality of the tool involves dynamic execution of arbitrary strings as code, which is a significant risk surface despite being the intended purpose of the benchmarking harness.

evaluating-code-models