The Agent Skills Directory

[REMOTE_CODE_EXECUTION] (HIGH): The references/issues.md file encourages the use of the --trust_remote_code flag. This flag allows the execution of arbitrary Python code defined in a HuggingFace model's repository, which is a critical security risk when interacting with untrusted model sources.
[COMMAND_EXECUTION] (HIGH): The references/issues.md file instructs users to run sudo make for building Docker images. Granting administrative privileges to a build process for external code is a high-risk operation.
[COMMAND_EXECUTION] (HIGH): The skill repeatedly promotes the use of the --allow_code_execution flag (e.g., in references/benchmarks.md). While necessary for benchmarking, this capability allows the execution of arbitrary, unvetted code generated by an AI model directly on the user's system or within a container.
[EXTERNAL_DOWNLOADS] (MEDIUM): The documentation guides users to download Docker images from ghcr.io/bigcode-project and install packages from various indices. These are non-whitelisted external sources that represent a supply chain risk.
[INDIRECT_PROMPT_INJECTION] (LOW): The harness is designed to ingest external datasets from HuggingFace and execute code based on that data. This creates a surface where a malicious dataset could compromise the system through the execution capability.
Ingestion points: Data loaded from HuggingFace via the --tasks argument in references/benchmarks.md.
Boundary markers: None identified; code is executed directly as part of the benchmark.
Capability inventory: Execution of arbitrary code on the host or in Docker via accelerate launch main.py --allow_code_execution.
Sanitization: None; the tool's purpose is to execute the generated output for testing.

evaluating-code-models