The Agent Skills Directory

[Indirect Prompt Injection] (HIGH): The documentation extensively promotes the use of the --allow_code_execution flag across multiple benchmarks (HumanEval, MBPP, DS-1000, etc.) in references/benchmarks.md. This flag enables the execution of code generated by LLMs, which is a significant security risk if a model is poisoned or produces malicious output.
Ingestion points: External code generated by LLMs during the benchmarking process.
Boundary markers: None specified in the instructions; the harness executes the generated code directly.
Capability inventory: Full Python/shell execution on the host system depending on the benchmark and language.
Sanitization: None provided; the harness relies on the user to provide a safe environment.
[Dynamic Execution] (HIGH): In references/issues.md, the guide suggests using the --trust_remote_code flag to resolve model loading issues. This flag allows the HuggingFace transformers library to execute arbitrary Python code contained within the model's repository, potentially leading to Remote Code Execution (RCE) if the model source is untrusted.
[Remote Code Execution] (MEDIUM): The skill instructs users to install various Python packages and run Docker containers from external sources.
Evidence: pip install lctk sortedcontainers and docker pull ghcr.io/bigcode-project/evaluation-harness-multiple in references/benchmarks.md.
[Privilege Escalation] (LOW): The troubleshooting guide in references/issues.md includes commands utilizing sudo for building Docker images (sudo make DOCKERFILE=Dockerfile all), which grants elevated privileges to the build process.

evaluating-code-models