The Agent Skills Directory

[COMMAND_EXECUTION]: The skill provides numerous command-line examples for the lm_eval tool and demonstrates how to integrate evaluation into training workflows using shell scripts and os.system() calls.
[REMOTE_CODE_EXECUTION]: The documentation describes functionalities involving code execution, notably the --allow_code_execution flag for benchmarks like HumanEval and the !function YAML tag for loading custom Python logic from local utility files.
[EXTERNAL_DOWNLOADS]: The instructions cover the installation of research-oriented Python packages from PyPI and the retrieval of model weights and datasets from trusted sources such as the HuggingFace Hub.
[PROMPT_INJECTION]: Documentation of execution-based benchmarks creates an Indirect Prompt Injection surface. 1. Ingestion points: model-generated code from benchmarks like HumanEval. 2. Boundary markers: none specified in the documentation. 3. Capability inventory: Python code execution via the benchmarking framework when the execution flag is enabled. 4. Sanitization: none mentioned, as the tool's purpose is to run generated output for verification.

evaluating-llms-harness