evaluating-llms-harness
Pass
Audited by Gen Agent Trust Hub on Mar 28, 2026
Risk Level: SAFECOMMAND_EXECUTIONREMOTE_CODE_EXECUTIONEXTERNAL_DOWNLOADSPROMPT_INJECTION
Full Analysis
- [COMMAND_EXECUTION]: The skill provides numerous command-line examples for the
lm_evaltool and demonstrates how to integrate evaluation into training workflows using shell scripts andos.system()calls. - [REMOTE_CODE_EXECUTION]: The documentation describes functionalities involving code execution, notably the
--allow_code_executionflag for benchmarks like HumanEval and the!functionYAML tag for loading custom Python logic from local utility files. - [EXTERNAL_DOWNLOADS]: The instructions cover the installation of research-oriented Python packages from PyPI and the retrieval of model weights and datasets from trusted sources such as the HuggingFace Hub.
- [PROMPT_INJECTION]: Documentation of execution-based benchmarks creates an Indirect Prompt Injection surface. 1. Ingestion points: model-generated code from benchmarks like HumanEval. 2. Boundary markers: none specified in the documentation. 3. Capability inventory: Python code execution via the benchmarking framework when the execution flag is enabled. 4. Sanitization: none mentioned, as the tool's purpose is to run generated output for verification.
Audit Metadata