evaluating-llms-harness

Pass

Audited by Gen Agent Trust Hub on Mar 28, 2026

Risk Level: SAFECOMMAND_EXECUTIONREMOTE_CODE_EXECUTIONEXTERNAL_DOWNLOADSPROMPT_INJECTION
Full Analysis
  • [COMMAND_EXECUTION]: The skill provides numerous command-line examples for the lm_eval tool and demonstrates how to integrate evaluation into training workflows using shell scripts and os.system() calls.
  • [REMOTE_CODE_EXECUTION]: The documentation describes functionalities involving code execution, notably the --allow_code_execution flag for benchmarks like HumanEval and the !function YAML tag for loading custom Python logic from local utility files.
  • [EXTERNAL_DOWNLOADS]: The instructions cover the installation of research-oriented Python packages from PyPI and the retrieval of model weights and datasets from trusted sources such as the HuggingFace Hub.
  • [PROMPT_INJECTION]: Documentation of execution-based benchmarks creates an Indirect Prompt Injection surface. 1. Ingestion points: model-generated code from benchmarks like HumanEval. 2. Boundary markers: none specified in the documentation. 3. Capability inventory: Python code execution via the benchmarking framework when the execution flag is enabled. 4. Sanitization: none mentioned, as the tool's purpose is to run generated output for verification.
Audit Metadata
Risk Level
SAFE
Analyzed
Mar 28, 2026, 06:07 PM