evaluating-llms-harness

Pass

Audited by Gen Agent Trust Hub on Apr 27, 2026

Risk Level: SAFECOMMAND_EXECUTIONREMOTE_CODE_EXECUTION
Full Analysis
  • [COMMAND_EXECUTION]: The skill documentation provides Python and shell script templates that utilize os.system and subprocess to automate benchmarking tasks and track training progress via model checkpoints. These are standard automation practices within Machine Learning development workflows.
  • [REMOTE_CODE_EXECUTION]: The instructions describe the legitimate use of the allow_code_execution flag for benchmarks like HumanEval. This feature is a core component of the lm-evaluation-harness library used to verify the functional correctness of model-generated code by executing it in a controlled environment.
  • [DATA_EXFILTRATION]: The skill mentions an option to disable SSL certificate verification (verify_certificate=false) when connecting to local API endpoints. This is explicitly labeled for development use only and is a common configuration for testing local inference servers like Ollama or vLLM.
  • [CREDENTIALS_UNSAFE]: The documentation demonstrates how to securely handle API keys (OpenAI, Anthropic) using environment variables (e.g., export OPENAI_API_KEY=sk-...) rather than hardcoding them into scripts, following security best practices.
Audit Metadata
Risk Level
SAFE
Analyzed
Apr 27, 2026, 07:07 AM