evaluation-metrics

Warn

Audited by Gen Agent Trust Hub on Mar 1, 2026

Risk Level: MEDIUMCOMMAND_EXECUTIONPROMPT_INJECTIONEXTERNAL_DOWNLOADS
Full Analysis
  • [COMMAND_EXECUTION]: The evaluate_humaneval function in SKILL.md utilizes the human_eval.execution.check_correctness method. This method is designed to execute Python code generated by a language model to verify its functional correctness. Executing unverified model output on a host system poses a significant security risk, as a malicious or compromised model could generate code that performs unauthorized file access, network operations, or other harmful actions.- [PROMPT_INJECTION]: Several components, including RAGMetrics and HallucinationDetector, are vulnerable to Indirect Prompt Injection. These components interpolate untrusted data (such as model predictions and retrieved contexts) directly into instructions given to another LLM without proper sanitization.
  • Ingestion points: The variables prediction, context, and answer used in SKILL.md and scripts/llm_evaluator.py are populated from potentially untrusted external sources.
  • Boundary markers: The evaluation prompts lack clear delimiters (like XML tags or triple quotes) and instructions to ignore any commands embedded within the data variables.
  • Capability inventory: The skill possesses the capability to generate text via an LLM and, critically, execute code via the HumanEval benchmark logic.
  • Sanitization: There is no evidence of filtering, escaping, or validation of the input data before it is interpolated into the evaluation prompts.- [EXTERNAL_DOWNLOADS]: The skill fetches evaluation metric scripts and pre-trained models (such as BERTScore models) from Hugging Face's official repositories using the evaluate and transformers libraries.
Audit Metadata
Risk Level
MEDIUM
Analyzed
Mar 1, 2026, 07:47 PM