llm-evaluation

Pass

Audited by Gen Agent Trust Hub on Feb 17, 2026

Risk Level: SAFEPROMPT_INJECTIONEXTERNAL_DOWNLOADS
Full Analysis
  • Indirect Prompt Injection (LOW): Vulnerability identified where untrusted content can manipulate the results of the 'judge' model.\n
  • Ingestion points: Untrusted data enters via parameters such as input_text, output_text, and context in scripts/evaluator-template.py and examples/evaluation-patterns.md.\n
  • Boundary markers: Absent. The code does not use delimiters (e.g., XML tags or triple quotes) or explicit 'ignore embedded instructions' warnings to isolate the text being evaluated.\n
  • Capability inventory: Evaluation scores produced by LLM calls directly drive 'Quality Gate' logic and model preference selection, which affects downstream application behavior.\n
  • Sanitization: Absent. Data is used directly in f-strings with only simple length truncation for token management, offering no protection against instruction hijacking.\n- External Downloads (SAFE): The skill utilizes standard, industry-standard libraries for its stated purpose.\n
  • Evidence: Mentions and utilizes ragas, numpy, scipy, langsmith, langfuse, and datasets for evaluation metrics and observability tracking.
Audit Metadata
Risk Level
SAFE
Analyzed
Feb 17, 2026, 06:11 PM