llm-evaluation

Pass

Audited by Gen Agent Trust Hub on Apr 17, 2026

Risk Level: SAFE
Full Analysis
  • [SAFE]: The skill provides educational content and Python code snippets for calculating common NLP metrics such as BLEU, ROUGE, and BERTScore using established libraries.
  • [SAFE]: Dependencies referenced in code snippets, including nltk, scikit-learn, scipy, transformers, and detoxify, are standard, well-known libraries in the data science and machine learning ecosystem.
  • [SAFE]: All network-related patterns (e.g., using the OpenAI API or Hugging Face transformers) represent standard integration practices for LLM development and evaluation.
  • [SAFE]: No patterns indicative of prompt injection, data exfiltration, credential harvesting, or malicious persistence were found in the skill's instructions or implementation examples.
  • [SAFE]: The skill uses well-known models from Microsoft (DeBERTa) for evaluation tasks, which is a standard practice in the industry.
Audit Metadata
Risk Level
SAFE
Analyzed
Apr 17, 2026, 01:49 PM