llm-evaluation
Pass
Audited by Gen Agent Trust Hub on Apr 17, 2026
Risk Level: SAFE
Full Analysis
- [SAFE]: The skill provides educational content and Python code snippets for calculating common NLP metrics such as BLEU, ROUGE, and BERTScore using established libraries.
- [SAFE]: Dependencies referenced in code snippets, including nltk, scikit-learn, scipy, transformers, and detoxify, are standard, well-known libraries in the data science and machine learning ecosystem.
- [SAFE]: All network-related patterns (e.g., using the OpenAI API or Hugging Face transformers) represent standard integration practices for LLM development and evaluation.
- [SAFE]: No patterns indicative of prompt injection, data exfiltration, credential harvesting, or malicious persistence were found in the skill's instructions or implementation examples.
- [SAFE]: The skill uses well-known models from Microsoft (DeBERTa) for evaluation tasks, which is a standard practice in the industry.
Audit Metadata