evaluation-metrics

Warn

Audited by Socket on Mar 1, 2026

1 alert found:

Security
SecurityMEDIUM
SKILL.md

This skill is a benign-looking LLM evaluation and benchmarking framework. The primary security concern is functional: the human-eval benchmark executes model-generated code (arbitrary code execution) and the framework directly uses raw LLM outputs in numeric or logical contexts without validation. Those behaviors are expected for evaluation tasks but are high-risk when models or inputs are untrusted or when execution is not sandboxed. There are no clear signs of credential harvesting, obfuscation, or deliberate exfiltration in the provided code. Mitigations: run code execution in strong isolation (containers/sandboxes), validate and sanitize model outputs before numeric conversion, and ensure the model client and dataset loaders are configured to respect privacy and not send sensitive test data to third parties.

Confidence: 75%Severity: 75%
Audit Metadata
Analyzed At
Mar 1, 2026, 07:48 PM
Package URL
pkg:socket/skills-sh/pluginagentmarketplace%2Fcustom-plugin-ai-engineer%2Fevaluation-metrics%2F@92d01b690b1c056b7e22e6ac7289ba886b856db3