evaluation-metrics
Audited by Socket on Mar 1, 2026
1 alert found:
SecurityThis skill is a benign-looking LLM evaluation and benchmarking framework. The primary security concern is functional: the human-eval benchmark executes model-generated code (arbitrary code execution) and the framework directly uses raw LLM outputs in numeric or logical contexts without validation. Those behaviors are expected for evaluation tasks but are high-risk when models or inputs are untrusted or when execution is not sandboxed. There are no clear signs of credential harvesting, obfuscation, or deliberate exfiltration in the provided code. Mitigations: run code execution in strong isolation (containers/sandboxes), validate and sanitize model outputs before numeric conversion, and ensure the model client and dataset loaders are configured to respect privacy and not send sensitive test data to third parties.