benchmark-agents
Warn
Audited by Socket on Mar 15, 2026
1 alert found:
AnomalyAnomalySKILL.md
LOWAnomalyLOW
SKILL.md
SUSPICIOUS: the skill's behavior largely matches its stated benchmarking purpose, but it materially expands agent authority by installing remote plugin code from GitHub, spawning autonomous interactive Claude sessions, and inspecting session logs/artifacts. The footprint is coherent for an internal eval harness, yet the install trust and multi-agent execution model make it high-risk operationally rather than clearly malicious.
Confidence: 82%Severity: 68%
Audit Metadata