Evals
Pass
Audited by Gen Agent Trust Hub on May 2, 2026
Risk Level: SAFECOMMAND_EXECUTION
Full Analysis
- [COMMAND_EXECUTION]: The skill employs
Bun.$inTools/AlgorithmBridge.ts,Graders/CodeBased/BinaryTests.ts, andGraders/CodeBased/StaticAnalysis.tsto execute shell commands. This is the primary mechanism for running test suites (e.g.,pytest,bun test), static analysis tools (e.g.,biome,tsc,ruff), and interacting with internal project management tools like THE ALGORITHM. This behavior is essential for the skill's stated purpose of evaluating code and agent behavior. - [COMMAND_EXECUTION]: The skill's main entry point (
SKILL.md) and all workflow instructions (e.g.,Workflows/RunEval.md,Workflows/CompareModels.md) mandate the execution of acurlcommand tohttp://localhost:8888/notifyupon invocation. This is used for local voice notifications and does not involve communication with external untrusted domains. - [PROMPT_INJECTION]: An indirect prompt injection surface exists in
Tools/FailureToTask.ts, which ingests data fromData/failures.jsonl. 1. Ingestion points:Data/failures.jsonl(parsed intoFailureLogobjects). 2. Boundary markers: Absent; descriptions and behaviors from the logs are directly interpolated into task descriptions and LLM-as-judge rubrics. 3. Capability inventory: Shell command execution viaBun.$in code-based graders. 4. Sanitization: Absent; descriptions are used verbatim. While this presents an attack surface where malicious input in a failure log could influence the agent during task generation, the risk is mitigated by the intended use case of processing internal project failures.
Audit Metadata