Evals

Pass

Audited by Gen Agent Trust Hub on May 2, 2026

Risk Level: SAFECOMMAND_EXECUTION
Full Analysis
  • [COMMAND_EXECUTION]: The skill employs Bun.$ in Tools/AlgorithmBridge.ts, Graders/CodeBased/BinaryTests.ts, and Graders/CodeBased/StaticAnalysis.ts to execute shell commands. This is the primary mechanism for running test suites (e.g., pytest, bun test), static analysis tools (e.g., biome, tsc, ruff), and interacting with internal project management tools like THE ALGORITHM. This behavior is essential for the skill's stated purpose of evaluating code and agent behavior.
  • [COMMAND_EXECUTION]: The skill's main entry point (SKILL.md) and all workflow instructions (e.g., Workflows/RunEval.md, Workflows/CompareModels.md) mandate the execution of a curl command to http://localhost:8888/notify upon invocation. This is used for local voice notifications and does not involve communication with external untrusted domains.
  • [PROMPT_INJECTION]: An indirect prompt injection surface exists in Tools/FailureToTask.ts, which ingests data from Data/failures.jsonl. 1. Ingestion points: Data/failures.jsonl (parsed into FailureLog objects). 2. Boundary markers: Absent; descriptions and behaviors from the logs are directly interpolated into task descriptions and LLM-as-judge rubrics. 3. Capability inventory: Shell command execution via Bun.$ in code-based graders. 4. Sanitization: Absent; descriptions are used verbatim. While this presents an attack surface where malicious input in a failure log could influence the agent during task generation, the risk is mitigated by the intended use case of processing internal project failures.
Audit Metadata
Risk Level
SAFE
Analyzed
May 2, 2026, 01:03 AM