Evals
Pass
Audited by Gen Agent Trust Hub on Mar 5, 2026
Risk Level: SAFECOMMAND_EXECUTION
Full Analysis
- [COMMAND_EXECUTION]: The skill utilizes shell commands to perform automated testing and integration tasks.\n
BinaryTestsGrader.tsexecutes test commands (e.g.,pytest,bun test) against local codebases to verify functionality.\nStaticAnalysisGrader.tsruns analysis tools like linters and type-checkers to assess code quality.\nAlgorithmBridge.tsexecutes CLI commands to interact with theTHEALGORITHMskill for reporting results.\n- [DATA_INGESTION]: The skill manages its operational state and configurations through local file access.\n- It reads task configurations, evaluation suites, and agent transcripts from the local filesystem to provide context for grading.\n
- It maintains a local failure log in
Data/failures.jsonlto track and convert agent errors into test cases.\n- [PROMPT_INJECTION]: The skill's use of LLM-based grading introduces a surface for indirect prompt injection from the content being evaluated.\n - Ingestion points: The
LLMRubricGraderandNaturalLanguageAssertGraderreceive untrusted output from other agent runs as input for grading.\n - Boundary markers: The prompts used for LLM judges do not implement strict delimiters to separate the grading instructions from the content being analyzed.\n
- Capability inventory: While the skill can execute shell commands, these are triggered by deterministic logic in the runner based on task definitions, not directly by the LLM judge's output.\n
- Sanitization: Content under evaluation is passed to the LLM judge without prior sanitization or escaping.
Audit Metadata