skills/steffen025/pai-opencode/Evals/Gen Agent Trust Hub

Evals

Pass

Audited by Gen Agent Trust Hub on Mar 5, 2026

Risk Level: SAFECOMMAND_EXECUTION
Full Analysis
  • [COMMAND_EXECUTION]: The skill utilizes shell commands to perform automated testing and integration tasks.\n
  • BinaryTestsGrader.ts executes test commands (e.g., pytest, bun test) against local codebases to verify functionality.\n
  • StaticAnalysisGrader.ts runs analysis tools like linters and type-checkers to assess code quality.\n
  • AlgorithmBridge.ts executes CLI commands to interact with the THEALGORITHM skill for reporting results.\n- [DATA_INGESTION]: The skill manages its operational state and configurations through local file access.\n
  • It reads task configurations, evaluation suites, and agent transcripts from the local filesystem to provide context for grading.\n
  • It maintains a local failure log in Data/failures.jsonl to track and convert agent errors into test cases.\n- [PROMPT_INJECTION]: The skill's use of LLM-based grading introduces a surface for indirect prompt injection from the content being evaluated.\n
  • Ingestion points: The LLMRubricGrader and NaturalLanguageAssertGrader receive untrusted output from other agent runs as input for grading.\n
  • Boundary markers: The prompts used for LLM judges do not implement strict delimiters to separate the grading instructions from the content being analyzed.\n
  • Capability inventory: While the skill can execute shell commands, these are triggered by deterministic logic in the runner based on task definitions, not directly by the LLM judge's output.\n
  • Sanitization: Content under evaluation is passed to the LLM judge without prior sanitization or escaping.
Audit Metadata
Risk Level
SAFE
Analyzed
Mar 5, 2026, 07:38 AM