compare-agents

Pass

Audited by Gen Agent Trust Hub on Apr 29, 2026

Risk Level: SAFECOMMAND_EXECUTIONEXTERNAL_DOWNLOADSDATA_EXFILTRATIONPROMPT_INJECTION
Full Analysis
  • [COMMAND_EXECUTION]: The skill instructs the agent to generate and execute local Python (evaluate.py) and TypeScript (evaluate.ts) scripts to run the evaluatorq runner. This involves shell access to install dependencies and initiate the evaluation process.
  • [EXTERNAL_DOWNLOADS]: The skill requires the installation of external packages evaluatorq and orq-ai-sdk from PyPI, and @orq-ai/evaluatorq from NPM. These are official packages provided by the skill author (orq-ai) for agent evaluation.
  • [DATA_EXFILTRATION]: When configured with an ORQ_API_KEY, the evaluation script transmits datapoints, agent inputs, and agent responses to the orq.ai platform (api.orq.ai). This is the intended functional behavior for visualizing experiment results.
  • [PROMPT_INJECTION]: The skill implements an 'LLM-as-a-judge' pattern where an evaluator agent processes the outputs of compared agents. This creates a surface for indirect prompt injection where a compared agent could manipulate the evaluator's output or logic.
  • Ingestion points: Agent responses and dataset inputs are processed as variables in the evaluation scripts (see resources/evaluatorq-api.md).
  • Boundary markers: The provided templates do not use specific delimiters or instructions to ignore potential commands embedded within agent responses.
  • Capability inventory: The generated scripts and the evaluatorq runner have the capability to execute shell commands and perform network operations as detailed in the SKILL.md and resources/job-patterns.md files.
  • Sanitization: No explicit sanitization or filtering of the agent's output is performed before it is passed to the LLM-as-a-judge evaluator.
Audit Metadata
Risk Level
SAFE
Analyzed
Apr 29, 2026, 01:37 PM