compare-agents
Pass
Audited by Gen Agent Trust Hub on Apr 29, 2026
Risk Level: SAFECOMMAND_EXECUTIONEXTERNAL_DOWNLOADSDATA_EXFILTRATIONPROMPT_INJECTION
Full Analysis
- [COMMAND_EXECUTION]: The skill instructs the agent to generate and execute local Python (
evaluate.py) and TypeScript (evaluate.ts) scripts to run theevaluatorqrunner. This involves shell access to install dependencies and initiate the evaluation process. - [EXTERNAL_DOWNLOADS]: The skill requires the installation of external packages
evaluatorqandorq-ai-sdkfrom PyPI, and@orq-ai/evaluatorqfrom NPM. These are official packages provided by the skill author (orq-ai) for agent evaluation. - [DATA_EXFILTRATION]: When configured with an
ORQ_API_KEY, the evaluation script transmits datapoints, agent inputs, and agent responses to the orq.ai platform (api.orq.ai). This is the intended functional behavior for visualizing experiment results. - [PROMPT_INJECTION]: The skill implements an 'LLM-as-a-judge' pattern where an evaluator agent processes the outputs of compared agents. This creates a surface for indirect prompt injection where a compared agent could manipulate the evaluator's output or logic.
- Ingestion points: Agent responses and dataset inputs are processed as variables in the evaluation scripts (see
resources/evaluatorq-api.md). - Boundary markers: The provided templates do not use specific delimiters or instructions to ignore potential commands embedded within agent responses.
- Capability inventory: The generated scripts and the
evaluatorqrunner have the capability to execute shell commands and perform network operations as detailed in theSKILL.mdandresources/job-patterns.mdfiles. - Sanitization: No explicit sanitization or filtering of the agent's output is performed before it is passed to the LLM-as-a-judge evaluator.
Audit Metadata