benchmark-skills

Pass

Audited by Gen Agent Trust Hub on Mar 10, 2026

Risk Level: SAFECOMMAND_EXECUTIONPROMPT_INJECTION
Full Analysis
  • [COMMAND_EXECUTION]: The skill uses the bun run benchmark command to execute its testing harness. This allows for local code execution within the agent's environment, which is necessary for the skill's stated purpose of benchmarking performance.
  • [PROMPT_INJECTION]: The skill is susceptible to indirect prompt injection because it processes content from untrusted evals.json files and uses an LLM to judge the output, potentially causing the model to obey instructions embedded in the test data.
  • Ingestion points: Untrusted data enters the system context via user-provided evals/evals.json files referenced in the documentation.
  • Boundary markers: The documentation does not provide instructions for using delimiters or boundary markers to isolate the evaluation data from the agent's instructions.
  • Capability inventory: The skill can execute shell commands via bun and likely performs network requests to external LLM APIs for the judging process.
  • Sanitization: No sanitization, validation, or escaping of the user-provided prompt content or assertions is described.
Audit Metadata
Risk Level
SAFE
Analyzed
Mar 10, 2026, 03:57 AM