benchmark-skills
Pass
Audited by Gen Agent Trust Hub on Mar 10, 2026
Risk Level: SAFECOMMAND_EXECUTIONPROMPT_INJECTION
Full Analysis
- [COMMAND_EXECUTION]: The skill uses the
bun run benchmarkcommand to execute its testing harness. This allows for local code execution within the agent's environment, which is necessary for the skill's stated purpose of benchmarking performance. - [PROMPT_INJECTION]: The skill is susceptible to indirect prompt injection because it processes content from untrusted
evals.jsonfiles and uses an LLM to judge the output, potentially causing the model to obey instructions embedded in the test data. - Ingestion points: Untrusted data enters the system context via user-provided
evals/evals.jsonfiles referenced in the documentation. - Boundary markers: The documentation does not provide instructions for using delimiters or boundary markers to isolate the evaluation data from the agent's instructions.
- Capability inventory: The skill can execute shell commands via
bunand likely performs network requests to external LLM APIs for the judging process. - Sanitization: No sanitization, validation, or escaping of the user-provided prompt content or assertions is described.
Audit Metadata