skill-forge-benchmark

Pass

Audited by Gen Agent Trust Hub on Apr 8, 2026

Risk Level: SAFECOMMAND_EXECUTIONPROMPT_INJECTION
Full Analysis
  • [COMMAND_EXECUTION]: The skill invokes python scripts/aggregate_benchmark.py using parameters from the benchmark configuration to summarize performance metrics.
  • [PROMPT_INJECTION]: The skill demonstrates an indirect prompt injection surface by consuming external data from evals/evals.json and various run logs.
  • Ingestion points: Reads evaluation sets and trial results (grading.json, timing.json) from the workspace.
  • Boundary markers: None; the skill lacks specific delimiters or instructions to ignore embedded prompts in the data.
  • Capability inventory: Executes shell-based Python scripts and orchestrates the activity of other sub-agents.
  • Sanitization: There is no mention of input validation or content filtering for the ingested JSON files.
Audit Metadata
Risk Level
SAFE
Analyzed
Apr 8, 2026, 02:59 AM