skill-forge-benchmark
Pass
Audited by Gen Agent Trust Hub on Apr 8, 2026
Risk Level: SAFECOMMAND_EXECUTIONPROMPT_INJECTION
Full Analysis
- [COMMAND_EXECUTION]: The skill invokes
python scripts/aggregate_benchmark.pyusing parameters from the benchmark configuration to summarize performance metrics. - [PROMPT_INJECTION]: The skill demonstrates an indirect prompt injection surface by consuming external data from
evals/evals.jsonand various run logs. - Ingestion points: Reads evaluation sets and trial results (
grading.json,timing.json) from the workspace. - Boundary markers: None; the skill lacks specific delimiters or instructions to ignore embedded prompts in the data.
- Capability inventory: Executes shell-based Python scripts and orchestrates the activity of other sub-agents.
- Sanitization: There is no mention of input validation or content filtering for the ingested JSON files.
Audit Metadata