benchmark-datasets

Pass

Audited by Gen Agent Trust Hub on Mar 18, 2026

Risk Level: SAFEPROMPT_INJECTIONREMOTE_CODE_EXECUTIONDATA_EXFILTRATION
Full Analysis
  • [PROMPT_INJECTION]: The skill facilitates the processing of adversarial prompts and jailbreak attempts (e.g., 'DAN', roleplay) for model evaluation, creating an indirect prompt injection surface.
  • Ingestion points: Various benchmark datasets like HarmBench and JailbreakBench referenced in the SKILL.md.
  • Boundary markers: No explicit delimiters or instructions to ignore embedded commands are shown in the illustrative code.
  • Capability inventory: Interaction with models via model.generate and similar generation functions.
  • Sanitization: The provided examples do not include content sanitization or validation of the benchmark data.
  • [REMOTE_CODE_EXECUTION]: The skill references established Python packages autoattack and textattack for evaluating model robustness against adversarial attacks.
  • [DATA_EXFILTRATION]: The skill includes links to official repositories and websites for AI safety research, such as harmbench.org and robustbench.github.io.
Audit Metadata
Risk Level
SAFE
Analyzed
Mar 18, 2026, 07:13 AM