safety-filter-bypass

Fail

Audited by Gen Agent Trust Hub on Feb 16, 2026

Risk Level: HIGHPROMPT_INJECTION
Full Analysis
  • [PROMPT_INJECTION] (HIGH): The skill provides specific frameworks (fictional framing, educational context) and technical methods to bypass AI guardrails and safety filters, directly mapped to OWASP LLM01.
  • [OBFUSCATION] (HIGH): The Python implementation automates the generation of adversarial variants using homoglyphs (Cyrillic substitution) and zero-width characters (U+200B, U+FEFF) to evade keyword and regex filters.
  • [METADATA_POISONING] (MEDIUM): The skill is framed as a security assessment tool but provides functional code and instructions for developing jailbreak payloads, which could be misused by an agent to execute harmful instructions.
Recommendations
  • AI detected serious security threats
Audit Metadata
Risk Level
HIGH
Analyzed
Feb 16, 2026, 12:00 PM