guardrails-safety-filter-builder

Pass

Audited by Gen Agent Trust Hub on Feb 17, 2026

Risk Level: SAFE
Full Analysis
  • Indirect Prompt Injection (LOW): The skill processes untrusted user input to classify topics and generate responses. While this represents a standard injection surface, the skill implements specific mitigation strategies including a detection function (detect_prompt_injection) and output validation to minimize risk.
  • Ingestion points: user_input variable in classify_topic and apply_guardrails functions.
  • Boundary markers: Absent in the classification prompt interpolation, but prefaced by a detection step.
  • Capability inventory: Calls to an internal LLM tool for classification and response generation.
  • Sanitization: Includes regex-based detection for common injection patterns and PII redaction logic.
  • Unverifiable Dependencies & Remote Code Execution (SAFE): The skill references the presidio-analyzer and presidio-anonymizer libraries. These are reputable, well-known open-source packages for PII management.
  • Prompt Injection (SAFE): The skill contains lists of prompt injection patterns (e.g., 'ignore previous instructions'), but these are used strictly for detection and filtering purposes, not as instructions to the agent.
Audit Metadata
Risk Level
SAFE
Analyzed
Feb 17, 2026, 05:43 PM