guardrails-safety-filter-builder
Pass
Audited by Gen Agent Trust Hub on Feb 17, 2026
Risk Level: SAFE
Full Analysis
- Indirect Prompt Injection (LOW): The skill processes untrusted user input to classify topics and generate responses. While this represents a standard injection surface, the skill implements specific mitigation strategies including a detection function (
detect_prompt_injection) and output validation to minimize risk. - Ingestion points:
user_inputvariable inclassify_topicandapply_guardrailsfunctions. - Boundary markers: Absent in the classification prompt interpolation, but prefaced by a detection step.
- Capability inventory: Calls to an internal LLM tool for classification and response generation.
- Sanitization: Includes regex-based detection for common injection patterns and PII redaction logic.
- Unverifiable Dependencies & Remote Code Execution (SAFE): The skill references the
presidio-analyzerandpresidio-anonymizerlibraries. These are reputable, well-known open-source packages for PII management. - Prompt Injection (SAFE): The skill contains lists of prompt injection patterns (e.g., 'ignore previous instructions'), but these are used strictly for detection and filtering purposes, not as instructions to the agent.
Audit Metadata