guardrails-safety
Pass
Audited by Gen Agent Trust Hub on Mar 10, 2026
Risk Level: SAFEPROMPT_INJECTION
Full Analysis
- [PROMPT_INJECTION]: The skill includes phrases like 'ignore previous instructions' and 'act as' within a regex list. These are intended for defensive detection by the 'InjectionDetector' class and do not represent a malicious attempt to override agent behavior.
- [PROMPT_INJECTION]: The 'ConstitutionalFilter' class presents an indirect prompt injection surface when reprocessing data.
- Ingestion points: The 'filter' method in 'SKILL.md' accepts an untrusted 'response' string for evaluation.
- Boundary markers: The prompts used for the 'critic' and 'reviser' models do not use delimiters or instructions to prevent the model from obeying commands embedded in the text being analyzed.
- Capability inventory: The skill performs additional LLM calls ('critic.generate', 'reviser.generate') based on the content of the 'response' and 'critique' strings.
- Sanitization: No injection-specific sanitization is performed to remove potential instructions from the response text before it is interpolated into the filter prompts.
Audit Metadata