constitutional-ai
Pass
Audited by Gen Agent Trust Hub on Feb 17, 2026
Risk Level: SAFEPROMPT_INJECTION
Full Analysis
- [PROMPT_INJECTION] (LOW): Indirect Prompt Injection surface area in training workflows.
- Ingestion points: The skill processes untrusted user data (
promptsandresponses) into template-based prompts for AI critiques and revisions (e.g.,Workflow 1,Workflow 2, andWorkflow 3). - Boundary markers: The skill uses basic field headers (
Question:,Response:,Critique:) but lacks explicit 'ignore instructions' delimiters or escaping for the variable data injected into the templates. - Capability inventory: The skill demonstrates Python code for model training (
SFTTrainer,RewardTrainer,PPOTrainer) and local file/directory creation (output_dir='constitutional-reward-model'). - Sanitization: No evidence of sanitization or validation of the input strings before formatting them into LLM templates, making it theoretically possible for a prompt or response to contain instructions that manipulate the critique/revision phase.
Audit Metadata