constitutional-ai

Pass

Audited by Gen Agent Trust Hub on Feb 17, 2026

Risk Level: SAFEPROMPT_INJECTION
Full Analysis
  • [PROMPT_INJECTION] (LOW): Indirect Prompt Injection surface area in training workflows.
  • Ingestion points: The skill processes untrusted user data (prompts and responses) into template-based prompts for AI critiques and revisions (e.g., Workflow 1, Workflow 2, and Workflow 3).
  • Boundary markers: The skill uses basic field headers (Question:, Response:, Critique:) but lacks explicit 'ignore instructions' delimiters or escaping for the variable data injected into the templates.
  • Capability inventory: The skill demonstrates Python code for model training (SFTTrainer, RewardTrainer, PPOTrainer) and local file/directory creation (output_dir='constitutional-reward-model').
  • Sanitization: No evidence of sanitization or validation of the input strings before formatting them into LLM templates, making it theoretically possible for a prompt or response to contain instructions that manipulate the critique/revision phase.
Audit Metadata
Risk Level
SAFE
Analyzed
Feb 17, 2026, 06:08 PM