reddit-moderate

Pass

Audited by Gen Agent Trust Hub on Apr 1, 2026

Risk Level: SAFEPROMPT_INJECTIONCOMMAND_EXECUTION
Full Analysis
  • [PROMPT_INJECTION]: The skill manages an indirect prompt injection surface when processing Reddit content.
  • Ingestion points: Untrusted content is fetched from Reddit modqueue, reports, and author history via the reddit_mod.py script.
  • Boundary markers: The classification prompt utilizes <untrusted-content> tags to isolate raw user data from instructions.
  • Capability inventory: The skill enables the agent to execute moderation actions such as approve, remove, and lock on Reddit items.
  • Sanitization: Explicit instructions are provided to the model to evaluate potential injection attempts (e.g., 'ignore previous instructions') as malicious signals for classification rather than as valid commands.
  • [COMMAND_EXECUTION]: The skill utilizes a local script reddit_mod.py for Reddit API interactions and managing subreddit-specific data directories.
  • [SAFE]: Credentials are managed via environment variables and local .env files with appropriate file permissions (chmod 600) recommended.
Audit Metadata
Risk Level
SAFE
Analyzed
Apr 1, 2026, 05:55 AM