llamaguard

Pass

Audited by Gen Agent Trust Hub on Feb 17, 2026

Risk Level: SAFE
Full Analysis
  • [Indirect Prompt Injection] (LOW): The skill is designed to process untrusted user input for the purpose of safety classification. While this provides a surface for indirect prompt injection, it is the intended primary function of the moderation tool. The code follows best practices by using chat templates for input formatting.
  • Ingestion points: check_input, check_output, and moderate_endpoint functions.
  • Boundary markers: Utilizes tokenizer.apply_chat_template to separate user content from system instructions.
  • Capability inventory: Capabilities are limited to local model inference and result classification; no file-writing or arbitrary command execution is present.
  • Sanitization: The LlamaGuard model itself serves as the sanitization mechanism.
  • [External Downloads] (LOW): The skill downloads model weights from the meta-llama organization on Hugging Face. This organization is listed as a trusted source, and the loading methods (from_pretrained, vllm.LLM) are standard for the field.
Audit Metadata
Risk Level
SAFE
Analyzed
Feb 17, 2026, 06:24 PM