ai-security-hardening

SKILL.md

AI Security Hardening

Secure LLM and AI systems against prompt injection, jailbreaks, data leakage, and supply chain threats in production environments.

When to Use This Skill

Use this skill when:

  • Deploying an LLM-powered application handling sensitive user data
  • Protecting against prompt injection attacks in AI agents
  • Implementing output filtering and content moderation
  • Securing model weights and API endpoints from theft
  • Achieving SOC2 or ISO 27001 compliance for AI systems

AI-Specific Threat Model

Threat                    Risk                          Control
─────────────────────────────────────────────────────────────────────
Prompt injection          System prompt override         Input sanitization, separate context
Data exfiltration         PII in model outputs           Output filtering, DLP scanning
Jailbreaking             Policy bypass                  Content moderation, guardrails
Model theft               Weight extraction via API      Rate limiting, access controls
Training data poisoning   Backdoored fine-tuned model    Dataset validation, provenance
Supply chain attack       Malicious model weights        Signature verification, scanning
Insecure output           XSS/SQLi from LLM response     Output encoding, parameterized queries

Prompt Injection Defense

import re
from typing import Optional

INJECTION_PATTERNS = [
    r"ignore\s+(all\s+)?(previous|prior|above)\s+instructions",
    r"you\s+are\s+now\s+",
    r"new\s+instructions?:",
    r"system\s+prompt",
    r"forget\s+everything",
    r"act\s+as\s+",
    r"jailbreak",
    r"dan\s+mode",
    r"<\s*system\s*>",
    r"\[INST\]",
]

def detect_prompt_injection(user_input: str) -> tuple[bool, Optional[str]]:
    """Return (is_suspicious, matched_pattern)."""
    normalized = user_input.lower().strip()
    for pattern in INJECTION_PATTERNS:
        if re.search(pattern, normalized, re.IGNORECASE):
            return True, pattern
    return False, None

def sanitize_user_input(user_input: str, max_length: int = 4000) -> str:
    """Sanitize input before passing to LLM."""
    # Truncate
    user_input = user_input[:max_length]

    # Remove null bytes and control characters
    user_input = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', user_input)

    # Check for injection
    suspicious, pattern = detect_prompt_injection(user_input)
    if suspicious:
        raise ValueError(f"Potential prompt injection detected: {pattern}")

    return user_input

Guardrails with NeMo Guardrails

# guardrails.yaml
from nemoguardrails import RailsConfig, LLMRails

config = RailsConfig.from_path("./guardrails-config")
rails = LLMRails(config)

async def safe_llm_call(user_message: str) -> str:
    response = await rails.generate_async(
        messages=[{"role": "user", "content": user_message}]
    )
    return response["content"]
# guardrails-config/config.yml
models:
  - type: main
    engine: openai
    model: gpt-4o-mini

rails:
  input:
    flows:
      - check jailbreak
      - check sensitive data
  output:
    flows:
      - check output for PII
      - check output for harmful content

Output Filtering & PII Scrubbing

import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

PII_ENTITIES = ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "CREDIT_CARD",
                "US_SSN", "IBAN_CODE", "IP_ADDRESS", "LOCATION"]

def scrub_pii_from_output(text: str) -> str:
    """Remove PII from LLM output before returning to user."""
    results = analyzer.analyze(text=text, entities=PII_ENTITIES, language="en")
    if not results:
        return text
    anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
    return anonymized.text

def validate_output_safety(output: str) -> bool:
    """Check output doesn't contain prompt injection artifacts."""
    dangerous_patterns = [
        r"<\s*script\s*>",         # XSS
        r"javascript:",             # XSS
        r";\s*(DROP|DELETE|INSERT)",# SQLi
        r"\$\{.*\}",               # template injection
        r"`.*`",                   # command injection in some contexts
    ]
    for pattern in dangerous_patterns:
        if re.search(pattern, output, re.IGNORECASE):
            return False
    return True

API Security for LLM Endpoints

from fastapi import FastAPI, HTTPException, Depends, Request
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import jwt
import time
from collections import defaultdict

app = FastAPI()
security = HTTPBearer()

# Rate limiting (per API key)
request_counts = defaultdict(list)

def rate_limit(api_key: str, max_requests: int = 100, window_seconds: int = 60):
    now = time.time()
    requests = request_counts[api_key]
    # Remove old requests outside window
    request_counts[api_key] = [t for t in requests if now - t < window_seconds]
    if len(request_counts[api_key]) >= max_requests:
        raise HTTPException(status_code=429, detail="Rate limit exceeded")
    request_counts[api_key].append(now)

async def verify_token(
    credentials: HTTPAuthorizationCredentials = Depends(security)
) -> dict:
    try:
        payload = jwt.decode(credentials.credentials, SECRET_KEY, algorithms=["HS256"])
        rate_limit(payload["sub"])
        return payload
    except jwt.ExpiredSignatureError:
        raise HTTPException(status_code=401, detail="Token expired")
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid token")

@app.post("/v1/chat/completions")
async def chat(request: Request, token: dict = Depends(verify_token)):
    body = await request.json()

    # Input validation
    user_msg = body.get("messages", [{}])[-1].get("content", "")
    try:
        safe_input = sanitize_user_input(user_msg)
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))

    # Call LLM and scrub output
    response = await call_llm(safe_input, token["scope"])
    response["choices"][0]["message"]["content"] = scrub_pii_from_output(
        response["choices"][0]["message"]["content"]
    )
    return response

Model Weight Security

# Verify model weights with SHA-256 hash before loading
MODEL_DIR="./models/llama-3.1-8b"
EXPECTED_HASH="sha256:abc123..."

# Generate hash of downloaded model
actual_hash=$(find "$MODEL_DIR" -name "*.safetensors" | sort | xargs sha256sum | sha256sum)
echo "Model hash: $actual_hash"

# Compare (automate in CI/CD)
if [ "$actual_hash" != "$EXPECTED_HASH" ]; then
  echo "ERROR: Model hash mismatch — possible tampering!"
  exit 1
fi

# Scan model files for embedded malware (ModelScan)
pip install modelscan
modelscan scan -p "$MODEL_DIR"

Network Isolation for AI Services

# Kubernetes NetworkPolicy — isolate LLM API
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: llm-api-isolation
  namespace: ai-services
spec:
  podSelector:
    matchLabels:
      app: vllm
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: backend           # only backend can call LLM
    ports:
    - protocol: TCP
      port: 8000
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring        # metrics only
    ports:
    - protocol: TCP
      port: 9090
  # Block egress to internet — prevent data exfiltration
  # (allow only internal cluster traffic)

Audit Logging

import structlog
from datetime import datetime, timezone

audit_log = structlog.get_logger("ai.audit")

def log_llm_interaction(
    user_id: str,
    session_id: str,
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
    was_filtered: bool,
    injection_detected: bool,
):
    audit_log.info(
        "llm_interaction",
        timestamp=datetime.now(timezone.utc).isoformat(),
        user_id=user_id,
        session_id=session_id,
        model=model,
        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens,
        was_filtered=was_filtered,
        injection_detected=injection_detected,
        # DO NOT log prompt/completion content — PII risk
    )

Common Issues

Issue Cause Fix
False positive injection blocks Overly broad regex Tune patterns; use ML-based classifier for high-traffic
PII in model outputs Model trained on PII data Add Presidio scrubbing to output layer
API key leakage Keys in logs or responses Mask keys in logging; use vault for key storage
Model weight tampering Unverified downloads Always verify SHA-256; use modelscan
Rate limit bypass Per-IP not per-user Rate limit on authenticated user ID, not IP

Best Practices

  • Never log raw prompts or completions — they may contain PII or sensitive data.
  • Treat LLM output as untrusted input — always encode before rendering in HTML.
  • Use network policies to prevent LLM pods from making outbound internet calls.
  • Rotate API keys quarterly; use short-lived JWT tokens for service-to-service auth.
  • Run modelscan on any model downloaded from the internet before serving.

Related Skills

Weekly Installs
3
GitHub Stars
13
First Seen
5 days ago
Installed on
opencode3
antigravity3
claude-code3
github-copilot3
codex3
zencoder3