skills/adaptationio/skrillz/observability-alert-manager

observability-alert-manager

SKILL.md

Observability Alert Manager

Configure and manage Grafana alerts for Claude Code monitoring using enhanced telemetry.

Data Source

Primary: {job="claude_code_enhanced"} in Loki

Operations

create-alert

Define new alert rule. Parameters: name, query (LogQL), threshold, duration, severity, notification.

list-alerts

Show all configured alerts and their status.

test-alert

Simulate alert conditions.

delete-alert

Remove alert rule.

Pre-built Alert Templates

Session Alerts

  1. Long Session Duration: Session >1 hour

    {job="claude_code_enhanced", event_type="session_end"} | json | duration_seconds > 3600
    
  2. High Turn Count: Session >50 turns

    {job="claude_code_enhanced", event_type="session_end"} | json | turn_count > 50
    
  3. Session Error Spike: >5 errors in session

    {job="claude_code_enhanced", event_type="session_end"} | json | error_count > 5
    

Error Alerts

  1. High Error Rate: >5 errors/hour

    count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error"} [1h]) > 5
    
  2. Specific Tool Failures: Bash errors

    count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error", tool="Bash"} [1h]) > 3
    

Context Alerts

  1. High Context Usage: >80% context window

    {job="claude_code_enhanced", event_type="context_utilization"} | json | context_percentage > 80
    
  2. Auto Compaction Triggered: Context full

    {job="claude_code_enhanced", event_type="context_compact", trigger="auto"}
    

Subagent Alerts

  1. Excessive Subagent Spawning: >10 subagents/session
    {job="claude_code_enhanced", event_type="session_end"} | json | subagents_spawned > 10
    

Activity Alerts

  1. Telemetry Staleness: No data >10min

    absent_over_time({job="claude_code_enhanced"} [10m])
    
  2. Unusual Activity Spike: >100 tool calls/hour

    count_over_time({job="claude_code_enhanced", event_type="tool_call"} [1h]) > 100
    

Prompt Pattern Alerts

  1. Debugging Session Spike: Many debugging prompts
    count_over_time({job="claude_code_enhanced", event_type="user_prompt", pattern="debugging"} [1h]) > 10
    

Example Alert Configurations

Create High Error Rate Alert

create-alert \
  --name "High Error Rate" \
  --query 'count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error"} [1h]) > 5' \
  --severity warning \
  --notification slack

Create Context Usage Alert

create-alert \
  --name "High Context Usage" \
  --query '{job="claude_code_enhanced", event_type="context_utilization"} | json | context_percentage > 80' \
  --severity info \
  --notification email

Create Session Duration Alert

create-alert \
  --name "Long Session Warning" \
  --query '{job="claude_code_enhanced", event_type="session_end"} | json | duration_seconds > 3600' \
  --severity info \
  --notification dashboard

Grafana Alert Setup

Via Grafana UI

  1. Navigate to Alerting → Alert rules
  2. Create new rule with Loki data source
  3. Enter LogQL query from templates above
  4. Configure conditions and notifications

Via API

curl -X POST http://localhost:3000/api/ruler/grafana/api/v1/rules/claude-code \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "name": "claude-code-alerts",
    "rules": [
      {
        "alert": "HighErrorRate",
        "expr": "count_over_time({job=\"claude_code_enhanced\", status=\"error\"} [1h]) > 5",
        "for": "5m",
        "labels": {"severity": "warning"},
        "annotations": {"summary": "High error rate detected"}
      }
    ]
  }'

Notification Channels

  • Slack: Webhook integration
  • Email: SMTP configuration
  • PagerDuty: Incident management
  • Dashboard: On-screen annotations

Alert Severity Levels

Level Use Case
critical Immediate action required
warning Needs attention soon
info Informational, no action needed

Scripts

  • scripts/create-alert.sh - Create new alert
  • scripts/list-alerts.sh - List all alerts
  • scripts/test-alerts.sh - Test alert conditions
  • scripts/import-alert-templates.sh - Import all pre-built templates
Weekly Installs
1
Installed on
claude-code1