observability-alert-manager
SKILL.md
Observability Alert Manager
Configure and manage Grafana alerts for Claude Code monitoring using enhanced telemetry.
Data Source
Primary: {job="claude_code_enhanced"} in Loki
Operations
create-alert
Define new alert rule. Parameters: name, query (LogQL), threshold, duration, severity, notification.
list-alerts
Show all configured alerts and their status.
test-alert
Simulate alert conditions.
delete-alert
Remove alert rule.
Pre-built Alert Templates
Session Alerts
-
Long Session Duration: Session >1 hour
{job="claude_code_enhanced", event_type="session_end"} | json | duration_seconds > 3600 -
High Turn Count: Session >50 turns
{job="claude_code_enhanced", event_type="session_end"} | json | turn_count > 50 -
Session Error Spike: >5 errors in session
{job="claude_code_enhanced", event_type="session_end"} | json | error_count > 5
Error Alerts
-
High Error Rate: >5 errors/hour
count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error"} [1h]) > 5 -
Specific Tool Failures: Bash errors
count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error", tool="Bash"} [1h]) > 3
Context Alerts
-
High Context Usage: >80% context window
{job="claude_code_enhanced", event_type="context_utilization"} | json | context_percentage > 80 -
Auto Compaction Triggered: Context full
{job="claude_code_enhanced", event_type="context_compact", trigger="auto"}
Subagent Alerts
- Excessive Subagent Spawning: >10 subagents/session
{job="claude_code_enhanced", event_type="session_end"} | json | subagents_spawned > 10
Activity Alerts
-
Telemetry Staleness: No data >10min
absent_over_time({job="claude_code_enhanced"} [10m]) -
Unusual Activity Spike: >100 tool calls/hour
count_over_time({job="claude_code_enhanced", event_type="tool_call"} [1h]) > 100
Prompt Pattern Alerts
- Debugging Session Spike: Many debugging prompts
count_over_time({job="claude_code_enhanced", event_type="user_prompt", pattern="debugging"} [1h]) > 10
Example Alert Configurations
Create High Error Rate Alert
create-alert \
--name "High Error Rate" \
--query 'count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error"} [1h]) > 5' \
--severity warning \
--notification slack
Create Context Usage Alert
create-alert \
--name "High Context Usage" \
--query '{job="claude_code_enhanced", event_type="context_utilization"} | json | context_percentage > 80' \
--severity info \
--notification email
Create Session Duration Alert
create-alert \
--name "Long Session Warning" \
--query '{job="claude_code_enhanced", event_type="session_end"} | json | duration_seconds > 3600' \
--severity info \
--notification dashboard
Grafana Alert Setup
Via Grafana UI
- Navigate to Alerting → Alert rules
- Create new rule with Loki data source
- Enter LogQL query from templates above
- Configure conditions and notifications
Via API
curl -X POST http://localhost:3000/api/ruler/grafana/api/v1/rules/claude-code \
-H "Content-Type: application/json" \
-u admin:admin \
-d '{
"name": "claude-code-alerts",
"rules": [
{
"alert": "HighErrorRate",
"expr": "count_over_time({job=\"claude_code_enhanced\", status=\"error\"} [1h]) > 5",
"for": "5m",
"labels": {"severity": "warning"},
"annotations": {"summary": "High error rate detected"}
}
]
}'
Notification Channels
- Slack: Webhook integration
- Email: SMTP configuration
- PagerDuty: Incident management
- Dashboard: On-screen annotations
Alert Severity Levels
| Level | Use Case |
|---|---|
critical |
Immediate action required |
warning |
Needs attention soon |
info |
Informational, no action needed |
Scripts
scripts/create-alert.sh- Create new alertscripts/list-alerts.sh- List all alertsscripts/test-alerts.sh- Test alert conditionsscripts/import-alert-templates.sh- Import all pre-built templates