alerting-dashboard-builder
Alerting & Dashboard Builder
Build effective alerts and dashboards based on SLOs.
SLO Definition
slos:
- name: api_availability
objective: 99.9%
window: 30d
sli: |
sum(rate(http_requests_total{status_code!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
- name: api_latency
objective: 95% # 95% of requests under 500ms
window: 30d
sli: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) < 0.5
Alert Rules
groups:
- name: slo_alerts
rules:
# Fast burn (1% budget in 1h)
- alert: AvailabilitySLOFastBurn
expr: |
(1 - (sum(rate(http_requests_total{status_code!~"5.."}[1h])) /
sum(rate(http_requests_total[1h])))) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Burning 1% error budget per hour"
runbook: "https://runbooks.example.com/availability-fast-burn"
# Slow burn (10% budget in 24h)
- alert: AvailabilitySLOSlowBurn
expr: |
(1 - (sum(rate(http_requests_total{status_code!~"5.."}[24h])) /
sum(rate(http_requests_total[24h])))) > 0.001
for: 1h
labels:
severity: warning
annotations:
summary: "Burning error budget slowly"
Dashboard Template
{
"title": "Service Health Dashboard",
"rows": [
{
"title": "Golden Signals",
"panels": [
{
"title": "Request Rate",
"query": "sum(rate(http_requests_total[5m]))",
"type": "graph"
},
{
"title": "Error Rate",
"query": "sum(rate(http_requests_total{status_code=~"5.."}[5m]))",
"type": "graph"
},
{
"title": "Latency (p50, p95, p99)",
"queries": [
"histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
]
},
{
"title": "Saturation (CPU, Memory)",
"queries": [
"rate(process_cpu_seconds_total[5m])",
"process_resident_memory_bytes"
]
}
]
},
{
"title": "SLO Tracking",
"panels": [
{
"title": "Error Budget Remaining",
"query": "1 - ((1 - 0.999) - (1 - slo_availability))"
}
]
}
]
}
What to Do When Alert Fires
# Alert Response Guide
## HighErrorRate
**What it means:** More than 5% of requests are failing
**First steps:**
1. Check recent deployments (rollback if needed)
2. Review error logs for patterns
3. Check dependent services health
4. Verify database connectivity
**Escalation:** If not resolved in 15 min, page on-call lead
## HighLatency
**What it means:** p95 latency above 2 seconds
**First steps:**
1. Check database query performance
2. Review recent code changes
3. Check cache hit rates
4. Look for slow external API calls
**Temporary mitigation:**
- Scale up instances
- Enable aggressive caching
## LowAvailability
**What it means:** Availability below 99.5%
**First steps:**
1. Check infrastructure (AWS status page)
2. Review load balancer health checks
3. Check for DDoS activity
4. Verify auto-scaling functioning
Output Checklist
- SLOs defined
- Alert rules configured
- Dashboards created
- Runbooks linked
- Response guides documented ENDFILE
More from monkey1sai/openai-cli
multi-tenant-safety-checker
Ensures tenant isolation at query and policy level using Row Level Security, automated testing, and security audits. Prevents data leakage between tenants. Use for "multi-tenancy", "tenant isolation", "RLS", or "data security".
10modal-drawer-system
Implements accessible modals and drawers with focus trap, ESC to close, scroll lock, portal rendering, and ARIA attributes. Includes sample implementations for common use cases like edit forms, confirmations, and detail views. Use when building "modals", "dialogs", "drawers", "sidebars", or "overlays".
10eslint-prettier-config
Configures ESLint and Prettier for consistent code quality with TypeScript, React, and modern best practices. Use when users request "ESLint setup", "Prettier config", "linting configuration", "code formatting", or "lint rules".
9api-security-hardener
Hardens API security with rate limiting, input validation, authentication, and protection against common attacks. Use when users request "API security", "secure API", "rate limiting", "input validation", or "API protection".
9secure-headers-csp-builder
Implements security headers and Content Security Policy with safe rollout strategy (report-only → enforce), testing, and compatibility checks. Use for "security headers", "CSP", "HTTP headers", or "XSS protection".
9security-incident-playbook-generator
Creates response procedures for security incidents with containment steps, communication templates, and evidence collection. Use for "incident response", "security playbook", "breach response", or "IR plan".
9