Observability & Reliability Engineering
Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development.
Quick Health Check (/16)
Score your current observability posture:
| Signal |
Healthy (2) |
Weak (1) |
Missing (0) |
| Structured logging |
JSON logs with trace_id correlation |
Logs exist but unstructured |
Console.log / print statements |
| Metrics collection |
RED/USE metrics with dashboards |
Some metrics, no dashboards |
No metrics |
| Distributed tracing |
Full request path with sampling |
Partial traces, key services only |
No tracing |
| Alerting |
SLO-based alerts with runbooks |
Threshold alerts, some runbooks |
No alerts or all-noise |
| Incident response |
Defined process with roles + post-mortems |
Ad-hoc response, some docs |
"Whoever notices fixes it" |
| SLOs defined |
SLOs with error budgets tracked weekly |
Informal availability targets |
No reliability targets |
| On-call rotation |
Structured rotation with escalation |
Informal "call someone" |
No on-call |
| Cost management |
Observability budget tracked monthly |
Some awareness of costs |
No idea what you spend |
12-16: Production-grade. Focus on optimization.
8-11: Foundation exists. Fill the gaps systematically.
4-7: Significant risk. Prioritize alerting + incident response.
0-3: Flying blind. Start with Phase 1 immediately.
Phase 1: Structured Logging
Log Architecture
Application → Structured JSON → Log Router → Storage → Query Engine
↓
Alert Pipeline
Required Fields (Every Log Line)
| Field |
Type |
Purpose |
Example |
timestamp |
ISO-8601 UTC |
When |
2026-02-22T18:30:00.123Z |
level |
enum |
Severity |
info, warn, error, fatal |
service |
string |
Which service |
payment-api |
version |
string |
Which deploy |
v2.3.1 |
environment |
string |
Which env |
production |
message |
string |
What happened |
Payment processed successfully |
trace_id |
string |
Request correlation |
abc123def456 |
span_id |
string |
Operation within trace |
span_789 |
duration_ms |
number |
How long |
142 |
Contextual Fields (Add Per Domain)
http:
method: POST
path: /api/v1/orders
status: 201
client_ip: 203.0.113.42
user_agent: "Mozilla/5.0..."
request_id: "req_abc123"
business:
user_id: "usr_456"
tenant_id: "tenant_789"
order_id: "ord_012"
action: "checkout"
amount_cents: 4999
currency: "USD"
error:
type: "PaymentDeclinedError"
message: "Card declined: insufficient funds"
code: "CARD_DECLINED"
stack: "..."
retry_count: 2
retryable: true
Log Level Decision Tree
Is the process about to crash?
→ FATAL (exit after logging)
Did an operation fail that needs human attention?
→ ERROR (page someone or create ticket)
Did something unexpected happen but we recovered?
→ WARN (review in daily triage)
Is this a normal business event worth recording?
→ INFO (audit trail, business metrics)
Is this useful for debugging but noisy in production?
→ DEBUG (off in prod, on in staging)
Is this only useful when stepping through code?
→ TRACE (never in production)
Log Level Rules
- ERROR means action required — if no one needs to act on it, it's WARN
- INFO is for business events — not internal implementation details
- No logging inside tight loops — aggregate and log summary
- Log at boundaries — API entry/exit, queue consume/publish, DB calls
- Never log secrets — API keys, tokens, passwords, PII (see scrubbing below)
PII & Secret Scrubbing
scrub_patterns:
- field_patterns: ["password", "secret", "token", "api_key", "authorization"]
action: replace_with_redacted
- field_patterns: ["email", "phone", "ssn", "national_id"]
action: sha256_hash
- field_patterns: ["credit_card", "card_number"]
action: mask_last_4
- field_patterns: ["client_ip", "ip_address"]
action: zero_last_octet
Logger Setup (By Language)
Node.js (Pino):
import pino from 'pino';
import { AsyncLocalStorage } from 'node:async_hooks';
const als = new AsyncLocalStorage<Record<string, string>>();
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
mixin: () => als.getStore() ?? {},
redact: ['req.headers.authorization', '*.password', '*.token'],
timestamp: pino.stdTimeFunctions.isoTime,
});
app.use((req, res, next) => {
const ctx = {
trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
request_id: crypto.randomUUID(),
service: 'payment-api',
version: process.env.APP_VERSION,
};
als.run(ctx, () => next());
});
Python (structlog):
import structlog
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso", utc=True),
structlog.processors.JSONRenderer(),
],
)
log = structlog.get_logger()
structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id)
Go (zerolog):
log := zerolog.New(os.Stdout).With().
Timestamp().
Str("service", "payment-api").
Str("version", version).
Logger()
reqLog := log.With().Str("trace_id", traceID).Logger()
Log Storage Decision
| Volume |
Solution |
Retention |
Cost |
| <10 GB/day |
Loki + Grafana |
30 days hot, 90 days cold |
Low |
| 10-100 GB/day |
Elasticsearch / OpenSearch |
14 days hot, 90 days S3 |
Medium |
| 100+ GB/day |
ClickHouse or Datadog |
7 days hot, 30 days archive |
High |
| Budget-constrained |
Loki + S3 backend |
90 days all cold |
Very low |
10 Logging Anti-Patterns
| # |
Anti-Pattern |
Fix |
| 1 |
log.error(err) with no context |
Always include: what operation, what input, what state |
| 2 |
Logging request/response bodies |
Log only in DEBUG; redact sensitive fields |
| 3 |
String concatenation in log messages |
Use structured fields: log.info("processed", { order_id, amount }) |
| 4 |
Catch-and-log-and-rethrow |
Log at the boundary where you handle it, not every layer |
| 5 |
Different log formats per service |
Standardize schema across all services |
| 6 |
No log rotation / retention policy |
Set max size + TTL; archive to cold storage |
| 7 |
Logging inside hot paths |
Aggregate: log summary every N items or every interval |
| 8 |
Missing correlation IDs |
Propagate trace_id from first entry point through all services |
| 9 |
Boolean log levels (verbose: true) |
Use standard levels with configurable minimum |
| 10 |
Logging PII in plain text |
Implement scrubbing at the logger level |
Phase 2: Metrics Collection
The RED Method (Request-Driven Services)
For every service endpoint, track:
| Metric |
What |
Prometheus Example |
| Rate |
Requests per second |
http_requests_total{method, path, status} |
| Errors |
Failed requests per second |
http_requests_total{status=~"5.."} / total |
| Duration |
Latency distribution |
http_request_duration_seconds{method, path} (histogram) |
The USE Method (Infrastructure Resources)
For every resource (CPU, memory, disk, network):
| Metric |
What |
Example |
| Utilization |
% resource busy |
CPU usage 78% |
| Saturation |
Queue depth / backpressure |
12 requests queued |
| Errors |
Resource errors |
3 disk I/O errors |
Golden Signals (Google SRE)
| Signal |
Meaning |
Source |
| Latency |
Time to serve requests |
RED Duration |
| Traffic |
Demand on the system |
RED Rate |
| Errors |
Rate of failed requests |
RED Errors |
| Saturation |
How "full" the service is |
USE Saturation |
Metric Types & When to Use Each
| Type |
Use Case |
Example |
| Counter |
Things that only go up |
Total requests, errors, bytes sent |
| Gauge |
Current value that goes up/down |
Active connections, queue depth, temperature |
| Histogram |
Distribution of values |
Request latency, response size |
| Summary |
Pre-calculated percentiles |
Client-side latency (when you need exact percentiles) |
Rule: Use histograms over summaries in most cases — they're aggregatable across instances.
Naming Conventions
# Pattern: <namespace>_<subsystem>_<name>_<unit>
http_server_request_duration_seconds
http_server_requests_total
db_pool_connections_active
queue_messages_pending
cache_hit_ratio
# Rules:
# 1. Use snake_case
# 2. Include unit suffix (_seconds, _bytes, _total)
# 3. _total suffix for counters
# 4. Don't include label names in metric name
# 5. Use base units (seconds not milliseconds, bytes not kilobytes)
Label Design Rules
| Rule |
Why |
Example |
| Keep cardinality <100 per label |
High cardinality kills performance |
status="200" not status="200 OK" |
| No user IDs as labels |
Unbounded cardinality |
Use log correlation instead |
| No request paths with IDs |
/api/users/123 creates millions of series |
Normalize: /api/users/:id |
| Max 5-7 labels per metric |
Each combo = a time series |
{method, path, status, service} |
Instrumentation Checklist
application_metrics:
- http_request_duration_seconds: histogram {method, path, status}
- http_request_size_bytes: histogram {method, path}
- http_response_size_bytes: histogram {method, path}
- http_requests_in_flight: gauge
- orders_processed_total: counter {status, payment_method}
- order_value_dollars: histogram {payment_method}
- user_signups_total: counter {source}
- db_query_duration_seconds: histogram {query_type, table}
- db_connections_active: gauge {pool}
- db_connections_idle: gauge {pool}
- cache_requests_total: counter {result: hit|miss}
- external_api_duration_seconds: histogram {service, endpoint}
- external_api_errors_total: counter {service, error_type}
- queue_messages_published_total: counter {queue}
- queue_messages_consumed_total: counter {queue, status}
- queue_processing_duration_seconds: histogram {queue}
- queue_depth: gauge {queue}
- queue_consumer_lag: gauge {queue, consumer_group}
infrastructure_metrics:
- cpu_usage_percent: gauge {instance}
- memory_usage_bytes: gauge {instance}
- disk_usage_bytes: gauge {instance, mount}
- disk_io_seconds: counter {instance, device}
- network_bytes: counter {instance, direction}
- container_cpu_usage: gauge {pod, container}
- container_memory_usage: gauge {pod, container}
Stack Recommendations
| Component |
Options |
Recommendation |
| Collection |
Prometheus, OTEL Collector, Datadog Agent |
Prometheus (free) or OTEL Collector (vendor-neutral) |
| Storage |
Prometheus, Thanos, Mimir, VictoriaMetrics |
VictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem) |
| Visualization |
Grafana, Datadog, New Relic |
Grafana (free, extensible) |
| Alerting |
Alertmanager, Grafana Alerting, PagerDuty |
Alertmanager + PagerDuty routing |
Phase 3: Distributed Tracing
Trace Architecture
Client Request
→ API Gateway (root span)
→ Auth Service (child span)
→ Order Service (child span)
→ Database Query (child span)
→ Payment Service (child span)
→ Stripe API (child span)
→ Notification Service (child span)
→ Email Provider (child span)
OpenTelemetry Setup
Auto-instrumentation (Node.js):
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] },
'@opentelemetry/instrumentation-express': { enabled: true },
})],
serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api',
});
sdk.start();
Custom spans for business logic:
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');
async function processPayment(order: Order) {
return tracer.startActiveSpan('process-payment', async (span) => {
span.setAttributes({
'order.id': order.id,
'order.amount_cents': order.amountCents,
'payment.method': order.paymentMethod,
});
try {
const result = await chargeCard(order);
span.setAttributes({ 'payment.status': result.status });
return result;
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}
Sampling Strategies
| Strategy |
When |
Config |
| Always On |
Dev/staging, low traffic (<100 rps) |
ratio: 1.0 |
| Probabilistic |
Moderate traffic (100-1000 rps) |
ratio: 0.1 (10%) |
| Rate-limited |
High traffic (>1000 rps) |
max_traces_per_second: 100 |
| Tail-based |
Want all errors + slow requests |
Collector-side: keep if error OR duration > p99 |
| Parent-based |
Respect upstream decisions |
If parent sampled, child sampled |
Recommendation: Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors.
Context Propagation
| Header |
Standard |
Format |
traceparent |
W3C Trace Context |
00-{trace_id}-{span_id}-{flags} |
tracestate |
W3C Trace Context |
Vendor-specific key-value pairs |
b3 |
Zipkin B3 |
{trace_id}-{span_id}-{sampled} |
Rule: Use W3C Trace Context (traceparent) as primary. Support B3 for legacy Zipkin systems.
Trace Storage
| Volume |
Solution |
Retention |
| <50 GB/day |
Jaeger + Elasticsearch |
7 days |
| 50-500 GB/day |
Tempo + S3 |
14 days |
| 500+ GB/day |
Tempo + S3 with aggressive sampling |
7 days |
| Budget-constrained |
Jaeger + Badger (local disk) |
3 days |
Phase 4: SLOs, SLIs & Error Budgets
SLI Selection by Service Type
| Service Type |
Primary SLI |
Secondary SLI |
Measurement |
| API / Web |
Availability + Latency |
Error rate |
Server-side + synthetic |
| Data pipeline |
Freshness + Correctness |
Throughput |
Pipeline timestamps + checksums |
| Storage |
Durability + Availability |
Latency |
Checksums + uptime monitoring |
| Streaming |
Throughput + Latency |
Message loss rate |
Consumer lag + e2e latency |
| Batch jobs |
Success rate + Freshness |
Duration |
Job scheduler metrics |
SLO Definition Template
slo:
name: "Payment API Availability"
service: payment-api
owner: payments-team
sli:
type: availability
definition: "Proportion of non-5xx responses"
measurement: |
sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="payment-api"}[5m]))
target: 99.95%
window: rolling_30d
error_budget:
total_minutes: 21.9
burn_rate_alerts:
- severity: critical
burn_rate: 14.4x
short_window: 5m
long_window: 1h
- severity: warning
burn_rate: 6x
short_window: 30m
long_window: 6h
- severity: ticket
burn_rate: 1x
short_window: 6h
long_window: 3d
consequences:
budget_remaining_above_50pct: "Normal development velocity"
budget_remaining_20_to_50pct: "Prioritize reliability work"
budget_remaining_below_20pct: "Feature freeze; reliability only"
budget_exhausted: "All hands on reliability until budget recovers"
Common SLO Targets
| Service Tier |
Availability |
p50 Latency |
p99 Latency |
Monthly Downtime |
| Tier 0 (payments, auth) |
99.99% |
<100ms |
<500ms |
4.3 min |
| Tier 1 (core API) |
99.95% |
<200ms |
<1s |
21.9 min |
| Tier 2 (non-critical) |
99.9% |
<500ms |
<2s |
43.8 min |
| Tier 3 (internal tools) |
99.5% |
<1s |
<5s |
3.6 hours |
| Batch / pipeline |
99% (success rate) |
N/A |
N/A |
N/A |
Error Budget Tracking
error_budget_review:
week: "2026-W08"
service: payment-api
slo_target: 99.95%
budget:
total_minutes_this_period: 21.9
consumed_minutes: 8.2
remaining_minutes: 13.7
remaining_percent: 62.6%
incidents_consuming_budget:
- date: "2026-02-18"
duration_minutes: 5.1
cause: "Database connection pool exhaustion"
preventable: true
action: "Increase pool size + add saturation alert"
- date: "2026-02-20"
duration_minutes: 3.1
cause: "Upstream payment provider timeout"
preventable: false
action: "Add circuit breaker with fallback"
velocity_decision: "Normal — 62.6% budget remaining"
reliability_work_this_week:
- "Add connection pool saturation alert"
- "Implement circuit breaker for payment provider"
Phase 5: Alert Design
Alert Quality Principles
- Every alert must be actionable — if no one needs to act, it's not an alert
- Every alert needs a runbook — linked directly in the alert annotation
- Symptom-based over cause-based — alert on "users can't checkout" not "CPU high"
- Multi-window burn rate — not static thresholds (see SLO alerts above)
- Alert on absence, not just presence — "no orders in 15 min" catches silent failures
Alert Severity Levels
| Severity |
Response Time |
Channel |
Who |
Example |
| P0 — Critical |
<5 min |
Page (PagerDuty/Opsgenie) |
On-call engineer |
Payment system down |
| P1 — High |
<30 min |
Page during business hours, Slack 24/7 |
On-call |
Error rate >5% for 10 min |
| P2 — Medium |
<4 hours |
Slack channel |
Team |
p99 latency degraded 2x |
| P3 — Low |
Next business day |
Ticket auto-created |
Team backlog |
Disk usage >80% |
| Info |
N/A |
Dashboard only |
No one |
Deploy completed |
Alerting Anti-Patterns
| Anti-Pattern |
Problem |
Fix |
| Static CPU/memory thresholds |
Noisy, not user-impacting |
Use SLO-based burn rate alerts |
| Alert per instance |
50 instances = 50 alerts for same issue |
Aggregate: alert on service-level error rate |
| No deduplication |
Same alert fires 100 times |
Group by service + alert name; set repeat interval |
| Missing runbook |
Engineer gets paged, doesn't know what to do |
Every alert links to a runbook |
| Threshold too sensitive |
Fires on brief spikes |
Use for: 5m to require sustained condition |
| Too many P0s |
Alert fatigue → ignoring real incidents |
Audit monthly; demote or remove noisy alerts |
Alert Template (Prometheus Alertmanager)
groups:
- name: payment-api-slo
rules:
- alert: PaymentAPIHighErrorRate
expr: |
(
sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="payment-api"}[5m]))
) > 0.01
for: 5m
labels:
severity: critical
service: payment-api
team: payments
annotations:
summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)"
description: "5xx error rate has exceeded 1% for 5 minutes"
runbook: "https://wiki.internal/runbooks/payment-api-errors"
dashboard: "https://grafana.internal/d/payment-api"
- alert: PaymentAPINoTraffic
expr: |
sum(rate(http_requests_total{service="payment-api"}[15m])) == 0
for: 5m
labels:
severity: critical
service: payment-api
annotations:
summary: "Payment API receiving zero traffic for 5 minutes"
runbook: "https://wiki.internal/runbooks/payment-api-no-traffic"
- alert: PaymentAPILatencyHigh
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)"
runbook: "https://wiki.internal/runbooks/payment-api-latency"
Runbook Template
# Runbook: PaymentAPIHighErrorRate
## What This Alert Means
The payment API is returning >1% 5xx errors over a 5-minute window.
Users are likely failing to complete checkouts.
## Impact
- Users cannot process payments
- Revenue loss: ~$X per minute (based on average traffic)
- SLO: Payment API availability (target: 99.95%)
## Immediate Actions
1. Check the error dashboard: [link]
2. Check recent deploys: `kubectl rollout history deployment/payment-api`
3. Check upstream dependencies:
- Database: [dashboard link]
- Stripe API: [status page]
- Redis cache: [dashboard link]
4. Check application logs:
kubectl logs -l app=payment-api --since=10m | jq 'select(.level=="error")'
## Common Causes & Fixes
| Cause | Diagnosis | Fix |
|-------|-----------|-----|
| Bad deploy | Errors started at deploy time | `kubectl rollout undo deployment/payment-api` |
| DB connection exhaustion | `db_connections_active` at max | Restart pods (rolling) + increase pool size |
| Stripe outage | Stripe status page red | Enable fallback payment processor |
| Memory leak | Memory climbing, OOMKilled events | Rolling restart + investigate |
## Escalation
- If unresolved after 15 min: page payment team lead
- If revenue impact >$10K: page VP Engineering
- If Stripe outage: communicate to support team for customer messaging
## Resolution
- Confirm error rate <0.1% for 10 min
- Post in #incidents: root cause + duration + impact
- Schedule post-mortem if downtime >5 min
Phase 6: Dashboard Architecture
Dashboard Hierarchy
L1: Executive / Business Dashboard (non-technical stakeholders)
↓
L2: Service Overview Dashboard (on-call, quick triage)
↓
L3: Service Deep-Dive Dashboard (debugging specific service)
↓
L4: Infrastructure Dashboard (resource-level details)
L1: Business Dashboard
panels:
- title: "Revenue per Minute"
type: stat
query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)"
- title: "Active Users (5min)"
type: stat
query: "count(count by (user_id) (http_requests_total{...}[5m]))"
- title: "Checkout Success Rate"
type: gauge
query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))"
thresholds: [95, 98, 99.5]
- title: "Error Budget Remaining"
type: gauge
query: "1 - (error_budget_consumed / error_budget_total)"
L2: Service Overview Dashboard
Every service gets one of these with identical layout:
row_1_traffic:
- "Request Rate (rps)" — timeseries, by status code
- "Error Rate (%)" — timeseries, threshold line at SLO
- "Active Requests" — gauge
row_2_latency:
- "Latency Distribution" — heatmap
- "p50 / p95 / p99" — timeseries, threshold lines
- "Latency by Endpoint" — table, sorted by p99
row_3_dependencies:
- "Downstream Latency" — timeseries per dependency
- "Downstream Error Rate" — timeseries per dependency
- "Database Query Duration" — timeseries by query type
row_4_resources:
- "CPU Usage" — timeseries per pod
- "Memory Usage" — timeseries per pod
- "Pod Restarts" — stat
row_5_business:
- "Business Metric 1" — service-specific
- "Business Metric 2" — service-specific
Dashboard Rules
- Time range default: last 1 hour — most debugging happens in recent time
- Variable selectors at top: environment, service, instance
- Consistent color coding: green=good, yellow=degraded, red=bad across all dashboards
- Link alerts to dashboards — every alert annotation includes dashboard URL
- No more than 15 panels per dashboard — split into L3 if needed
- Include "as of" timestamp — so screenshots in incidents are unambiguous
- Dashboard as code — store Grafana JSON in git, provision via API
Phase 7: Incident Response
Incident Severity Classification
| Severity |
Criteria |
Response |
Communication |
| SEV-1 |
Service down, data loss risk, security breach |
All hands, war room |
Status page update every 15 min |
| SEV-2 |
Degraded service, SLO at risk, partial outage |
On-call + backup |
Status page update every 30 min |
| SEV-3 |
Minor degradation, workaround exists |
On-call during hours |
Internal Slack update |
| SEV-4 |
Cosmetic, low impact |
Next sprint |
None |
Incident Roles
| Role |
Responsibility |
Who |
| Incident Commander (IC) |
Owns the incident. Coordinates. Makes decisions. |
On-call lead |
| Technical Lead |
Diagnoses and fixes. Communicates technical status to IC. |
Senior engineer |
| Communications Lead |
Updates status page, Slack, stakeholders. |
Product/support |
| Scribe |
Documents timeline, actions, decisions in real-time. |
Anyone available |
Incident Response Workflow
1. DETECT
- Alert fires → on-call paged
- Customer report → support escalates
- Internal discovery → engineer reports
2. TRIAGE (first 5 minutes)
- Confirm the issue is real (not false alert)
- Classify severity (SEV-1 through SEV-4)
- Open incident channel: #inc-YYYY-MM-DD-short-description
- Assign roles (IC, Tech Lead, Comms)
3. MITIGATE (next 5-30 minutes)
- Goal: STOP THE BLEEDING, not find root cause
- Options (try in order):
a. Rollback last deploy
b. Scale up / restart pods
c. Toggle feature flag off
d. Redirect traffic / enable fallback
e. Manual data fix
- Document every action with timestamp
4. STABILIZE
- Confirm mitigation is working (metrics back to normal)
- Monitor for 15-30 min for recurrence
- Update status page: "Monitoring fix"
5. RESOLVE
- Confirm all metrics healthy for 30+ min
- Update status page: "Resolved"
- Schedule post-mortem (within 48 hours for SEV-1/2)
- Send internal summary to stakeholders
Incident Channel Template
📋 Incident: Payment API 5xx Errors
🔴 Severity: SEV-2
🕐 Started: 2026-02-22 14:23 UTC
👤 IC: @alice
🔧 Tech Lead: @bob
📢 Comms: @charlie
Status: MITIGATING
Impact: ~5% of checkout requests failing
Customer-facing: Yes
Timeline:
14:23 — Alert fired: PaymentAPIHighErrorRate
14:25 — IC assigned: @alice, confirmed real via dashboard
14:28 — Tech Lead: error logs show connection pool exhaustion post-deploy
14:31 — Rolled back deployment v2.3.1 → v2.3.0
14:35 — Error rate dropping, monitoring
14:50 — Error rate <0.1%, marking resolved
Phase 8: Post-Mortem Framework
Blameless Post-Mortem Template
post_mortem:
title: "Payment API Connection Pool Exhaustion"
date: "2026-02-22"
severity: SEV-2
duration: 27 minutes (14:23 — 14:50 UTC)
authors: ["@alice", "@bob"]
reviewers: ["@engineering-leads"]
status: action_items_in_progress
summary: |
A deployment at 14:15 introduced a connection leak in the payment API.
Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of
checkout requests. Rolled back at 14:31; recovered by 14:50.
impact:
user_impact: "~340 users saw checkout failures over 27 minutes"
revenue_impact: "$2,100 estimated (based on average order value × failed checkouts)"
slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)"
data_impact: "No data loss. 12 orders failed; users could retry successfully."
timeline:
- time: "14:15"
event: "Deploy v2.3.1 rolled out (3/3 pods updated)"
- time: "14:23"
event: "PaymentAPIHighErrorRate alert fired"
- time: "14:25"
event: "IC assigned, confirmed via dashboard"
- time: "14:28"
event: "Root cause identified: new ORM query not releasing connections"
- time: "14:31"
event: "Rollback initiated: v2.3.1 → v2.3.0"
- time: "14:35"
event: "Error rate declining"
- time: "14:50"
event: "Resolved: error rate <0.1% sustained"
root_cause: |
The v2.3.1 deploy introduced a new database query in the order validation
path. The query used a raw connection instead of the pool's managed client,
so connections were acquired but never released. Under load, the pool
exhausted within 8 minutes.
contributing_factors:
- "No integration test for connection pool behavior under load"
- "Connection pool saturation metric existed but had no alert"
- "Code review didn't catch raw connection usage"
what_went_well:
- "Alert fired within 8 minutes of deploy"
- "IC assigned in 2 minutes"
- "Root cause identified in 3 minutes (clear in logs)"
- "Rollback executed cleanly"
what_went_wrong:
- "8-minute detection gap after deploy"
- "No canary deployment to catch before full rollout"
- "Connection pool saturation had no alert"
action_items:
- action: "Add connection pool saturation alert (>80% for 2 min)"
owner: "@bob"
priority: P1
due: "2026-02-25"
status: in_progress
ticket: "ENG-1234"
- action: "Enable canary deployments for payment-api"
owner: "@alice"
priority: P1
due: "2026-03-01"
ticket: "ENG-1235"
- action: "Add linting rule: no raw DB connections in application code"
owner: "@charlie"
priority: P2
due: "2026-03-07"
ticket: "ENG-1236"
- action: "Load test payment-api connection pool in staging"
owner: "@bob"
priority: P2
due: "2026-03-07"
ticket: "ENG-1237"
lessons_learned:
- "Resource saturation metrics need alerts, not just dashboards"
- "Canary deployments are mandatory for Tier 0 services"
- "ORM abstractions don't guarantee connection safety — review raw queries"
Post-Mortem Meeting Agenda (60 minutes)
1. (5 min) Context setting — IC reads the summary
2. (15 min) Timeline walkthrough — what happened, when, by whom
3. (15 min) Root cause deep-dive — 5 Whys exercise
4. (5 min) What went well — celebrate good response
5. (15 min) Action items — assign owners, priorities, due dates
6. (5 min) Wrap-up — review date for action item check-in
5 Whys Exercise
Problem: 5xx errors in payment API
Why 1: Database connections were exhausted
Why 2: A new query acquired connections without releasing them
Why 3: The query used a raw connection instead of the pool manager
Why 4: The ORM's raw query API doesn't auto-release (by design)
Why 5: We don't have a linting rule or code review checklist item for this
Root cause: Missing guard against raw connection usage in application code
Systemic fix: Linting rule + connection pool saturation alerting
Phase 9: On-Call Operations
On-Call Structure
on_call:
rotation: weekly
handoff_day: Monday 10:00 UTC
primary:
response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3)
escalation_after: 15 minutes no-ack
secondary:
response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3)
escalation_after: 30 minutes no-ack
manager_escalation:
trigger: SEV-1 unresolved after 30 minutes
handoff_checklist:
- Review open incidents and active alerts
- Check error budget status for all services
- Read post-mortems from previous week
- Verify PagerDuty schedule and contact info
- Test alert routing (send test page)
On-Call Health Metrics
| Metric |
Healthy |
Needs Attention |
Unhealthy |
| Pages per week |
<5 |
5-15 |
>15 |
| After-hours pages per week |
<2 |
2-5 |
>5 |
| False positive rate |
<10% |
10-30% |
>30% |
| Mean time to acknowledge |
<5 min |
5-15 min |
>15 min |
| Mean time to resolve |
<30 min |
30-120 min |
>120 min |
| Toil ratio (manual vs automated) |
<30% |
30-60% |
>60% |
Weekly On-Call Review Template
on_call_review:
week: "2026-W08"
engineer: "@bob"
incidents:
total: 7
sev_1: 0
sev_2: 1
sev_3: 4
false_positives: 2
after_hours: 3
time_spent:
incident_response: "4.5 hours"
toil_automation: "2 hours"
runbook_updates: "1 hour"
improvements_made:
- "Silenced noisy disk alert on dev servers"
- "Added auto-remediation for pod restart threshold"
improvements_needed:
- "Cache expiry alert fires every Tuesday at 03:00 — needs investigation"
- "Payment retry logic needs circuit breaker (caused 3 alerts)"
handoff_notes: |
Watch payment-api p99 latency — it's been creeping up since Wednesday.
Stripe changed their sandbox endpoints; staging may throw errors.
Phase 10: Chaos Engineering & Reliability Testing
Chaos Principles
- Start with a hypothesis: "If X fails, the system should Y"
- Run in production (start small — one instance, one AZ)
- Minimize blast radius with automatic rollback
- Build confidence incrementally: staging → canary → production
Chaos Experiment Template
chaos_experiment:
name: "Payment DB failover"
hypothesis: "If the primary database becomes unavailable, traffic should
failover to the replica within 30 seconds with <1% error rate spike"
steady_state:
- metric: "checkout_success_rate"
expected: ">99.5%"
- metric: "db_query_duration_p99"
expected: "<200ms"
injection:
type: "network_partition"
target: "payment-db-primary"
duration: "5 minutes"
blast_radius: "single AZ"
abort_conditions:
- "checkout_success_rate < 95% for > 60 seconds"
- "revenue_per_minute drops > 50%"
- "any SEV-1 incident declared"
results:
failover_time: "22 seconds"
error_spike: "0.3% for 25 seconds"
hypothesis_confirmed: true
follow_up_actions:
- "Document failover behavior in runbook"
- "Add failover time as SLI (target: <30s)"
Chaos Engineering Maturity Levels
| Level |
What You Test |
Tools |
| 1: Manual |
Kill a pod, see what happens |
kubectl delete pod |
| 2: Automated |
Scheduled pod kills, network delays |
Chaos Monkey, Litmus |
| 3: Game Days |
Multi-failure scenarios with team exercise |
Custom scripts + coordination |
| 4: Continuous |
Automated chaos in production with auto-rollback |
Gremlin, Chaos Mesh |
Phase 11: Observability Cost Optimization
Cost Drivers (Ranked)
| # |
Driver |
Typical % of Bill |
Optimization |
| 1 |
Log volume |
40-60% |
Reduce verbosity, drop DEBUG, sample repetitive |
| 2 |
Metric cardinality |
15-25% |
Drop unused metrics, limit labels |
| 3 |
Trace volume |
10-20% |
Sampling, tail-based sampling |
| 4 |
Retention |
10-15% |
Tiered storage (hot → warm → cold) |
| 5 |
Query cost |
5-10% |
Optimize dashboard queries, set max scan limits |
Cost Reduction Checklist
cost_optimization:
logs:
- action: "Drop DEBUG/TRACE in production"
savings: "30-50% of log volume"
- action: "Sample health check logs (1:100)"
savings: "5-15% of log volume"
- action: "Deduplicate identical error bursts"
savings: "10-20% during incidents"
- action: "Move logs older than 7 days to S3/cold storage"
savings: "60-80% of storage cost"
- action: "Drop request/response body logging"
savings: "20-40% of log volume"
metrics:
- action: "Audit unused metrics (no dashboard, no alert)"
savings: "10-30% of series"
- action: "Reduce histogram bucket count (default 11 → 8)"
savings: "~27% of histogram series"
- action: "Remove high-cardinality labels"
savings: "Variable — can be massive"
- action: "Increase scrape interval for non-critical metrics (15s → 60s)"
savings: "75% of data points for those metrics"
traces:
- action: "Implement tail-based sampling"
savings: "80-95% of trace volume"
- action: "Drop internal health check traces"
savings: "5-20% of trace volume"
- action: "Reduce span attribute size (truncate long strings)"
savings: "10-30% of trace storage"
general:
- action: "Review and right-size retention policies quarterly"
- action: "Set query timeouts and result limits on dashboards"
- action: "Use recording rules for expensive queries"
Monthly Cost Review Template
observability_cost_review:
month: "February 2026"
total_cost: "$X,XXX"
breakdown:
logs: { volume: "X TB", cost: "$X", pct: "X%" }
metrics: { series: "X million", cost: "$X", pct: "X%" }
traces: { volume: "X TB", cost: "$X", pct: "X%" }
infrastructure: { instances: X, cost: "$X", pct: "X%" }
cost_per:
request: "$0.000X"
service: "$X average"
engineer: "$X per engineer"
optimizations_applied: []
optimizations_planned: []
budget_status: "on_track | over_budget | under_budget"
Phase 12: Advanced Patterns
Correlation: Connecting the Three Pillars
Every log line includes: trace_id, span_id
Every trace span includes: service, operation
Every metric includes: service label
Correlation paths:
Alert fires (metric) → Click → Dashboard (metric) → Filter by time window
→ Trace search (same service + time) → Find failing trace
→ Logs (filter by trace_id) → See exact error
Support ticket (user report) → Find request_id in logs
→ Extract trace_id → View full trace → Identify slow span
→ Check span's service metrics → Confirm pattern
Synthetic Monitoring
synthetic_checks:
- name: "Checkout flow"
type: browser
frequency: 5m
locations: [us-east, eu-west, ap-southeast]
steps:
- navigate: "https://app.example.com/products"
- click: "Add to Cart"
- click: "Checkout"
- assert: "Order confirmation page loads in <3s"
alert_on: "2 consecutive failures from same location"
- name: "API health"
type: api
frequency: 1m
endpoints:
- url: "https://api.example.com/health"
expected_status: 200
max_latency_ms: 500
- url: "https://api.example.com/v1/products?limit=1"
expected_status: 200
max_latency_ms: 1000
Feature Flag Observability
feature_flag_monitoring:
- flag: "new_checkout_flow"
metrics_to_compare:
- "checkout_conversion_rate"
- "checkout_error_rate"
- "checkout_latency_p99"
alerts:
- "If error rate for new variant > 2x control, auto-disable flag"
Observability Maturity Model
| Dimension |
Level 1 |
Level 2 |
Level 3 |
Level 4 |
| Logging |
Unstructured logs |
Structured JSON, centralized |
Correlated with traces |
Automated log analysis |
| Metrics |
Basic infra metrics |
RED/USE for services |
SLO-based with error budgets |
Predictive (anomaly detection) |
| Tracing |
No tracing |
Key services instrumented |
Full distributed tracing |
Trace-driven testing |
| Alerting |
Static thresholds |
Multi-signal alerts |
Burn-rate based on SLOs |
Auto-remediation |
| Incident Response |
Ad hoc |
Defined process + roles |
Post-mortems with action tracking |
Chaos engineering in prod |
| Culture |
"Ops team handles it" |
Shared ownership (you build it, you run it) |
SLO-driven development velocity |
Reliability as a feature |
Quality Scoring Rubric (0-100)
| Dimension |
Weight |
0 |
5 |
10 |
| Logging quality |
15% |
Unstructured, no correlation |
Structured JSON, missing fields |
Full schema, trace correlation, PII scrubbing |
| Metrics coverage |
15% |
No metrics |
RED or USE, not both |
RED + USE + business metrics + custom |
| Tracing completeness |
10% |
No tracing |
Key services |
Full path, sampling strategy, tail-based |
| SLO maturity |
15% |
No reliability targets |
Informal targets |
SLOs with error budgets, burn-rate alerts, weekly review |
| Alert quality |
15% |
Noisy/missing |
Actionable, some runbooks |
SLO-based, full runbooks, low false positive |
| Incident response |
10% |
Ad hoc |
Defined process |
Full process, roles, post-mortems, chaos engineering |
| Dashboard design |
10% |
No dashboards |
Basic panels |
Hierarchical L1-L4, consistent, linked to alerts |
| Cost efficiency |
10% |
Unknown cost |
Tracked |
Optimized, reviewed monthly, within budget |
90-100: World-class. Teach others. 70-89: Production-ready. Fill specific gaps. 50-69: Functional but fragile. <50: Significant reliability risk.
10 Observability Commandments
- Structured or it didn't happen — unstructured logs are technical debt
- Correlate everything — trace_id connects logs, traces, and metrics
- Alert on symptoms, not causes — users don't care about CPU, they care about latency
- Every alert gets a runbook — no runbook = no alert
- SLOs drive velocity — error budgets decide when to ship vs stabilize
- Dashboards have hierarchy — executives don't need pod CPU graphs
- Blameless post-mortems always — blame prevents learning
- Cost is a feature — observability that bankrupts you isn't observability
- You build it, you run it — the team that ships code owns its observability
- Practice failure — chaos engineering builds confidence
12 Natural Language Commands
| Command |
What It Does |
| "Audit our observability" |
Run the /16 health check, score each dimension, prioritize gaps |
| "Design logging for [service]" |
Generate structured log schema with context fields for the service |
| "Set up metrics for [service]" |
Create RED + USE + business metric instrumentation plan |
| "Create SLOs for [service]" |
Define SLIs, targets, error budgets, and burn-rate alert rules |
| "Design alerts for [service]" |
Create alert rules with severity, thresholds, and runbook templates |
| "Build dashboard for [service]" |
Design L2 service overview dashboard with panel specifications |
| "Write a runbook for [alert]" |
Generate structured runbook with diagnosis steps and fixes |
| "Run post-mortem for [incident]" |
Generate blameless post-mortem document with timeline and action items |
| "Set up on-call for [team]" |
Design rotation, escalation policy, handoff checklist |
| "Plan chaos experiment for [scenario]" |
Design experiment with hypothesis, injection, abort conditions |
| "Optimize observability costs" |
Audit current spend, identify top savings, create reduction plan |
| "Design tracing for [system]" |
Create OpenTelemetry instrumentation plan with sampling strategy |
⚡ Level Up Your Observability
This skill gives you the methodology. For industry-specific implementation patterns:
🔗 More Free Skills by AfrexAI
afrexai-devops-engine — CI/CD, infrastructure, deployment strategies
afrexai-api-architect — API design, security, versioning
afrexai-database-engineering — Schema design, query optimization, migrations
afrexai-code-reviewer — Code review methodology with SPEAR framework
afrexai-prompt-engineering — System prompt design, testing, optimization
Browse all AfrexAI skills: clawhub.com | Full storefront