Monitoring Operations

Comprehensive observability patterns covering the three pillars (metrics, logging, tracing), alerting strategies, dashboard design, and infrastructure monitoring for production systems.

Three Pillars Quick Reference

Use this table to decide which observability signal fits your need:

Pillar	Best For	Tools	Data Type
Metrics	Aggregated numeric measurements, trends, alerting on thresholds	Prometheus, Datadog, CloudWatch, StatsD	Time-series (numeric)
Logs	Discrete events, error details, audit trails, debugging context	Loki, ELK, CloudWatch Logs, Fluentd	Unstructured/structured text
Traces	Request flow across services, latency breakdown, dependency mapping	Jaeger, Tempo, Zipkin, Datadog APM	Span trees (structured)

When to use which:

"How many requests per second?" → Metrics (counter + rate)
"Why did this specific request fail?" → Logs (error message + stack trace)
"Where is the latency in this request?" → Traces (span waterfall)
"Is the system healthy right now?" → Metrics (gauges + alerts)
"What happened at 3:42 AM?" → Logs (timestamped event search)
"Which downstream service caused the timeout?" → Traces (span analysis)

Correlation is key: Connect all three by embedding trace_id in log entries, recording exemplars in metrics, and linking trace spans to log queries.

Metrics Type Decision Tree

Use this tree to select the correct metric type:

What are you measuring?
│
├─ A count of events that only goes up?
│  └─ COUNTER
│     Examples: http_requests_total, errors_total, bytes_sent_total
│     Use rate() or increase() to get per-second or per-interval values
│     Never use a counter's raw value — it resets on restart
│
├─ A current value that goes up AND down?
│  └─ GAUGE
│     Examples: temperature_celsius, active_connections, queue_depth
│     Use for snapshots of current state
│     Can use avg_over_time(), max_over_time() for trends
│
├─ A distribution of values (latency, size)?
│  │
│  ├─ Need aggregatable quantiles across instances?
│  │  └─ HISTOGRAM
│  │     Examples: http_request_duration_seconds, response_size_bytes
│  │     Define buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
│  │     Use histogram_quantile() for percentiles (p50, p95, p99)
│  │     Aggregatable across instances (histograms can be summed)
│  │
│  └─ Need pre-calculated quantiles on a single instance?
│     └─ SUMMARY
│        Examples: go_gc_duration_seconds
│        Pre-calculates quantiles client-side
│        NOT aggregatable across instances
│        Prefer histogram unless you have a specific reason
│
└─ None of the above?
   └─ INFO metric (labels only, value=1)
      Examples: build_info{version="1.2.3", commit="abc123"}
      Use for metadata exposed as metrics

Rule of thumb: Start with counters and histograms. Add gauges for current state. Avoid summaries unless you have a compelling reason.

Alerting Decision Tree

What type of alert do you need?
│
├─ Known threshold with a fixed boundary?
│  └─ THRESHOLD-BASED
│     Example: CPU > 90% for 5 minutes
│     Pros: Simple, predictable, easy to understand
│     Cons: Requires manual tuning, doesn't adapt to patterns
│     Best for: Resource limits, error rate spikes, queue depth
│
├─ Normal behavior varies by time/season?
│  └─ ANOMALY-BASED
│     Example: Traffic 3 standard deviations below normal for this hour
│     Pros: Adapts to patterns, catches novel failures
│     Cons: Noisy during transitions, requires training data
│     Best for: Traffic patterns, business metrics, gradual degradation
│
└─ Defined reliability targets?
   └─ SLO-BASED (PREFERRED)
      Example: Error budget burn rate > 14.4x for 1 hour
      Pros: Aligned with user impact, reduces noise, principled
      Cons: Requires SLI/SLO definition, more complex setup
      Best for: User-facing services, platform reliability

Severity Levels

Severity	Response	Examples	Routing
Critical (P1)	Page on-call immediately	Service down, data loss risk, security breach	PagerDuty high-urgency, phone call
Warning (P2)	Investigate within hours	Elevated error rate, disk 80% full, SLO burn rate elevated	PagerDuty low-urgency, Slack alert channel
Info (P3)	Review next business day	Deployment completed, certificate expiring in 30 days	Slack info channel, ticket auto-created

When to Page vs When to Ticket

Page (wake someone up) when:

Users are currently impacted
Data loss is occurring or imminent
Security incident is active
Error budget will be exhausted within hours

Create ticket (don't page) when:

Issue is not user-facing yet
Automated remediation is possible
Degradation is slow and has runway
Issue is during business hours and can be triaged normally

Structured Logging Quick Reference

Standard JSON Log Format

{
  "timestamp": "2026-03-09T14:32:01.123Z",
  "level": "ERROR",
  "message": "Failed to process payment",
  "service": "payment-api",
  "version": "1.4.2",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "request_id": "req-abc123",
  "user_id": "usr-789",
  "error": {
    "type": "PaymentGatewayTimeout",
    "message": "Gateway response timeout after 30s",
    "stack": "..."
  },
  "duration_ms": 30042,
  "http": {
    "method": "POST",
    "path": "/api/v1/payments",
    "status_code": 504
  }
}

Log Level Decision Guide

Level	When to Use	Examples
DEBUG	Development only, verbose internal state	Variable values, SQL queries, cache hits/misses
INFO	Normal operations worth recording	Request completed, job started/finished, config loaded
WARN	Degraded but still functioning	Retry succeeded, fallback used, approaching limit
ERROR	Operation failed, needs attention	Payment failed, API call error, constraint violation
FATAL	Process cannot continue, must exit	Database unreachable at startup, invalid config, OOM

Rules:

Never log at ERROR for expected conditions (user input validation → WARN)
Every ERROR should be actionable — if no one will act on it, use WARN
DEBUG should be off in production by default
INFO should not be noisy — 1-5 log lines per request, not 50

Correlation IDs

Generate a request_id (UUID v4 or ULID) at the edge/gateway
Propagate through all internal services via headers (X-Request-ID)
Include trace_id and span_id from distributed tracing
Log all three IDs in every log entry for cross-referencing

Distributed Tracing Quick Reference

Core Concepts

Trace: End-to-end journey of a request across all services
Span: A single unit of work (HTTP call, DB query, function execution)
Context propagation: Passing trace/span IDs between services via headers

W3C TraceContext Header

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
              │  │                                  │                  │
              │  │                                  │                  └─ flags (01=sampled)
              │  │                                  └─ parent span ID (16 hex)
              │  └─ trace ID (32 hex)
              └─ version (00)

Sampling Strategies

Strategy	How It Works	Use When
Head-based (ratio)	Decide at trace start, propagate decision	Low traffic, need predictable volume
Always-on	Sample everything	Development, low-traffic services
Parent-based	Follow parent's sampling decision	Default for most services
Tail-based	Decide after trace completes (at Collector)	Need error/slow traces, high traffic

Recommendation: Use parent-based + tail-based at the Collector. This captures all error traces and slow traces while controlling volume.

Trace ID in Logs

Always include trace_id in structured log entries. This enables jumping from a log line to the full trace view:

Log entry → trace_id → Jaeger/Tempo → full request waterfall

Tool Selection Matrix

Feature	Prometheus + Grafana	Datadog	Grafana Cloud	CloudWatch
Cost	Free (infra costs)	$$$$ (per host/metric)	$$ (usage-based)	$$ (AWS-native)
Setup complexity	High (self-managed)	Low (SaaS agent)	Medium (managed)	Low (AWS-native)
Metrics	Prometheus (excellent)	Built-in (excellent)	Mimir (excellent)	Built-in (good)
Logs	Loki (good)	Built-in (excellent)	Loki (good)	CloudWatch Logs (good)
Traces	Jaeger/Tempo (good)	APM (excellent)	Tempo (good)	X-Ray (adequate)
Alerting	Alertmanager (good)	Built-in (excellent)	Grafana Alerting (good)	CloudWatch Alarms (adequate)
Dashboards	Grafana (excellent)	Built-in (excellent)	Grafana (excellent)	Dashboards (adequate)
Retention	Configurable (unlimited)	15 months default	Configurable	Up to 15 months
Multi-cloud	Yes	Yes	Yes	AWS only
Best for	Cost-conscious, control	Full-featured, enterprise	Open-source + managed	AWS-native shops

Recommendation path:

Starting out / budget-conscious: Prometheus + Grafana + Loki + Tempo (all free, self-hosted)
Small team, want managed: Grafana Cloud free tier (10k metrics, 50GB logs, 50GB traces)
Enterprise, need everything: Datadog (expensive but comprehensive)
AWS-only shop: CloudWatch + X-Ray (simplest if already on AWS)

Dashboard Design

USE Method (Infrastructure)

For every resource (CPU, memory, disk, network):

Signal	Question	Metric Example
Utilization	How busy is it?	`node_cpu_seconds_total` (% busy)
Saturation	How overloaded is it?	`node_load1` (run queue length)
Errors	Are there error events?	`node_network_receive_errs_total`

RED Method (Services)

For every service endpoint:

Signal	Question	Metric Example
Rate	How many requests per second?	`rate(http_requests_total[5m])`
Errors	How many are failing?	`rate(http_requests_total{status=~"5.."}[5m])`
Duration	How long do they take?	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`

Four Golden Signals (Google SRE)

Signal	What to Measure	Alert Threshold Guidance
Latency	Time to serve a request (distinguish success vs error latency)	p99 > 2x baseline
Traffic	Demand on the system (requests/sec, sessions, transactions)	Anomaly detection
Errors	Rate of failed requests (explicit 5xx, implicit policy violations)	> 0.1% of traffic
Saturation	How "full" the service is (CPU, memory, queue depth)	> 80% capacity

Dashboard Layout Best Practices

Top row: Key health indicators (error rate, latency p99, availability %)
Second row: Traffic and throughput (requests/sec, active users)
Third row: Resource utilization (CPU, memory, disk, network)
Bottom rows: Detailed breakdowns (by endpoint, by status code, by region)
Use variables: Service, environment, time range as dropdown selectors
Include annotations: Deployments, incidents, config changes as vertical markers

Common Gotchas

Gotcha	Why It Happens	Fix
Cardinality explosion	Using unbounded label values (user ID, request path, query string)	Use bounded labels only; aggregate high-cardinality data in logs, not metrics
Alert fatigue	Too many alerts, too sensitive thresholds, alerts on non-actionable symptoms	Require runbook for every alert; tune thresholds; use SLO-based alerting
Missing correlation IDs	Logs, metrics, and traces not linked together	Include trace_id in all log entries; use exemplars in metrics
Sampling bias	Head-based sampling drops error/slow traces at high sample rates	Use tail-based sampling at the Collector to always capture errors and slow traces
Log volume costs	DEBUG or verbose INFO in production, logging full request/response bodies	Set production to INFO minimum; truncate large payloads; use sampling for verbose paths
Metric naming inconsistency	Different teams use different naming conventions	Adopt OpenMetrics naming: `namespace_subsystem_unit_suffix` (e.g., `http_server_request_duration_seconds`)
Dashboard sprawl	Everyone creates dashboards, nobody maintains them	Standardize with USE/RED templates; review quarterly; delete unused dashboards
SLO too aggressive	Setting 99.99% availability without the budget or architecture for it	Start with 99.5% or 99.9%; tighten only when consistently meeting targets with margin
Missing baseline	Alerting on absolute thresholds without understanding normal behavior	Collect 2-4 weeks of baseline data before setting alert thresholds
Over-instrumentation	Instrumenting every function, creating too many spans/metrics	Instrument at service boundaries; use auto-instrumentation for HTTP/DB/gRPC; add manual spans selectively
Ignoring metric staleness	Assuming a metric that stops reporting means zero	Use `absent()` or `up == 0` to detect missing scrapers; distinguish "zero" from "not reporting"
Alerting on cause not symptom	Alerting on CPU usage instead of user-facing error rate	Alert on symptoms (error rate, latency); use cause metrics (CPU, memory) for investigation
No retention policy	Storing all metrics/logs at full resolution forever	Define retention tiers: 15s resolution for 2 weeks, 1m for 3 months, 5m for 1 year
Dashboard without context	Graphs with no units, no description, no threshold lines	Add units to Y-axis, threshold lines for SLOs, panel descriptions explaining what "good" looks like

Reference Files

File	Contents	Lines
metrics-alerting.md	Prometheus, Grafana, OpenTelemetry metrics, SLI/SLO/SLA, alert routing, runbooks, uptime monitoring	~650
logging.md	Structured logging, log levels, correlation IDs, aggregation (Loki, ELK), retention, PII masking, language-specific	~550
tracing.md	OpenTelemetry, spans, context propagation, sampling, Jaeger, async tracing, DB/HTTP/gRPC instrumentation	~600
infrastructure.md	Health checks, K8s probes, Docker HEALTHCHECK, infra metrics, APM, cost optimization, incident response	~550

monitoring-ops