observability
Observability Engineer - Full-Stack Monitoring Expert
⚠️ Chunking Rule
Large monitoring stacks (Prometheus + Grafana + OpenTelemetry + logs) = 1000+ lines. Generate ONE component per response: Metrics → Dashboards → Alerting → Tracing → Logs.
Purpose
Design and implement comprehensive observability systems covering metrics, logs, traces, and reliability engineering.
When to Use
- Set up Prometheus monitoring
- Create Grafana dashboards
- Implement distributed tracing (Jaeger, Tempo)
- Define SLIs/SLOs and error budgets
- Configure alerting systems
- Prevent alert fatigue
- Debug microservices latency
Core Concepts
Three Pillars of Observability
┌─────────────────────────────────────────────────────────────┐
│ OBSERVABILITY │
├─────────────────┬─────────────────┬─────────────────────────┤
│ METRICS │ LOGS │ TRACES │
├─────────────────┼─────────────────┼─────────────────────────┤
│ Prometheus │ Loki/ELK │ Jaeger/Tempo │
│ What happened? │ Why happened? │ How requests flow? │
│ Aggregated data │ Event details │ Request journey │
└─────────────────┴─────────────────┴─────────────────────────┘
RED Method (Services)
- Rate - Requests per second
- Errors - Error rate percentage
- Duration - Latency/response time
USE Method (Resources)
- Utilization - % time resource is busy
- Saturation - Queue length/wait time
- Errors - Error count
Prometheus Setup
Installation (Kubernetes)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.retention=30d
Key Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Recording Rules
groups:
- name: api_metrics
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_requests_error_rate:percentage
expr: (sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))) * 100
Grafana Dashboards
Dashboard Design Principles
┌─────────────────────────────────────┐
│ Critical Metrics (Big Numbers) │
├─────────────────────────────────────┤
│ Key Trends (Time Series) │
├─────────────────────────────────────┤
│ Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘
Essential Queries
# Request rate
sum(rate(http_requests_total[5m])) by (service)
# Error rate %
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
# P95 Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
Distributed Tracing
OpenTelemetry Setup (Node.js)
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const provider = new NodeTracerProvider();
provider.addSpanProcessor(new BatchSpanProcessor(new JaegerExporter()));
provider.register();
registerInstrumentations({
instrumentations: [new HttpInstrumentation()],
});
Context Propagation
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
Jaeger Deployment
# Kubernetes
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
spec:
strategy: production
storage:
type: elasticsearch
SLIs/SLOs
Defining SLOs
slos:
- name: api_availability
target: 99.9% # 43.2 min downtime/month
window: 28d
sli: sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99% # 99% requests < 500ms
window: 28d
sli: sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
Error Budget
Error Budget = 1 - SLO Target
Example: 99.9% SLO → 0.1% error budget → 43.2 min/month
Burn Rate Alerts
rules:
- alert: SLOErrorBudgetBurnFast
expr: slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn - consuming 2% budget in 1 hour"
Alert Fatigue Prevention
Multi-Window Alerting
# Combine short + long windows to reduce false positives
- alert: HighLatency
expr: |
(job:http_request_duration:p95_5m > 1 AND job:http_request_duration:p95_1h > 0.8)
for: 5m
Severity Levels
| Severity | Response | Examples |
|---|---|---|
| critical | Page immediately | Service down, data loss |
| warning | Review in hours | Degraded performance |
| info | Daily review | Capacity planning |
Best Practices
- Start with RED/USE methods for consistent metrics
- Use recording rules for expensive queries
- Implement multi-window alerts to reduce noise
- Set achievable SLOs (don't aim for 100%)
- Track error budget consistently
- Correlate traces with metrics using trace IDs
- Sample traces appropriately (1-10% in production)
- Add context to spans (user_id, request_id)
Related Skills
devops- Infrastructure provisioning
More from anton-abyzov/specweave
technical-writing
Technical writing expert for API documentation, README files, tutorials, changelog management, and developer documentation. Covers style guides, information architecture, versioning docs, OpenAPI/Swagger, and documentation-as-code. Activates for technical writing, API docs, README, changelog, tutorial writing, documentation, technical communication, style guide, OpenAPI, Swagger, developer docs.
45spec-driven-brainstorming
Spec-driven brainstorming and product discovery expert. Helps teams ideate features, break down epics, conduct story mapping sessions, prioritize using MoSCoW/RICE/Kano, and validate ideas with lean startup methods. Activates for brainstorming, product discovery, story mapping, feature ideation, prioritization, MoSCoW, RICE, Kano model, lean startup, MVP definition, product backlog, feature breakdown.
43kafka-architecture
Apache Kafka architecture expert for cluster design, capacity planning, and high availability. Use when designing Kafka clusters, choosing partition strategies, or sizing brokers for production workloads.
34docusaurus
Docusaurus 3.x documentation framework - MDX authoring, theming, versioning, i18n. Use for documentation sites or spec-weave.com.
29frontend
Expert frontend developer for React, Vue, Angular, and modern JavaScript/TypeScript. Use when creating components, implementing hooks, handling state management, or building responsive web interfaces. Covers React 18+ features, custom hooks, form handling, and accessibility best practices.
29reflect
Self-improving AI memory system that persists learnings across sessions in CLAUDE.md. Use when capturing corrections, remembering user preferences, or extracting patterns from successful implementations. Enables continual learning without starting from zero each conversation.
27