observability-design
Observability Design
Design comprehensive observability across metrics, logs, and traces to understand system behavior and debug issues.
Context
You are building observability for a system. The user struggles to debug production issues or wants better visibility. Read their current monitoring setup.
Domain Context
Based on Google's SRE practices and observability research:
- Metrics: Quantitative measurements (latency, error rate, QPS). Fast queries, low cardinality.
- Logs: Event records with context (error messages, user actions). High volume, searchable.
- Traces: Request flow across services. Shows dependencies and latency breakdown.
- Correlation IDs: Link logs and traces across services. Essential for debugging.
- Cardinality: Unique values for metric labels. High cardinality (millions of unique values) breaks storage.
Instructions
-
Define Key Metrics: For each critical path, specify SLI metrics (success rate, latency, saturation). Example: order checkout: success rate >99.9%, p99 latency <500ms.
-
Design Metrics Collection: Instrument code with metrics (request count, latency histogram, error count). Use metrics library (Prometheus, StatsD). Keep cardinality low.
-
Configure Logging: Log key events (authentication, errors, deployments). Include correlation ID in every log. Aggregate logs centrally (ELK, Datadog).
-
Implement Distributed Tracing: Every request gets trace ID at entry point. Pass to every downstream service. Record span (service name, operation, latency, result).
-
Build Dashboards & Alerts: Dashboard shows health overview (SLI status). Alerts on SLI violation. Alert requires runbook (action to resolve).
Anti-Patterns
- High Cardinality Metrics: Emit metric with user ID as label. Result: millions of unique metrics, breaks storage. Guard: Use low-cardinality labels (service, endpoint); never user ID or request ID.
- Logging Everything: Log every request, every database call. Result: petabytes of data, can't find signal. Guard: Log sparingly; focus on errors and critical paths.
- No Correlation IDs: Cannot trace request across services. Guard: Correlation ID generated at entry point, passed to all downstream services.
- Alerts Without Runbooks: Alert fires but nobody knows what to do. Result: ignored alerts. Guard: Every alert must have clear runbook explaining cause and fix.
Further Reading
- Google's Site Reliability Engineering — observability practices at scale
- The Art of Monitoring by James Turnbull — comprehensive monitoring and alerting strategy
- Observability Engineering by Gundeck and Ligus — modern observability with traces and logs
More from sethdford/claude-skills
api-test-automation
Expert approach to api-test-automation in test automation. Use when working with .
2developer-experience-audit
Systematically assess and improve developer experience (tools, documentation, onboarding, debugging) to increase team productivity. Use in roadmapping or when noticing developer friction.
2design-rationale
Write clear design rationale connecting decisions to user needs, business goals, and principles.
1api-error-handling
HTTP status codes, error response formats, recovery guidance, and client error handling.
1interface-design
Designing minimal, cohesive, role-based interfaces that respect Interface Segregation Principle.
1design-token
Define and organize design tokens (color, spacing, typography, elevation) with naming conventions and usage guidance.
1