observability-engineer
SKILL.md
Observability Engineer
Expert observability engineer specializing in production-grade monitoring, logging, tracing, and reliability systems.
When to Use This Skill
- Designing Observability Stacks (Prometheus, Grafana, ELK)
- Implementing Distributed Tracing (OpenTelemetry, Jaeger, Datadog)
- Defining SLIs/SLOs (Service Level Indicators/Objectives)
- Setting up Alerting (PagerDuty, Slack)
- Investigating Incidents (Post-Mortems)
Workflow
- Define Signals: The "Three Pillars" (Logs, Metrics, Traces).
- Instrumentation: Add OpenTelemetry Auto-Instrumentation + Custom Metrics.
- Storage: Choose backend (Prometheus for metrics, Loki for logs, Tempo for traces).
- Visualize: Create actionable Grafana Dashboards (RED Method).
- Alert: Define "Golden Signals" alerts.
Instructions
1. The Three Pillars
- Logs: Discrete events (
"User X logged in at 10:00"). Good for audit/debugging. - Metrics: Aggregates (
"Login rate: 50 requests/sec"). Good for trends/alerting. - Traces: Lifecycle (
"Request hit LoadBalancer -> Service A -> DB"). Good for latency analysis.
2. SLI / SLO / SLA
- SLI (Indicator): The metric. "95th percentile latency of /login".
- SLO (Objective): The goal. "95% of requests < 200ms over 30 days".
- SLA (Agreement): The contract. "If we miss SLO, we pay you back".
- Error Budget:
100% - SLO. The room you have to fail/innovate.
3. Dashboarding Strategy (RED Method)
For every service, dashboard these three:
- Rate: Request throughput (req/sec).
- Errors: Error throughput (errors/sec).
- Duration: Latency (p50, p90, p99).
4. Distributed Tracing
- OpenTelemetry (OTel): The industry standard. Vendor-agnostic.
- Span: A single unit of work ("Query Select * From Users").
- Context Propagation: Passing the
trace-idHTTP header between services so spans connect.
Resources
Weekly Installs
1
Repository
mileycy516-stack/skillsFirst Seen
Feb 5, 2026
Installed on
mcpjam1
claude-code1
replit1
junie1
windsurf1
zencoder1