skills/mileycy516-stack/skills/observability-engineer

observability-engineer

SKILL.md

Observability Engineer

Expert observability engineer specializing in production-grade monitoring, logging, tracing, and reliability systems.

When to Use This Skill

Designing Observability Stacks (Prometheus, Grafana, ELK)
Implementing Distributed Tracing (OpenTelemetry, Jaeger, Datadog)
Defining SLIs/SLOs (Service Level Indicators/Objectives)
Setting up Alerting (PagerDuty, Slack)
Investigating Incidents (Post-Mortems)

Workflow

Define Signals: The "Three Pillars" (Logs, Metrics, Traces).
Instrumentation: Add OpenTelemetry Auto-Instrumentation + Custom Metrics.
Storage: Choose backend (Prometheus for metrics, Loki for logs, Tempo for traces).
Visualize: Create actionable Grafana Dashboards (RED Method).
Alert: Define "Golden Signals" alerts.

Instructions

1. The Three Pillars

Logs: Discrete events ("User X logged in at 10:00"). Good for audit/debugging.
Metrics: Aggregates ("Login rate: 50 requests/sec"). Good for trends/alerting.
Traces: Lifecycle ("Request hit LoadBalancer -> Service A -> DB"). Good for latency analysis.

2. SLI / SLO / SLA

SLI (Indicator): The metric. "95th percentile latency of /login".
SLO (Objective): The goal. "95% of requests < 200ms over 30 days".
SLA (Agreement): The contract. "If we miss SLO, we pay you back".
Error Budget: 100% - SLO. The room you have to fail/innovate.

3. Dashboarding Strategy (RED Method)

For every service, dashboard these three:

Rate: Request throughput (req/sec).
Errors: Error throughput (errors/sec).
Duration: Latency (p50, p90, p99).

4. Distributed Tracing

OpenTelemetry (OTel): The industry standard. Vendor-agnostic.
Span: A single unit of work ("Query Select * From Users").
Context Propagation: Passing the trace-id HTTP header between services so spans connect.

Resources

Weekly Installs

1

Repository

mileycy516-stack/skills

First Seen

Feb 5, 2026

Installed on

mcpjam1

claude-code1

replit1

junie1

windsurf1

zencoder1