monitoring-observability
SKILL.md
Monitoring & Observability
Comprehensive monitoring, observability, and alerting strategies for production systems.
Three Pillars of Observability
Metrics
- Quantitative measurements (counters, gauges, histograms)
- Time-series data (Prometheus, InfluxDB, Datadog)
- Examples: request latency, error rate, CPU usage
Logs
- Structured event records
- Searchable and filterable
- Examples: application logs, access logs, error logs
Traces
- Request flow through system
- Distributed tracing (Jaeger, Zipkin)
- Shows dependencies and bottlenecks
Implementation Approaches
Metrics Collection
from prometheus_client import Counter, Histogram
request_count = Counter('http_requests_total', 'Total requests')
latency = Histogram('http_request_duration_seconds', 'Request latency')
@app.route('/api/users')
def get_users():
request_count.inc()
with latency.time():
return fetch_users()
Structured Logging
{
"timestamp": "2025-02-07T10:30:00Z",
"level": "ERROR",
"service": "user-service",
"request_id": "req_12345",
"user_id": "user_789",
"error_code": "DB_CONNECTION_FAILED",
"message": "Failed to connect to database",
"duration_ms": 1500
}
Distributed Tracing
- Instrument application code
- Propagate trace IDs across services
- Collect traces centrally (Jaeger, Zipkin)
- Visualize service dependencies
Popular Tools
| Category | Tools |
|---|---|
| Metrics | Prometheus, Grafana, Datadog, New Relic |
| Logging | ELK Stack, Splunk, CloudWatch, Loki |
| Tracing | Jaeger, Zipkin, DataDog APM |
| APM | New Relic, DataDog, Dynatrace |
Best Practices
- Structured Logging - JSON format with context
- Contextual Data - Request IDs, user IDs, service names
- Sampling - Don't log everything to save costs
- Retention Policy - Balance cost and retention needs
- Alerts - On error rates, latency, resource usage
- Dashboards - Visualize key metrics
- Runbooks - Document how to respond to alerts
Key Metrics to Monitor
- Request rate and latency (p50, p95, p99)
- Error rate and error types
- Resource usage (CPU, memory, disk)
- Database query performance
- Cache hit rates
- Queue depths
- User session counts
References
- Prometheus Monitoring Best Practices
- Observability Engineering (O'Reilly)
- Google SRE Book
- ELK Stack Documentation
- OpenTelemetry Project
Weekly Installs
4
Repository
1mangesh1/dev-s…llectionGitHub Stars
2
First Seen
Feb 21, 2026
Security Audits
Installed on
opencode4
gemini-cli4
github-copilot4
codex4
amp4
kimi-cli4