Observability & Monitoring
SKILL.md
Observability & Monitoring Skill
See what's happening in production. Debug without reproducing. Understand system behavior at scale.
The Three Pillars
| Pillar | What | When | Tools |
|---|---|---|---|
| Logs | Discrete events | Debugging, auditing | Winston, Pino, Serilog |
| Metrics | Aggregated measurements | Alerting, dashboards | Prometheus, CloudWatch |
| Traces | Request flow across services | Distributed debugging | Jaeger, Zipkin |
Modern approach: OpenTelemetry unifies all three.
Logging Best Practices
Structured Logging
// ❌ Bad: Unstructured
console.log(`User ${userId} clicked button ${buttonId}`);
// ✅ Good: Structured
logger.info('Button clicked', {
userId,
buttonId,
timestamp: Date.now(),
sessionId: ctx.sessionId
});
Log Levels
| Level | Usage | Example |
|---|---|---|
| ERROR | Something failed, needs attention | Payment failed |
| WARN | Unexpected but handled | Retry succeeded |
| INFO | Business events | User logged in |
| DEBUG | Developer details | Cache hit/miss |
| TRACE | Verbose internals | Function entry/exit |
Correlation IDs
Track requests across services:
// Middleware to propagate trace ID
app.use((req, res, next) => {
req.traceId = req.headers['x-trace-id'] || uuid();
res.setHeader('x-trace-id', req.traceId);
next();
});
// Include in all logs
logger.info('Processing request', { traceId: req.traceId, ...data });
Metrics Patterns
The RED Method (Request-focused)
For services:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Request latency distribution
The USE Method (Resource-focused)
For infrastructure:
- Utilization: % time resource busy
- Saturation: Queue depth
- Errors: Error count
Key Metric Types
| Type | Use Case | Example |
|---|---|---|
| Counter | Cumulative totals | requests_total |
| Gauge | Current value | temperature, queue_size |
| Histogram | Value distribution | request_duration_seconds |
| Summary | Quantiles | response_time_p99 |
Golden Signals (SRE)
- Latency — Time to serve request
- Traffic — Demand on system
- Errors — Failed requests rate
- Saturation — How full is the system
Distributed Tracing
Span Structure
Trace: user-checkout-abc123
├── Span: api-gateway (50ms)
│ ├── Span: auth-service (10ms)
│ └── Span: order-service (35ms)
│ ├── Span: inventory-check (8ms)
│ └── Span: payment-service (20ms)
│ └── Span: database-write (5ms)
Context Propagation
// OpenTelemetry automatic propagation
import { trace, context, propagation } from '@opentelemetry/api';
// Extract context from incoming request
const ctx = propagation.extract(context.active(), req.headers);
// Create span with parent context
const span = tracer.startSpan('process-order', undefined, ctx);
// Propagate to outgoing request
propagation.inject(context.active(), headers);
OpenTelemetry Setup
Node.js Quick Start
// tracing.ts - Load FIRST
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
.NET Quick Start
// Program.cs
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddOtlpExporter());
Alerting Strategy
Alert Hierarchy
| Severity | Response | Example |
|---|---|---|
| P1/Critical | Wake someone up | Service down |
| P2/High | Fix within hours | Error rate > 5% |
| P3/Medium | Fix within days | Disk 80% |
| P4/Low | Fix when convenient | Deprecation warning |
Alert Anti-Patterns
❌ Alert fatigue — Too many non-actionable alerts ❌ Missing runbook — Alert with no remediation steps ❌ Threshold-only — Alert on static value, not trend ❌ No owner — Alert goes to void
Good Alert Template
alert: HighErrorRate
expr: sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: high
team: backend
annotations:
summary: "Error rate above 5%"
runbook: "https://runbooks.example.com/high-error-rate"
dashboard: "https://grafana.example.com/d/errors"
Dashboard Design
Layout Principles
┌─────────────────────────────────────────────────────────┐
│ SERVICE HEALTH │
│ [Status] [Error Rate] [Latency P50] [Latency P99] │
├─────────────────────────────────────────────────────────┤
│ TRAFFIC │
│ [Requests/sec graph over time] │
├─────────────────────────────────────────────────────────┤
│ ERRORS │ LATENCY │
│ [Error breakdown by type] │ [Latency histogram] │
├─────────────────────────────────────────────────────────┤
│ RESOURCES │
│ [CPU] [Memory] [Disk] [Network] │
└─────────────────────────────────────────────────────────┘
Dashboard Hierarchy
- Overview — Executive view, all services
- Service — Single service deep dive
- Debug — Detailed metrics for investigation
Cloud Provider Tools
| Cloud | Metrics | Logs | Traces |
|---|---|---|---|
| Azure | Azure Monitor | Log Analytics | App Insights |
| AWS | CloudWatch | CloudWatch Logs | X-Ray |
| GCP | Cloud Monitoring | Cloud Logging | Cloud Trace |
Azure Application Insights
// Node.js
import { useAzureMonitor } from '@azure/monitor-opentelemetry';
useAzureMonitor({
azureMonitorExporterOptions: {
connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING
}
});
VS Code Extension Observability
For VS Code extensions like Alex:
What to Monitor
| Metric | Why |
|---|---|
| Command execution time | User experience |
| Activation time | Startup performance |
| Error rates by command | Reliability |
| Memory usage | Resource efficiency |
| API call latency | External dependencies |
Telemetry Implementation
import * as vscode from 'vscode';
const telemetry = vscode.env.createTelemetryLogger({
sendEventData(eventName, data) {
// Send to your telemetry backend
},
sendErrorData(error, data) {
// Send errors with context
}
});
// Usage
telemetry.logUsage('command.executed', {
commandId: 'alex.meditate',
durationMs: 1500
});
Debugging Patterns
Log-Driven Debugging
- Find error in logs
- Get correlation ID
- Search all logs with that ID
- Reconstruct timeline
Trace-Driven Debugging
- Find slow/failed trace
- Examine span waterfall
- Identify bottleneck span
- Drill into that service
Metric-Driven Debugging
- Notice anomaly in dashboard
- Correlate with other metrics
- Narrow time window
- Switch to logs/traces for details
Implementation Checklist
New Service
- Structured logging configured
- Correlation ID propagation
- Basic metrics (RED/USE)
- Health check endpoint
- OpenTelemetry instrumentation
- Dashboard created
- Alerts defined with runbooks
Production Readiness
- Error rates < 0.1% baseline
- P99 latency acceptable
- Logs searchable and retained
- Traces sampling configured
- On-call runbooks written
Related Skills
- performance-profiling — Deep dive into specific bottlenecks
- incident-response — Using observability during outages
- infrastructure-as-code — Deploying monitoring stack
- security-review — Audit logging requirements
"If you can't measure it, you can't improve it." — Peter Drucker
Good observability means finding the problem before your users do.