Observability & Monitoring Skill

See what's happening in production. Debug without reproducing. Understand system behavior at scale.

The Three Pillars

Pillar	What	When	Tools
Logs	Discrete events	Debugging, auditing	Winston, Pino, Serilog
Metrics	Aggregated measurements	Alerting, dashboards	Prometheus, CloudWatch
Traces	Request flow across services	Distributed debugging	Jaeger, Zipkin

Modern approach: OpenTelemetry unifies all three.

Logging Best Practices

Structured Logging

// ❌ Bad: Unstructured
console.log(`User ${userId} clicked button ${buttonId}`);

// ✅ Good: Structured
logger.info('Button clicked', {
  userId,
  buttonId,
  timestamp: Date.now(),
  sessionId: ctx.sessionId
});

Log Levels

Level	Usage	Example
ERROR	Something failed, needs attention	Payment failed
WARN	Unexpected but handled	Retry succeeded
INFO	Business events	User logged in
DEBUG	Developer details	Cache hit/miss
TRACE	Verbose internals	Function entry/exit

Correlation IDs

Track requests across services:

// Middleware to propagate trace ID
app.use((req, res, next) => {
  req.traceId = req.headers['x-trace-id'] || uuid();
  res.setHeader('x-trace-id', req.traceId);
  next();
});

// Include in all logs
logger.info('Processing request', { traceId: req.traceId, ...data });

Metrics Patterns

The RED Method (Request-focused)

For services:

Rate: Requests per second
Errors: Failed requests per second
Duration: Request latency distribution

The USE Method (Resource-focused)

For infrastructure:

Utilization: % time resource busy
Saturation: Queue depth
Errors: Error count

Key Metric Types

Type	Use Case	Example
Counter	Cumulative totals	requests_total
Gauge	Current value	temperature, queue_size
Histogram	Value distribution	request_duration_seconds
Summary	Quantiles	response_time_p99

Golden Signals (SRE)

Latency — Time to serve request
Traffic — Demand on system
Errors — Failed requests rate
Saturation — How full is the system

Distributed Tracing

Span Structure

Trace: user-checkout-abc123
├── Span: api-gateway (50ms)
│   ├── Span: auth-service (10ms)
│   └── Span: order-service (35ms)
│       ├── Span: inventory-check (8ms)
│       └── Span: payment-service (20ms)
│           └── Span: database-write (5ms)

Context Propagation

// OpenTelemetry automatic propagation
import { trace, context, propagation } from '@opentelemetry/api';

// Extract context from incoming request
const ctx = propagation.extract(context.active(), req.headers);

// Create span with parent context
const span = tracer.startSpan('process-order', undefined, ctx);

// Propagate to outgoing request
propagation.inject(context.active(), headers);

OpenTelemetry Setup

Node.js Quick Start

// tracing.ts - Load FIRST
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

.NET Quick Start

// Program.cs
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddOtlpExporter());

Alerting Strategy

Alert Hierarchy

Severity	Response	Example
P1/Critical	Wake someone up	Service down
P2/High	Fix within hours	Error rate > 5%
P3/Medium	Fix within days	Disk 80%
P4/Low	Fix when convenient	Deprecation warning

Alert Anti-Patterns

❌ Alert fatigue — Too many non-actionable alerts ❌ Missing runbook — Alert with no remediation steps ❌ Threshold-only — Alert on static value, not trend ❌ No owner — Alert goes to void

Good Alert Template

alert: HighErrorRate
expr: sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
  severity: high
  team: backend
annotations:
  summary: "Error rate above 5%"
  runbook: "https://runbooks.example.com/high-error-rate"
  dashboard: "https://grafana.example.com/d/errors"

Dashboard Design

Layout Principles

┌─────────────────────────────────────────────────────────┐
│                   SERVICE HEALTH                         │
│  [Status] [Error Rate] [Latency P50] [Latency P99]      │
├─────────────────────────────────────────────────────────┤
│                   TRAFFIC                                │
│  [Requests/sec graph over time]                         │
├─────────────────────────────────────────────────────────┤
│           ERRORS             │        LATENCY           │
│  [Error breakdown by type]   │  [Latency histogram]     │
├─────────────────────────────────────────────────────────┤
│                   RESOURCES                              │
│  [CPU] [Memory] [Disk] [Network]                        │
└─────────────────────────────────────────────────────────┘

Dashboard Hierarchy

Overview — Executive view, all services
Service — Single service deep dive
Debug — Detailed metrics for investigation

Cloud Provider Tools

Cloud	Metrics	Logs	Traces
Azure	Azure Monitor	Log Analytics	App Insights
AWS	CloudWatch	CloudWatch Logs	X-Ray
GCP	Cloud Monitoring	Cloud Logging	Cloud Trace

Azure Application Insights

// Node.js
import { useAzureMonitor } from '@azure/monitor-opentelemetry';

useAzureMonitor({
  azureMonitorExporterOptions: {
    connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING
  }
});

VS Code Extension Observability

For VS Code extensions like Alex:

What to Monitor

Metric	Why
Command execution time	User experience
Activation time	Startup performance
Error rates by command	Reliability
Memory usage	Resource efficiency
API call latency	External dependencies

Telemetry Implementation

import * as vscode from 'vscode';

const telemetry = vscode.env.createTelemetryLogger({
  sendEventData(eventName, data) {
    // Send to your telemetry backend
  },
  sendErrorData(error, data) {
    // Send errors with context
  }
});

// Usage
telemetry.logUsage('command.executed', {
  commandId: 'alex.meditate',
  durationMs: 1500
});

Debugging Patterns

Log-Driven Debugging

Find error in logs
Get correlation ID
Search all logs with that ID
Reconstruct timeline

Trace-Driven Debugging

Find slow/failed trace
Examine span waterfall
Identify bottleneck span
Drill into that service

Metric-Driven Debugging

Notice anomaly in dashboard
Correlate with other metrics
Narrow time window
Switch to logs/traces for details

Implementation Checklist

New Service

Production Readiness

Related Skills

performance-profiling — Deep dive into specific bottlenecks
incident-response — Using observability during outages
infrastructure-as-code — Deploying monitoring stack
security-review — Audit logging requirements

"If you can't measure it, you can't improve it." — Peter Drucker

Good observability means finding the problem before your users do.