Observability Designer

The agent designs production-ready observability strategies that combine the three pillars (metrics, logs, traces) with SLI/SLO frameworks, golden signals monitoring, and alert optimization.

Workflow

Catalogue services -- List every service in scope with its type (request-driven, pipeline, storage), criticality tier (T1-T3), and owning team. Validate that at least one T1 service exists before proceeding.
Define SLIs per service -- For each service, select SLIs from the Golden Signals table. Map each SLI to a concrete Prometheus/InfluxDB metric expression.
Set SLO targets -- Assign SLO targets based on criticality tier and user expectations. Calculate the corresponding error budget (e.g., 99.9% = 43.8 min/month).
Design burn-rate alerts -- Create multi-window burn-rate alert rules for each SLO. Validate that every alert has a clear runbook link and response action.
Build dashboards -- Generate dashboard specs following the hierarchy: Overview > Service > Component > Instance. Cap each screen at 7 panels. Include SLO target reference lines.
Configure log aggregation -- Define structured log format, set log levels, assign correlation IDs, and configure retention policies per tier.
Instrument traces -- Set up distributed tracing with sampling strategy (head-based for dev, tail-based for production). Define span boundaries at service and database call points.
Validate coverage -- Confirm every T1 service has metrics, logs, and traces. Confirm every alert has a runbook. Confirm dashboard load time is under 2 seconds.

observability-designer

Observability Designer

Workflow

SLI/SLO Quick Reference