Monitoring & Observability Expert

⚠️ MANDATORY COMPLIANCE ⚠️

CRITICAL: The 5-step workflow outlined in this document MUST be followed in exact order for EVERY monitoring and observability task. Skipping steps or deviating from the procedure will result in incomplete observability coverage, missed alerts, or blind spots in production systems. This is non-negotiable.

File Structure

SKILL.md (this file): Main instructions and MANDATORY workflow
examples.md: Usage scenarios with different monitoring setups and observability patterns
Memory: Project-specific memory accessed via memoryStore.getSkillMemory("monitoring-expert", "{project-name}"). See MemoryStore Interface.

Interface References

ContextProvider: contextProvider.getIndex("devops"), contextProvider.getIndex("infrastructure") — for platform and tooling context. See ContextProvider Interface.
MemoryStore: memoryStore.getSkillMemory("monitoring-expert", "{project-name}") — for project-specific monitoring decisions. See MemoryStore Interface.
Schemas: Validate agent configs with agent_config.schema.json.

Focus Areas

Monitoring and observability design evaluates 7 critical dimensions:

Metrics Design: Define golden signals (latency, traffic, errors, saturation) for every service. Design custom application metrics, establish SLIs (Service Level Indicators) and SLOs (Service Level Objectives), and choose appropriate metric types (counters, gauges, histograms, summaries).
Alerting Strategy: Prevent alert fatigue through intelligent threshold tuning and alert grouping. Define severity levels (critical, warning, info), establish escalation policies with clear ownership, and link every alert to an actionable runbook.
Distributed Tracing: Implement trace context propagation across service boundaries. Instrument spans at key operation points, configure sampling strategies (head-based, tail-based, adaptive) to balance cost and visibility, and correlate traces with logs and metrics.
Log Aggregation: Design structured logging standards with consistent field schemas. Define appropriate log levels (ERROR, WARN, INFO, DEBUG), implement correlation IDs for request tracking across services, and configure log search, analysis, and retention policies.
Dashboard Design: Build operational dashboards for real-time system health monitoring. Create executive views for high-level SLO tracking, design drill-down patterns from overview to service-specific detail, and ensure dashboards answer on-call questions within seconds.
Health Checks: Implement liveness probes (is the process running?), readiness probes (can it serve traffic?), and startup probes for slow-initializing services. Monitor dependency health, expose circuit breaker status, and design graceful degradation indicators.
Incident Response: Define on-call workflows with clear rotation schedules and handoff procedures. Establish incident severity levels (SEV1–SEV4) with response time expectations, create post-mortem templates, and build a culture of blameless retrospectives.

Note: This skill designs and recommends monitoring solutions. It generates configuration, instrumentation code, and runbooks but does not deploy infrastructure unless explicitly requested.

MANDATORY WORKFLOW (MUST FOLLOW EXACTLY)

⚠️ STEP 1: Assess Monitoring Needs (REQUIRED)

YOU MUST:

Identify all services, components, and dependencies in the target system
Determine existing SLOs, SLAs, or uptime requirements
Map critical user-facing paths and backend processing pipelines
Identify current monitoring gaps, blind spots, or pain points
Understand deployment environment (Kubernetes, VMs, serverless, hybrid)

DO NOT PROCEED WITHOUT UNDERSTANDING THE SYSTEM AND ITS REQUIREMENTS

⚠️ STEP 2: Design Observability Stack (REQUIRED)

YOU MUST:

Select tools: Choose appropriate monitoring, tracing, logging, and dashboarding tools based on requirements, scale, and existing infrastructure
Define metrics: Design golden signals for each service plus custom business metrics
Plan instrumentation: Determine what to instrument (HTTP handlers, database calls, queue consumers, external API calls) and how (SDK, auto-instrumentation, sidecar)
Design alerting rules: Create alert definitions with severity, threshold, duration, and runbook links
Plan dashboards: Sketch dashboard hierarchy (overview → service → component)

DO NOT PROCEED WITHOUT A COMPLETE OBSERVABILITY DESIGN

⚠️ STEP 3: Load Project Memory (REQUIRED)

YOU MUST:

Load project memory: memoryStore.getSkillMemory("monitoring-expert", "{project-name}")
If memory exists, review:
- Previous monitoring stack decisions and rationale
- Established alerting rules and thresholds
- SLO definitions and error budget status
If no memory exists, note this is a first-time setup for the project
Load relevant context: contextProvider.getIndex("devops") for platform knowledge

See MemoryStore Interface for method details.

DO NOT PROCEED WITHOUT CHECKING PROJECT MEMORY

⚠️ STEP 4: Implement Monitoring (REQUIRED)

YOU MUST:

Instrument code: Add metrics emission, trace spans, and structured log statements to application code
Configure alerting: Write alert rules (Prometheus alerting rules, Datadog monitors, CloudWatch alarms, etc.)
Build dashboards: Create dashboard definitions (Grafana JSON, Datadog dashboard API, etc.)
Define health checks: Implement liveness, readiness, and dependency health endpoints
Create runbooks: Write actionable runbooks for each critical alert
Set up SLOs: Define SLO configurations with error budget tracking

DO NOT USE ARBITRARY THRESHOLDS — BASE ALL VALUES ON REQUIREMENTS AND HISTORICAL DATA

⚠️ STEP 5: Review & Output (REQUIRED)

YOU MUST validate the monitoring setup against these criteria:

Coverage check:
- All services have golden signal metrics
- Critical paths have distributed traces
- Structured logging with correlation IDs is in place
- Health check endpoints exist for all services
Alert quality check:
- Every alert has a defined severity level
- Every critical/warning alert links to a runbook
- No duplicate or overlapping alerts
- Alert thresholds are based on SLOs, not arbitrary values
Dashboard check:
- Overview dashboard answers "is the system healthy?" at a glance
- Service dashboards enable drill-down for troubleshooting
- Dashboards load within 5 seconds
Output all artifacts to /claudedocs/ following OUTPUT_CONVENTIONS.md
Update project memory: Use memoryStore.update("monitoring-expert", "{project-name}", ...) to store decisions, tool choices, and SLO definitions

See MemoryStore Interface for update() and append() method details.

DO NOT SKIP VALIDATION

Compliance Checklist

Before completing ANY monitoring or observability task, verify:

Step 1: System services, dependencies, and SLO requirements identified
Step 2: Observability stack designed with tools, metrics, alerts, and dashboards
Step 3: Project memory loaded (or noted as first-time setup)
Step 4: Monitoring implemented — code instrumented, alerts configured, dashboards built
Step 5: Coverage validated, artifacts output to /claudedocs/, memory updated

FAILURE TO COMPLETE ALL STEPS INVALIDATES THE MONITORING DESIGN

Output File Naming Convention

Format: monitoring_{scope}_{date}.md

Where:

{scope} = target service or system name (e.g., payments, api_gateway, full_stack)
{date} = ISO date (e.g., 2026-02-12)

Examples:

monitoring_payments_2026-02-12.md
monitoring_api_gateway_2026-02-12.md
monitoring_full_stack_2026-02-12.md

Observability Stack Reference

Metrics & Monitoring

Tool	Type	Best For
Prometheus	Open source	Kubernetes-native metrics, pull-based collection
Grafana	Open source	Dashboarding, visualization, multi-source queries
Datadog	SaaS	Full-stack observability, APM, infrastructure monitoring
CloudWatch	AWS	AWS-native services, Lambda, ECS, RDS
New Relic	SaaS	APM, browser monitoring, AI-powered insights

Distributed Tracing

Tool	Type	Best For
OpenTelemetry	Open standard	Vendor-neutral instrumentation, auto-instrumentation
Jaeger	Open source	Distributed tracing backend, trace visualization
Zipkin	Open source	Lightweight tracing, B3 propagation
AWS X-Ray	AWS	AWS-native distributed tracing

Log Aggregation

Tool	Type	Best For
ELK Stack	Open source	Full-text search, log analytics, Kibana dashboards
Loki	Open source	Grafana-native, label-based log aggregation
Fluentd/Fluent Bit	Open source	Log collection, routing, transformation
Splunk	Enterprise	Enterprise log management, SIEM integration

Alerting & Incident Management

Tool	Type	Best For
PagerDuty	SaaS	On-call management, escalation policies, incident response
OpsGenie	SaaS	Alert routing, on-call schedules, integrations
Alertmanager	Open source	Prometheus alert routing, grouping, silencing
Grafana Alerting	Open source	Unified alerting across data sources

Version History

v1.0.0 (2026-02-12): Initial release
- Mandatory 5-step workflow for monitoring and observability design
- Golden signals and SLI/SLO framework
- Distributed tracing with OpenTelemetry support
- Alerting strategy with fatigue prevention
- Dashboard design patterns
- Health check and liveness/readiness probe guidance
- Incident response workflow integration
- Project memory integration for monitoring decisions persistence

monitoring-expert

Monitoring & Observability Expert

⚠️ MANDATORY COMPLIANCE ⚠️

File Structure

Interface References

Focus Areas

MANDATORY WORKFLOW (MUST FOLLOW EXACTLY)

⚠️ STEP 1: Assess Monitoring Needs (REQUIRED)

⚠️ STEP 2: Design Observability Stack (REQUIRED)

⚠️ STEP 3: Load Project Memory (REQUIRED)

⚠️ STEP 4: Implement Monitoring (REQUIRED)

⚠️ STEP 5: Review & Output (REQUIRED)

Compliance Checklist

Output File Naming Convention

Observability Stack Reference

Metrics & Monitoring

Distributed Tracing

Log Aggregation

Alerting & Incident Management

Further Reading

Version History