monitoring-observability

Installation
SKILL.md

Monitoring and Observability

If the system can fail in a way users notice, you should be able to see it before they tell you.

Context

Monitoring and observability translate system behavior into signals responders can trust. The goal is not to collect every metric. The goal is to make risky boundaries, user impact, and recovery state visible fast enough to guide action.

In a lifecycle-aware system, observability should preserve release intent. For brownfield work, make sure dashboards and alerts distinguish safe supported behavior from unsupported or coexistence-sensitive paths.

Inputs

  • architecture-doc -- produced by the preceding skill in the lifecycle
  • ci-cd-pipeline -- produced by the preceding skill in the lifecycle
  • api-contract -- produced by the preceding skill in the lifecycle

Process

Step 1: Map User-Critical and Boundary-Critical Signals

Start from the reviewed architecture, API contract, and delivery slice:

  • which flows must succeed
  • which flows must fail closed
  • which queues, bridges, or background workers can amplify failures
  • which rollback or coexistence indicators matter after release

Do not start from "what metrics are easy to collect."

Step 2: Define a Small Set of Actionable Signals

At minimum, cover:

  • availability and latency for user-facing routes
  • error-rate segmentation for supported vs unsupported paths where that distinction matters
  • queue depth, retry rate, or backfill lag for async seams
  • release or deploy markers so responders can correlate changes quickly
  • rollback or fallback health signals if the current slice depends on them

Step 3: Build Dashboards That Support Triage

Dashboards should answer:

  • what is broken
  • who is affected
  • whether the system is degrading or recovering
  • whether rollback, fail-closed behavior, or fallback mode is working

Prefer a small number of responder dashboards over a wall of charts.

Step 4: Design Alerts for Actionability

Each alert should have:

  • a clear trigger
  • user or business impact context
  • owner or escalation target
  • immediate next step or linked runbook

Alerts that cannot change behavior are noise.

Step 5: Validate with Release and Incident Scenarios

Before relying on the setup:

  • verify alerts fire for the risky boundary you care about
  • verify dashboards show release markers and recovery clearly
  • verify the signal is strong enough to support incident-response and rollback decisions

Outputs

  • monitoring-config -- produced by this skill
  • alert-rules -- produced by this skill
  • service-dashboard -- produced by this skill

Quality Gate

  • User-facing and boundary-critical flows are explicitly instrumented
  • Alerts are actionable and mapped to an owner or runbook
  • Release markers and rollback or fallback health are visible
  • Async seams or coexistence boundaries are observable where applicable
  • Responder dashboard supports triage without requiring ad hoc queries first

Anti-Patterns

  1. Metric soup -- Thousands of charts, no operational clarity.
  2. CPU-first monitoring -- Infra metrics without route or contract visibility.
  3. No release markers -- Incidents take longer because nobody can correlate behavior to deploys.
  4. Alerts without action -- Noise trains responders to ignore the system.
  5. No boundary-specific telemetry -- Supported and unsupported behavior are merged until the incident is already expensive.

Related Skills

  • ci-cd -- provides release and rollback context
  • incident-response -- consumes observability signals during incidents
  • runbooks -- attaches response steps to alerts
  • capacity-planning (planned) -- uses operational signals for projection

Distribution

  • Public install surface: skills/.curated
  • Canonical authoring source: skills/07-operations/monitoring-observability/SKILL.md
  • This package is exported for npx skills add/update compatibility.
  • Packaging stability: beta
  • Capability readiness: beta
Related skills

More from yknothing/prodcraft

Installs
6
First Seen
Mar 27, 2026