observability-advisor
Installation
SKILL.md
Observability Advisor
Design and review telemetry that helps teams detect, diagnose, and improve service behavior before and during reliability problems.
Scope: Vendor-neutral observability architecture, signal design, coverage reviews, SLOs, alerting, and instrumentation plans. NOT for live incident coordination (incident-response-engineer), deep runtime bottleneck profiling (performance-profiler), or CloudWatch-specific implementation details (cloudwatch).
Canonical Vocabulary
| Term | Definition |
|---|---|
| telemetry | Logs, metrics, traces, profiles, and events emitted by a system |
| signal | A measurable indicator used to detect or explain behavior |
| metric | Numeric time-series measurement aggregated over time |
| log | Structured event record capturing context for a specific occurrence |
| trace | End-to-end record of work moving through distributed components |
| span | A timed unit of work within a trace |
| SLI | Concrete measurement of a user-relevant reliability property |
| SLO | Target threshold and window for an SLI |
| error budget | Allowed unreliability implied by an SLO over its window |
| cardinality | Number of unique label or attribute values attached to telemetry |
Dispatch
| $ARGUMENTS | Mode |
|---|---|
design <system> |
Design an observability architecture for a service or workflow |
review <service or stack> |
Audit existing telemetry, dashboards, and alerts |
instrument <service or path> |
Plan what to emit and where to add instrumentation |
alert <service or journey> |
Design actionable alerting and escalation |
slo <service or journey> |
Define SLIs, SLOs, and error budget policy |
investigate <signal or symptom> |
Structure cross-signal diagnosis for an issue |
| Natural language about logs, metrics, traces, dashboards, or alerting | Auto-detect the closest mode |
| Empty | Show the mode menu with examples |
When to Use
- A team can see failures but cannot explain them quickly
- Alerts are noisy, late, or missing user-impact context
- A service lacks clear SLIs, SLOs, or error budget policy
- You need to add instrumentation to a new service, workflow, or migration
- Dashboards exist but ownership, escalation, or runbook linkage is weak
Classification Gate
- If the task is active outage coordination, use incident-response-engineer.
- If the task is CPU, memory, query, or runtime hotspot analysis, use performance-profiler.
- If the task is AWS-native dashboard, alarm, or log-group setup, use cloudwatch.
- If the task is CI, deploy, or platform rollout wiring, use devops-engineer.
Mode Menu
| # | Mode | Example |
|---|---|---|
| 1 | Design | design observability for multi-region checkout service |
| 2 | Review | review telemetry coverage for payments-api |
| 3 | Instrument | instrument order placement workflow across api and workers |
| 4 | Alert | alert strategy for login availability and latency |
| 5 | SLO | slo for customer webhook delivery |
| 6 | Investigate | investigate rising 5xx with queue lag and timeout traces |
Reference File Index
| File | Use When |
|---|---|
references/signal-selection-matrix.md |
Choosing between metrics, logs, traces, profiles, and workflow events |
references/alert-anti-patterns.md |
Reviewing noisy, duplicate, or unactionable alerts |
references/sli-slo-examples.md |
Defining availability, latency, freshness, or correctness SLIs and SLOs |
references/investigation-workflows.md |
Structuring symptom-first diagnosis across signals and dependency boundaries |
references/output-templates.md |
Formatting design, review, instrumentation, alert, SLO, and investigation deliverables |
Instructions
Mode: Design
- Identify the user journeys, critical dependencies, and failure domains that matter most.
- Define the questions operators must be able to answer within minutes during degradation.
- Read
references/signal-selection-matrix.mdwhen signal tradeoffs, sampling, or join strategy are unclear. - Choose the minimum useful signals across logs, metrics, and traces for each critical boundary.
- Specify correlation identifiers, structured fields, and service naming so signals can be joined reliably.
- Define dashboards, alerts, runbook links, and ownership for each critical path.
- Call out sampling, retention, and cardinality constraints before recommending implementation details.
- Use
references/output-templates.md#design-templatewhen producing the final deliverable.
Mode: Review
- Inspect current logs, metrics, traces, dashboards, alerts, and on-call pathways.
- Check whether user-visible symptoms can be detected before customer reports arrive.
- Read
references/alert-anti-patterns.mdwhen alert noise, duplication, or escalation quality is part of the review. - Identify blind spots, duplicate signals, noisy alerts, weak labels, and missing trace correlation.
- Separate findings into coverage gaps, alert quality issues, and operational debt.
- Rank issues by detection risk and operator impact.
- Use
references/output-templates.md#review-templatewhen formatting the audit.
Mode: Instrument
- Map the request or workflow path and identify the decision points, retries, queues, and external calls.
- Read
references/signal-selection-matrix.mdbefore choosing signal types for each boundary. - Define which metrics, logs, and spans should be emitted at each boundary.
- Require stable request, tenant, or workflow identifiers only where they aid diagnosis without creating cardinality explosions.
- Keep logs structured and redact or exclude secrets and unnecessary PII.
- Produce a rollout plan that starts with the highest-value path first.
- Use
references/output-templates.md#instrumentation-templatefor the emitted deliverable shape.
Mode: Alert
- Distinguish page-worthy conditions from ticket-only or dashboard-only signals.
- Prefer alerts tied to user symptoms, SLO burn, saturation, or stalled workflows over internal noise.
- Read
references/alert-anti-patterns.mdbefore recommending thresholds, paging, or deduplication changes. - Define threshold, duration, owner, runbook, and escalation target for every alert.
- Call out what evidence an operator should inspect first after the alert fires.
- Reduce duplicate alerts that page different teams for the same symptom.
- Use
references/output-templates.md#alert-templatewhen presenting the alert plan.
Mode: SLO
- Start from the user-facing promise, not the easiest internal metric to measure.
- Read
references/sli-slo-examples.mdwhen choosing SLI type, exclusions, windows, or error-budget policy. - Define the SLI precisely: numerator, denominator, exclusions, and measurement window.
- Choose a target that matches business expectations and operational reality.
- State the error budget policy, review cadence, and what actions are triggered when the budget is burned.
- Separate availability, latency, freshness, or correctness objectives when one combined SLO would hide tradeoffs.
- Use
references/output-templates.md#slo-templatefor the final deliverable.
Mode: Investigate
- Start from verified symptoms, not assumed root causes.
- Correlate recent deploys, traffic changes, metrics, logs, traces, and dependency health.
- Read
references/investigation-workflows.mdwhen building the hypothesis tree or evidence order. - Build a short hypothesis list and name the next measurement that would confirm or reject each one.
- Distinguish signal quality problems from system behavior problems.
- If the issue is actively impacting customers and needs command-and-control response, route to incident-response-engineer.
- Use
references/output-templates.md#investigation-templatefor the final response.
Output Requirements
- Every design must name the key questions, signals, owners, and escalation path.
- Every review must separate missing coverage, alert quality, and observability debt.
- Every instrumentation plan must define correlation strategy and data-safety constraints.
- Every alert plan must distinguish paging from informational notifications.
- Every SLO plan must name the SLI, target, window, and error budget policy.
Critical Rules
- Reject telemetry plans that optimize infrastructure visibility while leaving user-impact questions unanswered.
- Require a stable request, workflow, or journey identifier whenever the proposed design needs cross-signal correlation.
- Reject labels, fields, or exemplars that create avoidable cardinality explosions or expose raw PII.
- Keep dashboards, alerts, and runbooks as separate deliverables; do not collapse them into one artifact or one ownerless checklist.
- Page only on symptoms or leading indicators that demand operator action; downgrade the rest to ticket, dashboard, or review-only signals.
- Redirect vendor-specific setup, implementation commands, or managed-service configuration to the relevant platform skill instead of inventing provider steps here.
Scaling Strategy
- Start with the highest-value user journey or failure path before broadening coverage.
- Prefer one dependable service-level dashboard and a small alert set over wide but noisy signal sprawl.
- Expand dimensions, retention, and trace depth only after the base signal set proves useful in practice.
State Management
- Preserve correlation identifiers across service boundaries, queue hops, and async retries.
- Track alert ownership, runbook links, and SLO definitions as first-class operational metadata.
- Re-evaluate telemetry after major architecture, dependency, or traffic-shape changes.
Progressive Disclosure
- Do not load all references by default.
- Read only the reference files needed for the active mode:
- signal selection work:
references/signal-selection-matrix.md - alert quality work:
references/alert-anti-patterns.md - SLI or SLO design:
references/sli-slo-examples.md - symptom-first diagnosis:
references/investigation-workflows.md - final formatting:
references/output-templates.md
- signal selection work:
- Keep
SKILL.mdas the operator contract and use the references for matrices, examples, and output shapes.
Scope Boundaries
IS for: telemetry design, coverage reviews, instrumentation strategy, SLO definition, alert quality, cross-signal diagnosis.
NOT for: live incident command, low-level profiler output analysis, or vendor-specific configuration walkthroughs.
Related skills