Observability Platform

Use When

Use when building production observability for a SaaS platform — SigNoz self-hosted stack, OpenTelemetry instrumentation for Node.js/PHP/Android/iOS, Prometheus + Grafana RED dashboards, alerting rules, distributed tracing with Jaeger, Sentry for errors, SLO/SLI design, error budget burn-rate alerts, runbooks, and blameless postmortems.
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

Do Not Use When

The task is unrelated to observability-platform or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

Required Inputs

Gather relevant project context, constraints, and the concrete problem to solve; load references only as needed.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

Workflow

Read this SKILL.md first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

Quality Standards

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

Anti-Patterns

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

Outputs

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

References

Use the references/ directory for deep detail after reading the core workflow below.

Three Pillars of Observability

Logs answer what happened at time T. Metrics answer how often and how much over a window. Traces answer why a request was slow or failing across services. Correlation is the whole point — every log line, metric exemplar, and span must share a trace_id so a Grafana alert links to SigNoz traces that link to logs.

Rule of thumb: emit a metric for what you must watch 24/7, a log for what you investigate after an alert, a span for what crosses a service boundary.

Structured JSON Logging

Schema — every line must include timestamp (ISO-8601 UTC), level, service, trace_id, span_id, user_id, tenant_id, msg, plus event fields. Levels: FATAL (process exiting; page immediately), ERROR (request failed unrecovered; ticket or page by volume), WARN (recovered degradation; trend only), INFO (business events like order.created; default prod level), DEBUG (off in prod, on per-request via header), TRACE (local or short debugging only).

Node.js (pino):

import pino from "pino";
import { trace, context } from "@opentelemetry/api";
export const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  formatters: {
    level: (label) => ({ level: label.toUpperCase() }),
    log: (obj) => {
      const ctx = trace.getSpan(context.active())?.spanContext();
      return ctx ? { ...obj, trace_id: ctx.traceId, span_id: ctx.spanId } : obj;
    },
  },
  base: { service: process.env.SERVICE_NAME ?? "api" },
  timestamp: pino.stdTimeFunctions.isoTime,
});

PHP (monolog):

$log = new Monolog\Logger('billing-api');
$h = new Monolog\Handler\StreamHandler('php://stdout', Monolog\Logger::INFO);
$h->setFormatter(new Monolog\Formatter\JsonFormatter());
$log->pushHandler($h);
$log->pushProcessor(function (array $r): array {
    $ctx = \OpenTelemetry\API\Globals::tracerProvider()->getTracer('app')->getCurrentSpan()->getContext();
    $r['extra']['trace_id'] = $ctx->getTraceId();
    $r['extra']['span_id']  = $ctx->getSpanId();
    return $r;
});
$log->info('invoice.paid', ['tenant_id' => 'acme', 'invoice_id' => 'inv_42']);

SigNoz Setup

OpenTelemetry-native APM with ClickHouse storage — a self-hosted alternative to Datadog you actually own.

version: "3.9"
services:
  clickhouse:
    image: clickhouse/clickhouse-server:24.1.2
    volumes: ["./ch-data:/var/lib/clickhouse"]
    ulimits: { nofile: { soft: 262144, hard: 262144 } }
  query-service:
    image: signoz/query-service:0.47.0
    environment: { ClickHouseUrl: tcp://clickhouse:9000, STORAGE: clickhouse }
    depends_on: [clickhouse]
  frontend:
    image: signoz/frontend:0.47.0
    environment: { FRONTEND_API_ENDPOINT: http://query-service:8080 }
    ports: ["3301:3301"]
  otel-collector:
    image: signoz/signoz-otel-collector:0.102.1
    volumes: ["./otel-collector.yaml:/etc/otel-collector-config.yaml:ro"]
    ports: ["4317:4317", "4318:4318"]

Retention — set ClickHouse TTLs in Settings → Retention: traces 15d, logs 30d, metrics 90d. Disk is the dominant cost; shorter TTL beats sampling. First dashboard: Services → pick billing-api → Save Panel → pin to Service Health. SigNoz auto-derives RED from spans — do this before writing custom PromQL.

OpenTelemetry Node.js

Install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http. Require otel.js before app code:

import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-base";
import { trace, SpanStatusCode } from "@opentelemetry/api";

new NodeSDK({
  resource: new Resource({
    "service.name": "billing-api",
    "deployment.environment": process.env.NODE_ENV ?? "dev",
    "service.version": process.env.GIT_SHA ?? "0.0.0",
  }),
  traceExporter: new OTLPTraceExporter({ url: "http://otel-collector:4318/v1/traces" }),
  sampler: new TraceIdRatioBasedSampler(0.1),
  instrumentations: [getNodeAutoInstrumentations()],
}).start();

const tracer = trace.getTracer("billing");
export async function chargeInvoice(id: string) {
  return tracer.startActiveSpan("billing.chargeInvoice", async (span) => {
    span.setAttribute("invoice.id", id);
    try { return await stripe.charges.create({ /* ... */ }); }
    catch (e) { span.recordException(e as Error); span.setStatus({ code: SpanStatusCode.ERROR }); throw e; }
    finally { span.end(); }
  });
}

OpenTelemetry PHP

composer require open-telemetry/sdk open-telemetry/exporter-otlp open-telemetry/opentelemetry-auto-slim. Bootstrap loaded by composer.json autoload.files:

use OpenTelemetry\SDK\Sdk;
use OpenTelemetry\SDK\Trace\TracerProvider;
use OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor;
use OpenTelemetry\Contrib\Otlp\{SpanExporter, OtlpHttpTransportFactory};

$exporter = new SpanExporter((new OtlpHttpTransportFactory())
    ->create('http://otel-collector:4318/v1/traces', 'application/x-protobuf'));
$provider = TracerProvider::builder()->addSpanProcessor(new BatchSpanProcessor($exporter))->build();
Sdk::builder()->setTracerProvider($provider)->setAutoShutdown(true)->buildAndRegisterGlobal();

Slim route spans are auto-captured. Use semantic attribute names: http.method, http.route, http.status_code, db.system, db.statement — Grafana and SigNoz panels key off the spec; do not invent variants.

OpenTelemetry Android

// build.gradle.kts
dependencies {
    implementation(platform("io.opentelemetry:opentelemetry-bom:1.38.0"))
    implementation("io.opentelemetry.android:android-agent:0.8.0-alpha")
}
// MyApp.kt
class MyApp : Application() {
    lateinit var rum: OpenTelemetryRum
    override fun onCreate() {
        super.onCreate()
        rum = OpenTelemetryRum.builder(this, OtelRumConfig())
            .setEndpoint("https://otel.acme.io")
            .addInstrumentation(AnrInstrumentation())
            .addInstrumentation(CrashReporterInstrumentation())
            .build()
    }
}
// Compose screen tracking via CompositionLocal
val LocalTracer = staticCompositionLocalOf<Tracer> { error("no tracer") }
@Composable fun ScreenSpan(name: String, content: @Composable () -> Unit) {
    val tracer = LocalTracer.current
    DisposableEffect(name) {
        val span = tracer.spanBuilder("screen.$name").startSpan()
        onDispose { span.end() }
    }; content()
}

ANR detector correlates frozen frames with active trace_id — a jank spike in Grafana resolves to the Compose frame that blocked the main thread.

OpenTelemetry iOS

SPM: add https://github.com/open-telemetry/opentelemetry-swift at 1.10.1.

import OpenTelemetryApi
import OpenTelemetrySdk
import OpenTelemetryProtocolExporterHttp
import URLSessionInstrumentation

func bootstrapOtel() {
    let resource = Resource(attributes: ["service.name": .string("ios-app"), "deployment.environment": .string("prod")])
    let exporter = OtlpHttpTraceExporter(endpoint: URL(string: "https://otel.acme.io/v1/traces")!)
    let provider = TracerProviderBuilder()
        .add(spanProcessor: BatchSpanProcessor(spanExporter: exporter))
        .with(resource: resource).build()
    OpenTelemetry.registerTracerProvider(tracerProvider: provider)
    URLSessionInstrumentation(configuration: URLSessionInstrumentationConfiguration())
}
struct TracedScreen<Content: View>: View {
    let name: String
    @Environment(\.scenePhase) private var phase
    @ViewBuilder let content: () -> Content
    @State private var span: Span?
    var body: some View {
        content()
            .onAppear { span = OpenTelemetry.instance.tracerProvider.get(instrumentationName: "ui").spanBuilder(spanName: "screen.\(name)").startSpan() }
            .onDisappear { span?.end() }
            .onChange(of: phase) { if $0 == .background { span?.end(); span = nil } }
    }
}

Prometheus Metrics

Four types — Counter (monotonic increments, use rate()), Gauge (up and down), Histogram (bucketed, aggregate with histogram_quantile()), Summary (client-side quantiles — avoid, they do not merge across instances).

RED naming plus latency buckets covering 5 ms to 10 s:

http_requests_total{service,method,route,status}
http_request_duration_seconds_bucket{service,route,le}
http_requests_in_flight{service}
LatencyBuckets = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

Never use high-cardinality labels (user_id, request_id). 10 k tenants × 50 routes = 500 k series = Prometheus OOM.

Grafana Dashboards

Commit dashboard JSON to ops/grafana/dashboards/service-health.json:

{
  "title": "Service Health — RED",
  "templating": { "list": [
    { "name": "service", "type": "query", "query": "label_values(http_requests_total, service)", "includeAll": true, "multi": true },
    { "name": "env", "type": "custom", "query": "prod,staging,dev" }
  ]},
  "panels": [
    { "title": "Request Rate", "targets": [{ "expr": "sum(rate(http_requests_total{service=~\"$service\"}[5m])) by (service)" }] },
    { "title": "Error Ratio", "targets": [{ "expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=~\"$service\"}[5m]))" }] },
    { "title": "Latency P95", "targets": [{ "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[5m])) by (le, service))" }] }
  ]
}

Rule: one RED dashboard per service family, not per instance. Use $service so the dashboard scales.

Alerting Rules

groups:
- name: api-red
  interval: 30s
  rules:
  - alert: ApiHighErrorRate
    expr: |
      sum(rate(http_requests_total{service="billing-api",status=~"5.."}[5m]))
      / sum(rate(http_requests_total{service="billing-api"}[5m])) > 0.02
    for: 5m
    labels: { severity: critical, team: platform }
    annotations:
      summary: "billing-api 5xx above 2% for 5m"
      description: "Error ratio {{ $value | humanizePercentage }} — check recent deploy."
      runbook_url: "https://runbooks.acme.io/billing-api/high-error-rate"
  - alert: ApiLatencyP95High
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="billing-api"}[5m])) by (le)) > 0.5
    for: 10m
    labels: { severity: warning, team: platform }
    annotations: { summary: "billing-api P95 > 500 ms", runbook_url: "https://runbooks.acme.io/billing-api/latency" }

Severity — critical pages on-call, warning opens a ticket, info is dashboard-only. Alertmanager routing:

route:
  receiver: default
  group_by: [alertname, service]
  routes:
    - { matchers: [severity="critical"], receiver: pagerduty }
    - { matchers: [severity="warning"], receiver: opsgenie-ticket }
inhibit_rules:
  - source_matchers: [severity="critical"]
    target_matchers: [severity="warning"]
    equal: [service]
receivers:
  - { name: pagerduty, pagerduty_configs: [{ routing_key: "${PD_ROUTING_KEY}" }] }
  - { name: opsgenie-ticket, opsgenie_configs: [{ api_key: "${OG_API_KEY}", priority: P3 }] }

Inhibition suppresses warning when critical for the same service is firing — one incident, one page.

Distributed Tracing with Jaeger

W3C trace context header — every service MUST propagate traceparent: 00-<trace-id>-<parent-span-id>-<flags>, e.g. 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 (flags 01 = sampled).

Sampling — head-based TraceIdRatioBased(0.1) at first hop is cheap but drops errors; tail-based at the collector retains them. SigNoz tail_sampling config:

processors:
  tail_sampling:
    decision_wait: 30s
    policies:
      - { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
      - { name: slow,   type: latency,     latency: { threshold_ms: 1000 } }
      - { name: sample, type: probabilistic, probabilistic: { sampling_percentage: 5 } }

Analysis workflow: alert fires → open SigNoz → filter service.name and status=error in the alert window → sort by duration → open longest trace. The bottleneck span is almost always a DB query or a serial loop of HTTP calls.

Sentry Setup

Next.js sentry.client.config.ts:

import * as Sentry from "@sentry/nextjs";
Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  environment: process.env.NEXT_PUBLIC_APP_ENV,
  tracesSampleRate: 0.1,
  replaysSessionSampleRate: 0.1, replaysOnErrorSampleRate: 1.0,
  integrations: [Sentry.replayIntegration({ maskAllText: true, blockAllMedia: true })],
});

sentry.server.config.ts mirrors this with tracesSampleRate: 0.2 and profilesSampleRate: 0.1. Upload source maps via withSentryConfig in next.config.js with SENTRY_AUTH_TOKEN; without this, stack traces are minified noise.

iOS: SentrySDK.start { $0.dsn = ...; $0.tracesSampleRate = 0.2; $0.enableAutoPerformanceTracing = true }.
Android: SentryAndroid.init(this) { it.dsn = BuildConfig.SENTRY_DSN; it.tracesSampleRate = 0.2; it.isAttachScreenshot = true }.

Triage: new issue → on-call triages within 24h → assign owner and label triage → in-progress → resolved or ignored with written reason. Never leave in-progress past a sprint.

SLO/SLI Design

An SLI is a measurement; an SLO is a promise; an SLA is the contract with consequences. Write the SLI in PromQL first.

# Availability SLI
sum(rate(http_requests_total{service="billing-api",status!~"5.."}[5m])) / sum(rate(http_requests_total{service="billing-api"}[5m]))
# Latency SLI — P95 under 500 ms
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="billing-api"}[5m])) by (le)) < 0.5

SLO — 99.9% availability over a rolling 28-day window. Error budget: (1 - 0.999) × 28d × 24h × 60min = 40.32 minutes of bad traffic per 28d. Publish the SLO in Grafana alongside current burn. When the budget is spent, freeze feature releases and spend the next sprint on reliability — error budgets make the trade-off explicit instead of tribal.

Error Budget Burn Rate

Burn rate = (error ratio over window) / (1 − SLO). Multi-window multi-burn-rate (Google SRE Workbook Ch. 5):

Fast burn — 1h window, 14.4× rate → 2% of monthly budget burns in 1h → page on-call.
Slow burn — 6h window, 6× rate → 5% in 6h → ticket, no page.

- alert: SloFastBurn
  expr: |
    (sum(rate(http_requests_total{service="billing-api",status=~"5.."}[1h]))
     / sum(rate(http_requests_total{service="billing-api"}[1h]))) > (14.4 * 0.001)
    and
    (sum(rate(http_requests_total{service="billing-api",status=~"5.."}[5m]))
     / sum(rate(http_requests_total{service="billing-api"}[5m]))) > (14.4 * 0.001)
  for: 2m
  labels: { severity: critical, slo: availability }
  annotations: { summary: "billing-api SLO fast burn (14.4x)", runbook_url: "https://runbooks.acme.io/slo/fast-burn" }
- alert: SloSlowBurn
  expr: |
    (sum(rate(http_requests_total{service="billing-api",status=~"5.."}[6h]))
     / sum(rate(http_requests_total{service="billing-api"}[6h]))) > (6 * 0.001)
  for: 15m
  labels: { severity: warning, slo: availability }

The and between 1h and 5m windows prevents stale firing — the short window must still be bad for the page to be valid.

Real User Monitoring (RUM)

Core Web Vitals at P75 from real browsers — LCP ≤ 2.5 s, INP ≤ 200 ms, CLS ≤ 0.1. Sentry Performance captures these on @sentry/nextjs. Session replay — modest sample rates, aggressive masking:

Sentry.init({
  replaysSessionSampleRate: 0.1, replaysOnErrorSampleRate: 1.0,
  integrations: [Sentry.replayIntegration({
    maskAllText: true, maskAllInputs: true, blockAllMedia: true, blockClass: "sensitive",
  })],
});

Privacy — replays must mask all PII by default. An unmasked DOB or IBAN on a support agent screen is a DPPA breach. Add class="sensitive" to elements rendering personal data.

On-Call Runbooks

Every paging alert must have a runbook_url. Keep the runbook to one page with these sections:

Symptom — exact alert text and Grafana panel link.
Impact — which users see what, whether revenue is blocked.
First checks — three dashboards and two log queries for the first 5 minutes.
Known causes — ranked by frequency with remediation steps.
Escalation — who to wake when step 4 fails.

Severity ladder — SEV1 revenue-impacting outage (page primary within 5 min), SEV2 major degradation beyond SLO (page within 15 min), SEV3 minor, workaround exists (ticket next business day), SEV4 cosmetic or internal-only (backlog).

Escalation — primary on-call → secondary after 15 min no-ack → engineering manager after 30 min → VP Eng after 60 min. Status page updates every 30 min during SEV1–SEV2 even if "no new information." Silence is the second incident.

Blameless Postmortem

Write one within 5 business days of every SEV1 and SEV2.

Timeline — reconstructed from trace_ids, log timestamps, chat logs. Absolute UTC, not "around 3 pm."
Root cause (5-why) — never stop at the first "why." Deploy broke API → config key misnamed → typo not caught in CI → schema test disabled → no one owned it.
Impact — users affected, revenue lost, error budget spent.
Went well / went poorly — no names; focus on systems and signals.
Action items — each a JIRA ticket with owner and due date. No vague "improve monitoring"; write "add alert KafkaLagHigh threshold 5000 by 2026-05-01, owner @dee."

Publish internally. Anonymise and publish externally for customer-impacting SEV1s — it builds more trust than any marketing page.

Production Dashboards

Four dashboards every SaaS needs — Service Health (RED per service), Infrastructure (host CPU/memory/disk/network), Business KPIs (MRR, active users, signups, churn), SLO Tracker (current SLO % vs target, budget remaining per service). Pin all four to the on-call TV.

# Service Health
sum(rate(http_requests_total[5m])) by (service)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# Infrastructure
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes
# Business KPIs
sum(business_mrr_usd)
sum(increase(business_signups_total[1d]))
sum(active_users_gauge{window="24h"})
sum(increase(business_churn_total[30d])) / sum(business_customers_gauge offset 30d)
# SLO Tracker — availability % and budget remaining
sum(rate(http_requests_total{status!~"5.."}[28d])) by (service)
  / sum(rate(http_requests_total[28d])) by (service)
1 - (sum(rate(http_requests_total{status=~"5.."}[28d])) by (service)
     / sum(rate(http_requests_total[28d])) by (service)) / 0.001

During an incident, responders should not hunt dashboards — they should read them.

Companion Skills

observability-monitoring — foundational logs/metrics/traces — load first
reliability-engineering — toil elimination, risk quantification, incident response
database-reliability — database SLIs, replication lag tracking
kubernetes-platform — K8s pod/node metrics, PodMonitor for Prometheus
cicd-pipelines — deployment alerts tied to releases

Sources

Observability Engineering — Majors, Fong-Jones, Miranda (O'Reilly); Site Reliability Engineering — Google (free at sre.google/books)
SigNoz signoz.io/docs; OpenTelemetry opentelemetry.io/docs; Prometheus prometheus.io/docs; Grafana grafana.com/docs/grafana; Sentry docs.sentry.io

observability-platform