Production Monitoring Setup

Set up monitoring for a production application. If a target is provided, scope recommendations to that service or feature. Cover all four pillars, then configure health checks, alerts, SLOs, and dashboards.

The Four Pillars

Monitor all four. Missing any one creates a blind spot.

1. Errors

Track unhandled exceptions, failed API calls, and client-side errors. Don't wait for users to report them.

What to capture	How
Unhandled exceptions (server)	Global error handler reports to error tracking service. Include stack trace and request context.
Unhandled exceptions (client)	`window.onerror` and `onunhandledrejection` wired to error tracking.
Failed API calls	Log 4xx and 5xx responses with request path, status, and duration.
Client-side errors	React error boundaries catch render failures. Report with component tree.

Tool examples: Sentry (full stack, source maps, release tracking), LogRocket (session replay + error context).

2. Performance

Measure response times and user experience. Use percentiles, not averages.

Metric	Target	How to measure
API response time (p50)	< 200ms	Server-side timing middleware.
API response time (p95)	< 500ms	Same middleware, track percentile distribution.
API response time (p99)	< 1000ms	Same middleware. If p99 spikes, investigate outlier queries.
Largest Contentful Paint	< 2.5s	Real User Monitoring (RUM) or Lighthouse CI.
First Input Delay	< 100ms	RUM.
Cumulative Layout Shift	< 0.1	RUM or Lighthouse CI.

Tool examples: Vercel Analytics (zero-config for Next.js), Speedlify (self-hosted Lighthouse tracking).

3. Availability

Know when your service is down before your users do.

What to check	Frequency	Alert if
Health check endpoint	Every 30s	Two consecutive failures.
SSL certificate expiry	Daily	Less than 14 days remaining.
DNS resolution	Every 5m	Resolution fails or returns unexpected IP.
Key third-party services	Every 1m	Dependency returns errors for > 2 minutes.

Tool examples: Better Uptime (status pages + incident management), Checkly (synthetic monitoring with Playwright).

4. Business Metrics

Technical metrics alone don't tell you if the product works.

Metric	Why it matters
Signups per hour/day	Detects registration flow breakage immediately.
Conversion rate	Drop signals checkout or onboarding issues.
Key feature usage	Confirms new features are actually being used.
Error rate per user action	Ties technical errors to user impact.

Tool examples: PostHog (open-source product analytics, feature flags, session replay), Mixpanel (funnel and retention analysis).

Health Check Endpoint

Create a /api/health endpoint that verifies real connectivity, not just "the server is running."

// app/api/health/route.ts
import { NextResponse } from 'next/server'

export async function GET() {
  const checks: Record<string, 'ok' | 'fail'> = {}

  // Check database connectivity
  try {
    await db.query('SELECT 1')
    checks.database = 'ok'
  } catch {
    checks.database = 'fail'
  }

  // Check external services (cache, queue, etc.)
  try {
    await redis.ping()
    checks.cache = 'ok'
  } catch {
    checks.cache = 'fail'
  }

  const allHealthy = Object.values(checks).every((s) => s === 'ok')

  return NextResponse.json(
    { status: allHealthy ? 'healthy' : 'degraded', checks },
    { status: allHealthy ? 200 : 503 },
  )
}

Adapt the checks to the actual services in the stack. Return 503 if any check fails so load balancers and uptime monitors detect it.

Alerting Strategy

Alert on actionable signals only. Noisy alerts get ignored.

Signal	Severity	Notification channel	Response time
Health check down (2+ consecutive)	Critical	PagerDuty / phone call	< 15 minutes
Error rate > 5% of requests	Critical	Slack #incidents + PagerDuty	< 15 minutes
Error rate > 1% of requests	Warning	Slack #alerts	< 1 hour
p95 response time > 1s	Warning	Slack #alerts	< 1 hour
SSL cert expiry < 14 days	Info	Slack #ops	< 1 day
Disk usage > 80%	Warning	Slack #ops	< 4 hours
Deploy completed	Info	Slack #deploys	No response needed

Rules for good alerting:

Don't alert on single transient errors. Use thresholds and windows (e.g., "error rate > 5% over 5 minutes").
Every alert must have a clear owner and a documented first-response action.
Review alert fatigue monthly. If an alert fires > 3 times without action, fix it or remove it.

SLOs (Service Level Objectives)

Define SLOs for key services. SLOs set expectations; SLIs measure them.

Service	SLI (what you measure)	SLO (target)	Error budget
API availability	% of requests returning non-5xx	99.9%	8.7 hours downtime / year
API latency	p95 response time	< 500ms	5% of requests can exceed
Web vitals (LCP)	p75 LCP across all pages	< 2.5s	25% of page loads can exceed
Data pipeline freshness	Time since last successful sync	< 15 minutes	4 allowed SLO violations / month

How to use error budgets:

When the budget is healthy, ship fast and take calculated risks.
When the budget is low, freeze non-critical deploys and focus on reliability.
Track budget burn rate weekly. A sudden spike means something broke.

Dashboard Essentials

Build a team dashboard with these panels. Keep it to one screen.

Panel	What it shows	Why
Error rate (last 24h)	Errors per minute, with deploy markers	Correlate errors with deploys instantly.
Latency trends (last 24h)	p50, p95, p99 lines on one chart	Spot gradual degradation before it becomes critical.
Active users (real-time)	Current connected / active users	Context for error rates. 10 errors with 10 users is worse than 10 errors with 10,000.
Uptime status	Green/red for each monitored endpoint	Glanceable health.
Deploy history	Last 10 deploys with timestamp and author	Quick reference for "what changed?"
SLO burn rate	Error budget remaining for the period	Know when to slow down.

Incident Response

Use severity levels to set expectations and escalation.

Level	Definition	Example	Response expectation
SEV1	Service down or major data loss. Most users affected.	API returning 500 for all requests. Payment processing broken.	All hands. Respond in < 15 min. Communicate status every 30 min.
SEV2	Significant degradation. Some users affected.	Slow response times. One region down. Feature broken for subset of users.	On-call responds in < 30 min. Hourly status updates.
SEV3	Minor issue. Workaround exists.	Non-critical feature broken. Cosmetic bug in production.	Address within business hours. No status page update needed.

For every incident:

Acknowledge the alert and declare severity.
Open an incident channel (or thread) for coordination.
Mitigate first, investigate second. Get the service back up, then find root cause.
Write a brief post-incident review: what happened, why, and what will prevent recurrence.