DevOps — Infrastructure & Deployment

Ship reliably. Monitor everything. Fix fast.

Deployment Checklist

Before any deploy:

All tests pass in CI (not just locally)
Environment variables set in target environment
Database migrations tested against production-like data
Rollback plan documented (even if it's "revert this commit")
Health check endpoint exists and returns 200

CI/CD Pipeline

push → lint → typecheck → test → build → deploy staging → smoke test → deploy prod

Stage	Fails?	Action
Lint/Types	Block merge	Fix locally
Tests	Block merge	Fix or update tests
Build	Block merge	Fix build errors
Staging deploy	Block prod	Debug in staging
Smoke test	Block prod	Rollback staging, investigate
Prod deploy	Alert on-call	Rollback immediately

Docker

# Multi-stage build — keep images small
FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build

FROM node:22-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/server.js"]

Rules:

Always pin base image versions (not latest)
Use .dockerignore — never ship node_modules, .git, .env
One process per container
Health check in Dockerfile: HEALTHCHECK CMD curl -f http://localhost:3000/health

Monitoring

What	Tool Options	Alert When
Uptime	UptimeRobot, Checkly	Down > 30 seconds
Errors	Sentry, Datadog	Error rate > 1%
Latency	Grafana, Datadog	p95 > 2 seconds
Resources	Cloud provider metrics	CPU > 80%, memory > 85%
Logs	Datadog, Axiom, CloudWatch	Error patterns, keywords

Rules:

Every alert must have a runbook (even a one-liner)
If an alert fires and needs no action, delete it — alert fatigue kills
Log structured JSON, not printf strings
Include request ID in every log line for tracing

Infrastructure Defaults

Decision	Default	Why
Hosting	Vercel / Railway / Fly.io	Zero-config, scales
Database	Managed Postgres (Supabase, Neon, RDS)	Don't manage your own DB
Cache	Upstash Redis	Serverless, no ops
Queue	Inngest, Trigger.dev, or SQS	Managed, retries built-in
Storage	S3 / R2 / Supabase Storage	Cheap, reliable
DNS	Cloudflare	Fast, free tier
Secrets	Environment variables via platform	Never in code or git

Incident Response

Detect — alert fires or user report
Acknowledge — someone owns it (within 5 min)
Mitigate — rollback, feature flag off, or scale up (fix the bleeding)
Investigate — root cause after bleeding stops
Fix — proper fix with tests
Postmortem — blameless, focus on systems not people

devops

DevOps — Infrastructure & Deployment

Deployment Checklist

CI/CD Pipeline

Docker

Monitoring

Infrastructure Defaults

Incident Response