health-check
Health Check - Liveness, Readiness & Dependency Probes
Codifies the project's three-tier health probe architecture (shallow liveness, readiness, deep dependency checks), Docker container healthchecks, cron-based periodic monitoring, route discovery verification, and the comprehensive system health script. Patterns align with Kubernetes probe conventions even when running outside K8s.
Description
Codifies liveness, readiness, and deep dependency health endpoints for NodeJS-Starter-V1's Next.js and FastAPI services, covering three-tier probe architecture, Docker healthchecks, cron-based monitoring, route discovery verification, and the system health script.
When to Apply
Positive Triggers
- Adding new health check endpoints or probes
- Integrating new dependencies that need health verification
- Configuring Docker healthchecks for containers
- Setting up periodic health monitoring via cron
- Implementing startup, liveness, or readiness probes
- Adding service dependency checks to existing endpoints
- User mentions: "health check", "liveness", "readiness", "probe", "heartbeat", "service health", "dependency check"
Negative Triggers
- Collecting application metrics (use
metrics-collectorinstead) - Adding structured log statements (use
structured-logginginstead) - Designing dashboard UI for health status (use
dashboard-patternsinstead) - Implementing graceful shutdown (use
graceful-shutdownwhen available)
Core Directives
The Three Laws of Health Checks
- Three tiers, not one: Separate liveness (am I alive?), readiness (can I serve traffic?), and deep (are all dependencies healthy?). Never combine them.
- Parallel dependency checks: Check all dependencies concurrently via
Promise.allorasyncio.gather. Never check sequentially — a slow database should not delay the Redis check. - 503 for unhealthy: Return HTTP 200 for healthy/degraded, HTTP 503 for unhealthy. Load balancers and orchestrators use status codes, not response bodies.
Existing Project Infrastructure
Backend (FastAPI)
| Endpoint | Type | Location |
|---|---|---|
GET /health |
Liveness | apps/backend/src/api/routes/health.py |
GET /ready |
Readiness | apps/backend/src/api/routes/health.py |
GET /api/agents/{id}/health |
Agent health | apps/backend/src/api/routes/agent_dashboard.py |
Frontend (Next.js)
| Endpoint | Type | Location |
|---|---|---|
GET /api/health |
Shallow liveness | apps/web/app/api/health/route.ts |
GET /api/health/deep |
Deep dependency | apps/web/app/api/health/deep/route.ts |
GET /api/health/routes |
Route discovery | apps/web/app/api/health/routes/route.ts |
GET /api/cron/health-check |
Periodic cron | apps/web/app/api/cron/health-check/route.ts |
Docker
| Service | Command | Interval | Timeout | Retries |
|---|---|---|---|---|
| PostgreSQL | pg_isready -U starter_user -d starter_db |
10s | 5s | 5 |
| Redis | redis-cli ping |
10s | 5s | 5 |
System Script
scripts/health-check.ps1 — 6-phase comprehensive health check (prerequisites, database, backend, frontend, integration, summary) with exit code 0 (healthy) or 1 (unhealthy).
Health Status Model
All health endpoints use a three-state status:
| Status | HTTP Code | Meaning | Action |
|---|---|---|---|
healthy |
200 | All systems operational | None |
degraded |
200 | Functional but impaired | Monitor, alert |
unhealthy |
503 | Cannot serve requests | Remove from load balancer |
Aggregation Rule
if any dependency is unhealthy → overall = unhealthy (503)
else if any dependency is degraded → overall = degraded (200)
else → overall = healthy (200)
Probe Patterns
Tier 1: Liveness (Shallow)
Returns immediately with minimal computation. Used by load balancers and orchestrators to confirm the process is alive.
Backend (/health):
@router.get("/health")
async def health_check() -> dict[str, str]:
return {
"status": "healthy",
"timestamp": datetime.now().isoformat(),
"version": "0.1.0",
}
Frontend (/api/health):
interface HealthResponse {
status: "healthy" | "degraded" | "unhealthy";
timestamp: string;
version: string;
uptime: number;
environment: string;
}
Rules: No database calls, no external service checks, no computation. Must respond in < 50ms.
Tier 2: Readiness
Confirms the service can accept and process requests. Checks that critical dependencies are reachable.
Backend (/ready):
@router.get("/ready")
async def readiness_check() -> dict[str, str]:
# Check database connectivity
# Check Redis connectivity
# Check AI provider availability
return {"status": "ready", "timestamp": datetime.now().isoformat()}
Rules: Check only fast, critical dependencies (database, cache). Timeout each check at 2–5 seconds. Do not check optional or slow services.
Tier 3: Deep Dependency Check
Checks all dependencies in parallel with latency measurement. Used for debugging and monitoring dashboards, not for load balancer probes (too slow).
Frontend (/api/health/deep):
interface DependencyCheck {
name: string;
status: "healthy" | "degraded" | "unhealthy" | "unchecked";
latency_ms: number | null;
error: string | null;
last_checked: string;
}
Each dependency checker follows this pattern:
- Record start time
- Attempt operation with timeout (
AbortSignal.timeout(5000)) - Measure latency (
Date.now() - start) - Classify:
healthy(success),degraded(slow or partial),unhealthy(error/timeout)
Checks run in parallel via Promise.all:
const [database, backend, verification] = await Promise.all([
checkDatabase(),
checkBackend(),
checkVerificationSystem(),
]);
The summary aggregates results:
const summary = {
total_checks: checks.length,
passed: checks.filter(c => c.status === "healthy").length,
failed: checks.filter(c => c.status === "unhealthy").length,
degraded: checks.filter(c => c.status === "degraded").length,
};
Docker Healthcheck Pattern
Docker Compose healthchecks use CMD-SHELL with service-native commands:
services:
postgres:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U starter_user -d starter_db"]
interval: 10s
timeout: 5s
retries: 5
redis:
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
For application containers, use curl or wget against the liveness endpoint:
backend:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
start_period gives the application time to initialise before healthchecks begin. Use depends_on with condition: service_healthy to sequence container startup.
Cron-Based Monitoring
The project's /api/cron/health-check runs every 5 minutes, pings the backend, and logs results. Secured with CRON_SECRET bearer token.
Pattern for adding new periodic checks:
export async function GET(request: Request) {
// 1. Verify CRON_SECRET
const authHeader = request.headers.get("authorization");
if (authHeader !== `Bearer ${process.env.CRON_SECRET}`) {
return new NextResponse("Unauthorized", { status: 401 });
}
// 2. Run checks with latency measurement
const start = Date.now();
const response = await fetch(`${backendUrl}/health`, {
signal: AbortSignal.timeout(5000),
});
const latency = Date.now() - start;
// 3. Log results
logger.info("Health check cron", { backend: response.ok, latency });
// 4. Alert if unhealthy
if (!response.ok) { logger.error("Backend unhealthy"); }
// 5. Return results
return NextResponse.json({ status: response.ok ? "healthy" : "unhealthy" });
}
Route Health Verification
The /api/health/routes endpoint discovers all API routes by scanning the filesystem and optionally verifies each GET endpoint:
- Discovery: Recursively scan
app/api/forroute.tsfiles - Method detection: Parse file content for exported HTTP methods (GET, POST, PUT, PATCH, DELETE)
- Verification (optional
?verify=true): Send GET request to each endpoint with 5-second timeout - Status:
verified(200 OK),error(non-200 or timeout),unverified(not tested)
Adding a New Dependency Check
When integrating a new service, add a checker following this template:
async function checkNewService(): Promise<DependencyCheck> {
const start = Date.now();
const result: DependencyCheck = {
name: "service_name",
status: "unchecked",
latency_ms: null,
error: null,
last_checked: new Date().toISOString(),
};
try {
// Service-specific check (e.g., ping, SELECT 1, PING)
const response = await fetch(serviceUrl, {
signal: AbortSignal.timeout(5000),
});
result.latency_ms = Date.now() - start;
result.status = response.ok ? "healthy" : "degraded";
if (!response.ok) result.error = `HTTP ${response.status}`;
} catch (e) {
result.latency_ms = Date.now() - start;
result.status = "unhealthy";
result.error = e instanceof Error ? e.message : "Unknown error";
}
return result;
}
Then add it to the Promise.all array in the deep health endpoint.
Anti-Patterns
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
| Database query in liveness probe | Probe fails when DB is slow, kills healthy process | Liveness = process alive only; DB check in readiness |
| Sequential dependency checks | Total latency = sum of all checks | Promise.all / asyncio.gather for parallel |
| 200 OK when unhealthy | Load balancer keeps routing traffic to broken instance | 503 for unhealthy, 200 for healthy/degraded |
| No timeout on dependency checks | Single hung dependency blocks entire health response | AbortSignal.timeout(5000) on every check |
| Exposing sensitive details in health response | Internal errors, stack traces leaked to public | Return status + latency only; log details server-side |
No start_period in Docker healthcheck |
Container marked unhealthy during boot | Set start_period to cover startup time |
| Hardcoded service URLs in health checks | Breaks across environments | Use environment variables (BACKEND_URL, etc.) |
Checklist for New Health Endpoints
Structure
- Three-tier separation (liveness, readiness, deep)
-
healthy/degraded/unhealthystatus values - HTTP 200 for healthy/degraded, 503 for unhealthy
- Response includes
timestampandversion
Dependencies
- Parallel checking via
Promise.allorasyncio.gather - 5-second timeout per dependency check
- Latency measurement per dependency
- Graceful handling of missing environment variables
Docker
- Container healthcheck using service-native command
-
interval,timeout,retriesconfigured -
start_periodcovers application boot time -
condition: service_healthyfor dependent services
Monitoring
- Cron-based periodic checks for production
- CRON_SECRET authentication on cron endpoints
- Health check latency instrumented via
metrics-collector - Failures logged via
structured-logging
Response Format
[AGENT_ACTIVATED]: Health Check
[PHASE]: {Design | Implementation | Review}
[STATUS]: {in_progress | complete}
{health check analysis or implementation guidance}
[NEXT_ACTION]: {what to do next}
Integration Points
Metrics Collector
health_check_duration_mshistogram per dependencyhealth_check_statusgauge (1=healthy, 0.5=degraded, 0=unhealthy)
Structured Logging
- Info-level health check results (dependency, status, latency)
- Error-level alerts when dependencies become unhealthy
Error Taxonomy
SYS_HEALTH_DEPENDENCY_UNAVAILABLE(503) — critical dependency unreachableSYS_HEALTH_TIMEOUT(504) — dependency check exceeded timeout
Cron Scheduler
/api/cron/health-checkruns every 5 minutes with CRON_SECRET auth- Results can trigger alerting via notification system
Dashboard Patterns
StatusPulsecomponent for live dependency status indicatorsDataStripfor health check latency metrics- Connection status mapped to spectral colours (emerald=healthy, amber=degraded, red=unhealthy)
Australian Localisation (en-AU)
- Spelling: initialise, serialise, analyse, optimise, colour, behaviour
- Date: ISO 8601 in responses; DD/MM/YYYY in dashboard display
- Timezone: AEST/AEDT — timestamps stored as UTC