a6-recipe-health-check

Overview

Health checks monitor upstream backend nodes and automatically remove unhealthy nodes from the load balancer pool. APISIX supports two types:

Active: APISIX periodically probes each node with HTTP/HTTPS/TCP requests
Passive: APISIX analyzes real traffic responses to detect failures

Use both together for the most robust setup.

When to Use

Automatically remove failing backend nodes from rotation
Detect and recover from backend failures without manual intervention
Ensure high availability across multiple backend instances
Monitor backend health status via the a6 CLI

Health Check Configuration Reference

Active Health Check

Field	Type	Default	Description
`checks.active.type`	string	`"http"`	Check type: `"http"`, `"https"`, or `"tcp"`
`checks.active.http_path`	string	`"/"`	HTTP path to probe
`checks.active.host`	string	—	Host header for HTTP probes
`checks.active.port`	integer	—	Override port for probing (default: use node port)
`checks.active.timeout`	number	`1`	Probe timeout in seconds
`checks.active.concurrency`	integer	`10`	Number of concurrent probes
`checks.active.https_verify_certificate`	boolean	`true`	Verify TLS certificate for HTTPS probes
`checks.active.req_headers`	array[string]	—	Additional request headers for probes
`checks.active.healthy.interval`	integer	`1`	Seconds between probes for healthy nodes
`checks.active.healthy.successes`	integer	`2`	Consecutive successes to mark node healthy
`checks.active.healthy.http_statuses`	array[integer]	`[200, 302]`	HTTP codes considered healthy
`checks.active.unhealthy.interval`	integer	`1`	Seconds between probes for unhealthy nodes
`checks.active.unhealthy.http_failures`	integer	`5`	Consecutive HTTP failures to mark unhealthy
`checks.active.unhealthy.tcp_failures`	integer	`2`	Consecutive TCP failures to mark unhealthy
`checks.active.unhealthy.timeouts`	integer	`3`	Consecutive timeouts to mark unhealthy
`checks.active.unhealthy.http_statuses`	array[integer]	`[429, 404, 500, 501, 502, 503, 504, 505]`	HTTP codes considered unhealthy

Passive Health Check

Field	Type	Default	Description
`checks.passive.type`	string	`"http"`	Check type: `"http"`, `"https"`, or `"tcp"`
`checks.passive.healthy.successes`	integer	`5`	Consecutive successes to mark healthy
`checks.passive.healthy.http_statuses`	array[integer]	`[200, 201, 202, ..., 399]`	HTTP codes considered healthy
`checks.passive.unhealthy.http_failures`	integer	`5`	Consecutive failures to mark unhealthy
`checks.passive.unhealthy.tcp_failures`	integer	`2`	Consecutive TCP failures to mark unhealthy
`checks.passive.unhealthy.timeouts`	integer	`7`	Consecutive timeouts to mark unhealthy
`checks.passive.unhealthy.http_statuses`	array[integer]	`[429, 500, 503]`	HTTP codes considered unhealthy

Step-by-Step: Configure Health Checks

1. Active HTTP health check

a6 upstream create -f - <<'EOF'
{
  "id": "backend",
  "type": "roundrobin",
  "nodes": {
    "backend-1:8080": 1,
    "backend-2:8080": 1,
    "backend-3:8080": 1
  },
  "checks": {
    "active": {
      "type": "http",
      "http_path": "/health",
      "healthy": {
        "interval": 5,
        "successes": 2,
        "http_statuses": [200]
      },
      "unhealthy": {
        "interval": 3,
        "http_failures": 3,
        "http_statuses": [500, 502, 503]
      }
    }
  }
}
EOF

APISIX probes /health on each node:

Every 5s for healthy nodes
Every 3s for unhealthy nodes
3 consecutive failures → node removed
2 consecutive successes → node restored

2. Passive health check (analyze real traffic)

a6 upstream create -f - <<'EOF'
{
  "id": "backend-passive",
  "type": "roundrobin",
  "nodes": {
    "backend-1:8080": 1,
    "backend-2:8080": 1
  },
  "checks": {
    "passive": {
      "type": "http",
      "unhealthy": {
        "http_failures": 3,
        "http_statuses": [500, 502, 503],
        "timeouts": 3
      },
      "healthy": {
        "successes": 5,
        "http_statuses": [200, 201, 202, 203, 204]
      }
    }
  }
}
EOF

No probing — APISIX watches real traffic responses. After 3 consecutive 5xx errors, the node is removed. After 5 consecutive successes, it's restored.

Note: Passive-only health checks cannot recover a node that receives no traffic. Combine with active checks for full coverage.

3. Combined active + passive (recommended for production)

a6 upstream create -f - <<'EOF'
{
  "id": "production-backend",
  "type": "roundrobin",
  "nodes": {
    "backend-1:8080": 1,
    "backend-2:8080": 1,
    "backend-3:8080": 1
  },
  "checks": {
    "active": {
      "type": "http",
      "http_path": "/health",
      "healthy": {
        "interval": 5,
        "successes": 2,
        "http_statuses": [200]
      },
      "unhealthy": {
        "interval": 2,
        "http_failures": 3,
        "timeouts": 2,
        "http_statuses": [500, 502, 503, 504]
      }
    },
    "passive": {
      "type": "http",
      "unhealthy": {
        "http_failures": 3,
        "http_statuses": [500, 502, 503],
        "timeouts": 3
      },
      "healthy": {
        "successes": 3,
        "http_statuses": [200, 201, 204]
      }
    }
  }
}
EOF

4. Check upstream health status

# View health status of all nodes
a6 upstream health backend

Common Patterns

TCP health check (non-HTTP services)

{
  "checks": {
    "active": {
      "type": "tcp",
      "healthy": {
        "interval": 5,
        "successes": 2
      },
      "unhealthy": {
        "interval": 2,
        "tcp_failures": 3,
        "timeouts": 2
      }
    }
  }
}

HTTPS health check with certificate verification

{
  "checks": {
    "active": {
      "type": "https",
      "http_path": "/health",
      "https_verify_certificate": true,
      "healthy": {
        "interval": 10,
        "successes": 2,
        "http_statuses": [200]
      },
      "unhealthy": {
        "interval": 5,
        "http_failures": 3
      }
    }
  }
}

Custom probe headers (for auth-protected health endpoints)

{
  "checks": {
    "active": {
      "type": "http",
      "http_path": "/internal/health",
      "host": "health.internal",
      "req_headers": [
        "Authorization: Bearer health-check-token",
        "X-Health-Check: true"
      ],
      "healthy": {
        "interval": 10,
        "successes": 2
      },
      "unhealthy": {
        "interval": 5,
        "http_failures": 3
      }
    }
  }
}

Aggressive unhealthy detection (fast failover)

{
  "checks": {
    "active": {
      "type": "http",
      "http_path": "/health",
      "timeout": 2,
      "healthy": {
        "interval": 3,
        "successes": 1
      },
      "unhealthy": {
        "interval": 1,
        "http_failures": 1,
        "timeouts": 1
      }
    }
  }
}

Detects failures within 1 second and recovers within 3 seconds.

Config Sync Example

version: "1"
upstreams:
  - id: production-backend
    type: roundrobin
    nodes:
      "backend-1:8080": 1
      "backend-2:8080": 1
      "backend-3:8080": 1
    checks:
      active:
        type: http
        http_path: /health
        healthy:
          interval: 5
          successes: 2
          http_statuses: [200]
        unhealthy:
          interval: 2
          http_failures: 3
          timeouts: 2
          http_statuses: [500, 502, 503, 504]
      passive:
        type: http
        unhealthy:
          http_failures: 3
          http_statuses: [500, 502, 503]
          timeouts: 3
        healthy:
          successes: 3
          http_statuses: [200, 201, 204]
routes:
  - id: api
    uri: /api/*
    upstream_id: production-backend

Troubleshooting

Symptom	Cause	Fix
Health checks not running	No route references the upstream	Health checks only run for upstreams attached to at least one route
All nodes marked unhealthy	Health endpoint returns wrong status code	Verify `http_statuses` includes your health endpoint's response code
Node not recovering	Passive-only: no traffic reaches unhealthy node	Add active health checks for recovery
Probe hitting wrong endpoint	Default `http_path` is `/`	Set `http_path` to your actual health endpoint
TLS probe fails	Certificate verification fails	Set `https_verify_certificate: false` or fix certificates
Health checks too aggressive	Low thresholds with flaky endpoints	Increase `failures` threshold and `interval`
`a6 upstream health` shows no data	APISIX hasn't started health checks yet	Wait for the first probe interval to complete