skills/moonming/a6/a6-recipe-health-check

a6-recipe-health-check

SKILL.md

a6-recipe-health-check

Overview

Health checks monitor upstream backend nodes and automatically remove unhealthy nodes from the load balancer pool. APISIX supports two types:

  • Active: APISIX periodically probes each node with HTTP/HTTPS/TCP requests
  • Passive: APISIX analyzes real traffic responses to detect failures

Use both together for the most robust setup.

When to Use

  • Automatically remove failing backend nodes from rotation
  • Detect and recover from backend failures without manual intervention
  • Ensure high availability across multiple backend instances
  • Monitor backend health status via the a6 CLI

Health Check Configuration Reference

Active Health Check

Field Type Default Description
checks.active.type string "http" Check type: "http", "https", or "tcp"
checks.active.http_path string "/" HTTP path to probe
checks.active.host string Host header for HTTP probes
checks.active.port integer Override port for probing (default: use node port)
checks.active.timeout number 1 Probe timeout in seconds
checks.active.concurrency integer 10 Number of concurrent probes
checks.active.https_verify_certificate boolean true Verify TLS certificate for HTTPS probes
checks.active.req_headers array[string] Additional request headers for probes
checks.active.healthy.interval integer 1 Seconds between probes for healthy nodes
checks.active.healthy.successes integer 2 Consecutive successes to mark node healthy
checks.active.healthy.http_statuses array[integer] [200, 302] HTTP codes considered healthy
checks.active.unhealthy.interval integer 1 Seconds between probes for unhealthy nodes
checks.active.unhealthy.http_failures integer 5 Consecutive HTTP failures to mark unhealthy
checks.active.unhealthy.tcp_failures integer 2 Consecutive TCP failures to mark unhealthy
checks.active.unhealthy.timeouts integer 3 Consecutive timeouts to mark unhealthy
checks.active.unhealthy.http_statuses array[integer] [429, 404, 500, 501, 502, 503, 504, 505] HTTP codes considered unhealthy

Passive Health Check

Field Type Default Description
checks.passive.type string "http" Check type: "http", "https", or "tcp"
checks.passive.healthy.successes integer 5 Consecutive successes to mark healthy
checks.passive.healthy.http_statuses array[integer] [200, 201, 202, ..., 399] HTTP codes considered healthy
checks.passive.unhealthy.http_failures integer 5 Consecutive failures to mark unhealthy
checks.passive.unhealthy.tcp_failures integer 2 Consecutive TCP failures to mark unhealthy
checks.passive.unhealthy.timeouts integer 7 Consecutive timeouts to mark unhealthy
checks.passive.unhealthy.http_statuses array[integer] [429, 500, 503] HTTP codes considered unhealthy

Step-by-Step: Configure Health Checks

1. Active HTTP health check

a6 upstream create -f - <<'EOF'
{
  "id": "backend",
  "type": "roundrobin",
  "nodes": {
    "backend-1:8080": 1,
    "backend-2:8080": 1,
    "backend-3:8080": 1
  },
  "checks": {
    "active": {
      "type": "http",
      "http_path": "/health",
      "healthy": {
        "interval": 5,
        "successes": 2,
        "http_statuses": [200]
      },
      "unhealthy": {
        "interval": 3,
        "http_failures": 3,
        "http_statuses": [500, 502, 503]
      }
    }
  }
}
EOF

APISIX probes /health on each node:

  • Every 5s for healthy nodes
  • Every 3s for unhealthy nodes
  • 3 consecutive failures → node removed
  • 2 consecutive successes → node restored

2. Passive health check (analyze real traffic)

a6 upstream create -f - <<'EOF'
{
  "id": "backend-passive",
  "type": "roundrobin",
  "nodes": {
    "backend-1:8080": 1,
    "backend-2:8080": 1
  },
  "checks": {
    "passive": {
      "type": "http",
      "unhealthy": {
        "http_failures": 3,
        "http_statuses": [500, 502, 503],
        "timeouts": 3
      },
      "healthy": {
        "successes": 5,
        "http_statuses": [200, 201, 202, 203, 204]
      }
    }
  }
}
EOF

No probing — APISIX watches real traffic responses. After 3 consecutive 5xx errors, the node is removed. After 5 consecutive successes, it's restored.

Note: Passive-only health checks cannot recover a node that receives no traffic. Combine with active checks for full coverage.

3. Combined active + passive (recommended for production)

a6 upstream create -f - <<'EOF'
{
  "id": "production-backend",
  "type": "roundrobin",
  "nodes": {
    "backend-1:8080": 1,
    "backend-2:8080": 1,
    "backend-3:8080": 1
  },
  "checks": {
    "active": {
      "type": "http",
      "http_path": "/health",
      "healthy": {
        "interval": 5,
        "successes": 2,
        "http_statuses": [200]
      },
      "unhealthy": {
        "interval": 2,
        "http_failures": 3,
        "timeouts": 2,
        "http_statuses": [500, 502, 503, 504]
      }
    },
    "passive": {
      "type": "http",
      "unhealthy": {
        "http_failures": 3,
        "http_statuses": [500, 502, 503],
        "timeouts": 3
      },
      "healthy": {
        "successes": 3,
        "http_statuses": [200, 201, 204]
      }
    }
  }
}
EOF

4. Check upstream health status

# View health status of all nodes
a6 upstream health backend

Common Patterns

TCP health check (non-HTTP services)

{
  "checks": {
    "active": {
      "type": "tcp",
      "healthy": {
        "interval": 5,
        "successes": 2
      },
      "unhealthy": {
        "interval": 2,
        "tcp_failures": 3,
        "timeouts": 2
      }
    }
  }
}

HTTPS health check with certificate verification

{
  "checks": {
    "active": {
      "type": "https",
      "http_path": "/health",
      "https_verify_certificate": true,
      "healthy": {
        "interval": 10,
        "successes": 2,
        "http_statuses": [200]
      },
      "unhealthy": {
        "interval": 5,
        "http_failures": 3
      }
    }
  }
}

Custom probe headers (for auth-protected health endpoints)

{
  "checks": {
    "active": {
      "type": "http",
      "http_path": "/internal/health",
      "host": "health.internal",
      "req_headers": [
        "Authorization: Bearer health-check-token",
        "X-Health-Check: true"
      ],
      "healthy": {
        "interval": 10,
        "successes": 2
      },
      "unhealthy": {
        "interval": 5,
        "http_failures": 3
      }
    }
  }
}

Aggressive unhealthy detection (fast failover)

{
  "checks": {
    "active": {
      "type": "http",
      "http_path": "/health",
      "timeout": 2,
      "healthy": {
        "interval": 3,
        "successes": 1
      },
      "unhealthy": {
        "interval": 1,
        "http_failures": 1,
        "timeouts": 1
      }
    }
  }
}

Detects failures within 1 second and recovers within 3 seconds.

Config Sync Example

version: "1"
upstreams:
  - id: production-backend
    type: roundrobin
    nodes:
      "backend-1:8080": 1
      "backend-2:8080": 1
      "backend-3:8080": 1
    checks:
      active:
        type: http
        http_path: /health
        healthy:
          interval: 5
          successes: 2
          http_statuses: [200]
        unhealthy:
          interval: 2
          http_failures: 3
          timeouts: 2
          http_statuses: [500, 502, 503, 504]
      passive:
        type: http
        unhealthy:
          http_failures: 3
          http_statuses: [500, 502, 503]
          timeouts: 3
        healthy:
          successes: 3
          http_statuses: [200, 201, 204]
routes:
  - id: api
    uri: /api/*
    upstream_id: production-backend

Troubleshooting

Symptom Cause Fix
Health checks not running No route references the upstream Health checks only run for upstreams attached to at least one route
All nodes marked unhealthy Health endpoint returns wrong status code Verify http_statuses includes your health endpoint's response code
Node not recovering Passive-only: no traffic reaches unhealthy node Add active health checks for recovery
Probe hitting wrong endpoint Default http_path is / Set http_path to your actual health endpoint
TLS probe fails Certificate verification fails Set https_verify_certificate: false or fix certificates
Health checks too aggressive Low thresholds with flaky endpoints Increase failures threshold and interval
a6 upstream health shows no data APISIX hasn't started health checks yet Wait for the first probe interval to complete
Weekly Installs
1
Repository
moonming/a6
First Seen
7 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1