mcp-health-checker
MCP Health Checker
What it does
MCP (Model Context Protocol) servers are how OpenClaw connects to external tools — but connections go stale silently. A crashed MCP server doesn't throw an error until the agent tries to use it, causing confusing mid-task failures.
MCP Health Checker proactively monitors all configured MCP connections. It pings servers, measures latency, tracks uptime history, and alerts you before a stale connection causes a problem.
Inspired by OpenLobster's MCP connection health monitoring and OAuth 2.1+PKCE token refresh tracking.
When to invoke
- Automatically every 6 hours (cron) — silent background health check
- Manually before starting a task that depends on MCP tools
- When an MCP tool call fails unexpectedly — diagnose the connection
- After restarting MCP servers — verify all connections restored
Health checks performed
| Check | What it tests | Severity on failure |
|---|---|---|
| REACHABLE | Server responds to connection probe | CRITICAL |
| LATENCY | Response time under threshold (default: 5s) | HIGH |
| STALE | Connection age exceeds max (default: 24h) | HIGH |
| TOOL_COUNT | Server exposes expected number of tools | MEDIUM |
| CONFIG_VALID | MCP config entry has required fields | MEDIUM |
| AUTH_EXPIRY | OAuth/API token approaching expiration | HIGH |
How to use
python3 check.py --ping # Ping all configured MCP servers
python3 check.py --ping --server <name> # Ping a specific server
python3 check.py --ping --timeout 3 # Custom timeout in seconds
python3 check.py --status # Last check summary from state
python3 check.py --history # Show past check results
python3 check.py --config # Validate MCP config entries
python3 check.py --format json # Machine-readable output
Cron wakeup behaviour
Every 6 hours:
- Read MCP server configuration from
~/.openclaw/config/(YAML/JSON) - For each configured server:
- Attempt connection probe (TCP or HTTP depending on transport)
- Measure response latency
- Check connection age against staleness threshold
- Verify tool listing matches expected count (if tracked)
- Check auth token expiry (if applicable)
- Update state with per-server health records
- Print summary: healthy / degraded / unreachable counts
- Exit 1 if any CRITICAL findings
Procedure
Step 1 — Run a health check
python3 check.py --ping
Review the output. Healthy servers show a green check. Degraded servers show latency warnings. Unreachable servers show a critical alert.
Step 2 — Diagnose a specific server
python3 check.py --ping --server filesystem
Detailed output for a single server: latency, last seen, tool count, auth status.
Step 3 — Validate configuration
python3 check.py --config
Checks that all MCP config entries have the required fields (command, args or url depending on transport type).
Step 4 — Review history
python3 check.py --history
Shows uptime trends over the last 20 checks. Spot servers that are intermittently failing.
State
Per-server health records and check history stored in ~/.openclaw/skill-state/mcp-health-checker/state.yaml.
Fields: last_check_at, servers list, check_history.
Notes
- Does not modify MCP configuration — read-only monitoring
- Connection probes use the same transport as the MCP server (stdio subprocess spawn or HTTP GET)
- For stdio servers: probes verify the process can start and respond to
initialize - For HTTP/SSE servers: probes send a health-check HTTP request
- Latency threshold configurable via
--timeout(default: 5s) - Staleness threshold configurable via
--max-age(default: 24h)