kubespray-troubleshooting
Kubespray Troubleshooting
Overview
Diagnose and fix common Kubespray deployment failures. Most failures stem from network misconfiguration, etcd issues, or stale state from previous attempts.
Core principle: Read the exact task name that failed, check logs on that specific node, then fix and re-run (Ansible is idempotent).
When to Use
- Deployment fails mid-playbook
kubeadm joinerrors- etcd health check timeouts
- Nodes stuck in NotReady state
- Certificate-related failures
Not for: Initial deployment setup (use kubespray-deployment), upgrades (use kubespray-operations), certificate renewal (use kubespray-certificates)
Quick Diagnostic Flow
Playbook failed
│
▼
┌─────────────────┐
│ Which task? │
└────────┬────────┘
│
┌────┼────┬────────────┐
│ │ │ │
▼ ▼ ▼ ▼
etcd join containerd other
│ │ │ │
▼ ▼ ▼ ▼
Check Check Check Check
etcd IP containerd Ansible
logs config status logs -vvv
| Task Failed | First Check | Command |
|---|---|---|
| etcd health | etcd logs | journalctl -u etcd -f |
| kubeadm join | IP configuration | Verify ip= in inventory |
| container-engine | containerd status | systemctl status containerd |
| download | Network/proxy | Check internet connectivity |
| any task | Ansible debug | Re-run with -vvv flag |
Problem: VirtualBox NAT IP (10.0.2.15)
Symptom:
error execution phase preflight: couldn't validate the identity of the API Server:
Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info":
dial tcp 10.0.2.15:6443: connect: connection refused
Cause: Kubespray detected VirtualBox NAT interface instead of host-only network.
Fix: Add explicit ip= to inventory:
k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10
If already deployed with wrong IP: Must reset and redeploy:
ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b
# Fix inventory, then:
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b
Problem: etcd Health Check Failure
Symptom:
TASK [etcd : Configure | Wait for etcd cluster to be healthy]
fatal: [controller-0]: FAILED! => {"cmd": "etcdctl endpoint health"...
"dial tcp 192.168.10.100:2379: connect: connection refused"
Diagnose:
# On etcd node
systemctl status etcd
journalctl -u etcd -f
# Check if listening
ss -tlnp | grep 2379
Common causes:
- Wrong IP in etcd config - Reset and redeploy with correct
ip= - Certificate mismatch - Check
/etc/ssl/etcd/ssl/permissions - Firewall blocking - Ensure ports 2379/2380 open
Fix for stale state:
ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b
Problem: Nodes Stuck NotReady
Symptom: kubectl get nodes shows NotReady status
Diagnose:
# Check kubelet
systemctl status kubelet
journalctl -u kubelet -f
# Check CNI
ls /etc/cni/net.d/
ls /opt/cni/bin/
# Check node conditions
kubectl describe node <node-name>
Common causes:
- CNI not installed - Check network_plugin role completed
- containerd not running -
systemctl restart containerd - kubelet misconfigured - Check
/etc/kubernetes/kubelet-config.yaml
Problem: "No hosts matched"
Symptom:
[WARNING]: Could not match supplied host pattern, ignoring: etcd
skipping: no hosts matched
Cause: Inventory path or syntax error
Fix:
# Use file path, not directory
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b
# Verify inventory parses correctly
ansible -i inventory/mycluster/inventory.ini etcd --list-hosts
ansible -i inventory/mycluster/inventory.ini kube_control_plane --list-hosts
Problem: Container Runtime Not Running
Symptom:
[ERROR CRI]: container runtime is not running:
"transport: Error while dialing dial unix /var/run/containerd/containerd.sock:
connect: no such file or directory"
Fix:
# Check containerd
systemctl status containerd
journalctl -u containerd
# Restart if needed
systemctl restart containerd
# Verify socket exists
ls -la /var/run/containerd/containerd.sock
Problem: Certificate Errors
Symptom:
x509: certificate has expired or is not yet valid
Diagnose:
# Check cert expiration
kubeadm certs check-expiration
# Check specific cert
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
Fix: See kubespray-certificates skill for renewal procedures.
Reset Procedure
When deployment is corrupted beyond repair:
# Full reset - removes all K8s components
ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b
# Confirm with "yes" when prompted
# After reset, verify clean state
systemctl status kubelet # should be inactive
ls /etc/kubernetes/ # should be empty/minimal
# Redeploy
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b
Note: Reset removes etcd data. All cluster state is lost.
Log Locations
| Component | Log Command |
|---|---|
| etcd | journalctl -u etcd |
| kubelet | journalctl -u kubelet |
| containerd | journalctl -u containerd |
| API server | kubectl logs -n kube-system kube-apiserver-<node> |
| Ansible | Run with -vvv for debug output |
Re-running After Failure
Ansible is idempotent - safe to re-run after fixing issues:
# Re-run full playbook (skips completed tasks)
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b
# Re-run specific tags only
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b --tags etcd
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b --tags network