kubeadm-troubleshooting
kubeadm Troubleshooting
Overview
Systematic approach to diagnosing kubeadm cluster issues.
Core principle: Trace symptoms to specific components. Most issues are: certificates, networking, or kubelet configuration.
When to Use
kubeadm initorkubeadm joinfails- Nodes show
NotReadystatus - Pods stuck in
Pendingstate - Certificate-related errors
- kubelet crashlooping
- Control plane components failing
Diagnostic Flowchart
Symptom
│
├─ kubeadm init fails ────────────▶ Check pre-flight errors
│ Check port conflicts
│ Check containerd status
│
├─ kubeadm join fails ────────────▶ Check token validity
│ Check network connectivity
│ Check firewall (port 6443)
│
├─ Node NotReady ─────────────────▶ Check CNI installation
│ Check kubelet logs
│ Check node conditions
│
├─ Pod Pending ───────────────────▶ If CoreDNS: install CNI
│ If other: check node resources
│ Check taints/tolerations
│
├─ Certificate error ─────────────▶ Check certificate expiry
│ Check SAN configuration
│ Check clock synchronization
│
└─ kubelet crashloop ─────────────▶ Check config.yaml exists
Check containerd running
Check cgroup driver match
Quick Diagnostic Commands
# Cluster overview
kubectl get nodes -o wide
kubectl get pods -A
kubectl cluster-info
# Node details
kubectl describe node <node-name>
# Component health
kubectl get componentstatuses # deprecated but sometimes useful
kubectl get --raw='/readyz?verbose'
# kubelet status
systemctl status kubelet
journalctl -u kubelet -f --no-pager
# containerd status
systemctl status containerd
crictl ps
crictl pods
crictl info
# Network
ss -tlnp | grep -E '6443|10250|2379'
ip route
Issue: kubeadm init Fails
Pre-flight Errors
# See what's failing
kubeadm init --dry-run 2>&1 | grep -E '\[ERROR\]|\[WARNING\]'
| Error | Fix |
|---|---|
[ERROR CRI]: container runtime not running |
systemctl start containerd |
[ERROR Swap]: running with swap on |
swapoff -a |
[ERROR Port-6443]: Port 6443 in use |
Previous cluster: kubeadm reset |
[ERROR DirAvailable]: /etc/kubernetes not empty |
rm -rf /etc/kubernetes/* |
[ERROR FileAvailable--etc-kubernetes-manifests-*] |
Clean previous manifests |
Port Conflicts
# Check what's using required ports
ss -tlnp | grep -E '6443|2379|2380|10250|10259|10257'
# If previous cluster
kubeadm reset -f
rm -rf /etc/kubernetes /var/lib/kubelet /var/lib/etcd
iptables -F && iptables -t nat -F
containerd Issues
# Check containerd running
systemctl status containerd
# Check CRI plugin enabled (should NOT be in disabled list)
grep disabled_plugins /etc/containerd/config.toml
# Check socket
ls -la /run/containerd/containerd.sock
# Test with crictl
crictl info
Issue: kubeadm join Fails
Connection Refused
# From worker node - test API server reachability
curl -k https://192.168.10.100:6443/healthz
# Expected: "ok"
# If "Connection refused": firewall or API not listening
# On control plane - check API listening
ss -tlnp | grep 6443
# Should show: *:6443 or 0.0.0.0:6443
# Check firewall
systemctl is-active firewalld
# If active:
firewall-cmd --list-ports
# Or disable:
systemctl disable --now firewalld
Token Expired
# On control plane - check token validity
kubeadm token list
# If empty or expired:
kubeadm token create --print-join-command
CA Hash Mismatch
# Regenerate correct hash on control plane
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | \
openssl rsa -pubin -outform der 2>/dev/null | \
openssl dgst -sha256 -hex | sed 's/^.* //'
TLS Bootstrap Timeout
# On worker - check kubelet logs during join
journalctl -u kubelet -f
# Common causes:
# - Firewall blocking 6443
# - Wrong advertise address (multi-NIC)
# - Network routing issues
Issue: Node NotReady
Check Node Conditions
kubectl describe node <node-name> | grep -A 20 Conditions
| Condition | False Means |
|---|---|
| Ready | Node unhealthy |
| MemoryPressure | Memory OK |
| DiskPressure | Disk OK |
| PIDPressure | PIDs OK |
| NetworkUnavailable | CNI working |
NetworkUnavailable = True (CNI Missing)
# Check if CNI installed
ls /etc/cni/net.d/
# Empty = no CNI
# Install Flannel
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
# Verify
kubectl get pods -n kube-flannel
kubectl get nodes # Should become Ready
kubelet Not Reporting
# On the NotReady node
systemctl status kubelet
journalctl -u kubelet --no-pager | tail -50
# Common issues:
# - config.yaml missing: kubeadm join not completed
# - containerd socket missing: containerd not running
# - cgroup driver mismatch: SystemdCgroup not set
Issue: Pods Stuck Pending
CoreDNS Pending (No CNI)
kubectl get pods -n kube-system | grep coredns
# coredns-xxx 0/1 Pending
kubectl describe pod -n kube-system coredns-xxx
# Events: FailedScheduling - no nodes available to schedule
# Fix: Install CNI
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
Other Pods Pending
kubectl describe pod <pod-name> -n <namespace>
# Check Events section
# Common causes:
# - Insufficient resources
# - Node taints without tolerations
# - Node selector mismatch
# - PVC not bound
Check Taints (Control Plane Only Cluster)
# Control plane has NoSchedule taint by default
kubectl describe node | grep Taints
# To schedule on control plane (single-node cluster)
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
Issue: Certificate Errors
Check Certificate Expiry
# All certificates
kubeadm certs check-expiration
# Specific certificate
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
Certificate SAN Issues
# Check API server certificate SANs
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A1 "Subject Alternative Name"
# If missing required SAN, regenerate:
kubeadm init phase certs apiserver --apiserver-cert-extra-sans=new.domain.com
Renew Certificates
# Renew all
kubeadm certs renew all
# Restart control plane components
crictl pods --name='kube-apiserver|kube-controller|kube-scheduler|etcd' -q | xargs -I {} crictl stop {}
Clock Skew
# Check time on all nodes
date
# Enable NTP
timedatectl set-ntp true
chronyc sources -v
Issue: kubelet Crashlooping
Before kubeadm init/join (Expected)
journalctl -u kubelet | grep "config.yaml"
# "failed to load kubelet config file, path: /var/lib/kubelet/config.yaml"
# This is NORMAL before kubeadm init or join
# kubelet needs config from kubeadm
After kubeadm init/join
# Check config exists
ls -la /var/lib/kubelet/config.yaml
# Check kubeconfig exists
ls -la /etc/kubernetes/kubelet.conf
# Check containerd
systemctl status containerd
# Check cgroup driver match
grep cgroupDriver /var/lib/kubelet/config.yaml
# Should be: cgroupDriver: systemd
grep SystemdCgroup /etc/containerd/config.toml
# Should be: SystemdCgroup = true
Issue: etcd Problems
Check etcd Health
# Using etcdctl with certs
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
endpoint health
# Check etcd logs
crictl logs $(crictl ps -q --name=etcd)
etcd Data Corruption (Last Resort)
# WARNING: Destroys cluster state
kubeadm reset -f
rm -rf /var/lib/etcd
kubeadm init ...
Full Reset Procedure
# On ALL nodes
kubeadm reset -f
# Clean directories
rm -rf /etc/kubernetes
rm -rf /var/lib/kubelet
rm -rf /var/lib/etcd
rm -rf ~/.kube
# Clean iptables
iptables -F
iptables -t nat -F
iptables -t mangle -F
iptables -X
# Clean CNI
rm -rf /etc/cni/net.d
rm -rf /var/lib/cni
# Restart containerd
systemctl restart containerd
# Then re-init on control plane
kubeadm init --config=kubeadm-init.yaml
Log Collection Script
#!/bin/bash
LOGDIR="/tmp/k8s-debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$LOGDIR"
echo "Collecting system info..."
hostnamectl > "$LOGDIR/hostnamectl.txt"
ip addr > "$LOGDIR/ip-addr.txt"
ip route > "$LOGDIR/ip-route.txt"
ss -tlnp > "$LOGDIR/ss-tlnp.txt"
echo "Collecting service status..."
systemctl status kubelet > "$LOGDIR/kubelet-status.txt" 2>&1
systemctl status containerd > "$LOGDIR/containerd-status.txt" 2>&1
echo "Collecting logs..."
journalctl -u kubelet --no-pager > "$LOGDIR/kubelet.log" 2>&1
journalctl -u containerd --no-pager > "$LOGDIR/containerd.log" 2>&1
echo "Collecting Kubernetes info..."
kubectl get nodes -o wide > "$LOGDIR/nodes.txt" 2>&1
kubectl get pods -A -o wide > "$LOGDIR/pods.txt" 2>&1
kubectl describe nodes > "$LOGDIR/nodes-describe.txt" 2>&1
echo "Collecting directory listings..."
ls -la /etc/kubernetes/ > "$LOGDIR/etc-kubernetes.txt" 2>&1
ls -la /var/lib/kubelet/ > "$LOGDIR/var-lib-kubelet.txt" 2>&1
ls -la /etc/cni/net.d/ > "$LOGDIR/cni-config.txt" 2>&1
echo "Collecting crictl info..."
crictl info > "$LOGDIR/crictl-info.txt" 2>&1
crictl ps -a > "$LOGDIR/crictl-ps.txt" 2>&1
crictl pods > "$LOGDIR/crictl-pods.txt" 2>&1
tar -czf "$LOGDIR.tar.gz" -C /tmp "$(basename $LOGDIR)"
echo "Logs collected: $LOGDIR.tar.gz"
Quick Reference: Component Locations
| Component | Config | Logs |
|---|---|---|
| kubelet | /var/lib/kubelet/config.yaml |
journalctl -u kubelet |
| containerd | /etc/containerd/config.toml |
journalctl -u containerd |
| API Server | /etc/kubernetes/manifests/kube-apiserver.yaml |
crictl logs <id> |
| etcd | /etc/kubernetes/manifests/etcd.yaml |
crictl logs <id> |
| Certificates | /etc/kubernetes/pki/ |
N/A |
| kubeconfig | /etc/kubernetes/*.conf |
N/A |
| CNI | /etc/cni/net.d/ |
Varies by CNI |
More from sigridjineth/kubespray-skills
rke2-operations
Use when managing RKE2 cluster certificates, performing manual or automated version upgrades, rotating TLS certificates, deploying the System Upgrade Controller, or troubleshooting RKE2 certificate and upgrade errors. Use when seeing "x509 certificate has expired" or "CertificateExpirationWarning" events or "Job has reached the specified backoff limit" errors.
3rke2-deployment
Use when deploying Kubernetes clusters with RKE2 (Rancher Kubernetes Engine 2), configuring server and agent nodes, managing built-in Helm chart addons, or setting up CIS-hardened clusters. Use when seeing "rke2-server failed to start" or "unable to join cluster" errors.
3kubeadm-init
Use when initializing a Kubernetes control plane with kubeadm, setting up certificates, static pods, or troubleshooting init failures
2cluster-api
Use when managing Kubernetes clusters as Kubernetes resources with Cluster API (CAPI), provisioning workload clusters from a management cluster, performing declarative upgrades, or working with ClusterClass blueprints. Use when seeing "failed to connect to management cluster" or clusterctl errors.
2kubespray-airgap
Use when deploying Kubernetes in air-gapped or offline environments using kubespray-offline tool, setting up private container registries, staging binaries and images for offline use, configuring containerd registry mirrors, or troubleshooting image pull failures in isolated networks.
2kubeadm-join
Use when joining worker or control-plane nodes to a Kubernetes cluster, troubleshooting TLS bootstrap, or debugging node join failures
2