gcp-gke-troubleshooting
SKILL.md
GKE Troubleshooting
Purpose
Systematically diagnose and resolve common GKE issues. This skill provides structured debugging workflows, common causes, and proven solutions for the most frequent problems encountered in production deployments.
When to Use
Use this skill when you need to:
- Debug pods stuck in Pending, CrashLoopBackOff, or ImagePullBackOff status
- Troubleshoot networking issues (DNS failures, service connectivity)
- Fix Cloud SQL connection problems or IAM authentication errors
- Resolve Pub/Sub message processing issues
- Investigate resource exhaustion or scheduling failures
- Debug health probe failures
- Diagnose application crashes or startup issues
Trigger phrases: "pod not starting", "CrashLoopBackOff", "debug GKE issue", "Cloud SQL connection failed", "Pub/Sub not working", "pod pending"
Table of Contents
Quick Start
Quick diagnostic flow for any pod issue:
# 1. Check pod status
kubectl get pods -n wtr-supplier-charges
# 2. View detailed pod information
kubectl describe pod <pod-name> -n wtr-supplier-charges
# 3. Check logs
kubectl logs <pod-name> -n wtr-supplier-charges
# 4. Check previous logs if crashed
kubectl logs <pod-name> -n wtr-supplier-charges --previous
# 5. Check events for scheduling issues
kubectl get events -n wtr-supplier-charges --sort-by='.lastTimestamp'
# 6. Check resource availability
kubectl top nodes
kubectl top pods -n wtr-supplier-charges
Instructions
Step 1: Identify the Pod Status
Understand what the pod status means:
kubectl get pods -n wtr-supplier-charges -o wide
| Status | Meaning | Action |
|---|---|---|
| Running | Pod is executing | Check logs if issues |
| Pending | Waiting to be scheduled | Check events, node resources |
| CrashLoopBackOff | App crashes repeatedly | Check logs, configuration |
| ImagePullBackOff | Can't pull image | Verify image, permissions |
| Completed | Pod ran successfully and exited | Normal for batch jobs |
| Error | Pod exited with error | Check logs |
Step 2: Investigate Based on Status
Pod Status: ImagePullBackOff
Diagnose:
# Get detailed error
kubectl describe pod <pod-name> -n wtr-supplier-charges
# Look for "Failed to pull image" in Events section
# Example: "Failed to pull image ... access denied"
# Check if image exists in registry
gcloud artifacts docker images list \
europe-west2-docker.pkg.dev/ecp-artifact-registry/wtr-supplier-charges-container-images
Solutions:
- Image doesn't exist:
# Verify image tag is correct
kubectl get deployment supplier-charges-hub -n wtr-supplier-charges \
-o jsonpath='{.spec.template.spec.containers[0].image}'
- Missing Artifact Registry permissions:
# Grant Artifact Registry Reader role
gcloud artifacts repositories add-iam-policy-binding \
wtr-supplier-charges-container-images \
--location=europe-west2 \
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
--role="roles/artifactregistry.reader"
- Private image registry authentication:
# Create image pull secret
kubectl create secret docker-registry regcred \
--docker-server=europe-west2-docker.pkg.dev \
--docker-username=_json_key \
--docker-password="$(cat key.json)" \
-n wtr-supplier-charges
# Add to deployment
spec:
imagePullSecrets:
- name: regcred
Pod Status: CrashLoopBackOff
Diagnose:
# Check current logs
kubectl logs <pod-name> -n wtr-supplier-charges
# Check logs from previous container (if crashed)
kubectl logs <pod-name> -n wtr-supplier-charges --previous
# Check liveness probe configuration
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Liveness"
Common Causes:
- Application exits immediately:
# Check startup logs for Java/Spring Boot errors
kubectl logs <pod-name> -n wtr-supplier-charges | head -50
# Look for: ClassNotFoundException, ConfigurationException, connection errors
- Liveness probe fails too early:
# Increase initialDelaySeconds from 20 to 60
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","livenessProbe":{"initialDelaySeconds":60}}]}}}}'
- Out of memory:
# Check memory usage
kubectl top pods <pod-name> -n wtr-supplier-charges
# Increase memory limits
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","resources":{"limits":{"memory":"4Gi"}}}]}}}}'
- Missing environment variables:
# Check what env vars are set
kubectl exec <pod-name> -n wtr-supplier-charges -- env | sort
# Verify ConfigMap/Secret values
kubectl get configmap supplier-charges-hub-config -n wtr-supplier-charges -o yaml
kubectl get secret db-credentials -n wtr-supplier-charges -o yaml
Pod Status: Pending (Unschedulable)
Diagnose:
# Check events for scheduling messages
kubectl describe pod <pod-name> -n wtr-supplier-charges
# Look for: "Insufficient memory", "Insufficient cpu", "PersistentVolumeClaim"
# Check node capacity
kubectl top nodes
kubectl describe nodes
Solutions:
- Insufficient cluster resources:
# Scale deployment down
kubectl scale deployment supplier-charges-hub --replicas=1 -n wtr-supplier-charges
# Or trigger autoscaling (if available)
# GKE Autopilot automatically provisions capacity
- Node affinity/taints preventing scheduling:
# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# View pod's node affinity/tolerations
kubectl get pod <pod-name> -n wtr-supplier-charges -o yaml | grep -A 10 -B 2 "affinity\|toleration"
# Add toleration to deployment if needed
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "compute"
effect: "NoSchedule"
- PersistentVolumeClaim not bound:
# Check PVC status
kubectl get pvc -n wtr-supplier-charges
# If Pending, check storage class
kubectl get storageclass
Step 3: Network and Connectivity Issues
DNS Resolution Failures
Diagnose:
# Test DNS from pod
kubectl exec <pod-name> -n wtr-supplier-charges -- nslookup postgres
# Test connectivity to service
kubectl exec <pod-name> -n wtr-supplier-charges -- curl -v http://postgres:5432
Solutions:
- CoreDNS pods not running:
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Restart CoreDNS if needed
kubectl rollout restart deployment coredns -n kube-system
- Service doesn't exist or wrong namespace:
# Verify service exists
kubectl get svc postgres -n wtr-supplier-charges
# Use fully qualified DNS name if in different namespace
service-name.namespace.svc.cluster.local
Service Not Accessible
Diagnose:
# Check service endpoints
kubectl get endpoints supplier-charges-hub -n wtr-supplier-charges
# If empty, no pods match the selector
kubectl get svc supplier-charges-hub -n wtr-supplier-charges -o yaml | grep selector
kubectl get pods -n wtr-supplier-charges --show-labels
Solutions:
- Pod labels don't match service selector:
# Add/update labels on deployment
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \
-p '{"spec":{"template":{"metadata":{"labels":{"app":"supplier-charges-hub"}}}}}'
- Pods not in Ready state:
# Check readiness probe
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Readiness"
# Check health endpoint
kubectl exec <pod-name> -n wtr-supplier-charges -- \
curl localhost:8080/actuator/health/readiness
Step 4: Database Connection Issues
Diagnose:
# Test connectivity to Cloud SQL Proxy
kubectl exec <pod-name> -n wtr-supplier-charges -- nc -zv localhost 5432
# Check Cloud SQL Proxy logs
kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges
# Check application startup logs for DB connection errors
kubectl logs <pod-name> -c supplier-charges-hub-container -n wtr-supplier-charges | grep -i "database\|connection"
Solutions:
- IAM Authentication fails:
# Verify Workload Identity binding
kubectl get sa app-runtime -n wtr-supplier-charges -o yaml | grep iam.gke.io
# Grant cloudsql.client role
gcloud projects add-iam-policy-binding project-id \
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
--role="roles/cloudsql.client"
# Check service account email format (must be {name}@{project}.iam)
- Wrong connection string:
# Verify DB_CONNECTION_NAME format: project:region:instance
kubectl get configmap db-config -n wtr-supplier-charges -o yaml
# Should be something like: ecp-wtr-supplier-charges-labs:europe-west2:supplier-charges-hub
- Cloud SQL Proxy not running:
# Check sidecar logs
kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges
# Check sidecar resources
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 15 "cloud-sql-proxy"
Step 5: Pub/Sub Issues
Diagnose:
# Check subscription backlog
gcloud pubsub subscriptions describe supplier-charges-incoming-sub \
--project=ecp-wtr-supplier-charges-labs
# Check application Pub/Sub logs
kubectl logs <pod-name> -c supplier-charges-hub-container \
-n wtr-supplier-charges | grep -i "pubsub\|subscription"
# Test pub/sub connectivity from pod
kubectl exec <pod-name> -n wtr-supplier-charges -- \
gcloud pubsub topics list --project=ecp-wtr-supplier-charges-labs
Solutions:
- Missing Pub/Sub permissions:
# Grant Pub/Sub roles
gcloud projects add-iam-policy-binding project-id \
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
--role="roles/pubsub.subscriber"
gcloud projects add-iam-policy-binding project-id \
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
--role="roles/pubsub.publisher"
- High subscription backlog (messages not being consumed):
# Check if pod is running
kubectl get pods -n wtr-supplier-charges
# Check application logs for processing errors
kubectl logs -f <pod-name> -c supplier-charges-hub-container \
-n wtr-supplier-charges | grep -i "error\|exception"
# Increase message processing timeout
# In application.yaml:
# spring.cloud.gcp.pubsub.subscriber.max-ack-extension-period: 600
- Message processing failures:
# Check for poison messages (causing repeated failures)
# Review DLQ (Dead Letter Queue) if configured
# Implement retry logic with exponential backoff
# See Spring Cloud GCP documentation for retry configuration
Examples
See examples/examples.md for comprehensive examples including:
- Complete troubleshooting workflow
- Database connectivity debugging
- Pub/Sub debugging
Requirements
kubectlaccess to the clustergcloudCLI configured- Permissions to view pod logs and describe resources
- For database debugging: access to view Cloud SQL configuration
- For Pub/Sub debugging: access to view subscription details
See Also
- gcp-gke-deployment-strategies - Understand deployment health checks
- gcp-gke-monitoring-observability - Monitor applications
- gcp-gke-workload-identity - Debug IAM/Workload Identity issues
Weekly Installs
5
Repository
dawiddutoit/cus…m-claudeFirst Seen
Jan 26, 2026
Security Audits
Installed on
cline5
gemini-cli5
github-copilot5
codex5
cursor5
opencode5