GKE Troubleshooting

Purpose

Systematically diagnose and resolve common GKE issues. This skill provides structured debugging workflows, common causes, and proven solutions for the most frequent problems encountered in production deployments.

When to Use

Use this skill when you need to:

Debug pods stuck in Pending, CrashLoopBackOff, or ImagePullBackOff status
Troubleshoot networking issues (DNS failures, service connectivity)
Fix Cloud SQL connection problems or IAM authentication errors
Resolve Pub/Sub message processing issues
Investigate resource exhaustion or scheduling failures
Debug health probe failures
Diagnose application crashes or startup issues

Trigger phrases: "pod not starting", "CrashLoopBackOff", "debug GKE issue", "Cloud SQL connection failed", "Pub/Sub not working", "pod pending"

Purpose
When to Use
Quick Start
Instructions
Examples
Requirements
See Also

Quick Start

Quick diagnostic flow for any pod issue:

# 1. Check pod status
kubectl get pods -n wtr-supplier-charges

# 2. View detailed pod information
kubectl describe pod <pod-name> -n wtr-supplier-charges

# 3. Check logs
kubectl logs <pod-name> -n wtr-supplier-charges

# 4. Check previous logs if crashed
kubectl logs <pod-name> -n wtr-supplier-charges --previous

# 5. Check events for scheduling issues
kubectl get events -n wtr-supplier-charges --sort-by='.lastTimestamp'

# 6. Check resource availability
kubectl top nodes
kubectl top pods -n wtr-supplier-charges

Instructions

Step 1: Identify the Pod Status

Understand what the pod status means:

kubectl get pods -n wtr-supplier-charges -o wide

Status	Meaning	Action
Running	Pod is executing	Check logs if issues
Pending	Waiting to be scheduled	Check events, node resources
CrashLoopBackOff	App crashes repeatedly	Check logs, configuration
ImagePullBackOff	Can't pull image	Verify image, permissions
Completed	Pod ran successfully and exited	Normal for batch jobs
Error	Pod exited with error	Check logs

Step 2: Investigate Based on Status

Pod Status: ImagePullBackOff

Diagnose:

# Get detailed error
kubectl describe pod <pod-name> -n wtr-supplier-charges

# Look for "Failed to pull image" in Events section
# Example: "Failed to pull image ... access denied"

# Check if image exists in registry
gcloud artifacts docker images list \
  europe-west2-docker.pkg.dev/ecp-artifact-registry/wtr-supplier-charges-container-images

Solutions:

Image doesn't exist:

# Verify image tag is correct
kubectl get deployment supplier-charges-hub -n wtr-supplier-charges \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

Missing Artifact Registry permissions:

# Grant Artifact Registry Reader role
gcloud artifacts repositories add-iam-policy-binding \
  wtr-supplier-charges-container-images \
  --location=europe-west2 \
  --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
  --role="roles/artifactregistry.reader"

Private image registry authentication:

# Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=europe-west2-docker.pkg.dev \
  --docker-username=_json_key \
  --docker-password="$(cat key.json)" \
  -n wtr-supplier-charges

# Add to deployment
spec:
  imagePullSecrets:
  - name: regcred

Pod Status: CrashLoopBackOff

Diagnose:

# Check current logs
kubectl logs <pod-name> -n wtr-supplier-charges

# Check logs from previous container (if crashed)
kubectl logs <pod-name> -n wtr-supplier-charges --previous

# Check liveness probe configuration
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Liveness"

Common Causes:

Application exits immediately:

# Check startup logs for Java/Spring Boot errors
kubectl logs <pod-name> -n wtr-supplier-charges | head -50

# Look for: ClassNotFoundException, ConfigurationException, connection errors

Liveness probe fails too early:

# Increase initialDelaySeconds from 20 to 60
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","livenessProbe":{"initialDelaySeconds":60}}]}}}}'

Out of memory:

# Check memory usage
kubectl top pods <pod-name> -n wtr-supplier-charges

# Increase memory limits
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","resources":{"limits":{"memory":"4Gi"}}}]}}}}'

Missing environment variables:

# Check what env vars are set
kubectl exec <pod-name> -n wtr-supplier-charges -- env | sort

# Verify ConfigMap/Secret values
kubectl get configmap supplier-charges-hub-config -n wtr-supplier-charges -o yaml
kubectl get secret db-credentials -n wtr-supplier-charges -o yaml

Pod Status: Pending (Unschedulable)

Diagnose:

# Check events for scheduling messages
kubectl describe pod <pod-name> -n wtr-supplier-charges

# Look for: "Insufficient memory", "Insufficient cpu", "PersistentVolumeClaim"

# Check node capacity
kubectl top nodes
kubectl describe nodes

Solutions:

Insufficient cluster resources:

# Scale deployment down
kubectl scale deployment supplier-charges-hub --replicas=1 -n wtr-supplier-charges

# Or trigger autoscaling (if available)
# GKE Autopilot automatically provisions capacity

Node affinity/taints preventing scheduling:

# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# View pod's node affinity/tolerations
kubectl get pod <pod-name> -n wtr-supplier-charges -o yaml | grep -A 10 -B 2 "affinity\|toleration"

# Add toleration to deployment if needed
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "compute"
    effect: "NoSchedule"

PersistentVolumeClaim not bound:

# Check PVC status
kubectl get pvc -n wtr-supplier-charges

# If Pending, check storage class
kubectl get storageclass

Step 3: Network and Connectivity Issues

DNS Resolution Failures

Diagnose:

# Test DNS from pod
kubectl exec <pod-name> -n wtr-supplier-charges -- nslookup postgres

# Test connectivity to service
kubectl exec <pod-name> -n wtr-supplier-charges -- curl -v http://postgres:5432

Solutions:

CoreDNS pods not running:

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Restart CoreDNS if needed
kubectl rollout restart deployment coredns -n kube-system

Service doesn't exist or wrong namespace:

# Verify service exists
kubectl get svc postgres -n wtr-supplier-charges

# Use fully qualified DNS name if in different namespace
service-name.namespace.svc.cluster.local

Service Not Accessible

Diagnose:

# Check service endpoints
kubectl get endpoints supplier-charges-hub -n wtr-supplier-charges

# If empty, no pods match the selector
kubectl get svc supplier-charges-hub -n wtr-supplier-charges -o yaml | grep selector
kubectl get pods -n wtr-supplier-charges --show-labels

Solutions:

Pod labels don't match service selector:

# Add/update labels on deployment
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \
  -p '{"spec":{"template":{"metadata":{"labels":{"app":"supplier-charges-hub"}}}}}'

Pods not in Ready state:

# Check readiness probe
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Readiness"

# Check health endpoint
kubectl exec <pod-name> -n wtr-supplier-charges -- \
  curl localhost:8080/actuator/health/readiness

Step 4: Database Connection Issues

Diagnose:

# Test connectivity to Cloud SQL Proxy
kubectl exec <pod-name> -n wtr-supplier-charges -- nc -zv localhost 5432

# Check Cloud SQL Proxy logs
kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges

# Check application startup logs for DB connection errors
kubectl logs <pod-name> -c supplier-charges-hub-container -n wtr-supplier-charges | grep -i "database\|connection"

Solutions:

IAM Authentication fails:

# Verify Workload Identity binding
kubectl get sa app-runtime -n wtr-supplier-charges -o yaml | grep iam.gke.io

# Grant cloudsql.client role
gcloud projects add-iam-policy-binding project-id \
  --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
  --role="roles/cloudsql.client"

# Check service account email format (must be {name}@{project}.iam)

Wrong connection string:

# Verify DB_CONNECTION_NAME format: project:region:instance
kubectl get configmap db-config -n wtr-supplier-charges -o yaml

# Should be something like: ecp-wtr-supplier-charges-labs:europe-west2:supplier-charges-hub

Cloud SQL Proxy not running:

# Check sidecar logs
kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges

# Check sidecar resources
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 15 "cloud-sql-proxy"

Step 5: Pub/Sub Issues

Diagnose:

# Check subscription backlog
gcloud pubsub subscriptions describe supplier-charges-incoming-sub \
  --project=ecp-wtr-supplier-charges-labs

# Check application Pub/Sub logs
kubectl logs <pod-name> -c supplier-charges-hub-container \
  -n wtr-supplier-charges | grep -i "pubsub\|subscription"

# Test pub/sub connectivity from pod
kubectl exec <pod-name> -n wtr-supplier-charges -- \
  gcloud pubsub topics list --project=ecp-wtr-supplier-charges-labs

Solutions:

Missing Pub/Sub permissions:

# Grant Pub/Sub roles
gcloud projects add-iam-policy-binding project-id \
  --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
  --role="roles/pubsub.subscriber"

gcloud projects add-iam-policy-binding project-id \
  --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
  --role="roles/pubsub.publisher"

High subscription backlog (messages not being consumed):

# Check if pod is running
kubectl get pods -n wtr-supplier-charges

# Check application logs for processing errors
kubectl logs -f <pod-name> -c supplier-charges-hub-container \
  -n wtr-supplier-charges | grep -i "error\|exception"

# Increase message processing timeout
# In application.yaml:
# spring.cloud.gcp.pubsub.subscriber.max-ack-extension-period: 600

Message processing failures:

# Check for poison messages (causing repeated failures)
# Review DLQ (Dead Letter Queue) if configured

# Implement retry logic with exponential backoff
# See Spring Cloud GCP documentation for retry configuration

Examples

See examples/examples.md for comprehensive examples including:

Complete troubleshooting workflow
Database connectivity debugging
Pub/Sub debugging

Requirements

kubectl access to the cluster
gcloud CLI configured
Permissions to view pod logs and describe resources
For database debugging: access to view Cloud SQL configuration
For Pub/Sub debugging: access to view subscription details

gcp-gke-troubleshooting

GKE Troubleshooting

Purpose

When to Use

Table of Contents

Quick Start

Instructions

Step 1: Identify the Pod Status

Step 2: Investigate Based on Status

Pod Status: ImagePullBackOff

Pod Status: CrashLoopBackOff

Pod Status: Pending (Unschedulable)

Step 3: Network and Connectivity Issues

DNS Resolution Failures

Service Not Accessible

Step 4: Database Connection Issues

Step 5: Pub/Sub Issues

Examples

Requirements

See Also