Google Kubernetes Engine (GKE)

Deploy, operate, and scale managed Kubernetes clusters on Google Cloud Platform.

When to Use

Running containerized microservices at scale with automatic scaling and healing
Workloads requiring fine-grained orchestration, service mesh, or custom scheduling
Teams already invested in Kubernetes tooling (Helm, Argo CD, Flux)
When Cloud Run's request-based model does not fit (long-running, stateful workloads)

Prerequisites

Google Cloud SDK (gcloud) and kubectl installed
APIs enabled: Kubernetes Engine, Compute Engine
IAM role roles/container.admin for cluster management

gcloud services enable container.googleapis.com compute.googleapis.com
gcloud components install kubectl

Standard vs Autopilot

Feature	Standard	Autopilot
Node management	You manage node pools	Google manages nodes
Pricing	Pay per node (VM)	Pay per pod resource request
GPU/TPU	Full support	Supported (with limits)
DaemonSets	Allowed	Restricted
Best for	Full control, specialized HW	Hands-off, cost-optimized

Create a Standard Cluster

gcloud container clusters create prod-cluster \
  --region=us-central1 --num-nodes=2 \
  --machine-type=e2-standard-4 --disk-size=100 \
  --enable-autoscaling --min-nodes=1 --max-nodes=5 \
  --enable-autorepair --enable-autoupgrade \
  --release-channel=regular \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --enable-ip-alias --enable-network-policy \
  --enable-shielded-nodes \
  --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,WORKLOAD \
  --labels=env=production,team=platform

gcloud container clusters get-credentials prod-cluster --region=us-central1

Create an Autopilot Cluster

gcloud container clusters create-auto autopilot-prod \
  --region=us-central1 --release-channel=regular \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --network=my-vpc --subnetwork=gke-subnet

Node Pools

# High-memory pool with taint
gcloud container node-pools create highmem-pool \
  --cluster=prod-cluster --region=us-central1 \
  --machine-type=n2-highmem-8 --disk-size=200 --disk-type=pd-ssd \
  --num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=4 \
  --node-labels=workload=memory-intensive \
  --node-taints=dedicated=highmem:NoSchedule

# GPU pool
gcloud container node-pools create gpu-pool \
  --cluster=prod-cluster --region=us-central1 \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=4 \
  --node-taints=nvidia.com/gpu=present:NoSchedule

# Spot pool for batch workloads
gcloud container node-pools create spot-pool \
  --cluster=prod-cluster --region=us-central1 \
  --machine-type=e2-standard-4 --spot \
  --num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=20 \
  --node-taints=cloud.google.com/gke-spot=true:NoSchedule

Workload Identity

# Create GSA and grant permissions
gcloud iam service-accounts create app-gsa
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:app-gsa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

# Create KSA and bind to GSA
kubectl create namespace myapp
kubectl create serviceaccount app-ksa --namespace=myapp
gcloud iam service-accounts add-iam-policy-binding \
  app-gsa@${PROJECT_ID}.iam.gserviceaccount.com \
  --role=roles/iam.workloadIdentityUser \
  --member="serviceAccount:${PROJECT_ID}.svc.id.goog[myapp/app-ksa]"
kubectl annotate serviceaccount app-ksa --namespace=myapp \
  iam.gke.io/gcp-service-account=app-gsa@${PROJECT_ID}.iam.gserviceaccount.com

Deploying Workloads

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: myapp
spec:
  replicas: 3
  selector:
    matchLabels: { app: web-app }
  template:
    metadata:
      labels: { app: web-app }
    spec:
      serviceAccountName: app-ksa
      containers:
      - name: web
        image: us-central1-docker.pkg.dev/PROJECT_ID/repo/web-app:v1.2.0
        ports: [{ containerPort: 8080 }]
        resources:
          requests: { cpu: 250m, memory: 512Mi }
          limits: { cpu: 500m, memory: 1Gi }
        readinessProbe:
          httpGet: { path: /healthz, port: 8080 }
          initialDelaySeconds: 5
        livenessProbe:
          httpGet: { path: /healthz, port: 8080 }
          initialDelaySeconds: 15
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels: { app: web-app }
---
apiVersion: v1
kind: Service
metadata: { name: web-app, namespace: myapp }
spec:
  selector: { app: web-app }
  ports: [{ port: 80, targetPort: 8080 }]
  type: ClusterIP

Ingress with Managed SSL

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  namespace: myapp
  annotations:
    kubernetes.io/ingress.class: "gce"
    networking.gke.io/managed-certificates: "web-cert"
    kubernetes.io/ingress.global-static-ip-name: "web-static-ip"
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service: { name: web-app, port: { number: 80 } }
---
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata: { name: web-cert, namespace: myapp }
spec:
  domains: [app.example.com]

gcloud compute addresses create web-static-ip --global

Terraform Configuration

resource "google_container_cluster" "primary" {
  name     = "prod-cluster"
  location = "us-central1"

  release_channel { channel = "REGULAR" }
  workload_identity_config { workload_pool = "${var.project_id}.svc.id.goog" }

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.gke.name

  ip_allocation_policy {
    cluster_secondary_range_name  = "pods"
    services_secondary_range_name = "services"
  }

  private_cluster_config {
    enable_private_nodes   = true
    master_ipv4_cidr_block = "172.16.0.0/28"
  }

  network_policy { enabled = true }
  logging_config { enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"] }
  monitoring_config {
    enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
    managed_prometheus { enabled = true }
  }

  remove_default_node_pool = true
  initial_node_count       = 1
}

resource "google_container_node_pool" "primary" {
  name     = "primary-pool"
  cluster  = google_container_cluster.primary.name
  location = "us-central1"

  initial_node_count = 2
  autoscaling { min_node_count = 1; max_node_count = 5 }
  management  { auto_repair = true; auto_upgrade = true }

  node_config {
    machine_type = "e2-standard-4"
    disk_size_gb = 100
    disk_type    = "pd-balanced"
    oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
    metadata = { disable-legacy-endpoints = "true" }
  }
}

resource "google_compute_subnetwork" "gke" {
  name          = "gke-subnet"
  ip_cidr_range = "10.0.0.0/20"
  region        = "us-central1"
  network       = google_compute_network.vpc.id

  secondary_ip_range { range_name = "pods";     ip_cidr_range = "10.4.0.0/14" }
  secondary_ip_range { range_name = "services"; ip_cidr_range = "10.8.0.0/20" }
}

Common Operations

gcloud container clusters list
gcloud container clusters upgrade prod-cluster --region=us-central1 --master
kubectl top nodes && kubectl top pods --namespace=myapp
kubectl scale deployment web-app --replicas=5 --namespace=myapp
kubectl autoscale deployment web-app --namespace=myapp --min=3 --max=20 --cpu-percent=70
kubectl logs -f deployment/web-app --namespace=myapp --all-containers

Troubleshooting

Symptom	Cause	Fix
Pods stuck in `Pending`	No nodes with enough resources	Check autoscaler; add larger node pool; verify resource requests
`ImagePullBackOff`	Wrong image path or missing AR access	Verify image URL; grant `roles/artifactregistry.reader` to node SA
Workload Identity wrong account	KSA annotation missing	Re-annotate KSA; restart pods to pick up new token
Nodes `NotReady`	Disk/memory pressure or network issue	Run `kubectl describe node`; check taints and conditions
Ingress returns 502	Backend pods failing health check	Verify readiness probe; check NEG health in Console
Cluster create quota error	Insufficient regional CPU/IP quota	Request quota increase in IAM & Admin > Quotas
Network policy not working	Not enabled on cluster	Recreate with `--enable-network-policy` or use Dataplane V2

Related Skills

gcp-networking - VPC, firewall rules, and load balancers for GKE clusters
terraform-gcp - Provision GKE clusters with Infrastructure as Code
gcp-compute - When workloads are better suited for VMs than containers
gcp-cloud-sql - Connecting GKE pods to Cloud SQL via sidecar proxy

gcp-gke

Google Kubernetes Engine (GKE)

When to Use

Prerequisites

Standard vs Autopilot

Create a Standard Cluster

Create an Autopilot Cluster

Node Pools

Workload Identity

Deploying Workloads

Ingress with Managed SSL

Terraform Configuration

Common Operations

Troubleshooting

Related Skills