GKE Reliability Skill

This skill provides workflows for configuring your GKE cluster and workloads for high availability and reliability.

Workflows

1. Verify Cluster High Availability

Check if the cluster is regional or has multi-zonal node pools.

Command:

gcloud container clusters describe <cluster-name> --region <region> --format="json(location, locations)"

If location is a region (e.g., us-central1), the control plane is regional. If locations has multiple entries, nodes are spread across multiple zones.

2. Configure Pod Disruption Budgets (PDB)

PDBs ensure that a minimum number of pods are available during voluntary disruptions (like node upgrades).

Check existing PDBs:

kubectl get pdb -n <namespace>

Example Manifest:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

3. Configure Health Probes

Ensure all production containers have Liveness, Readiness, and optionally Startup probes.

Readiness Probe: Determines when a container is ready to start accepting traffic.
Liveness Probe: Determines when to restart a container.
Startup Probe: Disables liveness and readiness checks until the app has started up.

Check workload probes:

kubectl get deployment <app-name> -n <namespace> -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe"

4. Graceful Shutdown

Ensure applications handle SIGTERM signals gracefully and have an appropriate terminationGracePeriodSeconds set (default is 30s).

5. Topology Spread Constraints

Ensure pods are spread across zones or nodes to avoid correlated failures.

Example Manifest excerpt:

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
      labelSelector:
        matchLabels:
          app: my-app

6. Maintenance Windows and Exclusions

Configure when GKE can perform automated upgrades to avoid peak hours.

Command to set maintenance window:

gcloud container clusters update <cluster-name> \
    --region <region> \
    --maintenance-window-start <start-time> \
    --maintenance-window-recurrence "FREQ=DAILY"

Best Practices

Regional Clusters: Always use regional clusters for production workloads to survive zone failures.
Probes for All Containers: Every container in a production pod should have at least a readiness probe.
PDBs for Critical Apps: Use PDBs to prevent downtime during automated node upgrades.
Zone Spreading: Always use topologySpreadConstraints to ensure pods are distributed across zones, even in regional clusters.
Schedule Maintenance: Set maintenance windows to ensure upgrades happen during low-traffic periods.

gke-reliability

GKE Reliability Skill

Workflows

1. Verify Cluster High Availability

2. Configure Pod Disruption Budgets (PDB)

3. Configure Health Probes

4. Graceful Shutdown

5. Topology Spread Constraints

6. Maintenance Windows and Exclusions

Best Practices

More from googlecloudplatform/gke-mcp

gke-backup-dr

gke-storage

gke-app-onboarding

gke-workload-security

gke-cost-optimization

gke-networking-edge