gke-reliability
GKE Reliability Skill
This skill provides workflows for configuring your GKE cluster and workloads for high availability and reliability.
Workflows
1. Verify Cluster High Availability
Check if the cluster is regional or has multi-zonal node pools.
Command:
gcloud container clusters describe <cluster-name> --region <region> --format="json(location, locations)"
If location is a region (e.g., us-central1), the control plane is regional.
If locations has multiple entries, nodes are spread across multiple zones.
2. Configure Pod Disruption Budgets (PDB)
PDBs ensure that a minimum number of pods are available during voluntary disruptions (like node upgrades).
Check existing PDBs:
kubectl get pdb -n <namespace>
Example Manifest:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
3. Configure Health Probes
Ensure all production containers have Liveness, Readiness, and optionally Startup probes.
- Readiness Probe: Determines when a container is ready to start accepting traffic.
- Liveness Probe: Determines when to restart a container.
- Startup Probe: Disables liveness and readiness checks until the app has started up.
Check workload probes:
kubectl get deployment <app-name> -n <namespace> -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe"
4. Graceful Shutdown
Ensure applications handle SIGTERM signals gracefully and have an appropriate terminationGracePeriodSeconds set (default is 30s).
5. Topology Spread Constraints
Ensure pods are spread across zones or nodes to avoid correlated failures.
Example Manifest excerpt:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
labelSelector:
matchLabels:
app: my-app
6. Maintenance Windows and Exclusions
Configure when GKE can perform automated upgrades to avoid peak hours.
Command to set maintenance window:
gcloud container clusters update <cluster-name> \
--region <region> \
--maintenance-window-start <start-time> \
--maintenance-window-recurrence "FREQ=DAILY"
Best Practices
- Regional Clusters: Always use regional clusters for production workloads to survive zone failures.
- Probes for All Containers: Every container in a production pod should have at least a readiness probe.
- PDBs for Critical Apps: Use PDBs to prevent downtime during automated node upgrades.
- Zone Spreading: Always use
topologySpreadConstraintsto ensure pods are distributed across zones, even in regional clusters. - Schedule Maintenance: Set maintenance windows to ensure upgrades happen during low-traffic periods.
More from googlecloudplatform/gke-mcp
gke-backup-dr
Workflows for configuring Backup for GKE and disaster recovery.
2gke-storage
Guidance on managing storage in Google Kubernetes Engine (GKE) clusters.
2gke-app-onboarding
Workflows for containerizing and deploying applications to GKE for the first time.
2gke-workload-security
Workflows for auditing and hardening the security of GKE workloads.
2gke-cost-optimization
Guidance on optimizing costs for Google Kubernetes Engine (GKE) clusters.
2gke-networking-edge
Workflows for configuring edge networking, ingress, and security on GKE.
2