GKE Observability Skill

This skill provides workflows for ensuring your GKE cluster and workloads have adequate observability for production use.

Workflows

1. Audit Cluster Observability

Check if Cloud Logging and Cloud Monitoring are enabled on the cluster.

Command:

gcloud container clusters describe <cluster-name> --region <region> --project <project-id> --format="json(loggingConfig, monitoringConfig)"

Look for loggingService and monitoringService to be set to something other than none (usually logging.googleapis.com/kubernetes and monitoring.googleapis.com/kubernetes).

2. Enable Managed Service for Prometheus

Google Cloud Managed Service for Prometheus is the recommended way to collect metrics from your applications.

Command to enable:

gcloud container clusters update <cluster-name> \
    --enable-managed-prometheus \
    --region <region>

Verify installation:

kubectl get pods -n gmp-system

3. Workload Logging Verification

Ensure your workloads are logging to standard output, which Cloud Logging collects automatically.

Check workload logs:

kubectl logs <pod-name> -n <namespace>

Ensure logs are in a structured format (like JSON) if possible, for easier querying.

4. Dashboards and Alerts

Recommend creating dashboards in Cloud Monitoring for key metrics:

CPU Utilization
Memory Utilization
Request Latency
Error Rate

Set up alerting policies for critical thresholds.

5. Distributed Tracing

Enable distributed tracing to track requests across microservices.

Action: Recommend using OpenTelemetry in the application to send traces to Cloud Trace.
Benefit: Helps identify latency bottlenecks in distributed systems.

6. Continuous Profiling

Use continuous profiling to analyze application performance in production with low overhead.

Action: Recommend integrating the Cloud Profiler agent in your application code.
Benefit: Helps identify CPU and memory-consuming functions in production.

7. Querying Logs with LQL

Use Logging Query Language (LQL) in Cloud Logging to find specific logs.

Example LQL Queries:

Find error logs for a specific container:

resource.type="k8s_container"
resource.labels.container_name="my-app"
severity>=ERROR

Find logs with a specific message:

resource.type="k8s_container"
textPayload:"connection refused"

8. Enable Control Plane Metrics

For Standard clusters, you can enable collection of metrics from the Kubernetes API server, scheduler, and controller manager.

Command:

gcloud container clusters update <cluster-name> \
    --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER \
    --region <region>

9. Enable Dataplane V2 Observability

If using GKE Dataplane V2, you can enable advanced L4 observability.

Command:

gcloud container clusters update <cluster-name> \
    --enable-dataplane-v2-observability \
    --region <region>

This allows you to observe traffic flows and network metrics.

Best Practices

Structured Logging: Use JSON logging in your applications to make it easier to search and analyze logs in Cloud Logging.
Custom Metrics: Use Managed Service for Prometheus to expose and collect custom application metrics.
Full Pillars of Observability: Implement Tracing and Profiling in addition to Logs and Metrics for complete visibility.
Control Plane Metrics: Enable control plane metrics (if using Standard) to monitor the health of the API server and scheduler.

gke-observability

GKE Observability Skill

Workflows

1. Audit Cluster Observability

2. Enable Managed Service for Prometheus

3. Workload Logging Verification

4. Dashboards and Alerts

5. Distributed Tracing

6. Continuous Profiling

7. Querying Logs with LQL

8. Enable Control Plane Metrics

9. Enable Dataplane V2 Observability

Best Practices

More from googlecloudplatform/gke-mcp

gke-backup-dr

gke-reliability

gke-storage

gke-app-onboarding

gke-workload-security

gke-cost-optimization