gke-ai-troubleshooting-skill-creation-guide
Installation
SKILL.md
Troubleshooting Skill Creation Guide
Use this guide to build high-quality troubleshooting skills that enable AI agents to diagnose complex failures in GKE workloads.
🏗️ Skill Structure Standard
Mandatory Components
SKILL.md: The core diagnostic and resolution workflow.README.md: Public-facing overview and "When to use" guide.references/failure_signatures.md: Authentic log/metric signatures.scripts/validate_queries.sh: Automatic syntax validator for all queries.TEST.md: Manual verification plan for humans.EVAL.textproto: Evaluation suite for performance tracking.
Optional Components
BUILD: Build definition.
🏷️ Naming Conventions
- Directory Name: MUST be
kebab-case(e.g.,gke-ai-troubleshooting-tpu-vbar-oom). - Skill Name: MUST match the directory name.
🔍 Diagnostic Workflow Standards
Step 0: Mandatory Context
Every skill MUST begin with a "Step 0" to acquire necessary context.
- Mandatory Fields:
<project_id>,<location>,<cluster_name>,<timestamp>. - Optional/Case-by-Case Fields:
<node_name>,<workload_name>,<workload_namespace>,<nodepool_name>. - Time Rule: Reject relative time (e.g., "5 minutes ago"). Calculate a
window of
[T - 30m]to[T + 30m].
Diagnostic Steps
- Explicit Queries: Every step MUST provide a ready-to-use Cloud Logging (LQL) or Cloud Monitoring (PromQL) query.
- Placeholder Syntax: Use angle brackets like
<project_id>instead of curly braces for placeholders to avoid template resolution errors. - Risk Categorization: Label every step as [Low Risk] (Read-only) or [High Risk] (Mutative/Destructive).
- Automation: Specify if the agent should proceed automatically or wait for user confirmation.
🛠️ Accuracy & Validation
Zero Hallucination
- Never synthesize example logs or metrics.
- Source signatures from real incidents and anonymize where necessary.
- DO NOT EXTRAPOLATE: Only include steps and queries that were verified in the source conversation.
Security & Privacy
- No Raw Dumps: Do not instruct the agent to dump raw logs into shared spaces (bugs, chat).
- Signal Only: Instruct the agent to summarize findings and report only high-signal information (e.g., "Found specific error pattern X on node Y").
Automated Validation
- Every skill MUST include a script (at
scripts/validate_queries.sh) that usesquery_logsorgcloud logging read ... --limit=1to verify its LQL queries.
📋 Best Practices
- Conciseness: Keep instructions lean. Focus on "what to do" and "how to verify".
- Public Ready: Remove all internal notes, personal bookmarks, or project-specific jargon.
- Error Signatures: Explicitly link to
references/failure_signatures.mdin relevant diagnostic steps.
Related skills
More from googlecloudplatform/gke-mcp
gke-backup-dr
Workflows for configuring Backup for GKE and disaster recovery.
2gke-reliability
Workflows for ensuring high availability and reliability of GKE workloads.
2gke-storage
Guidance on managing storage in Google Kubernetes Engine (GKE) clusters.
2gke-app-onboarding
Workflows for containerizing and deploying applications to GKE for the first time.
2gke-workload-security
Workflows for auditing and hardening the security of GKE workloads.
2gke-cost-optimization
Guidance on optimizing costs for Google Kubernetes Engine (GKE) clusters.
2