gke-ai-troubleshooting-skill-creation-guide

Installation

SKILL.md

Troubleshooting Skill Creation Guide

Use this guide to build high-quality troubleshooting skills that enable AI agents to diagnose complex failures in GKE workloads.

🏗️ Skill Structure Standard

Mandatory Components

SKILL.md: The core diagnostic and resolution workflow.
README.md: Public-facing overview and "When to use" guide.
references/failure_signatures.md: Authentic log/metric signatures.
scripts/validate_queries.sh: Automatic syntax validator for all queries.
TEST.md: Manual verification plan for humans.
EVAL.textproto: Evaluation suite for performance tracking.

Optional Components

BUILD: Build definition.

🏷️ Naming Conventions

Directory Name: MUST be kebab-case (e.g., gke-ai-troubleshooting-tpu-vbar-oom).
Skill Name: MUST match the directory name.

🔍 Diagnostic Workflow Standards

Step 0: Mandatory Context

Every skill MUST begin with a "Step 0" to acquire necessary context.

Mandatory Fields: <project_id>, <location>, <cluster_name>, <timestamp>.
Optional/Case-by-Case Fields: <node_name>, <workload_name>, <workload_namespace>, <nodepool_name>.
Time Rule: Reject relative time (e.g., "5 minutes ago"). Calculate a window of [T - 30m] to [T + 30m].

Diagnostic Steps

Explicit Queries: Every step MUST provide a ready-to-use Cloud Logging (LQL) or Cloud Monitoring (PromQL) query.
Placeholder Syntax: Use angle brackets like <project_id> instead of curly braces for placeholders to avoid template resolution errors.
Risk Categorization: Label every step as [Low Risk] (Read-only) or [High Risk] (Mutative/Destructive).
Automation: Specify if the agent should proceed automatically or wait for user confirmation.

🛠️ Accuracy & Validation

Zero Hallucination

Never synthesize example logs or metrics.
Source signatures from real incidents and anonymize where necessary.
DO NOT EXTRAPOLATE: Only include steps and queries that were verified in the source conversation.

Security & Privacy

No Raw Dumps: Do not instruct the agent to dump raw logs into shared spaces (bugs, chat).
Signal Only: Instruct the agent to summarize findings and report only high-signal information (e.g., "Found specific error pattern X on node Y").

Automated Validation

Every skill MUST include a script (at scripts/validate_queries.sh) that uses query_logs or gcloud logging read ... --limit=1 to verify its LQL queries.

📋 Best Practices

Conciseness: Keep instructions lean. Focus on "what to do" and "how to verify".
Public Ready: Remove all internal notes, personal bookmarks, or project-specific jargon.
Error Signatures: Explicitly link to references/failure_signatures.md in relevant diagnostic steps.

Related skills

More from googlecloudplatform/gke-mcp

Installs

2

Repository

googlecloudplat…/gke-mcp

GitHub Stars

148

First Seen

2 days ago

Security Audits

Gen Agent Trust HubPass