google-cloud-recipe-networking-observability
Installation
SKILL.md
Google Cloud Networking Observability Expert
🛑 Core Directive: Results First
- Identify the Primary Source: Quickly determine if the user needs firewall logs, threat logs, Cloud NAT, VPC Flow logs, or metrics.
- Execute & Present: Perform the minimum required query to get a direct answer.
- Definitive Termination: Once you identify the requested data, regardless of the value (including 0, null, or "No traffic"), present the finding and call the finish tool in the same turn. Do NOT attempt to find "active" or "busier" resources to provide a "better" answer unless specifically instructed to troubleshoot a resource that is expected to be busy.
Log & Telemetry Overview
- Threat Logs: Specialized logs from Cloud Firewall Plus and Cloud IDS that identify malicious traffic patterns (for example, SQL injection or malware) using deep packet inspection.
- VPC Flow Logs: Capture sample IP traffic to and from network interfaces. Use for traffic analysis, volume trends, and top talkers.
- Firewall Logs: Record connection attempts matched by firewall rules. Use to identify "DENY" events or verify "ALLOW" rules.
- Cloud NAT Logs: Audit NAT translations. Use to audit traffic going through NAT gateways or troubleshoot port exhaustion.
- Networking Metrics: Aggregated time-series data for throughput, RTT (latency), and packet loss. Use for historical trends and performance monitoring.
- Connectivity Tests: Static analysis tool for path diagnostics. Use to identify firewall or routing misconfigurations between endpoints.
Procedures
0. Log Source Preference
- ALWAYS check for BigQuery linked datasets (for example,
big_query_linked_dataset,_AllLogs) before using Cloud Logging for high-volume analysis or aggregations. This is the preferred method for finding trends or top-blocking rules. - Metadata Awareness (BigQuery): Subnetworks may be configured with
EXCLUDE_ALL_METADATA, causing VM names to be NULL in VPC Flow Logs. If a query by VM name returns nothing, retry using the internal IP address (jsonPayload.connection.src_ip).
1. Tool Selection & Discovery
- MCP Servers First: Use Cloud Monitoring MCP, BigQuery MCP, or Cloud Logging MCP.
- Resource Discovery: If a user-specified resource (for example, NAT
gateway, VPN tunnel) is not found in metrics/logs:
- Use
run_shell_commandwithgcloudto list resources in the project. - Search Cloud Logging MCP for the resource name to find correct labels.
- Use
- CLI Fallback: Use
gcloudorbqonly if MCP servers are unavailable. DO NOT use gcloud monitoring; it is restricted. Immediately use the curl templates in metrics-analysis.md.
2. Schema Verification & Error Recovery
If a BigQuery query fails with an 'Unrecognized name' error or schema mismatch:
- Validate Schema: Run
bq show --schema --format=json {project_id}:{dataset_id}.{table_id}to verify field names and casing (for example,jsonPayloadversusjson_payload). 2. Dry Run: Before executing a corrected query, usebq query --use_legacy_sql=false --dry_run "{query_text}"to verify field references without incurring cost or execution time. 3. Retry: Apply identified fixes to the original query and execute.
3. Analysis Guides (Read Only When Needed)
For detailed SQL patterns, field definitions, and advanced troubleshooting, read the corresponding reference file:
- Threat Log Analysis: references/threat-analysis.md
- VPC Flow Analysis: references/vpc-flow-analysis.md
- Cloud NAT Analysis: references/cloud-nat-analysis.md
- Firewall Rule Analysis: references/firewall-analysis.md
- Networking Metrics: references/metrics-analysis.md
- Connectivity Test Analysis: references/connectivity-tests.md
Boundaries (CRITICAL)
- ALWAYS present the direct answer as soon as it is identified.
- NEVER run more than 2 exploratory queries before showing results.
- NEVER perform secondary verification (for example, don't check VPC flows after finding a firewall block) without explicit user permission.
- ALWAYS print the generated SQL for review before execution.
- ALWAYS include a link to the Flow Analyzer in the Google Cloud Console.
- NEVER query a second data source (such as, BigQuery logs) if the primary source (for example, Cloud Monitoring metrics) has already provided a conclusive answer. DO NOT compare metrics and logs to "verify" accuracy unless the user specifically asks why they differ.
- NO DISCREPANCY LOOPS: If Tool A provides a result (such as, 80,000 counts) and Tool B provides a different result (for example, 1,000 counts), DO NOT initiate a deep dive to explain the difference. Present the result from the primary tool and STOP.
- ALWAYS perform time-range calculations (such as, "12 hours ago") during the first turn to save steps.
- Conclusive Acceptance of Inactivity: Treat a result of "0", "0 traffic", "No data found", or "No records found" as a conclusive finding for the requested timeframe and resource. You MUST report this as the definitive state and terminate immediately.
- Standardized Discovery Path: For all "Top-N" or volume-based discovery tasks (for example, "highest traffic," "most hits," "top talkers"), you MUST use BigQuery aggregation on _AllLogs datasets. Manual aggregation of individual time-series points using the Monitoring API is forbidden due to step inefficiency.
- Ban on Auxiliary Scripting: Execute all data retrieval and parsing logic as direct tool calls (bq, curl, gcloud). Do NOT write or execute local shell scripts (.sh) or python files, as these introduce avoidable environment and permission errors that lead to investigation timeouts.
- Discovery Efficiency: For volume analysis (for example, "how many connections" or "top IPs by bytes"), BigQuery aggregation on VPC Flow logs (_AllLogs) is the Primary Source of Truth. If BigQuery data is available, it is conclusive. Do NOT query Monitoring API to "double check" BigQuery counts.