gcloud-usage
GCP Observability Best Practices
Structured Logging
JSON Log Format
Use structured JSON logging for better queryability:
{
"severity": "ERROR",
"message": "Payment failed",
"httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" },
"labels": { "user_id": "123", "transaction_id": "abc" },
"timestamp": "2025-01-15T10:30:00Z"
}
Severity Levels
Use appropriate severity for filtering:
- DEBUG: Detailed diagnostic info
- INFO: Normal operations, milestones
- NOTICE: Normal but significant events
- WARNING: Potential issues, degraded performance
- ERROR: Failures that don't stop the service
- CRITICAL: Failures requiring immediate action
- ALERT: Person must take action immediately
- EMERGENCY: System is unusable
Log Filtering Queries
Common Filters
# By severity
severity >= WARNING
# By resource
resource.type="cloud_run_revision"
resource.labels.service_name="my-service"
# By time
timestamp >= "2025-01-15T00:00:00Z"
# By text content
textPayload =~ "error.*timeout"
# By JSON field
jsonPayload.user_id = "123"
# Combined
severity >= ERROR AND resource.labels.service_name="api"
Advanced Queries
# Regex matching
textPayload =~ "status=[45][0-9]{2}"
# Substring search
textPayload : "connection refused"
# Multiple values
severity = (ERROR OR CRITICAL)
Metrics vs Logs vs Traces
When to Use Each
Metrics: Aggregated numeric data over time
- Request counts, latency percentiles
- Resource utilization (CPU, memory)
- Business KPIs (orders/minute)
Logs: Detailed event records
- Error details and stack traces
- Audit trails
- Debugging specific requests
Traces: Request flow across services
- Latency breakdown by service
- Identifying bottlenecks
- Distributed system debugging
Alert Policy Design
Alert Best Practices
- Avoid alert fatigue: Only alert on actionable issues
- Use multi-condition alerts: Reduce noise from transient spikes
- Set appropriate windows: 5-15 min for most metrics
- Include runbook links: Help responders act quickly
Common Alert Patterns
Error rate:
- Condition: Error rate > 1% for 5 minutes
- Good for: Service health monitoring
Latency:
- Condition: P99 latency > 2s for 10 minutes
- Good for: Performance degradation detection
Resource exhaustion:
- Condition: Memory > 90% for 5 minutes
- Good for: Capacity planning triggers
Cost Optimization
Reducing Log Costs
- Exclusion filters: Drop verbose logs at ingestion
- Sampling: Log only percentage of high-volume events
- Shorter retention: Reduce default 30-day retention
- Downgrade logs: Route to cheaper storage buckets
Exclusion Filter Examples
# Exclude health checks
resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health"
# Exclude debug logs in production
severity = DEBUG
Debugging Workflow
- Start with metrics: Identify when issues started
- Correlate with logs: Filter logs around problem time
- Use traces: Follow specific requests across services
- Check resource logs: Look for infrastructure issues
- Compare baselines: Check against known-good periods
More from fcakyon/claude-codex-settings
paper-search-usage
This skill should be used when user asks to "search for papers", "find research papers", "search arXiv", "search PubMed", "find academic papers", "search IEEE", "search Scopus", or "look up scientific literature".
91setup
This skill should be used when the user asks "how to setup GitHub CLI", "configure gh", "gh auth not working", "GitHub CLI connection failed", "gh CLI error", or needs help with GitHub authentication.
53tavily-usage
This skill should be used when user asks to "search the web", "fetch content from URL", "extract page content", "use Tavily search", "scrape this website", "get information from this link", or "web search for X".
48azure-usage
This skill should be used when user asks to "query Azure resources", "list storage accounts", "manage Key Vault secrets", "work with Cosmos DB", "check AKS clusters", "use Azure MCP", or interact with any Azure service.
48pr-workflow
This skill should be used when user asks to "create a PR", "make a pull request", "open PR for this branch", "submit changes as PR", "push and create PR", or runs /create-pr or /pr-creator commands.
47mongodb-usage
This skill should be used when user asks to "query MongoDB", "show database collections", "get collection schema", "list MongoDB databases", "search records in MongoDB", or "check database indexes".
47