production-troubleshooting
Production Troubleshooting
Overview
Diagnose performance issues and errors in production/test environments using systematic investigation workflows with Sentry, kubectl, and Helm configuration analysis.
When to Use This Skill
Use this skill when:
- User reports performance issues on test/production (not localhost)
- Need to investigate slow queries or high latency
- Debugging pod crashes or resource throttling
- Analyzing Sentry traces for errors
- Checking Kubernetes resource limits and configurations
Investigation Workflow
Follow these steps in order when troubleshooting production issues:
Step 1: Check Sentry Traces
Start with Sentry to identify slow queries and external API latency patterns.
Using Sentry MCP:
- Search for traces related to the reported issue
- Look for slow database queries (>500ms)
- Check external API call latency
- Identify error patterns and stack traces
What to look for:
- Database query times exceeding 500ms
- External API calls with high latency
- Repeated error patterns
- Performance degradation trends
Step 2: Review Application Logs
Examine kubectl logs for timing information and error patterns.
Using agent-tools-k8s:
agent-tools-k8s logs --pod <pod-name> --env <env> --tail 200
Key log patterns to search for:
[Server]- Server startup and initialization timing[SSR]- Server-side rendering timing[tRPC]- TRPC query execution timing[DB Pool]- Database connection pool statusERRORorWARN- Application errors and warnings
Common issues:
- Sequential API calls instead of parallel (Promise.all)
- Long DB connection acquisition times
- Slow SSR rendering
Step 3: Check Pod Resource Usage
Verify CPU and memory usage to detect throttling.
Using agent-tools-k8s:
agent-tools-k8s top --env <env>
Warning signs:
- CPU usage >70% indicates potential throttling
- Memory usage >80% indicates potential OOM issues
- Consistent high utilization suggests under-provisioning
Step 4: Review Pod Configuration
Check resource limits and Helm values to identify misconfigurations.
Using kubectl:
kubectl get pod <pod-name> -n <namespace> -o yaml
Key sections to check:
resources.limits.cpuandresources.limits.memoryresources.requests.cpuandresources.requests.memory- Environment variables configuration
- Image version and tags
Helm values locations:
- web-app:
/kubernetes/helm/web-app/values.{test,prod}.yaml
Reference references/helm-values-locations.md for detailed Helm configuration structure.
Common Causes & Solutions
CPU/Memory Throttling
- Symptom: High CPU/memory usage (>70-80%)
- Solution: Increase resource limits in Helm values
Network Latency
- Symptom: Slow external API calls, DNS resolution delays
- Solution: Check network policies, verify DNS configuration, consider retry logic
Database Connection Pool Issues
- Symptom:
[DB Pool]errors, slow connection acquisition - Solution: Review
idleTimeoutMillisand pool size configuration
Sequential API Calls
- Symptom: Multiple API calls taking cumulative time
- Solution: Refactor to use
Promise.all()for parallel execution
Resources
kubectl commands
Common kubectl operations (use via agent-tools-k8s):
agent-tools-k8s logs --pod <pod> --env <env> --tail 200- Extract and filter pod logsagent-tools-k8s top --env <env>- Show CPU/memory usage for podsagent-tools-k8s describe --resource pod --name <pod> --env <env>- Check resource limits and pod configurationagent-tools-k8s kubectl --env <env> --cmd "get pods"- Raw kubectl for anything else
references/
helm-values-locations.md- Detailed guide to Helm values file structure and locationscommon-issues.md- Catalog of common production issues and solutions