production-troubleshooting
Production Troubleshooting
Overview
Diagnose performance issues and errors in production/test environments using systematic investigation workflows with Sentry, kubectl, and Helm configuration analysis.
Prerequisites
Verify access to Sentry and Kubernetes tooling before troubleshooting.
- Prefer
k8s-toolwhen available for environment-aware commands. - Fall back to raw
kubectlcommands whenk8s-toolis not installed or not configured in the current environment. - Confirm namespace and target environment (
testorprod) before running commands.
When to Use This Skill
Apply this skill when:
- Investigating incidents in test/production (not localhost)
- Troubleshooting slow endpoints, slow queries, or elevated latency
- Debugging pod crashes, restart loops,
OOMKilled, or potential throttling - Analyzing Sentry traces for failures or degraded transactions
- Validating Kubernetes resource limits and related Helm values
Investigation Workflow
Follow this symptom-driven workflow and confirm evidence before making changes.
Step 1: Triage by Primary Symptom
Choose the first investigation path based on the reported symptom.
- For pod crash/restart symptoms (
CrashLoopBackOff,OOMKilled, frequent restarts): check pod status and logs first. - For latency/slow endpoint symptoms: inspect traces first, then correlate with logs and pod state.
Step 2A: Inspect Pod Status and Logs (Crash/Restart Path)
Check pod health state before trace analysis when the incident is pod-centric.
Using k8s-tool (preferred):
k8s-tool describe --resource pod --name <pod-name> --env <env>
k8s-tool logs --pod <pod-name> --env <env> --tail 200
Fallback using kubectl:
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail 200
Look for restart reasons, termination messages, probe failures, and repeated startup errors.
Step 2B: Inspect Sentry Traces (Latency/Error Path)
Use Sentry to identify slow database calls, external latency, and transaction-level failures.
Using Sentry MCP:
- Search for traces related to the reported issue
- Look for slow database queries (for this project, >500ms is a useful baseline heuristic, not a universal threshold)
- Check external API call latency
- Identify error patterns and stack traces
What to look for:
- Database query times exceeding expected baseline (commonly ~500ms in this project)
- External API calls with high latency
- Repeated error patterns
- Performance degradation trends
Step 3: Review Application Logs
Examine kubectl logs for timing information and error patterns.
Using k8s-tool:
k8s-tool logs --pod <pod-name> --env <env> --tail 200
Key log patterns to search for:
[Server]- Server startup and initialization timing[SSR]- Server-side rendering timing[tRPC]- TRPC query execution timing[DB Pool]- Database connection pool statusERRORorWARN- Application errors and warnings
Common issues:
- Sequential API calls instead of parallel (Promise.all)
- Long DB connection acquisition times
- Slow SSR rendering
Step 4: Check Pod Resource Usage
Verify CPU and memory usage to detect throttling.
Using k8s-tool:
k8s-tool top --env <env>
Warning signs:
- CPU usage >70% may indicate potential throttling
- Memory usage >80% may indicate elevated OOM risk
- Consistent high utilization suggests under-provisioning
Step 5: Review Pod Configuration
Check resource limits and Helm values to identify misconfigurations.
Using kubectl:
kubectl get pod <pod-name> -n <namespace> -o yaml
Key sections to check:
resources.limits.cpuandresources.limits.memoryresources.requests.cpuandresources.requests.memory- Environment variables configuration
- Image version and tags
Helm values locations:
- web-app:
/kubernetes/helm/web-app/values.{test,prod}.yaml
Reference references/helm-values-locations.md for detailed Helm configuration structure.
Step 6: Confirm Evidence Before Changing Configuration
Confirm that proposed fixes map to observed evidence before editing Helm values or code.
- Link each change to concrete evidence from traces, logs, pod events, or resource metrics.
- Prefer the smallest reversible change first.
- Re-check traces/logs after deployment to verify impact.
Common Causes & Solutions
CPU/Memory Throttling
- Symptom: Sustained high CPU/memory usage with degraded response times or restarts
- Confirm with evidence: Correlate resource metrics with throttling signals, restart events, and latency spikes
- Solution: Adjust resource requests/limits in Helm values only after confirmation
Network Latency
- Symptom: Slow external API calls, DNS resolution delays
- Confirm with evidence: Validate slow spans and timed log entries for network-bound operations
- Solution: Check network policies, verify DNS configuration, and tune retry behavior where appropriate
Database Connection Pool Issues
- Symptom:
[DB Pool]errors, slow connection acquisition - Confirm with evidence: Match pool warnings with trace timing and connection wait patterns
- Solution: Review
idleTimeoutMillisand pool size configuration
Sequential API Calls
- Symptom: Multiple API calls taking cumulative time
- Confirm with evidence: Verify sequential span ordering in traces or timestamped log sequence
- Solution: Refactor to use
Promise.all()for parallel execution
Resources
kubectl commands
Use these common operations with k8s-tool when available, or run equivalent raw kubectl commands as fallback:
k8s-tool logs --pod <pod> --env <env> --tail 200- Extract and filter pod logsk8s-tool top --env <env>- Show CPU/memory usage for podsk8s-tool describe --resource pod --name <pod> --env <env>- Check resource limits and pod configurationk8s-tool kubectl --env <env> --cmd "get pods"- Raw kubectl for anything else
references/
helm-values-locations.md- Detailed guide to Helm values file structure and locationscommon-issues.md- Catalog of common production issues and solutions
More from blogic-cz/blogic-marketplace
marketing-expert
This skill should be used when writing or rewriting marketing copy for software products, including positioning, messaging, homepage rewrite work, landing pages, product descriptions, conversion-focused updates, and sales-enablement content. Produces clear, truthful, high-performing SaaS copy.
97requirements
This skill should be used when clarifying a feature, writing a requirements spec, running a structured discovery session, or when users mention requirements-start, requirements-status, requirements-current, requirements-list, requirements-remind, or requirements-end.
78frontend-design
This skill should be used when a task requires designing or implementing frontend UI (components, pages, layouts, styling) and no more specialized frontend skill is a better fit. It guides production-grade, brand-consistent visual implementation with distinctive but controlled aesthetics.
77testing-patterns
This skill should be used when implementing or reviewing testing workflows in template-ts projects, especially for testing, Vitest, Playwright, integration test, and mocking scenarios.
76git-workflow
Automates the full PR lifecycle — create or update a pull request, then aggressively monitor CI checks and review feedback in a continuous loop, fixing failures and addressing comments until the PR is fully green. Also covers push, branch creation, and branch sync workflows.
76debugging-with-opensrc
Load this skill when debugging behavior in external libraries by reading local OpenSrc mirrors (Effect, TanStack, TRPC, Drizzle, Better Auth, Sentry, Pino), or when docs conflict with runtime behavior and source-level verification is required.
75