python-web-troubleshooter
Python Web Service Troubleshooter
Structured diagnosis for Python web service (Gunicorn / uWSGI) performance and capacity issues.
Core principle: workers are buffers, replicas are scale-out, and latency is what actually determines capacity.
Step 1: Break Down Latency First
Before changing any config, collect these metrics to build a complete picture of where time is being spent:
| Layer | Key Metrics |
|---|---|
| Gunicorn | idle worker count vs working worker count |
| Nginx | upstream_response_time, upstream_connect_time (check via APM or access log dashboard) |
| Application | total request duration, queue wait time as % of total |
| Database | DB CPU, query latency, DB load, waiting on IO? |
# Gunicorn idle vs busy workers — ps only lists processes, not request state.
# Use one of these approaches instead:
# Option A: If using a Prometheus exporter (e.g. prometheus-flask-exporter
# or a custom /metrics endpoint), query worker state directly:
# gunicorn_workers{state="idle"}
# gunicorn_workers{state="working"}
# Option B: Instrument via gunicorn server hooks in gunicorn.conf.py
# from prometheus_client import Gauge
# IDLE = Gauge('gunicorn_workers_idle', 'Idle workers')
# def pre_request(worker, req): IDLE.dec()
# def post_request(worker, req, environ, resp): IDLE.inc()
# Option C: Check Gunicorn's built-in stats socket (if --statsd-host is set)
# or read the worker utilization from your APM (Datadog, New Relic, etc.)
# Nginx upstream_response_time and upstream_connect_time are best read from
# your APM (Datadog, New Relic, etc.) or your log aggregation dashboard —
# the access log format varies too much across setups to parse reliably here.
# PostgreSQL slow queries
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
# MySQL slow queries
SELECT * FROM information_schema.PROCESSLIST
WHERE TIME > 1
ORDER BY TIME DESC;
Step 2: Symptom → Root Cause Decision Matrix
Match observed symptoms to the most likely root cause:
Symptom A: All workers busy, backlog growing, app CPU high
Root cause: Application-layer concurrency exhausted Action: Add more app replicas (scale out), not more workers per replica
Symptom B: App requests queuing, but DB CPU is low
Root cause: DB IO instability / storage layer performance issue Action: Check RDS IOPS metrics; verify provisioned IOPS on gp3 are sufficient for current load
Symptom C: DB CPU high, read-heavy workload
Root cause: Read load exceeding single instance capacity Action: Evaluate adding a DB read replica
Symptom D: DB CPU high, write-heavy or high lock contention
Root cause: Write bottleneck — read replicas won't help here Action: Optimize write-path SQL, upgrade DB instance size; consider sharding
Symptom E: A few specific endpoints are very slow (long tail), others are fine
Root cause: Individual slow handlers (PDF generation, complex queries, slow external calls) locking up workers Action: Add workers as a buffer for the long tail; or make the endpoint async
Symptom F: Everything is slow system-wide
Root cause: Systemic high latency — this is not a concurrency problem Action: Optimize slow queries, add indexes, reduce external calls; do not try to fix this by adding workers
Step 3: Capacity Math
Throughput formula
Stable throughput ≈ total workers ÷ avg request duration (seconds)
Example: 32 replicas × 4 workers = 128 workers, avg latency 4s
→ Stable throughput ≈ 128 ÷ 4 = 32 QPS
Impact of reducing latency vs adding workers
Avg latency 4s → 1s: 4× throughput gain (zero new workers needed)
Double the workers: 2× throughput gain (plus more DB pressure)
Reducing latency is always the higher-leverage move.
Step 4: Gunicorn Worker Count
Baseline formula
# General rule (reduce for CPU-bound, increase for IO-bound)
workers = (2 × CPU_cores) + 1
nproc # check CPU count
# Example: 4-core machine → workers = 9
When to add workers (and when not to)
- Good fit: A small number of slow, infrequent endpoints — use workers as a buffer to absorb the long tail
- Bad fit: System-wide slowness or DB already under pressure — adding workers just sends more concurrent load to the DB
Patterns to avoid
| Pattern | Why it's harmful |
|---|---|
| Thread workers | Spikes DB concurrency; lock/IO contention worsens latency |
| gevent | Monkey-patching dependency traps; very hard to debug in production |
| Forking child processes inside workers | Non-linear memory growth, orphan processes, OOM risk |
Step 5: Storage Layer Check (RDS gp3 IOPS)
If app requests are queuing but DB CPU is low, the bottleneck is likely IO. Check whether provisioned IOPS are being saturated:
# Check ReadIOPS and WriteIOPS vs your provisioned limit
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name ReadIOPS \
--dimensions Name=DBInstanceIdentifier,Value=<DB_INSTANCE_ID> \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name WriteIOPS \
--dimensions Name=DBInstanceIdentifier,Value=<DB_INSTANCE_ID> \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
# Also check DiskQueueDepth — sustained > 1 means IO is the bottleneck
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DiskQueueDepth \
--dimensions Name=DBInstanceIdentifier,Value=<DB_INSTANCE_ID> \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
If IOPS are hitting the provisioned ceiling or DiskQueueDepth is consistently above 1, increase the provisioned IOPS on the gp3 volume.
Step 6: Decision Order
1. Break down latency → find which layer is slow
↓
2. Fix latency → slow queries, indexes, reduce external calls, async offload
↓
3. Check storage layer → RDS IOPS / DiskQueueDepth; increase provisioned IOPS if saturated
↓
4. Tune workers → only for long-tail slow endpoints as a buffer
↓
5. Scale replicas → when app-layer concurrency is genuinely the bottleneck
↓
6. Add DB read replica → when read-heavy and primary is saturated
Checklist
[ ] Check Gunicorn idle vs working worker ratio
[ ] Check Nginx upstream_response_time and upstream_connect_time
[ ] Check DB slow queries (> 1s)
[ ] Check DB CPU and load
[ ] Check RDS ReadIOPS/WriteIOPS vs provisioned limit; check DiskQueueDepth
[ ] Calculate throughput ceiling: total workers ÷ avg request duration
[ ] Determine: long-tail problem or system-wide slowness?
[ ] Determine: bottleneck in app layer or DB layer?
Anti-Pattern Alerts
Death spiral #1 — Worker accumulation loop: Add workers → queue clears → DB gets hammered → latency rises → add more workers → memory pressure → OOM / 502
Death spiral #2 — Restart loop: DB IO saturates → DB latency spikes → workers get held longer → requests pile up → OOM → container restarts → pile-up continues
When you recognize either pattern, stop adding workers. Investigate storage layer and latency root cause instead.
More from arctuition/skills
jira-ticket-creator
Create Jira tickets using jira-cli (https://github.com/ankitpokhrel/jira-cli). Use when the user asks to create Jira tickets, issues, or stories with work types (Epic/Story/Bug/A/B Test), set to Backlog status. Selects the most appropriate component from API/Projects/Proposals/Backends/Regression/AI using the -C flag. Returns the ticket URL after creation. Assumes jira-cli is already installed and configured (user has run 'jira init').
11pr-code-review
Perform GitHub pull request code reviews using the gh CLI. Use when asked to review a PR, inspect PR diffs, leave inline review comments on specific lines, or produce a priority-based summary (P0-P3) of findings with an overall correctness verdict.
8sentry-issue-resolver
Analyze and resolve Sentry issues by fetching detailed issue information, performing deep root cause analysis, and providing actionable solutions. Use when the user asks to: (1) Analyze a Sentry issue, (2) Debug or investigate a Sentry error, (3) Fix a Sentry issue, (4) Get root cause analysis for application errors, (5) Resolve Sentry alerts. Works with Sentry URLs to fetch stack traces, error context, and event data.
6jira-ticket-manager
Create, search/list, and edit Jira tickets using jira-cli (https://github.com/ankitpokhrel/jira-cli). Use when the user asks to create tickets/issues/stories/bugs/epics (Epic/Story/Bug/A/B Test, set to Backlog), list or search their Jira tickets (e.g. "what am I working on", "show my open bugs", filter by status/type/component or raw JQL), view a specific ticket by key, or edit/update an existing ticket's fields (summary, description, assignee, priority, labels, parent epic). Selects the most appropriate component from API/Projects/Proposals/Backends/Regression/AI for new tickets. Assumes jira-cli is already installed and configured (user has run 'jira init').
2