deco-site-scaling-tuning
Deco Site Scaling Tuning
Analyze a site's Prometheus metrics to discover the optimal autoscaling parameters. This skill helps you find the CPU/concurrency threshold where latency degrades and recommends scaling configuration accordingly.
When to Use This Skill
- A site is overscaled (too many pods for its traffic)
- A site oscillates between scaling up and down (panic mode loop)
- Need to switch scaling metric (concurrency vs CPU vs RPS)
- Need to find the right target value for a site
- After deploying scaling changes, to verify they're working
Prerequisites
kubectlaccess to the target cluster- Prometheus accessible via port-forward (from
kube-prometheus-stackin monitoring namespace) - Python 3 for analysis scripts
- At least 6 hours of metric history for meaningful analysis
- For direct latency data: queue-proxy PodMonitor must be applied (see Step 0)
Quick Start
0. ENABLE METRICS → Apply queue-proxy PodMonitor if not already done
1. PORT-FORWARD → kubectl port-forward prometheus-pod 19090:9090
2. COLLECT DATA → Run analysis scripts against Prometheus
3. ANALYZE → Find CPU threshold where latency degrades
4. RECOMMEND → Choose scaling metric and target
5. APPLY → Use deco-site-deployment skill to apply changes
6. VERIFY → Monitor for 1-2 hours after change
Files in This Skill
| File | Purpose |
|---|---|
SKILL.md |
Overview, methodology, analysis procedures |
analysis-scripts.md |
Ready-to-use Python scripts for Prometheus queries |
Step 0: Enable Queue-Proxy Metrics (one-time)
Queue-proxy runs as a sidecar on every Knative pod and exposes request latency histograms. These are critical for precise tuning but are not scraped by default.
Apply this PodMonitor:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: knative-queue-proxy
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
namespaceSelector:
any: true
selector:
matchExpressions:
- key: serving.knative.dev/revision
operator: Exists
podMetricsEndpoints:
- port: http-usermetric
path: /metrics
interval: 15s
kubectl apply -f queue-proxy-podmonitor.yaml
# Wait 2-3 hours for data to accumulate before running latency analysis
Metrics unlocked by this PodMonitor:
revision_app_request_latencies_bucket— request latency histogram (p50/p95/p99)revision_app_request_latencies_sum/_count— for avg latencyrevision_app_request_count— request rate by response code
Step 1: Establish Prometheus Connection
PROM_POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n monitoring $PROM_POD 19090:9090 &
# Verify
curl -s "http://127.0.0.1:19090/api/v1/query?query=up" | jq '.status'
Step 2: Collect Current State
Before analyzing, understand what the site is currently configured for.
2a. Read current autoscaler config
SITENAME="<sitename>"
NS="sites-${SITENAME}"
# Current revision annotations
kubectl get rev -n $NS -o json | \
jq '.items[] | select(.status.conditions[]?.status == "True" and .status.conditions[]?.type == "Active") |
{name: .metadata.name, annotations: .metadata.annotations | with_entries(select(.key | startswith("autoscaling")))}'
# Global autoscaler defaults
kubectl get cm config-autoscaler -n knative-serving -o json | jq '.data | del(._example)'
2b. Current pod count and resources
kubectl get pods -n $NS --no-headers | wc -l
kubectl top pods -n $NS --no-headers | head -20
Step 3: Run Analysis
Use the scripts in analysis-scripts.md. The analysis follows this methodology:
Methodology: Finding the Optimal CPU Target
Goal: Find the CPU level at which latency starts to degrade. This is your scaling target — keep pods below this CPU to maintain good latency.
Approach:
-
Collect CPU per pod, concurrency per pod, pod count, and (if available) request latency over 6-12 hours
-
Bucket data by CPU range (0-200m, 200-300m, ..., 700m+)
-
For each bucket, compute avg/p95 concurrency per pod
-
Compute the "latency inflation factor" — how much concurrency increases beyond what the pod count reduction explains:
excess = (avg_conc_above_threshold / avg_conc_below_threshold) / (avg_pods_below / avg_pods_above)- excess = 1.0 → concurrency increase fully explained by fewer pods (no latency degradation)
- excess > 1.0 → latency is inflating concurrency (pods are slowing down)
- The CPU level where excess crosses ~1.5x is your inflection point
-
If queue-proxy latency is available, directly plot avg latency vs CPU — the hockey stick inflection is your target
What to Look For
CPU vs Concurrency/pod:
Low CPU (0-200m) → Low conc/pod → Pods are idle (overprovisioned)
Medium CPU (200-400m) → Moderate conc → Healthy range
★ INFLECTION ★ → Conc jumps → Latency starting to degrade
High CPU (500m+) → High conc/pod → Pods overloaded, latency bad
The inflection point is where you want your scaling target.
Decision Matrix
IMPORTANT: CPU target is in millicores (not percentage). E.g., target: 400 means scale when CPU reaches 400m.
| Inflection CPU | Recommended metric | Target | Notes |
|---|---|---|---|
| < CPU request | CPU scaling | target = inflection value in millicores | Standard case |
| ~ CPU request | CPU scaling | target = CPU_request × 0.8 | Conservative |
| > CPU request (no limit) | CPU scaling | target = CPU_request × 0.8, increase CPU request | Need more CPU headroom |
| No clear inflection | Concurrency scaling | Keep current but tune target | CPU isn't the bottleneck |
Common Patterns
Pattern: CPU-bound app (Deno SSR)
- Baseline CPU: 200-300m (Deno runtime + V8 JIT)
- Inflection: 400-500m
- Recommendation: CPU scaling with target = inflection (e.g., 400 millicores)
Pattern: IO-bound app (mostly external API calls)
- CPU stays low even under high concurrency
- Inflection not visible in CPU
- Recommendation: Keep concurrency scaling, tune the target
Pattern: Oscillating (panic loop)
- Symptoms: pods cycle between min and max
- Cause: concurrency scaling + low target +
scale-down-delayratchet - Fix: Switch to CPU scaling (breaks the latency→concurrency feedback loop)
Step 4: Apply Changes
Use the deco-site-deployment skill to:
- Update the
statesecret with new scaling config - Redeploy on both clouds
Example for CPU-based scaling (target is in millicores):
NEW_STATE=$(echo "$STATE" | jq '
.scaling.metric = {
"type": "cpu",
"target": 400
}
')
Step 5: Verify After Change
Monitor for 1-2 hours after applying changes:
# Watch pod count stabilize
watch -n 10 "kubectl get pods -n sites-<sitename> --no-headers | wc -l"
# Check if panic mode triggers (should be N/A for HPA/CPU)
# HPA doesn't have panic mode — this is one of the advantages
# Verify HPA is active
kubectl get hpa -n sites-<sitename>
# Check HPA status
kubectl describe hpa -n sites-<sitename>
Success Criteria
- Pod count stabilizes (no more oscillation)
- Avg CPU per pod stays below your target during normal traffic
- CPU crosses target only during genuine traffic spikes (and scales up proportionally)
- No panic mode events (HPA doesn't have panic mode)
- Latency stays acceptable (check with queue-proxy metrics if available)
Rollback
If the new scaling is worse, revert by changing the state secret back to concurrency scaling:
NEW_STATE=$(echo "$STATE" | jq '
.scaling.metric = {
"type": "concurrency",
"target": 15,
"targetUtilizationPercentage": 70
}
')
Related Skills
deco-site-deployment— Apply scaling changes and redeploydeco-site-memory-debugging— Debug memory issues on running podsdeco-incident-debugging— Incident response and triage