performance-report-analysis
Performance Report Analyzer
Interprets performance test results, identifies bottlenecks, classifies findings by severity, and produces two structured reports — one technical, one for business stakeholders. This skill starts where testing ends.
Scope boundary: This skill covers analysis and communication of results after
a test has run. For planning which tests to run, sizing VUs, or setting SLAs, use
performance-testing-strategy instead.
Output Format
After completing Steps 1–3, deliver two artifacts:
- Technical Report — for engineers and QA: raw findings, root cause analysis, severity classification, and concrete remediation steps.
- Business Report — for stakeholders: business impact framing, risk rating, and recommended decisions — no raw percentile numbers.
Step 1 — Gather the Results
Do not jump straight to analysis. First ask the questions below, grouped logically. Wait for answers before proceeding. The goal is to understand the results from the perspective of someone who already ran the test — not to plan a new one.
Group A — Always ask (every analysis session)
Ask these first, even if the user has already pasted raw output:
- Tool and test type: Which tool generated this report? (k6, Gatling, Locust, JMeter, Artillery, other) — and what type of test was it? (Load, Stress, Spike, Endurance, Smoke)
- Did the test complete? Did it run the full configured duration and reach the target user count / RPS? If not, the results may be partial or unreliable.
- Who will read the report? Engineers and QA only? A manager or team lead? Business stakeholders or executives? — This determines whether to produce a technical report, a business report, or both.
Group B — Ask only if missing from the pasted output
Ask only what is not already present in the data the user shared:
- Percentiles: Do you have p95 and p99, or only the average/mean?
- SLA targets: What were the pass/fail thresholds? (e.g., p95 < 800ms, error rate < 1%) — If none were defined, ask: what response time would be unacceptable for your users?
- Error details: Did the errors have a specific message? (timeout, connection refused, 5xx, assertion failure) — The error type changes the diagnosis.
Group C — Ask when the output shows something suspicious
Ask these only when the analysis reveals a finding that needs more context:
- Infrastructure data: Was any monitoring active during the test? (CPU, memory, DB metrics, APM) — Without this, bottleneck hypotheses cannot be confirmed.
- Baseline: Do you have results from a previous run of the same test? — Required for regression detection.
- Infra events: Did anything unusual happen during the test? (deployment, restart, alert, external incident)
Rule: Never proceed to Step 2 with only average/mean data and no percentiles, and no SLA targets. If the user cannot provide SLAs, use industry defaults: p95 < 1000ms, error rate < 1%, and state this assumption explicitly in the report.
Only load references/TOOL-REPORT-FORMATS.md when the user pastes raw output from k6, Gatling, Locust, JMeter, or Artillery and needs help reading or interpreting the specific fields, charts, or sections of that tool's report.
Only load references/BOTTLENECK-PATTERNS.md when the user asks to diagnose why a metric is degraded — CPU spikes, memory growth, slow queries, connection pool exhaustion, or third-party dependency slowness.
Only load references/REPORT-TEMPLATES.md when you are ready to produce the final technical or business report draft.
Step 2 — Analyze the Findings
2.0 Triage order — always start here
When results arrive, evaluate in this order. Stop and report if you find a blocker at any step:
- Error rate first — if errors exist at baseline load, it is a bug, not a capacity issue. Capacity numbers are invalid until the bug is fixed.
- Did the test reach its target load? — if VU count or RPS never reached the configured peak, the results are partial. State this limitation explicitly.
- Global p95 vs. SLA — is the most important SLA met or breached at steady state?
- Latency trend over time — was performance stable throughout the test, or did it degrade mid-run? (memory leak, warm-up, resource exhaustion)
- Per-endpoint breakdown — which specific endpoint is the source of the problem? Never report only global metrics if per-endpoint data is available.
2.1 SLA compliance check
For each metric, compare actual vs. target:
| Metric | Target | Actual | Status |
|---|---|---|---|
| p95 response time | < X ms | Y ms | PASS / FAIL |
| p99 response time | < X ms | Y ms | PASS / FAIL |
| Error rate | < X% | Y% | PASS / FAIL |
| Throughput | ≥ X RPS | Y RPS | PASS / FAIL |
Never average percentiles. p95 = 95th percentile of all requests — a 1.2s p95 means 5% of users experienced ≥ 1.2s. This is a precision number, not an average.
2.2 Latency distribution analysis
Healthy vs. degraded signatures:
| Pattern | What it indicates |
|---|---|
| p50 ≈ p95 (tight spread) | Consistent, predictable performance |
| p95 >> p50 (wide spread, long tail) | Outliers — GC pauses, lock contention, cold cache, DB spikes |
| p50 rises with load | Saturation — system is queuing requests |
| p99 >> p95 | Occasional severe stalls — investigate retries, timeouts, external deps |
| p95 rises linearly as users scale | Expected — not a bug; validate it stays within SLA |
| p95 rises exponentially above a threshold | Breaking point — find the knee of the curve |
2.3 Error rate analysis
Classify errors before diagnosing:
- Timeout errors — requests that never completed: indicates saturation or slow downstream
- 4xx errors — client errors under load: often test data issues (expired tokens, missing data)
- 5xx errors — server-side failures: overload, unhandled exceptions, OOM crashes
- Connection errors — refused or reset: infrastructure limit (connection pool, firewall, max threads)
An error rate that grows with load (not present at baseline) is a capacity signal, not a bug signal. An error rate present even at low load is a bug — fix it before interpreting capacity numbers.
2.4 Regression detection
When a baseline exists, always compute delta:
| Metric | Baseline | Current | Delta | Flag? |
|---|---|---|---|---|
| p95 | A ms | B ms | +X% | Flag if > 20% regression |
| Error rate | A% | B% | +X pp | Flag if any increase |
| Throughput | A RPS | B RPS | -X% | Flag if > 10% drop |
Regression thresholds (default, adjust to SLA):
- p95 or p99 regression > 20% → flag as degradation
- Error rate increase > 0.1 pp (from near-zero) → flag immediately
- Throughput drop > 10% at same load → flag as capacity regression
2.5 Per-endpoint analysis
Global metrics hide which endpoint is causing the problem. When per-endpoint data is available:
- Sort endpoints by p95 descending — the top 3 are your investigation targets
- Check whether failing endpoints share a pattern (same DB table, same external dep, same service)
- Distinguish: is one slow endpoint dragging up the global p95, or is the problem widespread?
- Report SLA compliance per endpoint when SLAs are defined per endpoint, not just globally
| Situation | What to report |
|---|---|
| 1 of 10 endpoints causes the SLA breach | Flag that endpoint specifically — global SLA breach is misleading without this context |
| All endpoints degrade together | Shared resource bottleneck — investigate DB, network, or infra |
| Only POST endpoints fail | Investigate write path: DB locks, validation, downstream writes |
| Only authenticated endpoints fail | Investigate auth middleware, token validation, session store |
2.6 Chart and graph interpretation
When the user shares or references visual reports (Gatling HTML, k6 dashboard, Locust charts):
| Chart type | What to look for | Red flags |
|---|---|---|
| Response time over time | Should be flat during steady state | Upward trend = leak or saturation; spikes = GC or retry storms |
| Active users / VUs over time | Should match the configured ramp profile | Plateau below target = test runner bottleneck or too-strict rate limiter |
| Requests per second | Should track with VU count | RPS plateaus while VUs increase = system saturated |
| Error rate over time | Should be near zero | Spike then recover = transient; growing = capacity limit approaching |
| Response time distribution | Should be right-skewed (few slow outliers) | Bimodal distribution = two distinct user populations or cache hit/miss split |
2.7 Spike test and warm-up analysis
Spike tests — three things to verify:
- Did the system reach the configured spike peak? (check VU/RPS chart)
- What was the error rate and p95 at peak? (the stress moment)
- How long did recovery take after load dropped? (recovery time = time to return to baseline p95 ± 10%)
Warm-up / cold start — when to exclude from SLA evaluation:
- If latency is high in the first 2–5 minutes then stabilizes → exclude warm-up window from SLA measurement
- Evaluate SLA only during steady-state (after the ramp-up phase completes)
- Note the warm-up duration in the report — it matters for autoscaling and readiness probe configuration
Step 3 — Classify Severity
Assign a severity to each finding before writing either report.
| Severity | Definition | Example |
|---|---|---|
| Critical | SLA breach; system unavailable or degraded for users in production | p95 > SLA × 2, error rate > 5% |
| High | SLA breach; significant user impact if deployed | p95 > SLA, error rate 1–5% |
| Medium | SLA met but trend is concerning; risk of future breach | p95 at 90% of SLA, rising with load |
| Low | Within SLA; minor optimization opportunity | p95 at 60% of SLA, no trend |
| Informational | Notable observation, no action required | Latency spike during GC, recovered immediately |
Every finding in the reports must carry a severity label.
Step 4 — Technical Report
Structure the technical report as follows. Be specific: include actual numbers, tool output excerpts, and concrete next steps with owners.
## Performance Test Technical Report
**Date:** [date]
**Test type:** [Smoke / Load / Stress / Spike / Endurance]
**Tool:** [k6 / Gatling / Locust / JMeter / other]
**Environment:** [staging / perf / prod-clone]
**Load profile:** [X users, Y RPS, Z minutes duration]
---
### Executive Summary (3 sentences max)
[What was tested. Whether SLAs were met. Top finding in plain language.]
---
### SLA Compliance
| Metric | Target | Actual | Result |
|---|---|---|---|
| p95 response time | ... | ... | PASS/FAIL |
| p99 response time | ... | ... | PASS/FAIL |
| Error rate | ... | ... | PASS/FAIL |
| Throughput | ... | ... | PASS/FAIL |
---
### Findings
#### [CRITICAL/HIGH/MEDIUM/LOW] Finding 1 — [Short title]
**Observed:** [What the data shows, with exact numbers]
**Root cause hypothesis:** [Why this likely happened — infrastructure, code, config]
**Evidence:** [Specific metric, timestamp, or tool output that supports this]
**Recommended action:** [Concrete next step — code change, config tuning, infra scaling]
**Owner:** [Team or role responsible]
**Retest required:** Yes / No
#### [severity] Finding 2 — ...
[repeat for each finding]
---
### Regression vs. Baseline
| Metric | Baseline | Current | Delta | Status |
|---|---|---|---|---|
| p95 | — | — | — | — |
| Error rate | — | — | — | — |
[Note: omit section if no baseline exists]
---
### Infrastructure Observations
[CPU, memory, DB, network observations during the test. Flag any resource that peaked
above safe thresholds (CPU > 70%, memory upward drift, connection pool saturation).]
---
### Recommendations Summary
| Priority | Action | Owner | Target date |
|---|---|---|---|
| P1 | ... | ... | ... |
| P2 | ... | ... | ... |
---
### Test Conditions
[Document: environment, dataset size, any known limitations that affect validity of results]
Step 5 — Business Report
The business report translates technical findings into decisions and risk. Never include raw percentile numbers, tool names, or technical jargon. Translate every metric into user or business impact.
Translation table — metrics to business language
| Technical metric | Business translation |
|---|---|
| p95 = 1.8s (SLA: < 1s) | 1 in 20 users waits nearly 2× longer than acceptable |
| Error rate = 3% at peak | 3 out of every 100 transactions fail during busy periods |
| System breaks at 800 users | Current capacity is 60% of the expected peak of 1,300 users |
| p95 regression +40% vs. last release | The recent release made the slowest user experience significantly worse |
| Endurance: memory grows 2GB over 4h | If deployed, the service will require a restart every few hours to avoid outages |
## Performance Test — Business Summary
**Date:** [date]
**System:** [product or service name]
**Test conducted by:** [team]
---
### What Was Tested
[One paragraph. What system, what scenario, what load level — in plain language.
No tool names. Example: "We simulated 500 users purchasing products simultaneously,
representing the expected traffic during the upcoming sale."]
---
### Key Question: Is It Ready?
**Overall verdict:** [Ready to deploy / Not ready — risks identified / Ready with conditions]
[One paragraph summary of what this means for the business.]
---
### Risk Summary
| Risk | Impact | Likelihood | Recommended action |
|---|---|---|---|
| [plain-language risk] | High/Med/Low | High/Med/Low | [decision recommendation] |
---
### What Happens If We Deploy Now
[Honest assessment of the user-facing impact if the system goes live as-is.
Focus on conversion, user experience, or revenue impact where applicable.]
---
### What Needs to Happen Before Go-Live
[Bullet list of must-fix items in plain language. Each bullet should say what needs
to happen and why it matters to users or the business — not how to fix it technically.]
---
### What We Can Defer
[Low/informational findings that do not block launch but should be addressed post-launch.]
---
### Decision Required
[If there is a go/no-go decision pending, state it explicitly with a recommendation
and the tradeoff of each option.]
Common Mistakes in Result Interpretation
1. Using mean response time instead of percentiles
The mean hides outliers. A mean of 200ms is meaningless if p99 = 8s. Always lead with p95 and p99 for user-perceived performance.
2. Ignoring error rate growth pattern
An error rate that jumps from 0% at 100 users to 5% at 200 users is a capacity signal. An error rate of 5% at every load level is a bug. These require different fixes.
3. Declaring success because SLAs are met "on average"
SLAs must be evaluated per percentile, not as averages. "p95 is fine" and "average is fine" are not the same statement.
4. Comparing results across different environments
A staging result and a production-sized environment result are not comparable. Always document environment differences and state explicitly what the result does and does not prove.
5. Writing a business report with technical numbers
Saying "p95 = 1,240ms exceeds our SLA of 800ms" to a business stakeholder produces no action. Say "1 in 20 users experiences a delay 55% longer than our target."
6. No baseline — calling first run a pass
Without a baseline, you can only say "SLAs are met." You cannot say performance has not regressed. Always capture and store first-run results as the baseline for future comparisons.
7. Stress test breaking point declared without recovery verification
Finding the breaking point is only half the stress test. Confirm the system recovers after load is removed. A system that crashes and stays crashed is far more dangerous than one that degrades gracefully.
8. Analyzing warm-up period as steady state
High p95 values in the first 2–5 minutes of a test are often JIT compilation, lazy connection initialization, or DNS resolution — not application performance. Always exclude the ramp-up/warm-up window when evaluating SLA compliance.
9. Reporting only global metrics when per-endpoint data is available
A global p95 of 900ms could mean all 10 endpoints are slow (infrastructure problem) or one endpoint is returning 9,000ms (specific bug). Always break down to endpoint level before drawing conclusions or making recommendations.
More from rcampos09/performance-testing-skills
k6-best-practices
>
32gatling-best-practices
Guides developers and testers in writing, fixing, and structuring Gatling load test scenarios. Use this skill whenever the user mentions load testing, performance testing, stress testing, Gatling, virtual users, VUs, ramp-up, injection profiles, simulations, JMeter migration, k6 migration, throughput, response time SLAs, or wants to benchmark an API or service — even if they don't explicitly say 'Gatling' or 'performance test'.
24performance-testing-strategy
>
18locust-best-practices
>
4