benchmark-runner
SKILL.md
Benchmark Runner
Standardizes performance comparison methodology: metric selection, test case design, environment capture, result formatting, and tradeoff analysis. Produces reproducible benchmark reports that support informed decisions — not just "A is faster than B" but "A is faster for small inputs while B scales better."
Reference Files
| File | Contents | Load When |
|---|---|---|
references/metric-selection.md |
Metric catalog (latency percentiles, throughput, memory, accuracy), selection criteria per task type | Always |
references/test-case-design.md |
Representative input selection, scale variation, edge case coverage, warmup strategies | Always |
references/environment-capture.md |
Hardware/software context recording, reproducibility requirements, variance control | Always |
references/statistical-rigor.md |
Sample sizing, variance measurement, significance testing, outlier handling | Results need statistical validation |
Prerequisites
- Clear candidates to compare (at least 2)
- Access to run or observe the candidates (code, API, or existing results)
- Representative workload definition
Workflow
Phase 1: Define Scope
- What are the candidates? — Name each candidate precisely, including version. "Python dict vs Redis" is too vague. "Python 3.12 dict (in-process) vs Redis 7.2 (localhost, TCP)" is testable.
- What claims need validation? — "A is faster" → faster at what? For what input size? Under what load? Benchmark design flows from the specific claim.
- What is the decision context? — Why does this comparison matter? This determines which metrics are most important.
Phase 2: Select Metrics
Choose metrics that match the decision context:
| Metric Category | Specific Metrics | When Important |
|---|---|---|
| Latency | P50, P95, P99, mean, std dev | User-facing operations, API calls |
| Throughput | ops/sec, tokens/sec, MB/sec | Batch processing, streaming |
| Memory | Peak RSS, avg RSS, allocation rate | Resource-constrained environments |
| Accuracy | F1, BLEU, exact match, precision/recall | ML models, algorithms with quality tradeoffs |
| Cost | $/1K operations, $/hour, $/GB | Cloud services, API comparisons |
| Startup | Time to first operation, cold start | Serverless, CLI tools |
Select 2-4 metrics. More than 4 makes comparison tables unreadable.
Phase 3: Design Test Cases
Create a matrix of inputs that reveal performance characteristics:
- Scale variation — Small, medium, large inputs. Performance often changes non-linearly with scale.
- Representative data — Use realistic inputs, not synthetic best-case data.
- Edge cases — Empty input, maximum size, adversarial input.
- Warmup — Exclude JIT compilation, cache warming, and connection establishment from measurements. Run N warmup iterations before recording.
Phase 4: Specify Environment
Record everything needed to reproduce the results:
- Hardware — CPU model, core count, RAM size, GPU model (if applicable)
- Software — OS version, language runtime version, dependency versions
- Configuration — Thread count, batch size, connection pool size, cache settings
- Isolation — What else was running? Background processes affect results.
Phase 5: Structure Results
Produce comparison tables with clear winners per metric, followed by tradeoff analysis.
Output Format
# Benchmark: {Descriptive Title}
**Date:** {YYYY-MM-DD}
**Hardware:** {CPU}, {RAM}, {GPU if applicable}
**Software:** {runtime versions}
**Configuration:** {key settings that affect results}
## Candidates
| # | Candidate | Version | Configuration |
|---|-----------|---------|---------------|
| A | {name} | {version} | {relevant config} |
| B | {name} | {version} | {relevant config} |
## Test Cases
| # | Name | Input Size | Description | Warmup | Iterations |
|---|------|------------|-------------|--------|------------|
| 1 | Small | {size} | {what it represents} | {N} | {N} |
| 2 | Medium | {size} | {what it represents} | {N} | {N} |
| 3 | Large | {size} | {what it represents} | {N} | {N} |
## Results
### Latency (ms, lower is better)
| Test Case | A (P50 / P95 / P99) | B (P50 / P95 / P99) | Winner |
|-----------|---------------------|---------------------|--------|
| Small | {values} | {values} | {A or B} |
| Medium | {values} | {values} | {A or B} |
| Large | {values} | {values} | {A or B} |
### Memory (MB, lower is better)
| Test Case | A (Peak) | B (Peak) | Winner |
|-----------|----------|----------|--------|
| Small | {value} | {value} | {A or B} |
| Medium | {value} | {value} | {A or B} |
| Large | {value} | {value} | {A or B} |
## Analysis
### Overall Winner
**{Candidate}** wins on {N} of {M} metrics across all test cases.
### Tradeoff Summary
- **Choose A when:** {conditions where A is the better choice}
- **Choose B when:** {conditions where B is the better choice}
### Caveats
- {Limitation of this benchmark}
- {Condition under which results may differ}
## Reproduction
```bash
# Environment setup
{commands to recreate the environment}
# Run benchmark
{commands to execute the benchmark}
## Configuring Scope
| Mode | Candidates | Depth | When to Use |
|------|-----------|-------|-------------|
| `quick` | 2 candidates, 1-2 metrics | Single test case, no statistics | Rough comparison, sanity check |
| `standard` | 2-3 candidates, 2-4 metrics | 3 test cases, mean + std dev | Default for most comparisons |
| `rigorous` | Any count, full metric suite | Multiple test cases, percentiles, significance tests | Publication, critical decisions |
## Calibration Rules
1. **Measure, don't guess.** Intuition about performance is unreliable. "Obviously
faster" is not a benchmark result.
2. **Apples to apples.** Candidates must be compared under identical conditions.
Different hardware, configuration, or input data invalidates the comparison.
3. **Report variance, not just means.** A mean of 50ms with std dev of 100ms is not
the same as a mean of 50ms with std dev of 2ms. Always report spread.
4. **Warm up before measuring.** First-run performance includes JIT, cache warming,
and connection setup. Exclude warmup iterations from results.
5. **Representative inputs only.** Benchmarking with synthetic best-case input is
misleading. Use data that resembles production workloads.
6. **State the winner per metric, not overall.** "A is better" is lazy. "A has lower
latency; B uses less memory" is useful.
## Error Handling
| Problem | Resolution |
|---------|------------|
| Cannot run candidates locally | Design the benchmark specification. Document what to measure and how. The user executes separately. |
| Results are noisy (high variance) | Increase iteration count. Check for background processes. Use dedicated hardware or containers for isolation. |
| Candidates serve different purposes | Acknowledge that the comparison is partial. Benchmark only the overlapping functionality. |
| No baseline exists | Establish one candidate as the baseline. Report relative performance (e.g., "B is 1.3x faster than A"). |
| Hardware context unavailable | Document what is known. Note that results may not be reproducible without full context. |
## When NOT to Benchmark
Push back if:
- The comparison is not performance-related (feature comparison → use a decision matrix or ADR instead)
- The candidates are fundamentally different tools (comparing a database to a message queue)
- The user wants to benchmark trivial operations (comparing two string concatenation methods in Python)
- Results from others already exist and conditions match — link to existing benchmarks instead
Weekly Installs
12
Repository
mathews-tom/pra…s-skillsGitHub Stars
11
First Seen
13 days ago
Security Audits
Installed on
opencode12
gemini-cli12
claude-code12
github-copilot12
codex12
kimi-cli12