Perf Skill

Quick Ref: /perf profile <target> | /perf bench <target> | /perf compare <baseline> <candidate> | /perf optimize <target>

YOU MUST EXECUTE THIS WORKFLOW. Do not just describe it.

Performance profiling, benchmarking, regression detection, and optimization recommendations for any language runtime. Produces actionable metrics, not vague advice.

Modes

Mode	Command	Purpose
Profile	`/perf profile <target>`	Profile execution, find hotspots
Benchmark	`/perf bench <target>`	Create or run benchmarks
Compare	`/perf compare <baseline> <candidate>`	Compare two runs for regression
Optimize	`/perf optimize <target>`	Analyze and apply optimizations

If no mode is specified, default to profile.

Step 0: Detect Language and Tooling

Identify the language/runtime from file extensions, go.mod, package.json, pyproject.toml, Cargo.toml, or explicit user input. Select the profiling stack:

Language	Benchmarking	CPU Profile	Memory Profile	Comparison
Go	`go test -bench`	`go tool pprof` (cpu)	`go tool pprof` (alloc)	`benchstat`
Python	`pytest-benchmark`, `timeit`	`cProfile`, `py-spy`	`memory_profiler`, `tracemalloc`	manual diff
Node	`benchmark.js`, `vitest bench`	`--prof`, `clinic.js`	`--heap-prof`, `0x`	manual diff
Rust	`criterion`, `cargo bench`	`cargo flamegraph`	`heaptrack`, `DHAT`	`critcmp`
Shell	`hyperfine`	`time`, `strace`	N/A	`hyperfine` built-in

Check which tools are actually installed. If a preferred tool is missing, fall back to standard-library alternatives before asking the user to install anything.

Step 1: Establish Baseline

Run existing benchmarks first. If none exist, create them.

1a. Find Existing Benchmarks

# Go
grep -r "func Benchmark" --include="*_test.go" -l .

# Python
find . -name "test_*" -exec grep -l "benchmark\|@pytest.mark.benchmark" {} +

# Rust
grep -r "#\[bench\]" --include="*.rs" -l .

# Node
find . -name "*.bench.*" -o -name "*.benchmark.*"

1b. Run or Create Benchmarks

If benchmarks exist for the target, run them and capture output. If none exist, write benchmarks covering the target function or module.

Benchmark requirements:

Measure wall-clock time (ops/sec or ns/op)
Measure memory allocations (bytes/op, allocs/op)
Run enough iterations for statistical stability (Go: -benchtime=3s -count=5)
Record latency percentiles where applicable: p50, p95, p99

Save raw baseline output to .agents/perf/baseline-YYYY-MM-DD.txt.

1c. Go Benchmark Template

func BenchmarkTargetFunction(b *testing.B) {
    // Setup outside the loop
    input := prepareInput()
    b.ResetTimer()
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        TargetFunction(input)
    }
}

1d. Python Benchmark Template

import pytest

@pytest.mark.benchmark(group="target")
def test_target_benchmark(benchmark):
    input_data = prepare_input()
    result = benchmark(target_function, input_data)
    assert result is not None

Step 2: Profile and Identify Hotspots

CPU Profiling

Find functions consuming the most CPU time.

Go:

go test -bench=BenchmarkTarget -cpuprofile=cpu.prof ./...
go tool pprof -top cpu.prof
go tool pprof -text -cum cpu.prof   # cumulative view

Python:

python -m cProfile -s cumulative target_script.py
# Or for running processes:
py-spy top --pid <PID>
py-spy record -o profile.svg --pid <PID>

Memory Profiling

Find allocation hotspots and potential leaks.

Go:

go test -bench=BenchmarkTarget -memprofile=mem.prof ./...
go tool pprof -top -alloc_space mem.prof

Python:

python -m memory_profiler target_script.py
# Or with tracemalloc in code:
# tracemalloc.start(); ...; snapshot = tracemalloc.take_snapshot()

I/O Profiling

Identify blocking operations in hot paths.

Check for synchronous file I/O, network calls, or database queries inside loops.
Look for missing connection pooling, unbuffered writes, or serial HTTP calls that could be concurrent.

Hotspot Summary

After profiling, produce a ranked list:

HOTSPOTS (by cumulative CPU time):
1. pkg/engine.Process       42.3%  (1.2s)   — main processing loop
2. pkg/engine.parseRecord   28.1%  (0.8s)   — record deserialization
3. pkg/io.ReadBatch         15.7%  (0.45s)  — disk reads

Step 3: Analyze and Recommend

Classify each finding by estimated impact:

Impact	Criteria	Action
High	>20% of total time or >50% of allocations	Fix immediately
Medium	5-20% of total time or notable allocation waste	Fix in this session
Low	<5% of total time, minor inefficiency	Log for later

Common Anti-Patterns

Check the profiled code against these known performance killers:

Unnecessary allocations — allocating inside hot loops, string concatenation in loops (use strings.Builder / []byte / io.StringWriter)
N+1 queries — database call per item instead of batch query
Missing caching — recomputing expensive results that rarely change
Blocking I/O in hot path — synchronous network/disk calls where async or buffered I/O would work
Excessive copying — passing large structs by value, copying slices instead of slicing
Suboptimal data structures — linear search where a map lookup works, unbounded slice growth without pre-allocation
Lock contention — mutex held across I/O or long computation
Regex compilation in loops — compile once, reuse the compiled pattern
Reflection in hot paths — replace with code generation or type switches
Unbuffered channels — causing goroutine scheduling overhead in Go

For each finding, state:

What: the specific code location and pattern
Why: how it hurts performance (with numbers from profiling)
Fix: concrete code change recommendation

Step 4: Optimize (optimize mode only)

Critical rule: ONE optimization at a time.

For each optimization:

Describe the change before making it
Apply the single change
Re-run the benchmark suite
Compare results against baseline using benchstat (Go) or manual diff
Keep or revert — only keep changes that measurably improve metrics
Commit with message format: perf(<scope>): <description> (+X% throughput) or perf(<scope>): <description> (-X% latency)

Acceptance Criteria

Improvement must be statistically significant (p < 0.05 for benchstat, or >5% consistent change for manual comparison)
No correctness regressions — all existing tests must still pass
No readability destruction for marginal gains (<2% improvement does not justify obfuscated code)

Optimization Order

Apply optimizations in this order (highest expected impact first):

Algorithmic improvements (O(n^2) to O(n log n), etc.)
Allocation reduction (pre-allocate, pool, reuse buffers)
I/O batching and buffering
Caching and memoization
Concurrency improvements (parallelize independent work)
Micro-optimizations (only if profiling confirms they matter)

Step 5: Output Report

Write the report to .agents/perf/YYYY-MM-DD-perf-<target>.md.

Report Template

# Performance Report: <target>
Date: YYYY-MM-DD
Mode: <profile|bench|compare|optimize>
Language: <detected>

## Summary
<1-2 sentence summary of findings>

## Baseline Metrics
| Metric | Value |
|--------|-------|
| ops/sec | ... |
| ns/op | ... |
| B/op | ... |
| allocs/op | ... |
| p50 latency | ... |
| p95 latency | ... |
| p99 latency | ... |

## Hotspots
<ranked list from Step 2>

## Findings
<classified findings from Step 3>

## Optimizations Applied (if optimize mode)
| Change | Before | After | Improvement |
|--------|--------|-------|-------------|
| ... | ... | ... | +X% |

## After Metrics (if optimize mode)
<same table as baseline, with new values>

## Recommendations
<remaining opportunities not addressed in this session>

Compare Mode Details

When running /perf compare <baseline> <candidate>:

Locate or re-run benchmarks for both versions
Use language-native comparison tools:
- Go: benchstat baseline.txt candidate.txt
- Rust: critcmp baseline candidate
- Other: side-by-side table with percentage deltas
Flag regressions (>5% slower or >10% more allocations) as REGRESSION
Flag improvements (>5% faster or >10% fewer allocations) as IMPROVEMENT
Flag statistically insignificant changes as NOISE

Output a summary table:

COMPARISON: baseline vs candidate
| Benchmark | Baseline | Candidate | Delta | Verdict |
|-----------|----------|-----------|-------|---------|
| BenchmarkProcess | 1.2ms | 0.9ms | -25% | IMPROVEMENT |
| BenchmarkParse | 450ns | 480ns | +6.7% | REGRESSION |
| BenchmarkIO | 3.1ms | 3.0ms | -3.2% | NOISE |

Edge Cases

No benchmarks and no clear target: Run /complexity first to identify hot paths, then benchmark those.
Flaky benchmarks: Increase iteration count, pin to a single core (GOMAXPROCS=1), close competing processes.
Cannot install profiling tools: Fall back to time for wall-clock and manual instrumentation for allocation counts.
Target is a CLI command: Use hyperfine for wall-clock benchmarking across any language.

perf