Performance Engineering

Evidence-based performance optimization → measure → profile → optimize → validate.

<when_to_use>

Profiling slow code paths or bottlenecks
Identifying memory leaks or excessive allocations
Optimizing latency-critical operations (P95, P99)
Benchmarking competing implementations
Database query optimization
Reducing CPU usage in hot paths
Improving throughput (RPS, ops/sec)

NOT for: premature optimization, optimization without measurement, guessing at bottlenecks

</when_to_use>

<iron_law>

NO OPTIMIZATION WITHOUT MEASUREMENT

Required workflow:

Measure baseline performance with realistic workload
Profile to identify actual bottleneck
Optimize the bottleneck (not what you think is slow)
Measure again to verify improvement
Document gains and tradeoffs

Optimizing unmeasured code wastes time and introduces bugs.

</iron_law>

Use TodoWrite to track optimization process:

Phase 1: Establishing baseline

content: "Establish performance baseline with realistic workload"
activeForm: "Establishing performance baseline"

Phase 2: Profiling bottlenecks

content: "Profile code to identify actual bottlenecks"
activeForm: "Profiling code to identify bottlenecks"

Phase 3: Analyzing root cause

content: "Analyze profiling data to determine root cause"
activeForm: "Analyzing profiling data"

Phase 4: Implementing optimization

content: "Implement targeted optimization for identified bottleneck"
activeForm: "Implementing optimization"

Phase 5: Validating improvement

content: "Measure performance gains and verify no regressions"
activeForm: "Validating performance improvement"

Key Performance Indicators

Latency (response time):

P50 (median) — typical case
P95 — most users
P99 — tail latency
P99.9 — outliers
TTFB — time to first byte
TTLB — time to last byte

Throughput:

RPS — requests per second
ops/sec — operations per second
bytes/sec — data transfer rate
queries/sec — database throughput

Memory:

Heap usage — allocated memory
GC frequency — garbage collection pauses
GC duration — stop-the-world time
Allocation rate — memory churn
Resident set size (RSS) — total memory

CPU:

CPU time — total compute
Wall time — elapsed time
Hot paths — frequently executed code
Time complexity — algorithmic efficiency
CPU utilization — percentage used

Always measure:

Before optimization (baseline)
After optimization (improvement)
Under realistic load (not toy data)
Multiple runs (account for variance)

<profiling_tools>

TypeScript/Bun

Built-in timing:

console.time('operation')
// ... code to measure
console.timeEnd('operation')

// High precision
const start = Bun.nanoseconds()
// ... code to measure
const elapsed = Bun.nanoseconds() - start
console.log(`Took ${elapsed / 1_000_000}ms`)

Performance API:

const mark1 = performance.mark('start')
// ... code to measure
const mark2 = performance.mark('end')
performance.measure('operation', 'start', 'end')
const measure = performance.getEntriesByName('operation')[0]
console.log(`Duration: ${measure.duration}ms`)

Memory profiling:

Chrome DevTools → Memory tab → heap snapshots
Node.js --inspect flag + Chrome DevTools
process.memoryUsage() for RSS/heap tracking

CPU profiling:

Chrome DevTools → Performance tab → record session
Node.js --prof flag + node --prof-process
Flamegraphs for visualization

Rust

Benchmarking:

#[cfg(test)]
mod benches {
    use criterion::{black_box, criterion_group, criterion_main, Criterion};

    fn benchmark_function(c: &mut Criterion) {
        c.bench_function("my_function", |b| {
            b.iter(|| my_function(black_box(42)))
        });
    }

    criterion_group!(benches, benchmark_function);
    criterion_main!(benches);
}

Profiling:

cargo bench — criterion benchmarks
perf record + perf report — Linux profiling
cargo flamegraph — visual flamegraphs
cargo bloat — binary size analysis
valgrind --tool=callgrind — detailed profiling
heaptrack — memory profiling

Instrumentation:

use std::time::Instant;

let start = Instant::now();
// ... code to measure
let duration = start.elapsed();
println!("Took: {:?}", duration);

</profiling_tools>

<optimization_patterns>

Algorithm Improvements

Time complexity:

O(n²) → O(n log n) — sorting, searching
O(n) → O(log n) — binary search, trees
O(n) → O(1) — hash maps, memoization

Space-time tradeoffs:

Cache computed results (memoization)
Precompute expensive operations
Index data for faster lookup
Use hash maps for O(1) access

Memory Optimization

Reduce allocations:

// Bad: creates new array each iteration
for (const item of items) {
  const results = []
  results.push(process(item))
}

// Good: reuse array
const results = []
for (const item of items) {
  results.push(process(item))
}

// Bad: allocates String every time
fn format_user(name: &str) -> String {
    format!("User: {}", name)
}

// Good: reuses buffer
fn format_user(name: &str, buf: &mut String) {
    buf.clear();
    buf.push_str("User: ");
    buf.push_str(name);
}

Memory pooling:

Reuse expensive objects (connections, buffers)
Object pools for frequently allocated types
Arena allocators for batch allocations

Lazy evaluation:

Compute only when needed
Stream processing vs loading all data
Iterators over materialized collections

I/O Optimization

Batching:

Batch API calls (1 request vs 100)
Batch database writes (bulk insert)
Batch file operations (single write vs many)

Caching:

Cache expensive computations
Cache database queries (Redis, in-memory)
Cache API responses (HTTP caching)
Invalidate stale cache entries

Async I/O:

Non-blocking operations (async/await)
Concurrent requests (Promise.all, tokio::spawn)
Connection pooling (reuse connections)

Database Optimization

Query optimization:

Add indexes for common queries
Use EXPLAIN/EXPLAIN ANALYZE
Avoid N+1 queries (use joins or batch loading)
Select only needed columns
Filter at database level (WHERE vs client filter)

Schema design:

Normalize to reduce duplication
Denormalize for read-heavy workloads
Partition large tables
Use appropriate data types

Connection management:

Connection pooling (don't create per request)
Prepared statements (avoid SQL parsing)
Transaction batching (reduce round trips)

</optimization_patterns>

Loop: Measure → Profile → Analyze → Optimize → Validate

Define performance goal — target metric (e.g., P95 < 100ms)
Establish baseline — measure current performance under realistic load
Profile systematically — identify actual bottleneck (not guesses)
Analyze root cause — understand why code is slow
Design optimization — plan targeted improvement
Implement optimization — make focused change
Measure improvement — verify gains, check for regressions
Document results — record baseline, optimization, gains, tradeoffs

At each step:

Document measurements with methodology
Note profiling tool output
Track optimization attempts (what worked/failed)
Update performance documentation

Before declaring optimization complete:

Check gains:

✓ Measured improvement meets target?
✓ Improvement statistically significant?
✓ Tested under realistic load?
✓ Multiple runs confirm consistency?

Check regressions:

✓ No degradation in other metrics?
✓ Memory usage still acceptable?
✓ Code complexity still manageable?
✓ Tests still pass?

Check documentation:

✓ Baseline measurements recorded?
✓ Optimization approach explained?
✓ Gains quantified with numbers?
✓ Tradeoffs documented?

ALWAYS:

Measure before optimizing (baseline)
Profile to find actual bottleneck
Use realistic workload (not toy data)
Measure multiple runs (account for variance)
Document baseline and improvements
Check for regressions in other metrics
Consider readability vs performance tradeoff
Verify statistical significance

NEVER:

Optimize without measuring first
Guess at bottleneck without profiling
Benchmark with unrealistic data
Trust single-run measurements
Skip documentation of results
Sacrifice correctness for speed
Optimize without clear performance goal
Ignore algorithmic improvements

Methodology:

benchmarking.md — rigorous benchmarking methodology

Related skills:

codebase-analysis — evidence-based investigation (foundation)
debugging-and-diagnosis — structured bug investigation
typescript-dev — correctness before performance

performance-engineering

Performance Engineering

Key Performance Indicators

TypeScript/Bun

Rust

Algorithm Improvements

Memory Optimization

I/O Optimization

Database Optimization

More from outfitter-dev/agents

codebase-recon

graphite-stacks

code-review

software-craft

subagents

bun-dev