performance-engineering
Performance Engineering
Evidence-based performance optimization → measure → profile → optimize → validate.
<when_to_use>
- Profiling slow code paths or bottlenecks
- Identifying memory leaks or excessive allocations
- Optimizing latency-critical operations (P95, P99)
- Benchmarking competing implementations
- Database query optimization
- Reducing CPU usage in hot paths
- Improving throughput (RPS, ops/sec)
NOT for: premature optimization, optimization without measurement, guessing at bottlenecks
</when_to_use>
<iron_law>
NO OPTIMIZATION WITHOUT MEASUREMENT
Required workflow:
- Measure baseline performance with realistic workload
- Profile to identify actual bottleneck
- Optimize the bottleneck (not what you think is slow)
- Measure again to verify improvement
- Document gains and tradeoffs
Optimizing unmeasured code wastes time and introduces bugs.
</iron_law>
Use TodoWrite to track optimization process:
Phase 1: Establishing baseline
- content: "Establish performance baseline with realistic workload"
- activeForm: "Establishing performance baseline"
Phase 2: Profiling bottlenecks
- content: "Profile code to identify actual bottlenecks"
- activeForm: "Profiling code to identify bottlenecks"
Phase 3: Analyzing root cause
- content: "Analyze profiling data to determine root cause"
- activeForm: "Analyzing profiling data"
Phase 4: Implementing optimization
- content: "Implement targeted optimization for identified bottleneck"
- activeForm: "Implementing optimization"
Phase 5: Validating improvement
- content: "Measure performance gains and verify no regressions"
- activeForm: "Validating performance improvement"
Key Performance Indicators
Latency (response time):
- P50 (median) — typical case
- P95 — most users
- P99 — tail latency
- P99.9 — outliers
- TTFB — time to first byte
- TTLB — time to last byte
Throughput:
- RPS — requests per second
- ops/sec — operations per second
- bytes/sec — data transfer rate
- queries/sec — database throughput
Memory:
- Heap usage — allocated memory
- GC frequency — garbage collection pauses
- GC duration — stop-the-world time
- Allocation rate — memory churn
- Resident set size (RSS) — total memory
CPU:
- CPU time — total compute
- Wall time — elapsed time
- Hot paths — frequently executed code
- Time complexity — algorithmic efficiency
- CPU utilization — percentage used
Always measure:
- Before optimization (baseline)
- After optimization (improvement)
- Under realistic load (not toy data)
- Multiple runs (account for variance)
<profiling_tools>
TypeScript/Bun
Built-in timing:
console.time('operation')
// ... code to measure
console.timeEnd('operation')
// High precision
const start = Bun.nanoseconds()
// ... code to measure
const elapsed = Bun.nanoseconds() - start
console.log(`Took ${elapsed / 1_000_000}ms`)
Performance API:
const mark1 = performance.mark('start')
// ... code to measure
const mark2 = performance.mark('end')
performance.measure('operation', 'start', 'end')
const measure = performance.getEntriesByName('operation')[0]
console.log(`Duration: ${measure.duration}ms`)
Memory profiling:
- Chrome DevTools → Memory tab → heap snapshots
- Node.js
--inspectflag + Chrome DevTools process.memoryUsage()for RSS/heap tracking
CPU profiling:
- Chrome DevTools → Performance tab → record session
- Node.js
--profflag +node --prof-process - Flamegraphs for visualization
Rust
Benchmarking:
#[cfg(test)]
mod benches {
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn benchmark_function(c: &mut Criterion) {
c.bench_function("my_function", |b| {
b.iter(|| my_function(black_box(42)))
});
}
criterion_group!(benches, benchmark_function);
criterion_main!(benches);
}
Profiling:
cargo bench— criterion benchmarksperf record+perf report— Linux profilingcargo flamegraph— visual flamegraphscargo bloat— binary size analysisvalgrind --tool=callgrind— detailed profilingheaptrack— memory profiling
Instrumentation:
use std::time::Instant;
let start = Instant::now();
// ... code to measure
let duration = start.elapsed();
println!("Took: {:?}", duration);
</profiling_tools>
<optimization_patterns>
Algorithm Improvements
Time complexity:
- O(n²) → O(n log n) — sorting, searching
- O(n) → O(log n) — binary search, trees
- O(n) → O(1) — hash maps, memoization
Space-time tradeoffs:
- Cache computed results (memoization)
- Precompute expensive operations
- Index data for faster lookup
- Use hash maps for O(1) access
Memory Optimization
Reduce allocations:
// Bad: creates new array each iteration
for (const item of items) {
const results = []
results.push(process(item))
}
// Good: reuse array
const results = []
for (const item of items) {
results.push(process(item))
}
// Bad: allocates String every time
fn format_user(name: &str) -> String {
format!("User: {}", name)
}
// Good: reuses buffer
fn format_user(name: &str, buf: &mut String) {
buf.clear();
buf.push_str("User: ");
buf.push_str(name);
}
Memory pooling:
- Reuse expensive objects (connections, buffers)
- Object pools for frequently allocated types
- Arena allocators for batch allocations
Lazy evaluation:
- Compute only when needed
- Stream processing vs loading all data
- Iterators over materialized collections
I/O Optimization
Batching:
- Batch API calls (1 request vs 100)
- Batch database writes (bulk insert)
- Batch file operations (single write vs many)
Caching:
- Cache expensive computations
- Cache database queries (Redis, in-memory)
- Cache API responses (HTTP caching)
- Invalidate stale cache entries
Async I/O:
- Non-blocking operations (async/await)
- Concurrent requests (Promise.all, tokio::spawn)
- Connection pooling (reuse connections)
Database Optimization
Query optimization:
- Add indexes for common queries
- Use EXPLAIN/EXPLAIN ANALYZE
- Avoid N+1 queries (use joins or batch loading)
- Select only needed columns
- Filter at database level (WHERE vs client filter)
Schema design:
- Normalize to reduce duplication
- Denormalize for read-heavy workloads
- Partition large tables
- Use appropriate data types
Connection management:
- Connection pooling (don't create per request)
- Prepared statements (avoid SQL parsing)
- Transaction batching (reduce round trips)
</optimization_patterns>
Loop: Measure → Profile → Analyze → Optimize → Validate
- Define performance goal — target metric (e.g., P95 < 100ms)
- Establish baseline — measure current performance under realistic load
- Profile systematically — identify actual bottleneck (not guesses)
- Analyze root cause — understand why code is slow
- Design optimization — plan targeted improvement
- Implement optimization — make focused change
- Measure improvement — verify gains, check for regressions
- Document results — record baseline, optimization, gains, tradeoffs
At each step:
- Document measurements with methodology
- Note profiling tool output
- Track optimization attempts (what worked/failed)
- Update performance documentation
Before declaring optimization complete:
Check gains:
- ✓ Measured improvement meets target?
- ✓ Improvement statistically significant?
- ✓ Tested under realistic load?
- ✓ Multiple runs confirm consistency?
Check regressions:
- ✓ No degradation in other metrics?
- ✓ Memory usage still acceptable?
- ✓ Code complexity still manageable?
- ✓ Tests still pass?
Check documentation:
- ✓ Baseline measurements recorded?
- ✓ Optimization approach explained?
- ✓ Gains quantified with numbers?
- ✓ Tradeoffs documented?
ALWAYS:
- Measure before optimizing (baseline)
- Profile to find actual bottleneck
- Use realistic workload (not toy data)
- Measure multiple runs (account for variance)
- Document baseline and improvements
- Check for regressions in other metrics
- Consider readability vs performance tradeoff
- Verify statistical significance
NEVER:
- Optimize without measuring first
- Guess at bottleneck without profiling
- Benchmark with unrealistic data
- Trust single-run measurements
- Skip documentation of results
- Sacrifice correctness for speed
- Optimize without clear performance goal
- Ignore algorithmic improvements
Methodology:
- benchmarking.md — rigorous benchmarking methodology
Related skills:
- codebase-analysis — evidence-based investigation (foundation)
- debugging-and-diagnosis — structured bug investigation
- typescript-dev — correctness before performance
More from outfitter-dev/agents
codebase-recon
This skill should be used when analyzing codebases, understanding architecture, or when "analyze", "investigate", "explore code", or "understand architecture" are mentioned.
92graphite-stacks
This skill should be used when the user asks to "create a stack", "submit stacked PRs", "gt submit", "gt create", "reorganize branches", "fix stack corruption", or mentions Graphite, stacked PRs, gt commands, or trunk-based development workflows.
76code-review
This skill should be used when reviewing code before commit, conducting quality gates, or when "review", "fresh eyes", "pre-commit review", or "quality gate" are mentioned.
34software-craft
This skill should be used when making design decisions, evaluating trade-offs, assessing code quality, or when "engineering judgment" or "code quality" are mentioned.
28subagents
This skill should be used when coordinating agents, delegating tasks to specialists, or when "dispatch agents", "which agent", or "multi-agent" are mentioned.
25bun-dev
This skill should be used when working with Bun runtime, bun:sqlite, Bun.serve, bun:test, or when "Bun", "bun:test", or Bun-specific patterns are mentioned.
23