performance-optimization
SKILL.md
Performance Optimization Skill
Apply these principles when optimizing code for performance. Focus on the critical 3% where performance truly matters - a 12% improvement is never marginal in engineering.
Core Philosophy
- Measure First: Never optimize without profiling data
- Estimate Costs: Back-of-envelope calculations before implementation
- Avoid Work: The fastest code is code that doesn't run
- Reduce Allocations: Memory allocation is often the hidden bottleneck
- Cache Locality: Memory access patterns dominate modern performance
Reference Latency Numbers
Use these for back-of-envelope calculations:
| Operation | Latency |
|---|---|
| L1 cache reference | 0.5 ns |
| Branch mispredict | 5 ns |
| L2 cache reference | 7 ns |
| Mutex lock/unlock | 25 ns |
| Main memory reference | 100 ns |
| Compress 1KB (Snappy) | 3 us |
| SSD random read (4KB) | 20 us |
| Round trip in datacenter | 50 us |
| Disk seek | 5 ms |
Optimization Techniques (Priority Order)
1. Algorithmic Improvements (Highest Impact)
Always check algorithm complexity first:
O(N^2) → O(N log N) = 1000x faster for N=1M
O(N) → O(1) = unbounded improvement
Common patterns:
- Replace nested loops with hash lookups
- Use sorted-list intersection → hash table lookups
- Process in reverse post-order to eliminate per-element checks
- Replace interval trees (O(log N)) with hash maps (O(1)) when ranges aren't needed
2. Reduce Memory Allocations
Allocation is expensive (~25-100ns + GC pressure)
# BAD: Allocates on every call
def process(items):
result = [] # New allocation
for item in items:
result.append(transform(item))
return result
# GOOD: Pre-allocate or reuse
def process(items, out=None):
if out is None:
out = [None] * len(items)
for i, item in enumerate(items):
out[i] = transform(item)
return out
Techniques:
- Pre-size containers with
reserve()or known capacity - Hoist temporary containers outside loops
- Reuse buffers across iterations (clear instead of recreate)
- Move instead of copy large structures
- Use stack allocation for bounded-lifetime objects
3. Compact Data Structures
Minimize memory footprint and cache lines touched:
// BAD: 24 bytes due to padding
struct Item {
flag: bool, // 1 byte + 7 padding
value: i64, // 8 bytes
count: i32, // 4 bytes + 4 padding
}
// GOOD: 16 bytes with reordering
struct Item {
value: i64, // 8 bytes
count: i32, // 4 bytes
flag: bool, // 1 byte + 3 padding
}
Techniques:
- Reorder struct fields by size (largest first)
- Use smaller numeric types when range permits (u8 vs i32)
- Replace 64-bit pointers with 32-bit indices
- Keep hot fields together, move cold fields to separate struct
- Use bit vectors for small-domain sets
- Flatten nested maps:
Map<A, Map<B, C>>→Map<(A,B), C>
4. Fast Paths for Common Cases
Optimize the common case without hurting rare cases:
# BAD: Always takes slow path
def parse_varint(data):
return generic_varint_parser(data)
# GOOD: Fast path for common 1-byte case
def parse_varint(data):
if data[0] < 128: # Single byte - 90% of cases
return data[0], 1
return generic_varint_parser(data) # Rare multi-byte
Techniques:
- Handle common dimensions inline (1-D, 2-D tensors)
- Check for empty/trivial inputs early
- Specialize for common sizes (small strings, few elements)
- Process trailing elements separately to avoid slow generic code
5. Precompute Expensive Information
Trade memory for compute when beneficial:
# BAD: Recomputes on every access
def is_vowel(char):
return char.lower() in 'aeiou'
# GOOD: Lookup table
VOWEL_TABLE = [c.lower() in 'aeiou' for c in (chr(i) for i in range(256))]
def is_vowel(char):
return VOWEL_TABLE[ord(char)]
Techniques:
- Precompute flags/properties at construction time
- Build lookup tables for character classification
- Cache expensive computed properties
- Validate at boundaries once, not repeatedly inside
6. Bulk/Batch APIs
Amortize fixed costs across multiple operations:
# BAD: N round trips
for item in items:
result = db.lookup(item)
# GOOD: 1 round trip
results = db.lookup_many(items)
Design APIs that support:
- Batch lookups instead of individual fetches
- Vectorized operations over loops
- Streaming interfaces for large datasets
7. Avoid Unnecessary Work
# BAD: Always computes expensive value
def process(data, config):
expensive = compute_expensive(data) # Always runs
if config.needs_expensive:
use(expensive)
# GOOD: Defer until needed
def process(data, config):
if config.needs_expensive:
expensive = compute_expensive(data) # Only when needed
use(expensive)
Techniques:
- Lazy evaluation for expensive operations
- Short-circuit evaluation in conditions
- Move loop-invariant code outside loops
- Specialize instead of using general-purpose libraries in hot paths
8. Help the Compiler/Runtime
Lower-level optimizations when profiling shows need:
// Avoid function call overhead in hot loops
#[inline(always)]
fn hot_function(x: i32) -> i32 { x * 2 }
// Copy to local variable for better alias analysis
fn process(data: &mut [i32], factor: &i32) {
let f = *factor; // Compiler knows this won't change
for x in data {
*x *= f;
}
}
Techniques:
- Use raw pointers/indices instead of iterators in critical loops
- Hand-unroll very hot loops (4+ iterations)
- Move slow-path code to separate non-inlined functions
- Avoid abstractions that hide costs in hot paths
9. Reduce Synchronization
Minimize lock contention and atomic operations:
- Default to thread-compatible (external sync) not thread-safe
- Sample statistics (1-in-32) instead of tracking everything
- Batch updates to shared state
- Use thread-local storage for per-thread data
Profiling Workflow
- Identify hotspots: Use profiler (pprof, perf, py-spy)
- Measure baseline: Write benchmark before optimizing
- Estimate improvement: Calculate expected gain
- Implement change: Focus on one optimization at a time
- Verify improvement: Run benchmark, confirm gain
- Check for regressions: Ensure no other code paths slowed down
When Profile is Flat (No Clear Hotspots)
- Pursue multiple small 1-2% improvements collectively
- Look for restructuring opportunities higher in call stack
- Profile allocations specifically (often hidden cost)
- Gather hardware performance counters (cache misses, branch mispredicts)
- Consider if the code is already well-optimized
Anti-Patterns to Avoid
- Premature optimization in non-critical code
- Optimizing without measurement - changes based on intuition
- Micro-optimizing while ignoring algorithmic complexity
- Copying when moving would suffice
- Growing containers one element at a time (quadratic)
- Allocating in loops when reuse is possible
- String formatting in hot paths (use pre-built templates)
- Regex when simple string matching suffices
Estimation Template
Before optimizing, estimate:
Operation cost: ___ ns/us/ms
Frequency: ___ times per second/request
Total time: cost × frequency = ___
Improvement target: ___% reduction
Expected new time: ___
Is this worth it? [ ] Yes [ ] No
Weekly Installs
5
Repository
ynulihao/agentskillosGitHub Stars
268
First Seen
Feb 10, 2026
Security Audits
Installed on
github-copilot5
mcpjam4
claude-code4
junie4
windsurf4
zencoder4