Performance Profiling
SKILL.md
Performance Profiling
Purpose
Systematically measure and analyze application performance using profiling tools to identify bottlenecks, hot paths, memory leaks, and inefficient operations.
When to Use
- Investigating slow operations or high latency
- Optimizing resource usage (CPU, memory, I/O)
- Diagnosing performance degradation
- Before and after performance improvements
- Capacity planning and scalability testing
Key Capabilities
- CPU Profiling - Identify time-consuming functions and hot paths
- Memory Profiling - Detect leaks, excessive allocation, and memory patterns
- I/O Analysis - Find slow database queries, file operations, network calls
Approach
-
Establish Baseline
- Measure current performance metrics
- Document expected vs actual performance
- Identify performance requirements (SLAs)
-
Select Profiling Tools
- Python: cProfile, memory_profiler, py-spy, line_profiler
- Node.js: Node.js built-in profiler, clinic.js, 0x
- Java: JProfiler, VisualVM, YourKit
- Go: pprof, trace
- Database: EXPLAIN, query logs, slow query log
- System: perf, strace, iostat, vmstat
-
Collect Profiling Data
- Run application under realistic load
- Capture CPU profile (flamegraphs)
- Capture memory snapshots
- Record I/O operations
- Monitor system metrics
-
Analyze Results
- Identify functions taking most CPU time
- Find memory allocation hotspots
- Locate slow database queries (N+1 problems)
- Detect blocking I/O operations
- Review call graphs and flame graphs
-
Prioritize Optimizations
- Focus on biggest bottlenecks first
- Consider effort vs impact
- Measure before and after improvements
Example
Context: Profiling a slow Python web API endpoint
Step 1: Baseline Measurement
# Measure endpoint response time
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/api/users
# Result: Total time: 2.8 seconds (Target: <500ms)
Step 2: CPU Profiling
# profile_endpoint.py
import cProfile
import pstats
from io import StringIO
def profile_request():
profiler = cProfile.Profile()
profiler.enable()
# Execute the slow endpoint
response = app.test_client().get('/api/users')
profiler.disable()
# Generate report
s = StringIO()
ps = pstats.Stats(profiler, stream=s).sort_stats('cumulative')
ps.print_stats(20) # Top 20 functions
print(s.getvalue())
profile_request()
CPU Profile Results:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 2.756 2.756 views.py:45(get_users)
500 1.200 0.002 2.450 0.005 database.py:89(get_user_details)
5000 0.850 0.000 0.850 0.000 {method 'execute' of 'sqlite3.Cursor'}
500 0.300 0.001 0.300 0.001 serializers.py:22(serialize_user)
1 0.150 0.150 0.150 0.150 {method 'fetchall' of 'sqlite3.Cursor'}
Analysis:
get_user_details()called 500 times → N+1 query problem- Database queries taking 85% of total time
- Each query is fast (0.002s), but 500 of them = 2.45s total
Step 3: Database Query Analysis
# Original code (N+1 problem)
def get_users():
users = User.query.all() # 1 query
results = []
for user in users:
# N queries (one per user)
user_details = UserDetail.query.filter_by(user_id=user.id).first()
results.append({
'user': user,
'details': user_details
})
return results
Step 4: Memory Profiling
from memory_profiler import profile
@profile
def get_users():
users = User.query.all()
results = []
for user in users:
user_details = UserDetail.query.filter_by(user_id=user.id).first()
results.append({
'user': user,
'details': user_details
})
return results
Memory Profile Results:
Line # Mem usage Increment Line Contents
================================================
45 50.2 MiB 50.2 MiB def get_users():
46 75.5 MiB 25.3 MiB users = User.query.all()
47 75.5 MiB 0.0 MiB results = []
48 125.8 MiB 50.3 MiB for user in users:
49 125.8 MiB 0.0 MiB user_details = UserDetail.query...
50 125.8 MiB 0.0 MiB results.append(...)
51 125.8 MiB 0.0 MiB return results
Analysis: Loading 500 users with details uses 75 MiB memory
Step 5: Flame Graph Analysis
# Generate flame graph (visual)
py-spy record -o profile.svg --duration 30 -- python app.py
Flame Graph Shows:
- 87% time in database queries
- 8% time in serialization
- 5% time in framework overhead
Optimization Applied:
# Optimized code (single query with join)
def get_users():
# Use eager loading to fetch users and details in one query
users = User.query.options(
joinedload(User.details)
).all()
results = []
for user in users:
results.append({
'user': user,
'details': user.details # Already loaded, no query
})
return results
Step 6: Verify Improvement
# Re-measure endpoint response time
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/api/users
# Result: Total time: 0.18 seconds (94% improvement!)
Expected Result:
- Identified N+1 query as primary bottleneck
- Reduced 500 queries to 1 query
- Improved response time from 2.8s to 0.18s
- Reduced memory usage by using lazy evaluation where possible
Best Practices
- ✅ Profile in production-like environment with realistic data
- ✅ Focus on user-facing operations first
- ✅ Use flame graphs for visual understanding
- ✅ Profile both CPU and memory together
- ✅ Measure before and after every optimization
- ✅ Profile under load (not just single requests)
- ✅ Keep profiling data for comparison over time
- ✅ Look for low-hanging fruit (N+1 queries, missing indexes)
- ✅ Consider statistical profiling for production (low overhead)
- ❌ Avoid: Optimizing without measuring first
- ❌ Avoid: Micro-optimizations that don't impact overall performance
- ❌ Avoid: Profiling only in development (profile staging/production)
- ❌ Avoid: Ignoring the 80/20 rule (fix biggest bottlenecks first)