Performance Profiling

Analyze system and application performance using comprehensive profiling techniques including Linux kernel-level tools (perf, ftrace, eBPF, SystemTap), application-level profiling, bottleneck identification, and optimization recommendations to improve system responsiveness, throughput, and resource efficiency.

When to use me

Use this skill when:

Application performance is slow or degrading
System resource utilization is high
Identifying CPU, memory, I/O, or network bottlenecks
Optimizing application response times
Debugging performance regressions
Capacity planning and resource sizing
Comparing performance before/after changes
Analyzing production performance issues
Creating performance baselines
Tuning system and application parameters

What I do

1. System-Level Profiling

CPU profiling: Analyze CPU usage, context switches, interrupts, scheduler latency
Memory profiling: Analyze memory usage, page faults, swapping, memory leaks
I/O profiling: Analyze disk I/O, file system performance, storage latency
Network profiling: Analyze network throughput, latency, packet loss, connections
Kernel profiling: Analyze kernel functions, system calls, interrupt handlers

2. Application-Level Profiling

Application CPU usage: Profile application-specific CPU consumption
Memory allocation: Track heap allocations, garbage collection, memory leaks
Function timing: Measure function execution times and call frequencies
Database query profiling: Analyze SQL query performance and optimization
API endpoint profiling: Measure API response times and throughput

3. Tool Integration

Linux perf: CPU profiling, hardware performance counters, tracepoints
eBPF/BCC: Dynamic tracing, custom performance instrumentation
Ftrace: Kernel function tracing, event tracing, latency measurements
SystemTap: System-wide tracing and profiling
Application profilers: Language-specific profiling tools
Container profiling: Docker, Kubernetes performance analysis

4. Bottleneck Identification

Hot spot detection: Identify frequently executed code paths
Resource contention: Detect lock contention, CPU starvation, I/O wait
Latency analysis: Measure and analyze latency distributions
Scalability analysis: Identify scalability limits and bottlenecks
Anomaly detection: Detect performance anomalies and regressions

5. Optimization Recommendations

Code optimizations: Suggest algorithmic improvements, caching strategies
Configuration tuning: Recommend system and application tuning parameters
Architecture improvements: Suggest architectural changes for performance
Resource allocation: Recommend optimal resource allocation strategies
Monitoring setup: Recommend performance monitoring configurations

6. Visualization & Reporting

Flame graphs: Generate CPU and memory flame graphs for visualization
Heat maps: Create latency heat maps for time-series analysis
Performance dashboards: Create real-time performance dashboards
Trend analysis: Analyze performance trends over time
Comparison reports: Compare performance across versions/environments

Profiling Tools Covered

Linux Kernel-Level Tools

perf: Linux performance events for CPU profiling, hardware counters
eBPF/BCC: Extended Berkeley Packet Filter for dynamic tracing
bpftrace: High-level tracing language for eBPF
Ftrace: Linux kernel internal tracer for function tracing
SystemTap: System-wide tracing and profiling framework
LTTng: Linux Trace Toolkit next generation
ktap: Lightweight kernel tracing

Application-Level Tools

Java: JProfiler, YourKit, VisualVM, Async Profiler
Python: cProfile, py-spy, Scalene, line_profiler
Node.js: clinic.js, 0x, node --prof, v8-profiler
Go: pprof, trace, delve, gops
Ruby: ruby-prof, stackprof, rbspy
.NET: dotnet-counters, dotnet-trace, PerfView
PHP: Xdebug, Blackfire, Tideways
C/C++: gprof, Valgrind, Intel VTune, perf

System Monitoring Tools

top/htop: Process monitoring
vmstat: Virtual memory statistics
iostat: I/O statistics
netstat/ss: Network statistics
sar: System activity reporter
dstat: Versatile resource statistics
nmon: Nigel's performance monitor

Visualization Tools

FlameGraph: CPU and memory flame graphs
perfetto: System tracing and performance visualization
grafana: Performance dashboard visualization
prometheus: Time-series monitoring and alerting
jaeger: Distributed tracing visualization

Analysis Techniques

CPU Profiling with perf

# Sample CPU usage for 30 seconds
perf record -F 99 -ag -- sleep 30

# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

# Analyze hardware performance counters
perf stat -e cycles,instructions,cache-misses,branch-misses ./application

# Trace system calls
perf trace -e syscalls:sys_enter_* ./application

eBPF Tracing with BCC

from bcc import BPF

# eBPF program to trace function calls
bpf_text = """
#include <uapi/linux/ptrace.h>

struct data_t {
    u64 timestamp;
    u32 pid;
    char comm[TASK_COMM_LEN];
    u64 duration_ns;
};

BPF_HASH(start, u32);
BPF_PERF_OUTPUT(events);

int trace_entry(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    
    start.update(&pid, &ts);
    return 0;
}

int trace_return(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 *tsp = start.lookup(&pid);
    
    if (tsp == 0) {
        return 0;
    }
    
    u64 duration = bpf_ktime_get_ns() - *tsp;
    
    struct data_t data = {};
    data.timestamp = bpf_ktime_get_ns();
    data.pid = pid;
    data.duration_ns = duration;
    bpf_get_current_comm(&data.comm, sizeof(data.comm));
    
    events.perf_submit(ctx, &data, sizeof(data));
    start.delete(&pid);
    
    return 0;
}
"""

# Attach to function entry and return
bpf = BPF(text=bpf_text)
bpf.attach_uprobe(name="application", sym="function_name", fn_name="trace_entry")
bpf.attach_uretprobe(name="application", sym="function_name", fn_name="trace_return")

Memory Leak Detection

# Monitor memory allocations
valgrind --leak-check=full --show-leak-kinds=all ./application

# Track heap allocations with eBPF
/usr/share/bcc/tools/memleak -p $(pidof application)

# Analyze memory usage over time
cat /proc/$(pidof application)/smaps | grep -i pss | awk '{total+=$2} END {print total}'

# Monitor garbage collection (Java)
jstat -gc $(pidof java) 1s

Latency Analysis

def analyze_latency_distribution(latency_samples):
    """
    Analyze latency distribution and identify outliers.
    """
    import numpy as np
    from scipy import stats
    
    latencies = np.array(latency_samples)
    
    analysis = {
        'count': len(latencies),
        'mean': np.mean(latencies),
        'median': np.median(latencies),
        'p90': np.percentile(latencies, 90),
        'p95': np.percentile(latencies, 95),
        'p99': np.percentile(latencies, 99),
        'std_dev': np.std(latencies),
        'min': np.min(latencies),
        'max': np.max(latencies),
        'outliers': []
    }
    
    # Identify outliers using IQR method
    q1 = np.percentile(latencies, 25)
    q3 = np.percentile(latencies, 75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    outliers = latencies[(latencies < lower_bound) | (latencies > upper_bound)]
    analysis['outliers'] = outliers.tolist()
    analysis['outlier_percentage'] = len(outliers) / len(latencies) * 100
    
    return analysis

Examples

# Profile CPU usage for 60 seconds
npm run performance-profiling:cpu -- --duration 60 --output cpu-profile.json

# Generate flame graph
npm run performance-profiling:flamegraph -- --pid $(pidof application) --output flamegraph.svg

# Analyze memory leaks
npm run performance-profiling:memory -- --application myapp --leak-check

# Trace database queries
npm run performance-profiling:database -- --database postgresql --duration 300

# Profile API endpoints
npm run performance-profiling:api -- --endpoints "/api/*" --duration 60 --output api-performance.json

# Compare performance before/after changes
npm run performance-profiling:compare -- --before baseline.json --after new-version.json --output comparison.json

# Analyze system resource usage
npm run performance-profiling:system -- --metrics cpu,memory,disk,network --duration 300

# Create performance dashboard
npm run performance-profiling:dashboard -- --metrics all --interval 1s --duration 3600

# Detect bottlenecks in microservices
npm run performance-profiling:microservices -- --services auth,payment,notification --duration 600

# Optimize configuration based on profiling
npm run performance-profiling:optimize -- --profile profile.json --output optimizations.md

# Monitor production performance
npm run performance-profiling:monitor -- --production --alert-threshold p95:200ms

Output format

Performance Profiling Report:

Performance Profiling Report
────────────────────────────
System: payment-processing-service
Analysis Date: 2026-02-26
Duration: 300 seconds
Profiling Tools: perf, eBPF, Application Profiler

Executive Summary:
⚠️ Performance issues detected: 3 critical, 2 warnings
✅ System resources: Within normal limits
📊 Overall performance score: 72/100

Critical Issues:
1. ❌ Database query bottleneck (Severity: Critical)
   • Query: SELECT * FROM transactions WHERE user_id = ?
   • Average latency: 450ms (p95: 1200ms)
   • Frequency: 1200 executions/minute
   • Root cause: Missing index on user_id column
   • Impact: 40% of API latency
   • Recommendation: Add index on transactions.user_id

2. ❌ Memory leak in cache service (Severity: Critical)
   • Service: redis-cache-service
   • Memory growth: 2MB/minute
   • Total leaked: 120MB over 1 hour
   • Pattern: Cache entries not expired properly
   • Recommendation: Implement TTL and LRU eviction

3. ❌ CPU contention in payment processor (Severity: Critical)
   • Function: processPayment() in payment-service
   • CPU usage: 85% during peak
   • Bottleneck: Cryptographic operations
   • Recommendation: Implement caching or hardware acceleration

Warnings:
1. ⚠️ API endpoint latency degradation (Severity: Warning)
   • Endpoint: POST /api/v1/payments
   • p95 latency increase: 150ms → 320ms (+113%)
   • Timeframe: Last 7 days
   • Recommendation: Profile endpoint and optimize

2. ⚠️ Garbage collection pauses (Severity: Warning)
   • Application: notification-service (Java)
   • GC pauses: 45ms average, 120ms max
   • Frequency: Every 30 seconds
   • Recommendation: Tune JVM garbage collector

System Resource Analysis:
┌────────────────────┬────────────┬────────────┬────────────┐
│ Resource           │ Usage      │ Threshold │ Status     │
├────────────────────┼────────────┼────────────┼────────────┤
│ CPU                │ 65%        │ 80%       ✅ Normal     │
│ Memory             │ 72%        │ 85%       ✅ Normal     │
│ Disk I/O           │ 45%        │ 70%       ✅ Normal     │
│ Network            │ 38%        │ 60%       ✅ Normal     │
│ Database Connections│ 85%       │ 90%       ⚠️ Warning    │
└────────────────────┴────────────┴────────────┴────────────┘

Application Performance:
• API Response Times:
  - p50: 85ms ✅
  - p95: 320ms ⚠️
  - p99: 1200ms ❌
  - Success Rate: 99.8% ✅

• Database Performance:
  - Query Cache Hit Rate: 65% ⚠️
  - Average Query Time: 85ms ✅
  - Slow Queries (>100ms): 12% ⚠️
  - Connection Pool Usage: 85% ⚠️

• Cache Performance:
  - Redis Hit Rate: 92% ✅
  - Cache Latency: 3ms ✅
  - Memory Usage: 78% ⚠️
  - Eviction Rate: 5% ✅

Flame Graph Analysis:
• Hot Functions:
  1. processPayment() - 35% CPU time
  2. validateTransaction() - 22% CPU time
  3. updateDatabase() - 18% CPU time
  4. sendNotification() - 8% CPU time
  5. logActivity() - 5% CPU time

• Optimization Opportunities:
  1. Cache validation results (potential 15% improvement)
  2. Batch database updates (potential 10% improvement)
  3. Async notifications (potential 8% improvement)

Memory Analysis:
• Heap Usage: 2.4GB
• Stack Usage: 320MB
• Native Memory: 450MB
• Garbage Collection:
  - Young GC: 45ms every 30s
  - Full GC: 120ms every 5min
  - Throughput: 98.5%

I/O Analysis:
• Disk Read: 45MB/s (average)
• Disk Write: 28MB/s (average)
• File Descriptors: 1250/4096 (31%)
• Network Throughput:
  - Inbound: 85Mbps
  - Outbound: 120Mbps
  - Connections: 850 active

Bottleneck Timeline:
┌─────────────────────────────────────────────────────────────┐
│ Bottleneck Timeline (Last 60 minutes)                       │
│                                                             │
│ 00:00 ┼───────┬──────────────┬─────────────┬────────────── │
│       │ CPU   │ Database     │ Memory      │ Network       │
│ 15:00 ┼───────┼──────────────┼─────────────┼────────────── │
│       │ ███   │ █████████    │ ███         │ ██            │
│ 30:00 ┼───────┼──────────────┼─────────────┼────────────── │
│       │ █████ │ ████████████ │ █████       │ ███           │
│ 45:00 ┼───────┼──────────────┼─────────────┼────────────── │
│       │ ██████│ █████████████│ ███████     │ ████          │
│ 60:00 ┼───────┴──────────────┴─────────────┴────────────── │
│      0%                   50%                   100%       │
└─────────────────────────────────────────────────────────────┘

Optimization Recommendations:
1. Immediate (High Impact):
   • Add database index on transactions.user_id
   • Implement cache TTL for redis-cache-service
   • Optimize processPayment() cryptographic operations

2. Short-term (Medium Impact):
   • Implement connection pooling for database
   • Add query caching for frequent queries
   • Batch database writes where possible

3. Long-term (Architectural):
   • Implement read replicas for database
   • Add CDN for static assets
   • Implement circuit breakers for external services

Performance Metrics Baseline:
• CPU Usage: < 70% target
• Memory Usage: < 80% target
• API p95 Latency: < 200ms target
• Database Query Time: < 100ms target
• Cache Hit Rate: > 90% target

Monitoring Configuration:
• Alert on: p95 latency > 200ms
• Alert on: CPU usage > 80% for 5 minutes
• Alert on: Memory usage > 85%
• Alert on: Error rate > 1%
• Dashboard: Real-time performance metrics

Next Steps:
1. Implement database index (estimate: 2 hours)
2. Fix memory leak in cache service (estimate: 4 hours)
3. Optimize payment processor CPU usage (estimate: 8 hours)
4. Deploy optimizations with feature flags
5. Monitor performance for 24 hours
6. Schedule performance regression tests

JSON Output Format:

{
  "analysis": {
    "system": "payment-processing-service",
    "analysis_date": "2026-02-26",
    "duration_seconds": 300,
    "profiling_tools": ["perf", "ebpf", "application_profiler"],
    "overall_score": 72
  },
  "critical_issues": [
    {
      "id": "issue-db-001",
      "description": "Database query bottleneck",
      "severity": "critical",
      "component": "database",
      "metric": "query_latency",
      "average_value": 450,
      "p95_value": 1200,
      "unit": "ms",
      "frequency": "1200 executions/minute",
      "root_cause": "Missing index on user_id column",
      "impact": "40% of API latency",
      "recommendation": "Add index on transactions.user_id",
      "estimated_effort_hours": 2,
      "priority": "high"
    },
    {
      "id": "issue-memory-001",
      "description": "Memory leak in cache service",
      "severity": "critical",
      "component": "cache",
      "metric": "memory_growth",
      "average_value": 2,
      "unit": "MB/minute",
      "total_leaked": 120,
      "total_leaked_unit": "MB",
      "timeframe": "1 hour",
      "pattern": "Cache entries not expired properly",
      "recommendation": "Implement TTL and LRU eviction",
      "estimated_effort_hours": 4,
      "priority": "high"
    }
  ],
  "system_resources": {
    "cpu": {
      "usage_percentage": 65,
      "threshold": 80,
      "status": "normal",
      "breakdown": {
        "user": 45,
        "system": 20,
        "iowait": 8,
        "steal": 2
      }
    },
    "memory": {
      "usage_percentage": 72,
      "threshold": 85,
      "status": "normal",
      "breakdown": {
        "heap": 2400,
        "stack": 320,
        "native": 450,
        "cached": 1200
      }
    },
    "disk_io": {
      "usage_percentage": 45,
      "threshold": 70,
      "status": "normal",
      "read_mbps": 45,
      "write_mbps": 28
    },
    "network": {
      "usage_percentage": 38,
      "threshold": 60,
      "status": "normal",
      "inbound_mbps": 85,
      "outbound_mbps": 120,
      "connections": 850
    }
  },
  "application_performance": {
    "api_response_times": {
      "p50_ms": 85,
      "p95_ms": 320,
      "p99_ms": 1200,
      "success_rate": 99.8
    },
    "database_performance": {
      "query_cache_hit_rate": 65,
      "average_query_time_ms": 85,
      "slow_queries_percentage": 12,
      "connection_pool_usage": 85
    },
    "cache_performance": {
      "hit_rate": 92,
      "latency_ms": 3,
      "memory_usage_percentage": 78,
      "eviction_rate": 5
    }
  },
  "flame_graph_analysis": {
    "hot_functions": [
      {
        "function": "processPayment",
        "cpu_percentage": 35,
        "optimization_opportunity": "Cache validation results"
      },
      {
        "function": "validateTransaction",
        "cpu_percentage": 22,
        "optimization_opportunity": "Batch validation"
      }
    ],
    "optimization_opportunities": [
      {
        "description": "Cache validation results",
        "estimated_improvement": 15,
        "effort_hours": 8
      },
      {
        "description": "Batch database updates",
        "estimated_improvement": 10,
        "effort_hours": 6
      }
    ]
  },
  "optimization_recommendations": {
    "immediate": [
      "Add database index on transactions.user_id",
      "Implement cache TTL for redis-cache-service",
      "Optimize processPayment() cryptographic operations"
    ],
    "short_term": [
      "Implement connection pooling for database",
      "Add query caching for frequent queries",
      "Batch database writes where possible"
    ],
    "long_term": [
      "Implement read replicas for database",
      "Add CDN for static assets",
      "Implement circuit breakers for external services"
    ]
  },
  "performance_baseline": {
    "cpu_usage_target": 70,
    "memory_usage_target": 80,
    "api_p95_latency_target": 200,
    "database_query_time_target": 100,
    "cache_hit_rate_target": 90
  },
  "next_steps": [
    {
      "action": "Implement database index",
      "estimate_hours": 2,
      "priority": "high"
    },
    {
      "action": "Fix memory leak in cache service",
      "estimate_hours": 4,
      "priority": "high"
    },
    {
      "action": "Optimize payment processor CPU usage",
      "estimate_hours": 8,
      "priority": "medium"
    }
  ]
}

Performance Dashboard:

Performance Dashboard
────────────────────
Status: ACTIVE
Last Update: 2026-02-26 19:45:00
Update Interval: 1 second

Real-time Metrics:
┌────────────────────┬────────────┬────────────┬────────────┐
│ Metric             │ Current    │ 1min Avg   │ Trend      │
├────────────────────┼────────────┼────────────┼────────────┤
│ CPU Usage          │ 65%        │ 62%        │ ↗️ Rising   │
│ Memory Usage       │ 72%        │ 71%        │ → Stable   │
│ API Latency (p95)  │ 320ms      │ 310ms      ↗️ Rising     │
│ Database Latency   │ 85ms       │ 82ms       → Stable     │
│ Cache Hit Rate     │ 92%        │ 91%        ↘️ Falling    │
│ Error Rate         │ 0.2%       │ 0.3%       ↘️ Falling    │
└────────────────────┴────────────┴────────────┴────────────┘

Alerts:
• ⚠️  API p95 latency above threshold (200ms): 320ms
• ✅  CPU usage within limits
• ✅  Memory usage within limits
• ⚠️  Database connections approaching limit (85%)

Hotspots:
1. processPayment(): 35% CPU (🔥 Hot)
2. validateTransaction(): 22% CPU (⚠️ Warm)
3. updateDatabase(): 18% CPU (⚠️ Warm)

Resource Utilization Trend:
CPU:    ████████████████████████████████████░░░░ 65%
Memory: ██████████████████████████████████████░░ 72%
Disk:   █████████████████████░░░░░░░░░░░░░░░░░░░ 45%
Network:████████████████░░░░░░░░░░░░░░░░░░░░░░░░ 38%

Recent Events:
• 19:40: Database query slowdown detected
• 19:35: Cache miss rate increased by 15%
• 19:30: API latency spike (p95: 450ms)
• 19:25: Memory usage increased by 2%

Recommendations:
1. Add index on transactions.user_id (pending)
2. Implement cache TTL (in progress)
3. Optimize payment processor (planned)

Performance Score: 72/100
Status: Needs Improvement

Notes

Profile in production-like environments for accurate results
Use appropriate sampling rates to balance overhead and accuracy
Compare against baselines to identify regressions
Monitor profiling overhead to avoid affecting production performance
Use flame graphs for visual bottleneck identification
Combine multiple tools for comprehensive analysis
Profile representative workloads that match production usage
Consider security implications of profiling in production
Document profiling methodology for reproducibility
Automate performance regression testing in CI/CD pipelines

performance-profiling