skills/wojons/skills/performance-profiling

performance-profiling

SKILL.md

Performance Profiling

Analyze system and application performance using comprehensive profiling techniques including Linux kernel-level tools (perf, ftrace, eBPF, SystemTap), application-level profiling, bottleneck identification, and optimization recommendations to improve system responsiveness, throughput, and resource efficiency.

When to use me

Use this skill when:

  • Application performance is slow or degrading
  • System resource utilization is high
  • Identifying CPU, memory, I/O, or network bottlenecks
  • Optimizing application response times
  • Debugging performance regressions
  • Capacity planning and resource sizing
  • Comparing performance before/after changes
  • Analyzing production performance issues
  • Creating performance baselines
  • Tuning system and application parameters

What I do

1. System-Level Profiling

  • CPU profiling: Analyze CPU usage, context switches, interrupts, scheduler latency
  • Memory profiling: Analyze memory usage, page faults, swapping, memory leaks
  • I/O profiling: Analyze disk I/O, file system performance, storage latency
  • Network profiling: Analyze network throughput, latency, packet loss, connections
  • Kernel profiling: Analyze kernel functions, system calls, interrupt handlers

2. Application-Level Profiling

  • Application CPU usage: Profile application-specific CPU consumption
  • Memory allocation: Track heap allocations, garbage collection, memory leaks
  • Function timing: Measure function execution times and call frequencies
  • Database query profiling: Analyze SQL query performance and optimization
  • API endpoint profiling: Measure API response times and throughput

3. Tool Integration

  • Linux perf: CPU profiling, hardware performance counters, tracepoints
  • eBPF/BCC: Dynamic tracing, custom performance instrumentation
  • Ftrace: Kernel function tracing, event tracing, latency measurements
  • SystemTap: System-wide tracing and profiling
  • Application profilers: Language-specific profiling tools
  • Container profiling: Docker, Kubernetes performance analysis

4. Bottleneck Identification

  • Hot spot detection: Identify frequently executed code paths
  • Resource contention: Detect lock contention, CPU starvation, I/O wait
  • Latency analysis: Measure and analyze latency distributions
  • Scalability analysis: Identify scalability limits and bottlenecks
  • Anomaly detection: Detect performance anomalies and regressions

5. Optimization Recommendations

  • Code optimizations: Suggest algorithmic improvements, caching strategies
  • Configuration tuning: Recommend system and application tuning parameters
  • Architecture improvements: Suggest architectural changes for performance
  • Resource allocation: Recommend optimal resource allocation strategies
  • Monitoring setup: Recommend performance monitoring configurations

6. Visualization & Reporting

  • Flame graphs: Generate CPU and memory flame graphs for visualization
  • Heat maps: Create latency heat maps for time-series analysis
  • Performance dashboards: Create real-time performance dashboards
  • Trend analysis: Analyze performance trends over time
  • Comparison reports: Compare performance across versions/environments

Profiling Tools Covered

Linux Kernel-Level Tools

  • perf: Linux performance events for CPU profiling, hardware counters
  • eBPF/BCC: Extended Berkeley Packet Filter for dynamic tracing
  • bpftrace: High-level tracing language for eBPF
  • Ftrace: Linux kernel internal tracer for function tracing
  • SystemTap: System-wide tracing and profiling framework
  • LTTng: Linux Trace Toolkit next generation
  • ktap: Lightweight kernel tracing

Application-Level Tools

  • Java: JProfiler, YourKit, VisualVM, Async Profiler
  • Python: cProfile, py-spy, Scalene, line_profiler
  • Node.js: clinic.js, 0x, node --prof, v8-profiler
  • Go: pprof, trace, delve, gops
  • Ruby: ruby-prof, stackprof, rbspy
  • .NET: dotnet-counters, dotnet-trace, PerfView
  • PHP: Xdebug, Blackfire, Tideways
  • C/C++: gprof, Valgrind, Intel VTune, perf

System Monitoring Tools

  • top/htop: Process monitoring
  • vmstat: Virtual memory statistics
  • iostat: I/O statistics
  • netstat/ss: Network statistics
  • sar: System activity reporter
  • dstat: Versatile resource statistics
  • nmon: Nigel's performance monitor

Visualization Tools

  • FlameGraph: CPU and memory flame graphs
  • perfetto: System tracing and performance visualization
  • grafana: Performance dashboard visualization
  • prometheus: Time-series monitoring and alerting
  • jaeger: Distributed tracing visualization

Analysis Techniques

CPU Profiling with perf

# Sample CPU usage for 30 seconds
perf record -F 99 -ag -- sleep 30

# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

# Analyze hardware performance counters
perf stat -e cycles,instructions,cache-misses,branch-misses ./application

# Trace system calls
perf trace -e syscalls:sys_enter_* ./application

eBPF Tracing with BCC

from bcc import BPF

# eBPF program to trace function calls
bpf_text = """
#include <uapi/linux/ptrace.h>

struct data_t {
    u64 timestamp;
    u32 pid;
    char comm[TASK_COMM_LEN];
    u64 duration_ns;
};

BPF_HASH(start, u32);
BPF_PERF_OUTPUT(events);

int trace_entry(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    
    start.update(&pid, &ts);
    return 0;
}

int trace_return(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 *tsp = start.lookup(&pid);
    
    if (tsp == 0) {
        return 0;
    }
    
    u64 duration = bpf_ktime_get_ns() - *tsp;
    
    struct data_t data = {};
    data.timestamp = bpf_ktime_get_ns();
    data.pid = pid;
    data.duration_ns = duration;
    bpf_get_current_comm(&data.comm, sizeof(data.comm));
    
    events.perf_submit(ctx, &data, sizeof(data));
    start.delete(&pid);
    
    return 0;
}
"""

# Attach to function entry and return
bpf = BPF(text=bpf_text)
bpf.attach_uprobe(name="application", sym="function_name", fn_name="trace_entry")
bpf.attach_uretprobe(name="application", sym="function_name", fn_name="trace_return")

Memory Leak Detection

# Monitor memory allocations
valgrind --leak-check=full --show-leak-kinds=all ./application

# Track heap allocations with eBPF
/usr/share/bcc/tools/memleak -p $(pidof application)

# Analyze memory usage over time
cat /proc/$(pidof application)/smaps | grep -i pss | awk '{total+=$2} END {print total}'

# Monitor garbage collection (Java)
jstat -gc $(pidof java) 1s

Latency Analysis

def analyze_latency_distribution(latency_samples):
    """
    Analyze latency distribution and identify outliers.
    """
    import numpy as np
    from scipy import stats
    
    latencies = np.array(latency_samples)
    
    analysis = {
        'count': len(latencies),
        'mean': np.mean(latencies),
        'median': np.median(latencies),
        'p90': np.percentile(latencies, 90),
        'p95': np.percentile(latencies, 95),
        'p99': np.percentile(latencies, 99),
        'std_dev': np.std(latencies),
        'min': np.min(latencies),
        'max': np.max(latencies),
        'outliers': []
    }
    
    # Identify outliers using IQR method
    q1 = np.percentile(latencies, 25)
    q3 = np.percentile(latencies, 75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    outliers = latencies[(latencies < lower_bound) | (latencies > upper_bound)]
    analysis['outliers'] = outliers.tolist()
    analysis['outlier_percentage'] = len(outliers) / len(latencies) * 100
    
    return analysis

Examples

# Profile CPU usage for 60 seconds
npm run performance-profiling:cpu -- --duration 60 --output cpu-profile.json

# Generate flame graph
npm run performance-profiling:flamegraph -- --pid $(pidof application) --output flamegraph.svg

# Analyze memory leaks
npm run performance-profiling:memory -- --application myapp --leak-check

# Trace database queries
npm run performance-profiling:database -- --database postgresql --duration 300

# Profile API endpoints
npm run performance-profiling:api -- --endpoints "/api/*" --duration 60 --output api-performance.json

# Compare performance before/after changes
npm run performance-profiling:compare -- --before baseline.json --after new-version.json --output comparison.json

# Analyze system resource usage
npm run performance-profiling:system -- --metrics cpu,memory,disk,network --duration 300

# Create performance dashboard
npm run performance-profiling:dashboard -- --metrics all --interval 1s --duration 3600

# Detect bottlenecks in microservices
npm run performance-profiling:microservices -- --services auth,payment,notification --duration 600

# Optimize configuration based on profiling
npm run performance-profiling:optimize -- --profile profile.json --output optimizations.md

# Monitor production performance
npm run performance-profiling:monitor -- --production --alert-threshold p95:200ms

Output format

Performance Profiling Report:

Performance Profiling Report
────────────────────────────
System: payment-processing-service
Analysis Date: 2026-02-26
Duration: 300 seconds
Profiling Tools: perf, eBPF, Application Profiler

Executive Summary:
⚠️ Performance issues detected: 3 critical, 2 warnings
✅ System resources: Within normal limits
📊 Overall performance score: 72/100

Critical Issues:
1. ❌ Database query bottleneck (Severity: Critical)
   • Query: SELECT * FROM transactions WHERE user_id = ?
   • Average latency: 450ms (p95: 1200ms)
   • Frequency: 1200 executions/minute
   • Root cause: Missing index on user_id column
   • Impact: 40% of API latency
   • Recommendation: Add index on transactions.user_id

2. ❌ Memory leak in cache service (Severity: Critical)
   • Service: redis-cache-service
   • Memory growth: 2MB/minute
   • Total leaked: 120MB over 1 hour
   • Pattern: Cache entries not expired properly
   • Recommendation: Implement TTL and LRU eviction

3. ❌ CPU contention in payment processor (Severity: Critical)
   • Function: processPayment() in payment-service
   • CPU usage: 85% during peak
   • Bottleneck: Cryptographic operations
   • Recommendation: Implement caching or hardware acceleration

Warnings:
1. ⚠️ API endpoint latency degradation (Severity: Warning)
   • Endpoint: POST /api/v1/payments
   • p95 latency increase: 150ms → 320ms (+113%)
   • Timeframe: Last 7 days
   • Recommendation: Profile endpoint and optimize

2. ⚠️ Garbage collection pauses (Severity: Warning)
   • Application: notification-service (Java)
   • GC pauses: 45ms average, 120ms max
   • Frequency: Every 30 seconds
   • Recommendation: Tune JVM garbage collector

System Resource Analysis:
┌────────────────────┬────────────┬────────────┬────────────┐
│ Resource           │ Usage      │ Threshold │ Status     │
├────────────────────┼────────────┼────────────┼────────────┤
│ CPU                │ 65%        │ 80%       ✅ Normal     │
│ Memory             │ 72%        │ 85%       ✅ Normal     │
│ Disk I/O           │ 45%        │ 70%       ✅ Normal     │
│ Network            │ 38%        │ 60%       ✅ Normal     │
│ Database Connections│ 85%       │ 90%       ⚠️ Warning    │
└────────────────────┴────────────┴────────────┴────────────┘

Application Performance:
• API Response Times:
  - p50: 85ms ✅
  - p95: 320ms ⚠️
  - p99: 1200ms ❌
  - Success Rate: 99.8% ✅

• Database Performance:
  - Query Cache Hit Rate: 65% ⚠️
  - Average Query Time: 85ms ✅
  - Slow Queries (>100ms): 12% ⚠️
  - Connection Pool Usage: 85% ⚠️

• Cache Performance:
  - Redis Hit Rate: 92% ✅
  - Cache Latency: 3ms ✅
  - Memory Usage: 78% ⚠️
  - Eviction Rate: 5% ✅

Flame Graph Analysis:
• Hot Functions:
  1. processPayment() - 35% CPU time
  2. validateTransaction() - 22% CPU time
  3. updateDatabase() - 18% CPU time
  4. sendNotification() - 8% CPU time
  5. logActivity() - 5% CPU time

• Optimization Opportunities:
  1. Cache validation results (potential 15% improvement)
  2. Batch database updates (potential 10% improvement)
  3. Async notifications (potential 8% improvement)

Memory Analysis:
• Heap Usage: 2.4GB
• Stack Usage: 320MB
• Native Memory: 450MB
• Garbage Collection:
  - Young GC: 45ms every 30s
  - Full GC: 120ms every 5min
  - Throughput: 98.5%

I/O Analysis:
• Disk Read: 45MB/s (average)
• Disk Write: 28MB/s (average)
• File Descriptors: 1250/4096 (31%)
• Network Throughput:
  - Inbound: 85Mbps
  - Outbound: 120Mbps
  - Connections: 850 active

Bottleneck Timeline:
┌─────────────────────────────────────────────────────────────┐
│ Bottleneck Timeline (Last 60 minutes)                       │
│                                                             │
│ 00:00 ┼───────┬──────────────┬─────────────┬────────────── │
│       │ CPU   │ Database     │ Memory      │ Network       │
│ 15:00 ┼───────┼──────────────┼─────────────┼────────────── │
│       │ ███   │ █████████    │ ███         │ ██            │
│ 30:00 ┼───────┼──────────────┼─────────────┼────────────── │
│       │ █████ │ ████████████ │ █████       │ ███           │
│ 45:00 ┼───────┼──────────────┼─────────────┼────────────── │
│       │ ██████│ █████████████│ ███████     │ ████          │
│ 60:00 ┼───────┴──────────────┴─────────────┴────────────── │
│      0%                   50%                   100%       │
└─────────────────────────────────────────────────────────────┘

Optimization Recommendations:
1. Immediate (High Impact):
   • Add database index on transactions.user_id
   • Implement cache TTL for redis-cache-service
   • Optimize processPayment() cryptographic operations

2. Short-term (Medium Impact):
   • Implement connection pooling for database
   • Add query caching for frequent queries
   • Batch database writes where possible

3. Long-term (Architectural):
   • Implement read replicas for database
   • Add CDN for static assets
   • Implement circuit breakers for external services

Performance Metrics Baseline:
• CPU Usage: < 70% target
• Memory Usage: < 80% target
• API p95 Latency: < 200ms target
• Database Query Time: < 100ms target
• Cache Hit Rate: > 90% target

Monitoring Configuration:
• Alert on: p95 latency > 200ms
• Alert on: CPU usage > 80% for 5 minutes
• Alert on: Memory usage > 85%
• Alert on: Error rate > 1%
• Dashboard: Real-time performance metrics

Next Steps:
1. Implement database index (estimate: 2 hours)
2. Fix memory leak in cache service (estimate: 4 hours)
3. Optimize payment processor CPU usage (estimate: 8 hours)
4. Deploy optimizations with feature flags
5. Monitor performance for 24 hours
6. Schedule performance regression tests

JSON Output Format:

{
  "analysis": {
    "system": "payment-processing-service",
    "analysis_date": "2026-02-26",
    "duration_seconds": 300,
    "profiling_tools": ["perf", "ebpf", "application_profiler"],
    "overall_score": 72
  },
  "critical_issues": [
    {
      "id": "issue-db-001",
      "description": "Database query bottleneck",
      "severity": "critical",
      "component": "database",
      "metric": "query_latency",
      "average_value": 450,
      "p95_value": 1200,
      "unit": "ms",
      "frequency": "1200 executions/minute",
      "root_cause": "Missing index on user_id column",
      "impact": "40% of API latency",
      "recommendation": "Add index on transactions.user_id",
      "estimated_effort_hours": 2,
      "priority": "high"
    },
    {
      "id": "issue-memory-001",
      "description": "Memory leak in cache service",
      "severity": "critical",
      "component": "cache",
      "metric": "memory_growth",
      "average_value": 2,
      "unit": "MB/minute",
      "total_leaked": 120,
      "total_leaked_unit": "MB",
      "timeframe": "1 hour",
      "pattern": "Cache entries not expired properly",
      "recommendation": "Implement TTL and LRU eviction",
      "estimated_effort_hours": 4,
      "priority": "high"
    }
  ],
  "system_resources": {
    "cpu": {
      "usage_percentage": 65,
      "threshold": 80,
      "status": "normal",
      "breakdown": {
        "user": 45,
        "system": 20,
        "iowait": 8,
        "steal": 2
      }
    },
    "memory": {
      "usage_percentage": 72,
      "threshold": 85,
      "status": "normal",
      "breakdown": {
        "heap": 2400,
        "stack": 320,
        "native": 450,
        "cached": 1200
      }
    },
    "disk_io": {
      "usage_percentage": 45,
      "threshold": 70,
      "status": "normal",
      "read_mbps": 45,
      "write_mbps": 28
    },
    "network": {
      "usage_percentage": 38,
      "threshold": 60,
      "status": "normal",
      "inbound_mbps": 85,
      "outbound_mbps": 120,
      "connections": 850
    }
  },
  "application_performance": {
    "api_response_times": {
      "p50_ms": 85,
      "p95_ms": 320,
      "p99_ms": 1200,
      "success_rate": 99.8
    },
    "database_performance": {
      "query_cache_hit_rate": 65,
      "average_query_time_ms": 85,
      "slow_queries_percentage": 12,
      "connection_pool_usage": 85
    },
    "cache_performance": {
      "hit_rate": 92,
      "latency_ms": 3,
      "memory_usage_percentage": 78,
      "eviction_rate": 5
    }
  },
  "flame_graph_analysis": {
    "hot_functions": [
      {
        "function": "processPayment",
        "cpu_percentage": 35,
        "optimization_opportunity": "Cache validation results"
      },
      {
        "function": "validateTransaction",
        "cpu_percentage": 22,
        "optimization_opportunity": "Batch validation"
      }
    ],
    "optimization_opportunities": [
      {
        "description": "Cache validation results",
        "estimated_improvement": 15,
        "effort_hours": 8
      },
      {
        "description": "Batch database updates",
        "estimated_improvement": 10,
        "effort_hours": 6
      }
    ]
  },
  "optimization_recommendations": {
    "immediate": [
      "Add database index on transactions.user_id",
      "Implement cache TTL for redis-cache-service",
      "Optimize processPayment() cryptographic operations"
    ],
    "short_term": [
      "Implement connection pooling for database",
      "Add query caching for frequent queries",
      "Batch database writes where possible"
    ],
    "long_term": [
      "Implement read replicas for database",
      "Add CDN for static assets",
      "Implement circuit breakers for external services"
    ]
  },
  "performance_baseline": {
    "cpu_usage_target": 70,
    "memory_usage_target": 80,
    "api_p95_latency_target": 200,
    "database_query_time_target": 100,
    "cache_hit_rate_target": 90
  },
  "next_steps": [
    {
      "action": "Implement database index",
      "estimate_hours": 2,
      "priority": "high"
    },
    {
      "action": "Fix memory leak in cache service",
      "estimate_hours": 4,
      "priority": "high"
    },
    {
      "action": "Optimize payment processor CPU usage",
      "estimate_hours": 8,
      "priority": "medium"
    }
  ]
}

Performance Dashboard:

Performance Dashboard
────────────────────
Status: ACTIVE
Last Update: 2026-02-26 19:45:00
Update Interval: 1 second

Real-time Metrics:
┌────────────────────┬────────────┬────────────┬────────────┐
│ Metric             │ Current    │ 1min Avg   │ Trend      │
├────────────────────┼────────────┼────────────┼────────────┤
│ CPU Usage          │ 65%        │ 62%        │ ↗️ Rising   │
│ Memory Usage       │ 72%        │ 71%        │ → Stable   │
│ API Latency (p95)  │ 320ms      │ 310ms      ↗️ Rising     │
│ Database Latency   │ 85ms       │ 82ms       → Stable     │
│ Cache Hit Rate     │ 92%        │ 91%        ↘️ Falling    │
│ Error Rate         │ 0.2%       │ 0.3%       ↘️ Falling    │
└────────────────────┴────────────┴────────────┴────────────┘

Alerts:
• ⚠️  API p95 latency above threshold (200ms): 320ms
• ✅  CPU usage within limits
• ✅  Memory usage within limits
• ⚠️  Database connections approaching limit (85%)

Hotspots:
1. processPayment(): 35% CPU (🔥 Hot)
2. validateTransaction(): 22% CPU (⚠️ Warm)
3. updateDatabase(): 18% CPU (⚠️ Warm)

Resource Utilization Trend:
CPU:    ████████████████████████████████████░░░░ 65%
Memory: ██████████████████████████████████████░░ 72%
Disk:   █████████████████████░░░░░░░░░░░░░░░░░░░ 45%
Network:████████████████░░░░░░░░░░░░░░░░░░░░░░░░ 38%

Recent Events:
• 19:40: Database query slowdown detected
• 19:35: Cache miss rate increased by 15%
• 19:30: API latency spike (p95: 450ms)
• 19:25: Memory usage increased by 2%

Recommendations:
1. Add index on transactions.user_id (pending)
2. Implement cache TTL (in progress)
3. Optimize payment processor (planned)

Performance Score: 72/100
Status: Needs Improvement

Notes

  • Profile in production-like environments for accurate results
  • Use appropriate sampling rates to balance overhead and accuracy
  • Compare against baselines to identify regressions
  • Monitor profiling overhead to avoid affecting production performance
  • Use flame graphs for visual bottleneck identification
  • Combine multiple tools for comprehensive analysis
  • Profile representative workloads that match production usage
  • Consider security implications of profiling in production
  • Document profiling methodology for reproducibility
  • Automate performance regression testing in CI/CD pipelines
Weekly Installs
21
Repository
wojons/skills
GitHub Stars
1
First Seen
Feb 28, 2026
Installed on
gemini-cli21
github-copilot21
codex21
kimi-cli21
cursor21
amp21