skills/levnikolaevich/claude-code-skills/ln-811-performance-profiler

ln-811-performance-profiler

SKILL.md

Paths: File paths (shared/, references/, ../ln-*) are relative to skills repo root. If not found at CWD, locate this SKILL.md directory and go up one level for repo root.

ln-811-performance-profiler

Type: L3 Worker Category: 8XX Optimization

Runtime profiler that executes the optimization target, measures multiple metrics (CPU, memory, I/O, time), instruments code for per-function breakdown, and produces a standardized performance map from real data.


Overview

Aspect Details
Input Problem statement: target (file/endpoint/pipeline) + observed metric
Output Performance map (multi-metric, per-function), suspicion stack, bottleneck classification
Pattern Discover test → Baseline run → Static analysis → Deep profile → Performance map → Report

Workflow

Phases: Test Discovery → Baseline Run → Static Analysis → Deep Profile → Performance Map → Report


Phase 0: Test Discovery/Creation

MANDATORY READ: Load shared/references/ci_tool_detection.md for test framework detection. MANDATORY READ: Load shared/references/benchmark_generation.md for auto-generating benchmarks when none exist.

Find or create commands that exercise the optimization target. Two outputs: test_command (profiling/measurement) and e2e_test_command (functional safety gate).

Step 1: Discover test_command

Priority Method Action
1 User-provided User specifies test command or API endpoint
2 Discover existing E2E test Grep test files for target entry point (stop at first match)
3 Create test script Generate per shared/references/benchmark_generation.md to .optimization/{slug}/profile_test.sh

E2E discovery protocol (stop at first match):

Priority Method How
1 Route-based search Grep e2e/integration test files for entry point route
2 Function-based search Grep for entry point function name
3 Module-based search Grep for import of entry point module

Test creation (if no existing test found):

Target Type Generated Script
API endpoint curl -w "%{time_total}" -o /dev/null -s {endpoint}
Function Stack-specific benchmark per shared/references/benchmark_generation.md
Pipeline Full pipeline invocation with test input

Step 2: Discover e2e_test_command

If test_command came from E2E discovery (Step 1 priority 2): e2e_test_command = test_command.

Otherwise, run E2E discovery protocol again (same 3-priority table) to find a separate functional safety test.

If not found: e2e_test_command = null, log: WARNING: No e2e test covers {entry_point}. Full test suite serves as functional gate.

Output

Field Description
test_command Command for profiling/measurement
e2e_test_command Command for functional safety gate (may equal test_command, or null)
e2e_test_source Discovery method: user / route / function / module / none

Phase 1: Baseline Run (Multi-Metric)

Run test_command with system-level profiling. Capture simultaneously:

Metric How to Capture When
Wall time time wrapper or test harness Always
CPU time (user+sys) /usr/bin/time -v or language profiler Always
Memory peak (RSS) /usr/bin/time -v (Max RSS) or tracemalloc / process.memoryUsage() Always
I/O bytes /usr/bin/time -v or structured logs If I/O suspected
HTTP round-trips Count from structured logs or application metrics If network I/O in call graph
GPU utilization nvidia-smi --query-gpu Only if CUDA/GPU detected in stack

Baseline Protocol

Parameter Value
Runs 3
Metric Median
Warm-up 1 discarded run
Output baseline — multi-metric snapshot

Phase 2: Static Analysis → Instrumentation Points

MANDATORY READ: Load bottleneck_classification.md

Trace call chain from code + build suspicion stack. Purpose: guide WHERE to instrument in Phase 3.

Step 1: Trace Call Chain

Starting from entry point, trace depth-first (max depth 5). At each step, READ the full function body.

Cross-service tracing: If service_topology is available from coordinator and a step makes an HTTP/gRPC call to another service whose code is accessible:

Situation Action
HTTP call to service with code in submodule/monorepo Follow into that service's handler: resolve route → trace handler code (depth resets to 0 for the new service)
HTTP call to service without accessible code Classify as External, record latency estimate
gRPC/message queue to known service Same as HTTP — follow into handler if code accessible

Record service: "{service_name}" on each step to track which service owns it. The performance_map steps tree can span multiple services.

Depth-First Rule: If code of the called service is accessible — ALWAYS profile INSIDE. NEVER classify an accessible service as "External/slow" without profiling its internals. "Slow" is a symptom, not a diagnosis.

5 Whys for each bottleneck: Before reporting a bottleneck, chain "why?" until you reach config/architecture level:

  1. "What is slow?" → alignment service (5.9s) 2. "Why?" → 6 pairs × ~1s each 3. "Why ~1s per pair?" → O(n²) mwmf computation 4. "Why O(n²)?" → library default, not production config 5. "Why default?" → matching_methods not configured → root cause = config

Step 2: Classify & Suspicion Scan

For each step, classify by type (CPU, I/O-DB, I/O-Network, I/O-File, Architecture, External, Cache) and scan for performance concerns.

Suspicion checklist (minimum, not limitation):

Category What to Look For
Connection management Client created per-request? Missing pooling? Missing reuse?
Data flow Data read multiple times? Over-fetching? Unnecessary transforms?
Async patterns Sync I/O in async context? Sequential awaits without data dependency?
Resource lifecycle Unclosed connections? Temp files? Memory accumulation in loop?
Configuration Hardcoded timeouts? Default pool sizes? Missing batch size config?
Redundant work Same validation at multiple layers? Same data loaded twice?
Architecture N+1 in loop? Batch API unused? Cache infra unused? Sequential-when-parallel?
(open) Anything else spotted — checklist does not limit findings

Step 2b: Suspicion Deduplication

MANDATORY READ: Load shared/references/output_normalization.md

After generating suspicions across all call chain steps, normalize and deduplicate per §1-§2:

  • Normalize suspicion descriptions (replace specific values with placeholders)
  • Group identical suspicions across different steps → merge into single entry with affected_steps: [list]
  • Example: "Missing connection pooling" found in steps 1.1, 1.2, 1.3 → one suspicion with affected_steps: ["1.1", "1.2", "1.3"]

Step 3: Verify & Map to Instrumentation Points

FOR each suspicion:
  1. VERIFY: follow code to confirm or dismiss
  2. VERDICT: CONFIRMED → map to instrumentation point | DISMISSED → log reason
  3. For each CONFIRMED suspicion, identify:
     - function to wrap with timing
     - I/O call to count
     - memory allocation to track

Profiler Selection (per stack)

Stack Non-invasive profiler Invasive (if non-invasive insufficient)
Python py-spy, cProfile time.perf_counter() decorators
Node.js clinic, --prof console.time() wrappers
Go pprof (built-in) Usually not needed
.NET dotnet-trace Stopwatch wrappers
Rust cargo flamegraph std::time::Instant

Stack detection: per shared/references/ci_tool_detection.md.


Phase 3: Deep Profile

Profiler Hierarchy (escalate as needed)

Level Tool Examples What It Shows When to Use
1 py-spy, cProfile, pprof, dotnet-trace Function-level hotspots Always — first pass
2 line_profiler, per-line timing Line-level timing in hotspot function Hotspot function found but cause unclear
3 tracemalloc, memory_profiler Per-line memory allocation Memory metrics abnormal in baseline

Step 1: Non-Invasive Profiling (preferred)

Run test_command with Level 1 profiler to get per-function breakdown without code changes.

Step 2: Escalation Decision

After Level 1 profiler run, evaluate result against suspicion stack from Phase 2:

Profiler Result Action
Hotspot function identified, time breakdown confirms suspicions DONE — proceed to Phase 4
Hotspot identified but internal cause unclear (CPU vs I/O inside one function) Escalate to Level 2 (line-level timing)
Memory baseline abnormal (peak or delta) Escalate to Level 3 (memory profiler)
Multiple suspicions unresolved — profiler granularity insufficient Go to Step 3 (targeted instrumentation)
Profiler unavailable or overhead > 20% of wall time Go to Step 3 (targeted instrumentation)

Step 3: Targeted Instrumentation (proactive)

Add timing/logging along the call stack at instrumentation points identified in Phase 2 Step 3:

1. FOR each CONFIRMED suspicion without measured data:
     Add timing wrapper around target function/I/O call
     Add counter for I/O round-trips if network/DB suspected
     (cross-service: instrument in the correct service's codebase)
2. Re-run test_command (3 runs, median)
3. Collect per-function measurements from logs
4. Record list of instrumented files (may span multiple services)
Instrumentation Type When Example
Timing wrapper Always for unresolved suspicions time.perf_counter() around function call
I/O call counter Network or DB bottleneck suspected Count HTTP requests, DB queries in loop
Memory snapshot Memory accumulation suspected tracemalloc.get_traced_memory() before/after

KEEP instrumentation in place. The executor reuses it for post-optimization per-function comparison, then cleans up after strike. Report instrumented_files in output.


Phase 4: Build Performance Map

Standardized format — feeds into .optimization/{slug}/context.md for downstream consumption.

performance_map:
  test_command: "uv run pytest tests/e2e/test_example.py -s"
  baseline:
    wall_time_ms: 7280
    cpu_time_ms: 850
    memory_peak_mb: 256
    memory_delta_mb: 45
    io_read_bytes: 1200000
    io_write_bytes: 500000
    http_round_trips: 13
  steps:                          # service field present only in multi-service topology
    - id: "1"
      function: "process_job"
      location: "app/services/job_processor.py:45"
      service: "api"             # optional — which service owns this step
      wall_time_ms: 7200
      time_share_pct: 99
      type: "function_call"
      children:
        - id: "1.1"
          function: "translate_binary"
          wall_time_ms: 7100
          type: "function_call"
          children:
            - id: "1.1.1"
              function: "tikal_extract"
              service: "tikal"   # cross-service: code traced into submodule
              wall_time_ms: 2800
              type: "http_call"
              http_round_trips: 1
            - id: "1.1.2"
              function: "mt_translate"
              service: "mt-engine"
              wall_time_ms: 3500
              type: "http_call"
              http_round_trips: 13
  bottleneck_classification: "I/O-Network"
  bottleneck_detail: "13 sequential HTTP calls to MT service (3500ms)"
  top_bottlenecks:
    - step: "1.1.2", type: "I/O-Network", share: 48%
    - step: "1.1.1", type: "I/O-Network", share: 38%

Phase 5: Report

Report Structure

profile_result:
  entry_point_info:
    type: <string>                     # "api_endpoint" | "function" | "pipeline"
    location: <string>                 # file:line
    route: <string|null>               # API route (if endpoint)
    function: <string>                 # Entry point function name
  performance_map: <object>            # Full map from Phase 4
  bottleneck_classification: <string>  # Primary bottleneck type
  bottleneck_detail: <string>          # Human-readable description
  top_bottlenecks:
    - step, type, share, description
  optimization_hints:                  # CONFIRMED suspicions only (Phase 2)
    - hint with evidence
  suspicion_stack:                     # Full audit trail (confirmed + dismissed)
    - category: <string>
      location: <string>
      description: <string>
      verdict: <string>               # "confirmed" | "dismissed"
      evidence: <string>
      verification_note: <string>
  e2e_test:
    command: <string|null>             # E2E safety test command (from Phase 0)
    source: <string>                   # user / route / function / module / none
  instrumented_files: [<string>]       # Files with active instrumentation (empty if non-invasive only)
  wrong_tool_indicators: []            # Empty = proceed, non-empty = exit

Wrong Tool Indicators

Indicator Condition
external_service_no_alternative 90%+ measured time in external service, no batch/cache/parallel path
within_industry_norm Measured time within expected range for operation type
infrastructure_bound Bottleneck is hardware (measured via system metrics)
already_optimized Code already uses best patterns (confirmed by suspicion scan)

Error Handling

Error Recovery
Cannot resolve entry point Block: "file/function not found at {path}"
Test command fails on unmodified code Block: "test fails before profiling — fix test first"
Profiler not available for stack Fall back to invasive instrumentation (Phase 3 Step 2)
Instrumentation breaks tests Revert immediately: git checkout -- .
Call chain too deep (> 5 levels) Stop at depth 5, note truncation
Cannot classify step type Default to "Unknown", use measured time
No I/O detected (pure CPU) Classify as CPU, focus on algorithm profiling

References

  • bottleneck_classification.md — classification taxonomy
  • latency_estimation.md — latency heuristics (fallback for static-only mode)
  • shared/references/ci_tool_detection.md — stack/tool detection
  • shared/references/benchmark_generation.md — benchmark templates per stack

Definition of Done

  • Test command discovered or created for optimization target
  • E2E safety test discovered (or documented as unavailable)
  • Baseline measured: wall time, CPU, memory (3 runs, median)
  • Call graph traced and function bodies read
  • Suspicion stack built: each suspicion verified and mapped to instrumentation point
  • Deep profile completed (non-invasive preferred, invasive if needed)
  • Instrumented files reported (cleanup deferred to executor)
  • Performance map built in standardized format (real measurements)
  • Top 3 bottlenecks identified from measured data
  • Wrong tool indicators evaluated from real metrics
  • optimization_hints contain only CONFIRMED suspicions with measurement evidence
  • Report returned to coordinator

Version: 3.0.0 Last Updated: 2026-03-15

Weekly Installs
6
GitHub Stars
207
First Seen
1 day ago
Installed on
opencode6
gemini-cli6
github-copilot6
codex6
amp6
cline6