linux-perf

SKILL.md

Linux perf

Purpose

Guide agents through perf for CPU profiling: sampling, hardware counter measurement, hotspot identification, and integration with flamegraph generation.

Triggers

  • "Which function is consuming the most CPU?"
  • "How do I measure cache misses / IPC?"
  • "How do I use perf to find hotspots?"
  • "How do I generate a flamegraph from perf data?"
  • "perf shows [unknown] or [kernel] frames"

Workflow

1. Prerequisites

# Install
sudo apt install linux-perf    # Debian/Ubuntu (version-matched)
sudo dnf install perf          # Fedora/RHEL

# Check permissions
# By default perf requires root or paranoid level ≤ 1
cat /proc/sys/kernel/perf_event_paranoid
# 2 = only CPU stats (not kernel), 1 = user+kernel, 0 = all, -1 = no restrictions

# Temporarily lower (session only)
sudo sysctl -w kernel.perf_event_paranoid=1

# Persistent
echo 'kernel.perf_event_paranoid=1' | sudo tee /etc/sysctl.d/99-perf.conf
sudo sysctl -p /etc/sysctl.d/99-perf.conf

Compile the target with debug symbols for useful frame data:

gcc -g -O2 -fno-omit-frame-pointer -o prog main.c
# -fno-omit-frame-pointer: essential for frame-pointer-based unwinding
# Alternative: compile with DWARF CFI and use --call-graph=dwarf

2. perf stat — quick counters

# Basic hardware counters
perf stat ./prog

# With specific events
perf stat -e cache-misses,cache-references,instructions,cycles,branch-misses ./prog

# Wall-clock comparison: N runs
perf stat -r 5 ./prog

# Attach to existing process
perf stat -p 12345 sleep 10

Interpret perf stat output:

  • IPC (instructions per cycle) < 1.0: memory-bound or stalled pipeline
  • cache-miss rate > 5%: significant cache pressure
  • branch-miss rate > 5%: branch predictor struggling

3. perf record — sampling

# Default: sample at 1000 Hz (cycles event)
perf record -g ./prog

# Specify frequency
perf record -F 999 -g ./prog

# Specific event
perf record -e cache-misses -g ./prog

# Attach to running process
perf record -F 999 -g -p 12345 sleep 30

# Off-CPU profiling (time spent waiting)
perf record -e sched:sched_switch -ag sleep 10

# DWARF call graphs (better for binaries without frame pointers)
perf record -F 999 --call-graph=dwarf ./prog

# Save to named file
perf record -o myapp.perf.data -g ./prog

4. perf report — interactive analysis

perf report                          # reads perf.data
perf report -i myapp.perf.data
perf report --no-children            # self time only (not cumulative)
perf report --sort comm,dso,sym      # sort by fields
perf report --stdio                  # non-interactive text output

Navigation in TUI:

  • Enter — expand a symbol
  • a — annotate (show assembly with hit counts)
  • s — show source (needs debug info)
  • d — filter by DSO (library)
  • t — filter by thread
  • ? — help

5. perf annotate — hot instructions

# Show assembly with hit percentages
perf annotate sym_name

# From report: press 'a' on a symbol
# Or directly:
perf annotate -i perf.data --symbol=hot_function --stdio

High hit count on a mov or vmovdqa suggests a cache miss at that load.

6. perf top — live profiling

# Live top, like 'top' but for functions
sudo perf top -g

# Filter by process
sudo perf top -p 12345

7. Feed into flamegraphs

# Generate perf script output
perf script > out.perf

# Use Brendan Gregg's FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph
./FlameGraph/stackcollapse-perf.pl out.perf > out.folded
./FlameGraph/flamegraph.pl out.folded > flamegraph.svg

# Open flamegraph.svg in browser

See skills/profilers/flamegraphs for reading flamegraphs and interpreting results.

8. Common issues

Problem Cause Fix
Permission denied perf_event_paranoid too high Lower paranoid level or run with sudo
[unknown] frames Missing frame pointers or debug info Recompile with -fno-omit-frame-pointer or use --call-graph=dwarf
[kernel] everywhere Kernel symbols not visible Use sudo perf record; install linux-image-$(uname -r)-dbgsym
No kallsyms Kernel symbols unavailable `echo 0
Empty report for short program Program exits too fast Use -F 9999 or instrument longer workload
DWARF unwinding slow Large DWARF stack Limit with --call-graph dwarf,512

9. Useful events

# List all available events
perf list

# Common hardware events
cycles
instructions
cache-references
cache-misses
branch-instructions
branch-misses
stalled-cycles-frontend
stalled-cycles-backend

# Software events
context-switches
cpu-migrations
page-faults

# Tracepoints (requires root)
sched:sched_switch
syscalls:sys_enter_read

For a counter reference and interpretation guide, see references/events.md.

Related skills

  • Use skills/profilers/flamegraphs for SVG flamegraph generation and reading
  • Use skills/profilers/valgrind for cache simulation and memory profiling
  • Use skills/compilers/gcc or skills/compilers/clang for PGO from perf data (AutoFDO)
Weekly Installs
1
First Seen
Today
Installed on
mcpjam1
claude-code1
replit1
junie1
windsurf1
zencoder1