hardware-counters
Hardware Performance Counters
Purpose
Guide agents through hardware performance counter analysis: collecting PMU events with perf stat -e, using the PAPI library for portable counter access, interpreting cache miss rates and branch misprediction ratios, computing IPC, and correlating events to source lines with perf annotate.
Triggers
- "How do I measure cache miss rate with perf?"
- "How do I count branch mispredictions?"
- "How do I compute IPC (instructions per clock) with perf?"
- "How do I use the PAPI library for hardware counters?"
- "How do I see which source lines cause the most cache misses?"
- "How do I measure memory bandwidth with performance counters?"
Workflow
1. perf stat — basic counter collection
# Basic hardware event summary
perf stat ./prog
# Output:
# Performance counter stats for './prog':
#
# 1,234,567,890 instructions
# 456,789,012 cycles
# 12,345,678 cache-misses # 1.23 % of all cache refs
# 23,456,789 branch-misses # 2.34 % of all branches
#
# 0.456789012 seconds time elapsed
# Derived metrics (computed from the output)
# IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70
# CPI = cycles / instructions ≈ 0.37
2. Specifying PMU events with -e
# Specific hardware events
perf stat -e instructions,cycles,cache-misses,branch-misses ./prog
# L1/L2/L3 cache events
perf stat -e \
L1-dcache-loads,L1-dcache-load-misses,\
L2-loads,L2-load-misses,\
LLC-loads,LLC-load-misses \
./prog
# Memory bandwidth (Intel)
perf stat -e \
uncore_imc/cas_count_read/,\
uncore_imc/cas_count_write/ \
./prog
# TLB misses
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog
# Branch misprediction rate
perf stat -e branches,branch-misses ./prog
# Rate = branch-misses / branches × 100%
# Available events (varies by CPU)
perf list hardware # generic hardware events
perf list cache # cache events
perf list pmu # raw PMU events for your CPU
3. Key metrics and thresholds
| Metric | Formula | Healthy | Concerning |
|---|---|---|---|
| IPC | instructions / cycles | > 2.0 (modern x86) | < 1.0 |
| L1 miss rate | L1-misses / L1-accesses | < 1% | > 5% |
| LLC miss rate | LLC-misses / LLC-accesses | < 1% | > 10% |
| Branch miss rate | branch-misses / branches | < 1% | > 5% |
| MPKI | misses per 1K instructions | — | L3 MPKI > 10 = memory bound |
# Compute MPKI (Misses Per Kilo-Instructions)
perf stat -e instructions,LLC-load-misses ./prog
# MPKI = LLC-load-misses / (instructions / 1000)
4. Raw PMU events (CPU-specific)
For events not in the generic aliases, use raw event codes:
# Intel: use perf list or look up in Intel SDM
# Format: rXXYY where XX=umask, YY=event code
perf stat -e r0124 ./prog # example Intel raw event
# List Intel events with ocperf (OpenCL Perf Events)
pip install ocperf
ocperf.py list | grep "mem_load"
# Use libpfm4 for event names
pfm_ls | grep "MEM_LOAD"
perf stat -e $(pfm_ls | grep "MEM_LOAD_RETIRED.L3_MISS") ./prog
# AMD: similar approach
perf stat -e r04041 ./prog # AMD raw event
5. Source-level annotation with perf record/annotate
# Record with hardware events
perf record -e LLC-load-misses -g ./prog
# Annotate: show source lines sorted by cache miss count
perf annotate --stdio
# Interactive (requires debug symbols)
perf report
# Press 'a' on a function to annotate it
# Combined: record hotspot + annotate
perf record -e cycles:u -g ./prog
perf annotate --symbol=my_function --stdio 2>/dev/null | head -40
# Example annotate output:
# Percent | Source code
# 45.23 | for (int i = 0; i < N; i++)
# 3.12 | sum += data[i]; ← cache miss here (strided access)
6. PAPI — Portable API for hardware counters
PAPI provides a portable C API across different CPU architectures:
#include <papi.h>
#include <stdio.h>
int main(void) {
int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC,
PAPI_L2_TCM, PAPI_BR_MSP};
long long values[4];
if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
fprintf(stderr, "PAPI init failed\n");
return 1;
}
PAPI_start_counters(Events, 4);
// --- Code to measure ---
do_work();
// -----------------------
PAPI_stop_counters(values, 4);
printf("Instructions: %lld\n", values[0]);
printf("Cycles: %lld\n", values[1]);
printf("IPC: %.2f\n", (double)values[0]/values[1]);
printf("L2 cache misses: %lld\n", values[2]);
printf("Branch mispred: %lld\n", values[3]);
return 0;
}
# Build with PAPI
gcc -O2 -g -o prog prog.c -lpapi
# Available PAPI events on your system
papi_avail -a | head -30
papi_native_avail | grep "L3" # native events with "L3"
Common PAPI presets:
| Preset | Event |
|---|---|
PAPI_TOT_INS |
Total instructions |
PAPI_TOT_CYC |
Total cycles |
PAPI_L1_DCM |
L1 data cache misses |
PAPI_L2_TCM |
L2 total cache misses |
PAPI_L3_TCM |
L3 total cache misses |
PAPI_BR_MSP |
Branch mispredictions |
PAPI_TLB_DM |
Data TLB misses |
PAPI_FP_INS |
Floating point instructions |
PAPI_VEC_INS |
Vector/SIMD instructions |
7. Intel PCM (Performance Counter Monitor)
# Intel PCM — system-wide counters, no root required on modern kernels
git clone https://github.com/intel/pcm
cd pcm && cmake -S . -B build && cmake --build build
# Measure memory bandwidth
./build/bin/pcm-memory 1 # sample every 1 second
# Core utilization + IPC
./build/bin/pcm 1
# Cache miss breakdown per socket
./build/bin/pcm 1 -csv | head -20
Related skills
- Use
skills/profilers/intel-vtune-amd-uproffor guided microarchitecture analysis - Use
skills/profilers/linux-perffor perf record/report and flamegraph generation - Use
skills/low-level-programming/cpu-cache-optfor applying cache optimization patterns - Use
skills/low-level-programming/simd-intrinsicsfor improving FLOPS/cycle metrics
More from mohitmishra786/low-level-dev-skills
cmake
CMake build system skill for C/C++ projects. Use when writing or refactoring CMakeLists.txt, configuring out-of-source builds, selecting generators (Ninja, Make, VS), managing targets and dependencies with target_link_libraries, integrating external packages via find_package or FetchContent, enabling sanitizers, setting up toolchain files for cross-compilation, or exporting CMake packages. Activates on queries about CMakeLists.txt, cmake configure errors, target properties, install rules, CPack, or CMake presets.
580static-analysis
Static analysis skill for C/C++ codebases. Use when hardening code quality, triaging noisy builds, running clang-tidy, cppcheck, or scan-build, interpreting check categories, suppressing false positives, or integrating static analysis into CI. Activates on queries about clang-tidy checks, cppcheck, scan-build, compile_commands.json, code hardening, or static analysis warnings.
407llvm
LLVM IR and pass pipeline skill. Use when working directly with LLVM Intermediate Representation (IR), running opt passes, generating IR with llc, inspecting or writing LLVM IR for custom passes, or understanding how the LLVM backend lowers IR to assembly. Activates on queries about LLVM IR, opt, llc, llvm-dis, LLVM passes, IR transformations, or building LLVM-based tools.
361gdb
GDB debugger skill for C/C++ programs. Use when starting a GDB session, setting breakpoints, stepping through code, inspecting variables, debugging crashes, using reverse debugging (record/replay), remote debugging with gdbserver, or loading core dumps. Activates on queries about GDB commands, segfaults, hangs, watchpoints, conditional breakpoints, pretty-printers, Python GDB scripting, or multi-threaded debugging.
153linux-perf
Linux perf profiler skill for CPU performance analysis. Use when collecting sampling profiles with perf record, generating perf report, measuring hardware counters (cache misses, branch mispredicts, IPC), identifying hot functions, or feeding perf data into flamegraph tools. Activates on queries about perf, Linux performance counters, PMU events, off-CPU profiling, perf stat, perf annotate, or sampling-based profiling on Linux.
142core-dumps
Core dump analysis skill for production crash triage. Use when loading core files in GDB or LLDB, enabling core dump generation on Linux/macOS, mapping symbols with debuginfo or debuginfod, or extracting backtraces from crashes without re-running the program. Activates on queries about core files, ulimit, coredumpctl, debuginfod, crash triage, or analyzing segfaults from production binaries.
131