hardware-counters
SKILL.md
Hardware Performance Counters
Purpose
Guide agents through hardware performance counter analysis: collecting PMU events with perf stat -e, using the PAPI library for portable counter access, interpreting cache miss rates and branch misprediction ratios, computing IPC, and correlating events to source lines with perf annotate.
Triggers
- "How do I measure cache miss rate with perf?"
- "How do I count branch mispredictions?"
- "How do I compute IPC (instructions per clock) with perf?"
- "How do I use the PAPI library for hardware counters?"
- "How do I see which source lines cause the most cache misses?"
- "How do I measure memory bandwidth with performance counters?"
Workflow
1. perf stat — basic counter collection
# Basic hardware event summary
perf stat ./prog
# Output:
# Performance counter stats for './prog':
#
# 1,234,567,890 instructions
# 456,789,012 cycles
# 12,345,678 cache-misses # 1.23 % of all cache refs
# 23,456,789 branch-misses # 2.34 % of all branches
#
# 0.456789012 seconds time elapsed
# Derived metrics (computed from the output)
# IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70
# CPI = cycles / instructions ≈ 0.37
2. Specifying PMU events with -e
# Specific hardware events
perf stat -e instructions,cycles,cache-misses,branch-misses ./prog
# L1/L2/L3 cache events
perf stat -e \
L1-dcache-loads,L1-dcache-load-misses,\
L2-loads,L2-load-misses,\
LLC-loads,LLC-load-misses \
./prog
# Memory bandwidth (Intel)
perf stat -e \
uncore_imc/cas_count_read/,\
uncore_imc/cas_count_write/ \
./prog
# TLB misses
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog
# Branch misprediction rate
perf stat -e branches,branch-misses ./prog
# Rate = branch-misses / branches × 100%
# Available events (varies by CPU)
perf list hardware # generic hardware events
perf list cache # cache events
perf list pmu # raw PMU events for your CPU
3. Key metrics and thresholds
| Metric | Formula | Healthy | Concerning |
|---|---|---|---|
| IPC | instructions / cycles | > 2.0 (modern x86) | < 1.0 |
| L1 miss rate | L1-misses / L1-accesses | < 1% | > 5% |
| LLC miss rate | LLC-misses / LLC-accesses | < 1% | > 10% |
| Branch miss rate | branch-misses / branches | < 1% | > 5% |
| MPKI | misses per 1K instructions | — | L3 MPKI > 10 = memory bound |
# Compute MPKI (Misses Per Kilo-Instructions)
perf stat -e instructions,LLC-load-misses ./prog
# MPKI = LLC-load-misses / (instructions / 1000)
4. Raw PMU events (CPU-specific)
For events not in the generic aliases, use raw event codes:
# Intel: use perf list or look up in Intel SDM
# Format: rXXYY where XX=umask, YY=event code
perf stat -e r0124 ./prog # example Intel raw event
# List Intel events with ocperf (OpenCL Perf Events)
pip install ocperf
ocperf.py list | grep "mem_load"
# Use libpfm4 for event names
pfm_ls | grep "MEM_LOAD"
perf stat -e $(pfm_ls | grep "MEM_LOAD_RETIRED.L3_MISS") ./prog
# AMD: similar approach
perf stat -e r04041 ./prog # AMD raw event
5. Source-level annotation with perf record/annotate
# Record with hardware events
perf record -e LLC-load-misses -g ./prog
# Annotate: show source lines sorted by cache miss count
perf annotate --stdio
# Interactive (requires debug symbols)
perf report
# Press 'a' on a function to annotate it
# Combined: record hotspot + annotate
perf record -e cycles:u -g ./prog
perf annotate --symbol=my_function --stdio 2>/dev/null | head -40
# Example annotate output:
# Percent | Source code
# 45.23 | for (int i = 0; i < N; i++)
# 3.12 | sum += data[i]; ← cache miss here (strided access)
6. PAPI — Portable API for hardware counters
PAPI provides a portable C API across different CPU architectures:
#include <papi.h>
#include <stdio.h>
int main(void) {
int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC,
PAPI_L2_TCM, PAPI_BR_MSP};
long long values[4];
if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
fprintf(stderr, "PAPI init failed\n");
return 1;
}
PAPI_start_counters(Events, 4);
// --- Code to measure ---
do_work();
// -----------------------
PAPI_stop_counters(values, 4);
printf("Instructions: %lld\n", values[0]);
printf("Cycles: %lld\n", values[1]);
printf("IPC: %.2f\n", (double)values[0]/values[1]);
printf("L2 cache misses: %lld\n", values[2]);
printf("Branch mispred: %lld\n", values[3]);
return 0;
}
# Build with PAPI
gcc -O2 -g -o prog prog.c -lpapi
# Available PAPI events on your system
papi_avail -a | head -30
papi_native_avail | grep "L3" # native events with "L3"
Common PAPI presets:
| Preset | Event |
|---|---|
PAPI_TOT_INS |
Total instructions |
PAPI_TOT_CYC |
Total cycles |
PAPI_L1_DCM |
L1 data cache misses |
PAPI_L2_TCM |
L2 total cache misses |
PAPI_L3_TCM |
L3 total cache misses |
PAPI_BR_MSP |
Branch mispredictions |
PAPI_TLB_DM |
Data TLB misses |
PAPI_FP_INS |
Floating point instructions |
PAPI_VEC_INS |
Vector/SIMD instructions |
7. Intel PCM (Performance Counter Monitor)
# Intel PCM — system-wide counters, no root required on modern kernels
git clone https://github.com/intel/pcm
cd pcm && cmake -S . -B build && cmake --build build
# Measure memory bandwidth
./build/bin/pcm-memory 1 # sample every 1 second
# Core utilization + IPC
./build/bin/pcm 1
# Cache miss breakdown per socket
./build/bin/pcm 1 -csv | head -20
Related skills
- Use
skills/profilers/intel-vtune-amd-uproffor guided microarchitecture analysis - Use
skills/profilers/linux-perffor perf record/report and flamegraph generation - Use
skills/low-level-programming/cpu-cache-optfor applying cache optimization patterns - Use
skills/low-level-programming/simd-intrinsicsfor improving FLOPS/cycle metrics
Weekly Installs
12
Repository
mohitmishra786/…v-skillsGitHub Stars
26
First Seen
10 days ago
Security Audits
Installed on
opencode12
gemini-cli12
github-copilot12
codex12
kimi-cli12
cursor12