pgo
PGO (Profile-Guided Optimisation)
Purpose
Guide agents through the full PGO workflow: instrument build → representative workload → collect profile → optimised build, covering both GCC and Clang, plus BOLT for post-link optimisation.
Triggers
- "How do I use PGO to speed up my binary?"
- "What is profile-guided optimization and when should I use it?"
- "How do I use
-fprofile-generateand-fprofile-use?" - "My
-O3build isn't fast enough — what next?" - "How does BOLT differ from PGO?"
- "How do I collect representative profile data?"
Workflow
1. When to use PGO
Is -O3 -march=native already applied?
no → apply standard optimisation first
yes → is workload branch-heavy or has irregular call patterns?
yes → PGO will likely help 5-30%
no → PGO may not help; profile first with linux-perf
PGO helps most with:
- Large binaries with many cold/hot code paths (compilers, databases, servers)
- Branch-heavy code where static prediction is wrong
- Function call-heavy code where inlining decisions improve with profile data
2. GCC PGO workflow
# Step 1: Build with instrumentation
gcc -O2 -fprofile-generate -fprofile-dir=./pgo-data \
prog.c -o prog_instr
# Step 2: Run with representative workload(s)
./prog_instr < workload1.input
./prog_instr < workload2.input
# Generates .gcda files in ./pgo-data/
# Step 3: Build optimised binary using profile
gcc -O2 -fprofile-use -fprofile-dir=./pgo-data \
-fprofile-correction \
prog.c -o prog_pgo
-fprofile-correction: handles profile count inconsistencies from parallel or nondeterministic runs. Always include it.
3. Clang PGO workflow (IR-based, preferred)
# Step 1: Instrument build
clang -O2 -fprofile-instr-generate prog.c -o prog_instr
# Step 2: Run workload (generates default.profraw)
./prog_instr < workload.input
LLVM_PROFILE_FILE="prog-%p.profraw" ./prog_instr # per-PID files for parallel runs
# Step 3: Merge raw profiles
llvm-profdata merge -output=prog.profdata *.profraw
# Step 4: Optimised build
clang -O2 -fprofile-instr-use=prog.profdata prog.c -o prog_pgo
Clang's IR PGO is more accurate than GCC's and supports SamplePGO (sampling-based, no instrumentation overhead).
4. Clang SamplePGO (sampling, no instrumentation)
# Step 1: Build with frame pointers for accurate stacks
clang -O2 -fno-omit-frame-pointer prog.c -o prog
# Step 2: Sample with perf
perf record -b -e cycles:u ./prog < workload.input
perf script -F ip,brstack > perf.script # or use perf2bolt
# Step 3: Convert perf data
llvm-profgen --binary=./prog --perf-script=perf.script \
--output=prog.profdata
# Step 4: Optimised build
clang -O2 -fprofile-sample-use=prog.profdata prog.c -o prog_spgo
SamplePGO is ideal for production profiling without instrumentation overhead.
5. CMake integration
option(PGO_INSTRUMENT "Build with PGO instrumentation" OFF)
option(PGO_USE "Build with PGO profile data" OFF)
if(PGO_INSTRUMENT)
add_compile_options(-fprofile-instr-generate)
add_link_options(-fprofile-instr-generate)
endif()
if(PGO_USE)
add_compile_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
add_link_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
endif()
Build script:
# Phase 1: instrument
cmake -S . -B build-pgo-instr -DPGO_INSTRUMENT=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-pgo-instr -j$(nproc)
# Collect profile
./build-pgo-instr/prog < workload.input
llvm-profdata merge -output=prog.profdata *.profraw
# Phase 2: optimised
cmake -S . -B build-pgo -DPGO_USE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-pgo -j$(nproc)
6. BOLT (post-link binary optimisation)
BOLT reorders functions and basic blocks in the final binary based on profile data, improving instruction cache locality. Works after PGO for additional 5-15%.
# Step 1: Build with relocation support
clang -O2 -Wl,--emit-relocs prog.c -o prog
# Step 2: Collect profile with perf
perf record -e cycles:u -b ./prog < workload.input
perf2bolt prog -p perf.data -o prog.fdata
# Or use instrumented BOLT
llvm-bolt prog -instrument -o prog.instr
./prog.instr < workload.input
# Generates /tmp/prof.fdata
# Step 3: Apply BOLT optimisation
llvm-bolt prog -data prog.fdata -o prog.bolt \
-reorder-blocks=ext-tsp \
-reorder-functions=hfsort \
-split-functions \
-split-all-cold \
-dyno-stats
7. Verifying PGO impact
# Compare perf of instrumented vs PGO build
perf stat ./prog_baseline < workload.input
perf stat ./prog_pgo < workload.input
# Check which functions are hot in each
perf record ./prog_pgo < workload.input
perf report --stdio | head -30
For full workflow details and Clang vs GCC profile format notes, see references/pgo-workflow.md.
Related skills
- Use
skills/compilers/gccfor GCC flag context - Use
skills/compilers/clangfor Clang PGO and SamplePGO setup - Use
skills/profilers/linux-perffor collecting SamplePGO perf data - Use
skills/profilers/flamegraphsto identify hot paths before applying PGO
More from mohitmishra786/low-level-dev-skills
cmake
CMake build system skill for C/C++ projects. Use when writing or refactoring CMakeLists.txt, configuring out-of-source builds, selecting generators (Ninja, Make, VS), managing targets and dependencies with target_link_libraries, integrating external packages via find_package or FetchContent, enabling sanitizers, setting up toolchain files for cross-compilation, or exporting CMake packages. Activates on queries about CMakeLists.txt, cmake configure errors, target properties, install rules, CPack, or CMake presets.
580static-analysis
Static analysis skill for C/C++ codebases. Use when hardening code quality, triaging noisy builds, running clang-tidy, cppcheck, or scan-build, interpreting check categories, suppressing false positives, or integrating static analysis into CI. Activates on queries about clang-tidy checks, cppcheck, scan-build, compile_commands.json, code hardening, or static analysis warnings.
407llvm
LLVM IR and pass pipeline skill. Use when working directly with LLVM Intermediate Representation (IR), running opt passes, generating IR with llc, inspecting or writing LLVM IR for custom passes, or understanding how the LLVM backend lowers IR to assembly. Activates on queries about LLVM IR, opt, llc, llvm-dis, LLVM passes, IR transformations, or building LLVM-based tools.
361gdb
GDB debugger skill for C/C++ programs. Use when starting a GDB session, setting breakpoints, stepping through code, inspecting variables, debugging crashes, using reverse debugging (record/replay), remote debugging with gdbserver, or loading core dumps. Activates on queries about GDB commands, segfaults, hangs, watchpoints, conditional breakpoints, pretty-printers, Python GDB scripting, or multi-threaded debugging.
153linux-perf
Linux perf profiler skill for CPU performance analysis. Use when collecting sampling profiles with perf record, generating perf report, measuring hardware counters (cache misses, branch mispredicts, IPC), identifying hot functions, or feeding perf data into flamegraph tools. Activates on queries about perf, Linux performance counters, PMU events, off-CPU profiling, perf stat, perf annotate, or sampling-based profiling on Linux.
142core-dumps
Core dump analysis skill for production crash triage. Use when loading core files in GDB or LLDB, enabling core dump generation on Linux/macOS, mapping symbols with debuginfo or debuginfod, or extracting backtraces from crashes without re-running the program. Activates on queries about core files, ulimit, coredumpctl, debuginfod, crash triage, or analyzing segfaults from production binaries.
131