pgo
PGO (Profile-Guided Optimisation)
Purpose
Guide agents through the full PGO workflow: instrument build → representative workload → collect profile → optimised build, covering both GCC and Clang, plus BOLT for post-link optimisation.
Triggers
- "How do I use PGO to speed up my binary?"
- "What is profile-guided optimization and when should I use it?"
- "How do I use
-fprofile-generateand-fprofile-use?" - "My
-O3build isn't fast enough — what next?" - "How does BOLT differ from PGO?"
- "How do I collect representative profile data?"
Workflow
1. When to use PGO
Is -O3 -march=native already applied?
no → apply standard optimisation first
yes → is workload branch-heavy or has irregular call patterns?
yes → PGO will likely help 5-30%
no → PGO may not help; profile first with linux-perf
PGO helps most with:
- Large binaries with many cold/hot code paths (compilers, databases, servers)
- Branch-heavy code where static prediction is wrong
- Function call-heavy code where inlining decisions improve with profile data
2. GCC PGO workflow
# Step 1: Build with instrumentation
gcc -O2 -fprofile-generate -fprofile-dir=./pgo-data \
prog.c -o prog_instr
# Step 2: Run with representative workload(s)
./prog_instr < workload1.input
./prog_instr < workload2.input
# Generates .gcda files in ./pgo-data/
# Step 3: Build optimised binary using profile
gcc -O2 -fprofile-use -fprofile-dir=./pgo-data \
-fprofile-correction \
prog.c -o prog_pgo
-fprofile-correction: handles profile count inconsistencies from parallel or nondeterministic runs. Always include it.
3. Clang PGO workflow (IR-based, preferred)
# Step 1: Instrument build
clang -O2 -fprofile-instr-generate prog.c -o prog_instr
# Step 2: Run workload (generates default.profraw)
./prog_instr < workload.input
LLVM_PROFILE_FILE="prog-%p.profraw" ./prog_instr # per-PID files for parallel runs
# Step 3: Merge raw profiles
llvm-profdata merge -output=prog.profdata *.profraw
# Step 4: Optimised build
clang -O2 -fprofile-instr-use=prog.profdata prog.c -o prog_pgo
Clang's IR PGO is more accurate than GCC's and supports SamplePGO (sampling-based, no instrumentation overhead).
4. Clang SamplePGO (sampling, no instrumentation)
# Step 1: Build with frame pointers for accurate stacks
clang -O2 -fno-omit-frame-pointer prog.c -o prog
# Step 2: Sample with perf
perf record -b -e cycles:u ./prog < workload.input
perf script -F ip,brstack > perf.script # or use perf2bolt
# Step 3: Convert perf data
llvm-profgen --binary=./prog --perf-script=perf.script \
--output=prog.profdata
# Step 4: Optimised build
clang -O2 -fprofile-sample-use=prog.profdata prog.c -o prog_spgo
SamplePGO is ideal for production profiling without instrumentation overhead.
5. CMake integration
option(PGO_INSTRUMENT "Build with PGO instrumentation" OFF)
option(PGO_USE "Build with PGO profile data" OFF)
if(PGO_INSTRUMENT)
add_compile_options(-fprofile-instr-generate)
add_link_options(-fprofile-instr-generate)
endif()
if(PGO_USE)
add_compile_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
add_link_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
endif()
Build script:
# Phase 1: instrument
cmake -S . -B build-pgo-instr -DPGO_INSTRUMENT=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-pgo-instr -j$(nproc)
# Collect profile
./build-pgo-instr/prog < workload.input
llvm-profdata merge -output=prog.profdata *.profraw
# Phase 2: optimised
cmake -S . -B build-pgo -DPGO_USE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-pgo -j$(nproc)
6. BOLT (post-link binary optimisation)
BOLT reorders functions and basic blocks in the final binary based on profile data, improving instruction cache locality. Works after PGO for additional 5-15%.
# Step 1: Build with relocation support
clang -O2 -Wl,--emit-relocs prog.c -o prog
# Step 2: Collect profile with perf
perf record -e cycles:u -b ./prog < workload.input
perf2bolt prog -p perf.data -o prog.fdata
# Or use instrumented BOLT
llvm-bolt prog -instrument -o prog.instr
./prog.instr < workload.input
# Generates /tmp/prof.fdata
# Step 3: Apply BOLT optimisation
llvm-bolt prog -data prog.fdata -o prog.bolt \
-reorder-blocks=ext-tsp \
-reorder-functions=hfsort \
-split-functions \
-split-all-cold \
-dyno-stats
7. Verifying PGO impact
# Compare perf of instrumented vs PGO build
perf stat ./prog_baseline < workload.input
perf stat ./prog_pgo < workload.input
# Check which functions are hot in each
perf record ./prog_pgo < workload.input
perf report --stdio | head -30
For full workflow details and Clang vs GCC profile format notes, see references/pgo-workflow.md.
Related skills
- Use
skills/compilers/gccfor GCC flag context - Use
skills/compilers/clangfor Clang PGO and SamplePGO setup - Use
skills/profilers/linux-perffor collecting SamplePGO perf data - Use
skills/profilers/flamegraphsto identify hot paths before applying PGO