intel-vtune-amd-uprof
Intel VTune & AMD uProf
Purpose
Guide agents through CPU microarchitecture profiling with Intel VTune Profiler (free Community Edition) and AMD uProf: hotspot identification, microarchitecture analysis, memory access pattern optimization, pipeline stall diagnosis, and roofline model analysis.
Triggers
- "How do I use Intel VTune to profile my code?"
- "What are pipeline stalls and how do I reduce them?"
- "How do I analyze memory bandwidth with VTune?"
- "What is the roofline model and how do I use it?"
- "How do I use AMD uProf as a free alternative to VTune?"
- "My code has good cache hit rates but is still slow"
Workflow
1. VTune setup (free Community Edition)
# Download Intel VTune Profiler (Community Edition — free)
# https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
# Install on Linux
source /opt/intel/oneapi/vtune/latest/env/vars.sh
# CLI usage
vtune -collect hotspots ./prog
vtune -collect microarchitecture-exploration ./prog
vtune -collect memory-access ./prog
# View results in GUI
vtune-gui &
# File → Open Result → select .vtune directory
# Or use amplxe-cl (legacy CLI)
amplxe-cl -collect hotspots ./prog
amplxe-cl -report hotspots -r result/
2. Analysis types
| Analysis | What it finds | When to use |
|---|---|---|
| Hotspots | CPU-bound functions | First step — find where time is spent |
| Microarchitecture Exploration | IPC, pipeline stalls, retired instructions | After hotspot — why is the hotspot slow? |
| Memory Access | Cache misses, DRAM bandwidth, NUMA | Memory-bound code |
| Threading | Lock contention, parallel efficiency | Multithreaded code |
| HPC Performance | Vectorization, memory, roofline | HPC / scientific code |
| I/O | Disk and network bottlenecks | I/O-bound code |
3. Hotspot analysis
# Collect and report hotspots
vtune -collect hotspots -result-dir hotspots_result ./prog
# Report top functions by CPU time
vtune -report hotspots -r hotspots_result -format csv | head -20
# CLI output example:
# Function CPU Time Module
# compute_fft 4.532s libfft.so
# matrix_mult 2.108s prog
# parse_input 0.234s prog
Build with debug info for meaningful symbols:
gcc -O2 -g ./prog.c -o prog # symbols visible in VTune
gcc -O2 -g -gsplit-dwarf -fno-omit-frame-pointer ./prog.c -o prog # better stacks
4. Microarchitecture exploration — pipeline stalls
vtune -collect microarchitecture-exploration -r micro_result ./prog
vtune -report summary -r micro_result
Key metrics to examine:
| Metric | Meaning | Good value |
|---|---|---|
| IPC (Instructions Per Clock) | How many instructions retire per cycle | x86: aim for > 2.0 |
| CPI (Clocks Per Instruction) | Inverse of IPC | Lower is better |
| Bad Speculation | Branch mispredictions | < 5% |
| Front-End Bound | Instruction decode bottleneck | < 15% |
| Back-End Bound | Execution unit or memory stall | < 30% |
| Retiring | Useful work fraction | > 70% ideal |
| Memory Bound | % cycles waiting for memory | < 20% |
Pipeline Analysis (Top-Down Methodology):
├── Retiring (good, useful work)
├── Bad Speculation (branch mispredictions)
├── Front-End Bound
│ ├── Fetch Latency (I-cache misses, branch mispredicts)
│ └── Fetch Bandwidth
└── Back-End Bound
├── Memory Bound
│ ├── L1 Bound → L1 cache misses
│ ├── L2 Bound → L2 cache misses
│ ├── L3 Bound → L3 cache misses
│ └── DRAM Bound → main memory bandwidth limited
└── Core Bound → ALU/compute bound
5. Memory access analysis
# Collect memory access profile
vtune -collect memory-access -r mem_result ./prog
# Key output sections:
# - Memory Bound: % time waiting for memory
# - LLC (Last Level Cache) Miss Rate
# - DRAM Bandwidth: GB/s achieved vs theoretical peak
# - NUMA: cross-socket accesses (for multi-socket systems)
Reading DRAM bandwidth:
DRAM Bandwidth: 18.4 GB/s
Peak Theoretical: 51.2 GB/s
Utilization: 36% — likely not DRAM-bound
If DRAM-bound: optimize data layout (AoS → SoA), reduce working set, improve spatial locality.
6. AMD uProf — free alternative for AMD CPUs
# Download AMD uProf
# https://www.amd.com/en/developer/uprof.html
# CLI profiling
AMDuProfCLI collect --config tbp ./prog # time-based profiling
AMDuProfCLI collect --config assess ./prog # microarchitecture assessment
AMDuProfCLI collect --config memory ./prog # memory access
# Generate report
AMDuProfCLI report -i /tmp/uprof_result/ -o report.html
# Open GUI
AMDuProf &
AMD uProf metrics map to VTune equivalents:
Retired Instructions→ IPC analysisBranch Mispredictions→ Bad SpeculationL1/L2/L3 Cache Misses→ Memory Bound levelsData Cache Accesses→ Cache efficiency
7. Roofline model
The roofline model shows whether code is compute-bound or memory-bound by comparing achieved performance against hardware limits:
Performance (GFLOPS/s)
| _______________
Peak | /
Perf | / compute bound
| /
| /
| / memory bandwidth bound
| /
+------------------------------→
Arithmetic Intensity (FLOPS/Byte)
# VTune roofline collection
vtune -collect hpc-performance -r roofline_result ./prog
# Then: VTune GUI → Roofline view
# For manual calculation:
# Arithmetic Intensity = FLOPS / memory_bytes_accessed
# Peak FLOPS = CPUs × cores × freq × FLOPS_per_cycle_per_core
# Peak BW = from hardware spec (e.g., 51.2 GB/s for DDR4-3200 dual channel)
# likwid-perfctr for manual roofline data (Linux)
likwid-perfctr -C 0 -g FLOPS_DP ./prog # double-precision FLOPS
likwid-perfctr -C 0 -g MEM ./prog # memory bandwidth
Related skills
- Use
skills/profilers/hardware-countersfor raw PMU event collection with perf stat - Use
skills/profilers/linux-perffor perf-based profiling on Linux - Use
skills/low-level-programming/cpu-cache-optfor memory access pattern optimization - Use
skills/low-level-programming/simd-intrinsicsfor vectorization to increase FLOPS
More from mohitmishra786/low-level-dev-skills
cmake
CMake build system skill for C/C++ projects. Use when writing or refactoring CMakeLists.txt, configuring out-of-source builds, selecting generators (Ninja, Make, VS), managing targets and dependencies with target_link_libraries, integrating external packages via find_package or FetchContent, enabling sanitizers, setting up toolchain files for cross-compilation, or exporting CMake packages. Activates on queries about CMakeLists.txt, cmake configure errors, target properties, install rules, CPack, or CMake presets.
579static-analysis
Static analysis skill for C/C++ codebases. Use when hardening code quality, triaging noisy builds, running clang-tidy, cppcheck, or scan-build, interpreting check categories, suppressing false positives, or integrating static analysis into CI. Activates on queries about clang-tidy checks, cppcheck, scan-build, compile_commands.json, code hardening, or static analysis warnings.
407llvm
LLVM IR and pass pipeline skill. Use when working directly with LLVM Intermediate Representation (IR), running opt passes, generating IR with llc, inspecting or writing LLVM IR for custom passes, or understanding how the LLVM backend lowers IR to assembly. Activates on queries about LLVM IR, opt, llc, llvm-dis, LLVM passes, IR transformations, or building LLVM-based tools.
361gdb
GDB debugger skill for C/C++ programs. Use when starting a GDB session, setting breakpoints, stepping through code, inspecting variables, debugging crashes, using reverse debugging (record/replay), remote debugging with gdbserver, or loading core dumps. Activates on queries about GDB commands, segfaults, hangs, watchpoints, conditional breakpoints, pretty-printers, Python GDB scripting, or multi-threaded debugging.
153linux-perf
Linux perf profiler skill for CPU performance analysis. Use when collecting sampling profiles with perf record, generating perf report, measuring hardware counters (cache misses, branch mispredicts, IPC), identifying hot functions, or feeding perf data into flamegraph tools. Activates on queries about perf, Linux performance counters, PMU events, off-CPU profiling, perf stat, perf annotate, or sampling-based profiling on Linux.
142core-dumps
Core dump analysis skill for production crash triage. Use when loading core files in GDB or LLDB, enabling core dump generation on Linux/macOS, mapping symbols with debuginfo or debuginfod, or extracting backtraces from crashes without re-running the program. Activates on queries about core files, ulimit, coredumpctl, debuginfod, crash triage, or analyzing segfaults from production binaries.
131