intel-vtune-amd-uprof
SKILL.md
Intel VTune & AMD uProf
Purpose
Guide agents through CPU microarchitecture profiling with Intel VTune Profiler (free Community Edition) and AMD uProf: hotspot identification, microarchitecture analysis, memory access pattern optimization, pipeline stall diagnosis, and roofline model analysis.
Triggers
- "How do I use Intel VTune to profile my code?"
- "What are pipeline stalls and how do I reduce them?"
- "How do I analyze memory bandwidth with VTune?"
- "What is the roofline model and how do I use it?"
- "How do I use AMD uProf as a free alternative to VTune?"
- "My code has good cache hit rates but is still slow"
Workflow
1. VTune setup (free Community Edition)
# Download Intel VTune Profiler (Community Edition — free)
# https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
# Install on Linux
source /opt/intel/oneapi/vtune/latest/env/vars.sh
# CLI usage
vtune -collect hotspots ./prog
vtune -collect microarchitecture-exploration ./prog
vtune -collect memory-access ./prog
# View results in GUI
vtune-gui &
# File → Open Result → select .vtune directory
# Or use amplxe-cl (legacy CLI)
amplxe-cl -collect hotspots ./prog
amplxe-cl -report hotspots -r result/
2. Analysis types
| Analysis | What it finds | When to use |
|---|---|---|
| Hotspots | CPU-bound functions | First step — find where time is spent |
| Microarchitecture Exploration | IPC, pipeline stalls, retired instructions | After hotspot — why is the hotspot slow? |
| Memory Access | Cache misses, DRAM bandwidth, NUMA | Memory-bound code |
| Threading | Lock contention, parallel efficiency | Multithreaded code |
| HPC Performance | Vectorization, memory, roofline | HPC / scientific code |
| I/O | Disk and network bottlenecks | I/O-bound code |
3. Hotspot analysis
# Collect and report hotspots
vtune -collect hotspots -result-dir hotspots_result ./prog
# Report top functions by CPU time
vtune -report hotspots -r hotspots_result -format csv | head -20
# CLI output example:
# Function CPU Time Module
# compute_fft 4.532s libfft.so
# matrix_mult 2.108s prog
# parse_input 0.234s prog
Build with debug info for meaningful symbols:
gcc -O2 -g ./prog.c -o prog # symbols visible in VTune
gcc -O2 -g -gsplit-dwarf -fno-omit-frame-pointer ./prog.c -o prog # better stacks
4. Microarchitecture exploration — pipeline stalls
vtune -collect microarchitecture-exploration -r micro_result ./prog
vtune -report summary -r micro_result
Key metrics to examine:
| Metric | Meaning | Good value |
|---|---|---|
| IPC (Instructions Per Clock) | How many instructions retire per cycle | x86: aim for > 2.0 |
| CPI (Clocks Per Instruction) | Inverse of IPC | Lower is better |
| Bad Speculation | Branch mispredictions | < 5% |
| Front-End Bound | Instruction decode bottleneck | < 15% |
| Back-End Bound | Execution unit or memory stall | < 30% |
| Retiring | Useful work fraction | > 70% ideal |
| Memory Bound | % cycles waiting for memory | < 20% |
Pipeline Analysis (Top-Down Methodology):
├── Retiring (good, useful work)
├── Bad Speculation (branch mispredictions)
├── Front-End Bound
│ ├── Fetch Latency (I-cache misses, branch mispredicts)
│ └── Fetch Bandwidth
└── Back-End Bound
├── Memory Bound
│ ├── L1 Bound → L1 cache misses
│ ├── L2 Bound → L2 cache misses
│ ├── L3 Bound → L3 cache misses
│ └── DRAM Bound → main memory bandwidth limited
└── Core Bound → ALU/compute bound
5. Memory access analysis
# Collect memory access profile
vtune -collect memory-access -r mem_result ./prog
# Key output sections:
# - Memory Bound: % time waiting for memory
# - LLC (Last Level Cache) Miss Rate
# - DRAM Bandwidth: GB/s achieved vs theoretical peak
# - NUMA: cross-socket accesses (for multi-socket systems)
Reading DRAM bandwidth:
DRAM Bandwidth: 18.4 GB/s
Peak Theoretical: 51.2 GB/s
Utilization: 36% — likely not DRAM-bound
If DRAM-bound: optimize data layout (AoS → SoA), reduce working set, improve spatial locality.
6. AMD uProf — free alternative for AMD CPUs
# Download AMD uProf
# https://www.amd.com/en/developer/uprof.html
# CLI profiling
AMDuProfCLI collect --config tbp ./prog # time-based profiling
AMDuProfCLI collect --config assess ./prog # microarchitecture assessment
AMDuProfCLI collect --config memory ./prog # memory access
# Generate report
AMDuProfCLI report -i /tmp/uprof_result/ -o report.html
# Open GUI
AMDuProf &
AMD uProf metrics map to VTune equivalents:
Retired Instructions→ IPC analysisBranch Mispredictions→ Bad SpeculationL1/L2/L3 Cache Misses→ Memory Bound levelsData Cache Accesses→ Cache efficiency
7. Roofline model
The roofline model shows whether code is compute-bound or memory-bound by comparing achieved performance against hardware limits:
Performance (GFLOPS/s)
| _______________
Peak | /
Perf | / compute bound
| /
| /
| / memory bandwidth bound
| /
+------------------------------→
Arithmetic Intensity (FLOPS/Byte)
# VTune roofline collection
vtune -collect hpc-performance -r roofline_result ./prog
# Then: VTune GUI → Roofline view
# For manual calculation:
# Arithmetic Intensity = FLOPS / memory_bytes_accessed
# Peak FLOPS = CPUs × cores × freq × FLOPS_per_cycle_per_core
# Peak BW = from hardware spec (e.g., 51.2 GB/s for DDR4-3200 dual channel)
# likwid-perfctr for manual roofline data (Linux)
likwid-perfctr -C 0 -g FLOPS_DP ./prog # double-precision FLOPS
likwid-perfctr -C 0 -g MEM ./prog # memory bandwidth
Related skills
- Use
skills/profilers/hardware-countersfor raw PMU event collection with perf stat - Use
skills/profilers/linux-perffor perf-based profiling on Linux - Use
skills/low-level-programming/cpu-cache-optfor memory access pattern optimization - Use
skills/low-level-programming/simd-intrinsicsfor vectorization to increase FLOPS
Weekly Installs
12
Repository
mohitmishra786/…v-skillsGitHub Stars
27
First Seen
11 days ago
Security Audits
Installed on
opencode12
gemini-cli12
github-copilot12
codex12
kimi-cli12
cursor12