rust-profiling
Rust Profiling
Purpose
Guide agents through Rust performance profiling: flamegraphs via cargo-flamegraph, binary size analysis, monomorphization bloat measurement, Criterion microbenchmarks, and interpreting profiling results with inlined Rust frames.
Triggers
- "How do I generate a flamegraph for a Rust program?"
- "My Rust binary is huge — how do I find what's causing it?"
- "How do I write Criterion benchmarks?"
- "How do I measure monomorphization bloat?"
- "Rust performance is worse than expected — how do I profile it?"
- "How do I use perf with Rust?"
Workflow
1. Build for profiling
# Release with debug symbols (needed for readable profiles)
# Cargo.toml:
[profile.release-with-debug]
inherits = "release"
debug = true
cargo build --profile release-with-debug
# Or quick: release + debug info inline
CARGO_PROFILE_RELEASE_DEBUG=true cargo build --release
2. Flamegraphs with cargo-flamegraph
# Install
cargo install flamegraph
# Linux: uses perf (requires perf_event_paranoid ≤ 1)
sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid'
cargo flamegraph --bin myapp -- arg1 arg2
# macOS: uses DTrace (requires sudo)
sudo cargo flamegraph --bin myapp -- arg1 arg2
# Profile tests
cargo flamegraph --test mytest -- test_filter
# Profile benchmarks
cargo flamegraph --bench mybench -- --bench
# Output
# Generates flamegraph.svg in current directory
# Open in browser: firefox flamegraph.svg
Custom flamegraph options:
# More samples
cargo flamegraph --freq 1000 --bin myapp
# Filter to specific threads
cargo flamegraph --bin myapp -- args 2>/dev/null
# Using perf directly for more control
perf record -g -F 999 ./target/release-with-debug/myapp args
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg
3. Binary size analysis with cargo-bloat
# Install
cargo install cargo-bloat
# Show top functions by size
cargo bloat --release -n 20
# Show per-crate size breakdown
cargo bloat --release --crates
# Include only specific crate
cargo bloat --release --filter myapp
# Compare before/after a change
cargo bloat --release --crates > before.txt
# make changes
cargo bloat --release --crates > after.txt
diff before.txt after.txt
Typical output:
File .text Size Crate Name
2.4% 3.0% 47.0KiB std <std macros>
1.8% 2.3% 35.5KiB myapp myapp::heavy_module::process
1.2% 1.5% 23.1KiB serde serde::de::...
4. Monomorphization bloat with cargo-llvm-lines
# Install
cargo install cargo-llvm-lines
# Show LLVM IR line counts (proxy for monomorphization)
cargo llvm-lines --release | head -40
# Filter to your crate only
cargo llvm-lines --release | grep '^myapp'
Typical output:
Lines Copies Function name
85330 1 [LLVM passes]
7761 92 core::fmt::write
4672 11 myapp::process::<impl MyTrait for T>
3201 47 <alloc::vec::Vec<T> as core::ops::Drop>::drop
High Copies count = monomorphization expansion. Fix:
// Before: generic, gets monomorphized for every T
fn process<T: AsRef<[u8]>>(data: T) -> usize {
do_work(data.as_ref())
}
// After: thin generic wrapper + concrete inner
fn process<T: AsRef<[u8]>>(data: T) -> usize {
fn inner(data: &[u8]) -> usize { do_work(data) }
inner(data.as_ref())
}
5. Criterion microbenchmarks
# Cargo.toml
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
[[bench]]
name = "my_bench"
harness = false
// benches/my_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
fn bench_process(c: &mut Criterion) {
// Simple benchmark
c.bench_function("process 1000 items", |b| {
let data: Vec<i32> = (0..1000).collect();
b.iter(|| process(black_box(&data))) // black_box prevents optimization
});
}
fn bench_sizes(c: &mut Criterion) {
let mut group = c.benchmark_group("process_sizes");
for size in [100, 1000, 10000].iter() {
let data: Vec<i32> = (0..*size).collect();
group.bench_with_input(
BenchmarkId::from_parameter(size),
&data,
|b, data| b.iter(|| process(black_box(data))),
);
}
group.finish();
}
criterion_group!(benches, bench_process, bench_sizes);
criterion_main!(benches);
# Run all benchmarks
cargo bench
# Run specific benchmark
cargo bench --bench my_bench
# Run with filter
cargo bench -- process_sizes
# Compare with baseline (save/load)
cargo bench -- --save-baseline before
# make changes
cargo bench -- --baseline before
# View HTML report
open target/criterion/report/index.html
6. perf with Rust (Linux)
# Record
perf record -g ./target/release-with-debug/myapp args
perf record -g -F 999 ./target/release-with-debug/myapp args # higher freq
# Report
perf report # interactive TUI
perf report --stdio --no-call-graph | head -40 # text
# Annotate specific function
perf annotate myapp::hot_function
# stat (quick counters)
perf stat ./target/release/myapp args
Rust-specific perf tips:
- Build with
debug = 1(line tables only) for faster builds with line-level attribution - Use
RUSTFLAGS="-C force-frame-pointers=yes"for better call graphs without DWARF unwinding - Disable ASLR for reproducible addresses:
setarch $(uname -m) -R ./myapp
7. heaptrack / DHAT for allocations
# heaptrack (Linux)
heaptrack ./target/release/myapp args
heaptrack_print heaptrack.myapp.*.zst | head -50
# DHAT via Valgrind
valgrind --tool=dhat ./target/debug/myapp args
# Open dhat-out.* with dh_view.html
For flamegraph setup and Criterion configuration, see references/cargo-flamegraph-setup.md.
Related skills
- Use
skills/rust/rustc-basicsfor build configuration (debug symbols, profiles) - Use
skills/profilers/linux-perffor perf fundamentals - Use
skills/profilers/flamegraphsfor reading and interpreting flamegraph SVGs - Use
skills/profilers/valgrindfor allocation profiling with massif/DHAT
More from mohitmishra786/low-level-dev-skills
cmake
CMake build system skill for C/C++ projects. Use when writing or refactoring CMakeLists.txt, configuring out-of-source builds, selecting generators (Ninja, Make, VS), managing targets and dependencies with target_link_libraries, integrating external packages via find_package or FetchContent, enabling sanitizers, setting up toolchain files for cross-compilation, or exporting CMake packages. Activates on queries about CMakeLists.txt, cmake configure errors, target properties, install rules, CPack, or CMake presets.
579static-analysis
Static analysis skill for C/C++ codebases. Use when hardening code quality, triaging noisy builds, running clang-tidy, cppcheck, or scan-build, interpreting check categories, suppressing false positives, or integrating static analysis into CI. Activates on queries about clang-tidy checks, cppcheck, scan-build, compile_commands.json, code hardening, or static analysis warnings.
407llvm
LLVM IR and pass pipeline skill. Use when working directly with LLVM Intermediate Representation (IR), running opt passes, generating IR with llc, inspecting or writing LLVM IR for custom passes, or understanding how the LLVM backend lowers IR to assembly. Activates on queries about LLVM IR, opt, llc, llvm-dis, LLVM passes, IR transformations, or building LLVM-based tools.
361gdb
GDB debugger skill for C/C++ programs. Use when starting a GDB session, setting breakpoints, stepping through code, inspecting variables, debugging crashes, using reverse debugging (record/replay), remote debugging with gdbserver, or loading core dumps. Activates on queries about GDB commands, segfaults, hangs, watchpoints, conditional breakpoints, pretty-printers, Python GDB scripting, or multi-threaded debugging.
153linux-perf
Linux perf profiler skill for CPU performance analysis. Use when collecting sampling profiles with perf record, generating perf report, measuring hardware counters (cache misses, branch mispredicts, IPC), identifying hot functions, or feeding perf data into flamegraph tools. Activates on queries about perf, Linux performance counters, PMU events, off-CPU profiling, perf stat, perf annotate, or sampling-based profiling on Linux.
142core-dumps
Core dump analysis skill for production crash triage. Use when loading core files in GDB or LLDB, enabling core dump generation on Linux/macOS, mapping symbols with debuginfo or debuginfod, or extracting backtraces from crashes without re-running the program. Activates on queries about core files, ulimit, coredumpctl, debuginfod, crash triage, or analyzing segfaults from production binaries.
131