go-performance-best-practices
Go Performance Best Practices
Comprehensive performance optimization guide for Go codebases. Contains 41 rules across 8 categories with real-world benchmarks, BOMvault-specific examples, and proven optimization patterns from 10+ years of production experience.
When to Apply
Reference these guidelines when:
- Writing or refactoring Go code
- Tuning latency, throughput, allocation rate, or GC behavior
- Investigating performance regressions
- Reviewing code for performance issues
- Debugging memory leaks or goroutine leaks
- Optimizing containerized services (ECS, Kubernetes)
The Performance Optimization Workflow
Phase 1: Measure First (Don't Guess)
Never optimize without data. The #1 mistake is optimizing based on intuition.
# Step 1: Establish baseline with benchmarks
go test -bench=. -benchmem -count=5 ./... | tee baseline.txt
# Step 2: Generate CPU profile for hot paths
go test -bench=BenchmarkCriticalPath -cpuprofile=cpu.prof
go tool pprof -http=:8080 cpu.prof
# Step 3: Generate heap profile for allocations
go test -bench=BenchmarkCriticalPath -memprofile=heap.prof
go tool pprof -http=:8080 heap.prof
# Step 4: Check allocation counts (correlates with latency)
go tool pprof -alloc_objects heap.prof
Key pprof views:
| View | Use For |
|---|---|
top |
Quick ranking of hot functions |
list funcname |
Line-by-line attribution |
web |
Visual call graph |
flame |
Flame graph for deep call stacks |
peek funcname |
Callers and callees |
Phase 2: Identify the Bottleneck
Use the right profile for the right problem:
| Symptom | Profile Type | pprof Flag |
|---|---|---|
| High CPU usage | CPU | -cpuprofile |
| High memory usage | Heap (inuse) | -memprofile + -inuse_space |
| High allocation rate / GC pressure | Heap (alloc) | -memprofile + -alloc_objects |
| Goroutine leaks | Goroutine | runtime/pprof.Lookup("goroutine") |
| Lock contention | Mutex | -mutexprofile |
| Blocking operations | Block | -blockprofile |
Quick diagnosis commands:
# CPU: What's using the most cycles?
go tool pprof -top cpu.prof
# Memory: What's consuming the most heap?
go tool pprof -top -inuse_space heap.prof
# Allocations: What's creating the most objects?
go tool pprof -top -alloc_objects heap.prof
# Compare before/after
go tool pprof -base baseline.prof optimized.prof
Phase 3: Apply Targeted Optimization
Match the symptom to the optimization category:
| Symptom | Category | Key Rules |
|---|---|---|
| CPU-bound | Work Avoidance | work-cache-*, work-short-circuit-* |
| Memory-bound | Allocation | alloc-preallocate-*, alloc-copy-to-avoid-retention |
| GC pauses | GC Tuning | gc-set-gomemlimit, gc-use-sync-pool |
| I/O latency | I/O | io-buffered-io, io-reuse-http-client |
| Lock contention | Concurrency | conc-reduce-lock-contention, conc-use-atomics |
| Goroutine explosion | Concurrency | conc-limit-goroutines, conc-bounded-channels |
Phase 4: Verify Improvement
# Run benchmark again
go test -bench=. -benchmem -count=5 ./... | tee optimized.txt
# Compare results
benchstat baseline.txt optimized.txt
# Verify no regressions in other benchmarks
Success criteria:
- Measurable improvement (not just "feels faster")
- No regressions in other areas
- Code remains readable and maintainable
- Changes are justified by data
Common Optimization Scenarios
Scenario 1: High Latency / Slow Response Times
Symptoms: P99 latency spikes, slow API responses, timeouts
Diagnosis:
# CPU profile during slow requests
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -http=:8080 cpu.prof
Common causes and fixes:
| Cause | Indicator | Fix |
|---|---|---|
| JSON encoding | encoding/json in top |
Use json.NewEncoder streaming, consider jsoniter |
| Regex compilation | regexp.Compile in hot path |
Cache compiled regex at init |
| Slice/map scanning | Loops in profile | Convert to map lookup |
| String concatenation | + operator in loops |
Use strings.Builder |
| Excessive logging | Logger in top | Reduce log level in hot path |
Scenario 2: High Memory Usage / OOM Kills
Symptoms: Container OOM killed, memory growing over time, swap thrashing
Diagnosis:
# Heap profile
curl http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof -inuse_space -top heap.prof
# Check for memory leaks (growing allocations)
go tool pprof -alloc_space -top heap.prof
Common causes and fixes:
| Cause | Indicator | Fix |
|---|---|---|
| Large slice retention | append with small subslices |
copy() to new slice |
| Unbounded caches | Map growing without eviction | Add LRU/TTL eviction |
| io.ReadAll on large files | Large []byte allocations |
Stream with io.Copy |
| String/[]byte conversions | runtime.stringtoslicebyte |
Stay in one domain |
| Goroutine leaks | Goroutine count growing | Check context cancellation |
Scenario 3: High GC Pressure / CPU Spent in GC
Symptoms: gc_pause_seconds high, runtime.mallocgc in CPU profile
Diagnosis:
# Check GC stats
GODEBUG=gctrace=1 ./myservice 2>&1 | head -20
# Allocation profile
go tool pprof -alloc_objects -top heap.prof
Common causes and fixes:
| Cause | Indicator | Fix |
|---|---|---|
| Many small allocations | High alloc_objects |
Use sync.Pool |
| Creating slices in loops | make([]T, ...) in hot path |
Preallocate or pool |
| fmt.Sprintf in hot path | fmt.* allocations |
Use strconv |
| Interface boxing | interface{} conversions |
Use generics or concrete types |
| Not setting GOMEMLIMIT | Frequent GC cycles | Set GOMEMLIMIT to 80-90% of container |
Scenario 4: Goroutine Leaks / Count Growing
Symptoms: Goroutine count increases over time, eventual resource exhaustion
Diagnosis:
# Goroutine profile
curl http://localhost:8080/debug/pprof/goroutine?debug=2 > goroutine.txt
cat goroutine.txt | head -100
# Count by state
curl http://localhost:8080/debug/pprof/goroutine?debug=1 | head -50
Common causes and fixes:
| Cause | Indicator | Fix |
|---|---|---|
| Blocked channel receive | chan receive in stack |
Add timeout or close channel |
| HTTP client no timeout | net/http.(*persistConn).readLoop |
Set client timeout |
| Ticker not stopped | time.Tick in stack |
Use time.NewTicker + defer Stop() |
| Context not cancelled | context.Background() everywhere |
Pass and check context |
| Worker pool leak | Workers waiting on closed channel | Proper shutdown signaling |
Scenario 5: Lock Contention / Serialized Execution
Symptoms: CPU not fully utilized, goroutines blocked on mutex
Diagnosis:
# Mutex profile (must be enabled)
curl http://localhost:8080/debug/pprof/mutex > mutex.prof
go tool pprof -top mutex.prof
# Block profile
curl http://localhost:8080/debug/pprof/block > block.prof
go tool pprof -top block.prof
Common causes and fixes:
| Cause | Indicator | Fix |
|---|---|---|
| Global mutex | Single lock in mutex profile | Shard by key |
| Write lock for reads | sync.Mutex on read-heavy map |
Use sync.RWMutex |
| Lock held during I/O | I/O calls while holding lock | Release lock before I/O |
| Atomic operations on struct | atomic.Value for config |
Use atomic.Pointer[T] |
BOMvault Service Optimization Guide
License Enricher
Profile: CPU-bound, high allocation rate from parsing
Key optimizations:
- Cache compiled SPDX license regex patterns at init
- Pool
bytes.Bufferfor license text processing - Preallocate slice for
AffectedPackagesbased on typical size - Stream large license files instead of
io.ReadAll
// BOMvault license-enricher pattern
var (
spdxRegex = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9.-]*$`)
bufPool = sync.Pool{New: func() any { return new(bytes.Buffer) }}
)
func (e *Enricher) ProcessLicense(data []byte) (*License, error) {
buf := bufPool.Get().(*bytes.Buffer)
buf.Reset()
defer bufPool.Put(buf)
// ... use buf for processing
}
Vulnerability Enricher
Profile: I/O-bound (NVD API), memory spikes from CVE data
Key optimizations:
- Reuse
http.Clientwith connection pooling - Stream JSON responses for large CVE feeds
- Set
GOMEMLIMITto 80% of container memory - Use map for CVE ID lookups instead of slice scanning
- Batch database inserts (100-500 per batch)
// BOMvault vulnerability-enricher pattern
var nvdClient = &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
},
}
type CVEIndex struct {
byID map[string]*CVE // O(1) lookup
}
Graph Ingest
Profile: Memory-bound, large SBOM processing
Key optimizations:
- Stream SBOM JSON parsing with
json.Decoder - Copy component slices to avoid retaining entire SBOM
- Use
GOMEMLIMITwith soft memory limit - Bounded worker pool for parallel component processing
- Context timeouts for database operations
// BOMvault graph-ingest pattern
func (g *GraphIngest) ProcessSBOM(ctx context.Context, r io.Reader) error {
dec := json.NewDecoder(r) // Stream, don't ReadAll
// Bounded parallelism
sem := make(chan struct{}, 10)
for dec.More() {
var component Component
if err := dec.Decode(&component); err != nil {
return err
}
sem <- struct{}{}
go func(c Component) {
defer func() { <-sem }()
g.processComponent(ctx, c)
}(component)
}
return nil
}
Alert Writer
Profile: I/O-bound (SARIF generation), batch processing
Key optimizations:
- Precompute report templates at startup
- Batch writes to reduce syscalls
- Pool buffers for SARIF report generation
- Use
strings.Builderfor alert message construction
// BOMvault alert-writer pattern
var (
reportTemplates = template.Must(template.ParseGlob("templates/*.html"))
bufPool = sync.Pool{New: func() any { return new(bytes.Buffer) }}
)
func (w *AlertWriter) GenerateSARIF(findings []*Finding) ([]byte, error) {
buf := bufPool.Get().(*bytes.Buffer)
buf.Reset()
buf.Grow(len(findings) * 500) // Estimate size
defer bufPool.Put(buf)
// Batch write to buffer, then single Write to output
}
Rule Categories by Priority
| Priority | Category | Impact | Prefix |
|---|---|---|---|
| 1 | Measurement & Profiling | CRITICAL | prof- |
| 2 | Allocation & Data Structures | HIGH | alloc- |
| 3 | Strings, Bytes & Encoding | HIGH | bytes- |
| 4 | Concurrency & Synchronization | HIGH | conc- |
| 5 | GC & Memory Limits | HIGH | gc- |
| 6 | I/O & Networking | HIGH | io- |
| 7 | Runtime & Scheduling | MEDIUM | rt- |
| 8 | Work Avoidance & Caching | MEDIUM | work- |
Quick Reference
1. Measurement & Profiling (CRITICAL)
| Rule | Impact | When to Apply |
|---|---|---|
prof-use-testing-benchmarks |
Foundation | Always benchmark before optimizing |
prof-report-allocs |
Foundation | When allocation rate matters |
prof-benchmark-timers |
Foundation | When setup skews results |
prof-cpu-profile |
Foundation | CPU-bound workloads |
prof-heap-profile |
Foundation | Memory issues, GC pressure |
2. Allocation & Data Structures (HIGH)
| Rule | Impact | When to Apply |
|---|---|---|
alloc-preallocate-slices |
2-10x | Known size, append loops |
alloc-preallocate-maps |
2-5x | Known cardinality |
alloc-copy-to-avoid-retention |
Memory leak | Subslices of large arrays |
alloc-use-copy-builtin |
2-3x | Slice-to-slice moves |
alloc-avoid-string-byte-conv |
2x | Frequent conversions |
alloc-use-zero-value-buffers |
Minor | Buffer initialization |
3. Strings, Bytes & Encoding (HIGH)
| Rule | Impact | When to Apply |
|---|---|---|
bytes-use-strings-builder |
100-1000x | String concatenation loops (vs + operator) |
bytes-use-bytes-buffer |
10-100x | Byte accumulation |
bytes-grow-when-known |
2-5x | Known final size |
bytes-avoid-fmt-in-hot-path |
5-10x | Number formatting |
bytes-precompile-regexp |
10-100x | Regex in hot path |
4. Concurrency & Synchronization (HIGH)
| Rule | Impact | When to Apply |
|---|---|---|
conc-limit-goroutines |
Stability | Unbounded parallelism |
conc-bounded-channels |
2-5x | Burst absorption |
conc-use-context-cancel |
Resource safety | Long-running operations |
conc-reduce-lock-contention |
2-10x | Mutex in profile |
conc-use-atomics |
5-10x | Simple counters |
conc-pass-context |
Resource safety | All API boundaries |
5. GC & Memory Limits (HIGH)
| Rule | Impact | When to Apply |
|---|---|---|
gc-set-gomemlimit |
OOM prevention | Containerized apps |
gc-tune-gogc |
CPU/memory tradeoff | GC overhead visible |
gc-use-sync-pool |
10-50x | Short-lived buffers |
gc-reset-before-put |
Memory leak | Pooled objects with refs |
gc-avoid-pooling-large |
Memory | Large objects (>32KB) |
6. I/O & Networking (HIGH)
| Rule | Impact | When to Apply |
|---|---|---|
io-buffered-io |
10x | Unbuffered file I/O |
io-stream-large-bodies |
O(1) memory | Large HTTP bodies |
io-reuse-http-client |
7-10x | Multiple HTTP requests |
io-tune-transport |
2-5x | High concurrency HTTP |
io-set-timeouts |
Stability | All HTTP servers/clients |
7. Runtime & Scheduling (MEDIUM)
| Rule | Impact | When to Apply |
|---|---|---|
rt-avoid-busy-loop |
100x CPU | Polling loops |
rt-stop-tickers |
Resource leak | time.NewTicker usage |
rt-set-gomaxprocs |
Container CPU | Docker/ECS/K8s |
rt-use-timeout-contexts |
Stability | External calls |
8. Work Avoidance & Caching (MEDIUM)
| Rule | Impact | When to Apply |
|---|---|---|
work-cache-compiled-regex |
10-100x | Regex in request path |
work-cache-lookups |
O(1) vs O(n) | Repeated containment checks |
work-batch-small-writes |
3-10x | Many small writes |
work-precompute-templates |
10-100x | Template in request path |
work-short-circuit-common |
2-10x | Common trivial inputs |
Decision Trees
"My service is slow"
Is it CPU-bound? (CPU near 100%)
├── Yes → Profile CPU
│ ├── Hot function is I/O → Check io-* rules
│ ├── Hot function is encoding → Check bytes-* rules
│ ├── Hot function is your code → Check work-* rules
│ └── Hot function is GC → Check gc-* rules
└── No → Profile for blocking
├── Mutex contention → Check conc-reduce-lock-contention
├── Channel blocking → Check conc-bounded-channels
├── Network I/O → Check io-* rules
└── Disk I/O → Check io-buffered-io
"My service uses too much memory"
Is memory growing over time?
├── Yes (leak) →
│ ├── Goroutine count growing → Check context cancellation
│ ├── Map growing → Add eviction/TTL
│ ├── Slice retention → Use copy() for subslices
│ └── Pooled object refs → Reset before Put
└── No (steady but high) →
├── Large allocations → Stream instead of ReadAll
├── Many small allocations → Use sync.Pool
├── High peak usage → Set GOMEMLIMIT
└── Buffer reallocation → Preallocate with known size
"My service has GC problems"
Is GC taking too much CPU?
├── Yes →
│ ├── Many objects → Pool short-lived objects
│ ├── Large heap → Set GOMEMLIMIT higher
│ └── Frequent cycles → Increase GOGC (200-400)
└── No, but pauses are long →
├── Large heap → Reduce allocation rate
└── Pointer-heavy structures → Consider flat arrays
Profiling Cheat Sheet
Enable pprof in Production
import _ "net/http/pprof"
func main() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// ... rest of app
}
Common pprof Commands
# Interactive mode
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
go tool pprof http://localhost:6060/debug/pprof/heap
# Web UI (recommended)
go tool pprof -http=:8080 cpu.prof
# Command-line analysis
go tool pprof -top cpu.prof
go tool pprof -list=FunctionName cpu.prof
go tool pprof -png -output=profile.png cpu.prof
# Compare profiles
go tool pprof -base before.prof after.prof
# Allocation analysis
go tool pprof -alloc_objects heap.prof # Count of allocations
go tool pprof -alloc_space heap.prof # Bytes allocated
go tool pprof -inuse_objects heap.prof # Current live objects
go tool pprof -inuse_space heap.prof # Current memory usage
Benchmark Commands
# Run all benchmarks
go test -bench=. -benchmem ./...
# Run specific benchmark
go test -bench=BenchmarkProcess -benchmem
# Multiple runs for statistical significance
go test -bench=. -benchmem -count=10 | tee results.txt
# Compare results
go install golang.org/x/perf/cmd/benchstat@latest
benchstat before.txt after.txt
# Generate profiles from benchmarks
go test -bench=BenchmarkProcess -cpuprofile=cpu.prof -memprofile=mem.prof
Profile-Guided Optimization (PGO)
Go 1.21+ supports PGO for 2-7% performance improvement in production workloads.
PGO Workflow
# Step 1: Collect production CPU profile (30+ seconds recommended)
curl http://localhost:6060/debug/pprof/profile?seconds=60 > default.pgo
# Step 2: Place profile in package directory
cp default.pgo ./cmd/myservice/default.pgo
# Step 3: Build with PGO (auto-detects default.pgo)
go build ./cmd/myservice
# Step 4: Verify PGO was applied
go build -gcflags="-d=pgo" ./cmd/myservice 2>&1 | grep "PGO"
Best practices:
- Collect profiles under realistic production load
- Re-collect profiles periodically (weekly/monthly)
- PGO improves inlining and devirtualization decisions
- Works best for CPU-bound workloads
PGO Impact by Workload Type
| Workload Type | Expected Improvement | Notes |
|---|---|---|
| HTTP services | 2-4% | Helps with routing, JSON, template code |
| GRPC services | 3-5% | Protocol buffer encoding benefits |
| CLI tools | 2-3% | Shorter startup time |
| Computation-heavy | 5-7% | Best for math, parsing, encoding |
Go 1.24 Features (January 2025+)
Go 1.24 introduces significant runtime improvements:
Swiss Tables for Maps
Maps now use Swiss Tables internally for ~10% faster operations on average:
// No code changes required - automatic in Go 1.24+
m := make(map[string]int) // Uses Swiss Tables internally
Impact: Lookup and iteration 10-30% faster depending on workload.
testing.B.Loop for Benchmarks
New idiomatic benchmark pattern (Go 1.24+):
// Go 1.23 and earlier
func BenchmarkProcess(b *testing.B) {
for i := 0; i < b.N; i++ {
process()
}
}
// Go 1.24+ (preferred)
func BenchmarkProcess(b *testing.B) {
for b.Loop() {
process()
}
}
Benefits: Avoids common mistakes with benchmark timers, cleaner syntax.
Version Compatibility Table
| Feature | Minimum Go Version | Impact |
|---|---|---|
| Generics | 1.18 | Type-safe pools |
GOMEMLIMIT |
1.19 | OOM prevention |
| PGO | 1.21 | 2-7% |
maps stdlib package |
1.21 | Clone, Keys |
slices stdlib package |
1.21 | Sort, Clone |
sync.OnceFunc |
1.21 | Lazy init |
cmp package |
1.21 | Generic compare |
log/slog |
1.21 | Structured logs |
| Swiss Tables (maps) | 1.24 | 10% faster maps |
testing.B.Loop |
1.24 | Cleaner benchmarks |
References
- Effective Go
- Go Performance Wiki
- pprof Documentation
- A Guide to the Go Garbage Collector
- High Performance Go Workshop
- Go Memory Model
- Profile-Guided Optimization
- Go 1.24 Release Notes
Full Compiled Document
For the complete guide with all rules expanded: AGENTS.md