skills/mcart13/dev-skills/go-performance-best-practices

go-performance-best-practices

SKILL.md

Go Performance Best Practices

Comprehensive performance optimization guide for Go codebases. Contains 41 rules across 8 categories with real-world benchmarks, BOMvault-specific examples, and proven optimization patterns from 10+ years of production experience.

When to Apply

Reference these guidelines when:

  • Writing or refactoring Go code
  • Tuning latency, throughput, allocation rate, or GC behavior
  • Investigating performance regressions
  • Reviewing code for performance issues
  • Debugging memory leaks or goroutine leaks
  • Optimizing containerized services (ECS, Kubernetes)

The Performance Optimization Workflow

Phase 1: Measure First (Don't Guess)

Never optimize without data. The #1 mistake is optimizing based on intuition.

# Step 1: Establish baseline with benchmarks
go test -bench=. -benchmem -count=5 ./... | tee baseline.txt

# Step 2: Generate CPU profile for hot paths
go test -bench=BenchmarkCriticalPath -cpuprofile=cpu.prof
go tool pprof -http=:8080 cpu.prof

# Step 3: Generate heap profile for allocations
go test -bench=BenchmarkCriticalPath -memprofile=heap.prof
go tool pprof -http=:8080 heap.prof

# Step 4: Check allocation counts (correlates with latency)
go tool pprof -alloc_objects heap.prof

Key pprof views:

View Use For
top Quick ranking of hot functions
list funcname Line-by-line attribution
web Visual call graph
flame Flame graph for deep call stacks
peek funcname Callers and callees

Phase 2: Identify the Bottleneck

Use the right profile for the right problem:

Symptom Profile Type pprof Flag
High CPU usage CPU -cpuprofile
High memory usage Heap (inuse) -memprofile + -inuse_space
High allocation rate / GC pressure Heap (alloc) -memprofile + -alloc_objects
Goroutine leaks Goroutine runtime/pprof.Lookup("goroutine")
Lock contention Mutex -mutexprofile
Blocking operations Block -blockprofile

Quick diagnosis commands:

# CPU: What's using the most cycles?
go tool pprof -top cpu.prof

# Memory: What's consuming the most heap?
go tool pprof -top -inuse_space heap.prof

# Allocations: What's creating the most objects?
go tool pprof -top -alloc_objects heap.prof

# Compare before/after
go tool pprof -base baseline.prof optimized.prof

Phase 3: Apply Targeted Optimization

Match the symptom to the optimization category:

Symptom Category Key Rules
CPU-bound Work Avoidance work-cache-*, work-short-circuit-*
Memory-bound Allocation alloc-preallocate-*, alloc-copy-to-avoid-retention
GC pauses GC Tuning gc-set-gomemlimit, gc-use-sync-pool
I/O latency I/O io-buffered-io, io-reuse-http-client
Lock contention Concurrency conc-reduce-lock-contention, conc-use-atomics
Goroutine explosion Concurrency conc-limit-goroutines, conc-bounded-channels

Phase 4: Verify Improvement

# Run benchmark again
go test -bench=. -benchmem -count=5 ./... | tee optimized.txt

# Compare results
benchstat baseline.txt optimized.txt

# Verify no regressions in other benchmarks

Success criteria:

  • Measurable improvement (not just "feels faster")
  • No regressions in other areas
  • Code remains readable and maintainable
  • Changes are justified by data

Common Optimization Scenarios

Scenario 1: High Latency / Slow Response Times

Symptoms: P99 latency spikes, slow API responses, timeouts

Diagnosis:

# CPU profile during slow requests
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -http=:8080 cpu.prof

Common causes and fixes:

Cause Indicator Fix
JSON encoding encoding/json in top Use json.NewEncoder streaming, consider jsoniter
Regex compilation regexp.Compile in hot path Cache compiled regex at init
Slice/map scanning Loops in profile Convert to map lookup
String concatenation + operator in loops Use strings.Builder
Excessive logging Logger in top Reduce log level in hot path

Scenario 2: High Memory Usage / OOM Kills

Symptoms: Container OOM killed, memory growing over time, swap thrashing

Diagnosis:

# Heap profile
curl http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof -inuse_space -top heap.prof

# Check for memory leaks (growing allocations)
go tool pprof -alloc_space -top heap.prof

Common causes and fixes:

Cause Indicator Fix
Large slice retention append with small subslices copy() to new slice
Unbounded caches Map growing without eviction Add LRU/TTL eviction
io.ReadAll on large files Large []byte allocations Stream with io.Copy
String/[]byte conversions runtime.stringtoslicebyte Stay in one domain
Goroutine leaks Goroutine count growing Check context cancellation

Scenario 3: High GC Pressure / CPU Spent in GC

Symptoms: gc_pause_seconds high, runtime.mallocgc in CPU profile

Diagnosis:

# Check GC stats
GODEBUG=gctrace=1 ./myservice 2>&1 | head -20

# Allocation profile
go tool pprof -alloc_objects -top heap.prof

Common causes and fixes:

Cause Indicator Fix
Many small allocations High alloc_objects Use sync.Pool
Creating slices in loops make([]T, ...) in hot path Preallocate or pool
fmt.Sprintf in hot path fmt.* allocations Use strconv
Interface boxing interface{} conversions Use generics or concrete types
Not setting GOMEMLIMIT Frequent GC cycles Set GOMEMLIMIT to 80-90% of container

Scenario 4: Goroutine Leaks / Count Growing

Symptoms: Goroutine count increases over time, eventual resource exhaustion

Diagnosis:

# Goroutine profile
curl http://localhost:8080/debug/pprof/goroutine?debug=2 > goroutine.txt
cat goroutine.txt | head -100

# Count by state
curl http://localhost:8080/debug/pprof/goroutine?debug=1 | head -50

Common causes and fixes:

Cause Indicator Fix
Blocked channel receive chan receive in stack Add timeout or close channel
HTTP client no timeout net/http.(*persistConn).readLoop Set client timeout
Ticker not stopped time.Tick in stack Use time.NewTicker + defer Stop()
Context not cancelled context.Background() everywhere Pass and check context
Worker pool leak Workers waiting on closed channel Proper shutdown signaling

Scenario 5: Lock Contention / Serialized Execution

Symptoms: CPU not fully utilized, goroutines blocked on mutex

Diagnosis:

# Mutex profile (must be enabled)
curl http://localhost:8080/debug/pprof/mutex > mutex.prof
go tool pprof -top mutex.prof

# Block profile
curl http://localhost:8080/debug/pprof/block > block.prof
go tool pprof -top block.prof

Common causes and fixes:

Cause Indicator Fix
Global mutex Single lock in mutex profile Shard by key
Write lock for reads sync.Mutex on read-heavy map Use sync.RWMutex
Lock held during I/O I/O calls while holding lock Release lock before I/O
Atomic operations on struct atomic.Value for config Use atomic.Pointer[T]

BOMvault Service Optimization Guide

License Enricher

Profile: CPU-bound, high allocation rate from parsing

Key optimizations:

  • Cache compiled SPDX license regex patterns at init
  • Pool bytes.Buffer for license text processing
  • Preallocate slice for AffectedPackages based on typical size
  • Stream large license files instead of io.ReadAll
// BOMvault license-enricher pattern
var (
    spdxRegex = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9.-]*$`)
    bufPool   = sync.Pool{New: func() any { return new(bytes.Buffer) }}
)

func (e *Enricher) ProcessLicense(data []byte) (*License, error) {
    buf := bufPool.Get().(*bytes.Buffer)
    buf.Reset()
    defer bufPool.Put(buf)
    // ... use buf for processing
}

Vulnerability Enricher

Profile: I/O-bound (NVD API), memory spikes from CVE data

Key optimizations:

  • Reuse http.Client with connection pooling
  • Stream JSON responses for large CVE feeds
  • Set GOMEMLIMIT to 80% of container memory
  • Use map for CVE ID lookups instead of slice scanning
  • Batch database inserts (100-500 per batch)
// BOMvault vulnerability-enricher pattern
var nvdClient = &http.Client{
    Timeout: 30 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:     90 * time.Second,
    },
}

type CVEIndex struct {
    byID map[string]*CVE  // O(1) lookup
}

Graph Ingest

Profile: Memory-bound, large SBOM processing

Key optimizations:

  • Stream SBOM JSON parsing with json.Decoder
  • Copy component slices to avoid retaining entire SBOM
  • Use GOMEMLIMIT with soft memory limit
  • Bounded worker pool for parallel component processing
  • Context timeouts for database operations
// BOMvault graph-ingest pattern
func (g *GraphIngest) ProcessSBOM(ctx context.Context, r io.Reader) error {
    dec := json.NewDecoder(r)  // Stream, don't ReadAll

    // Bounded parallelism
    sem := make(chan struct{}, 10)

    for dec.More() {
        var component Component
        if err := dec.Decode(&component); err != nil {
            return err
        }

        sem <- struct{}{}
        go func(c Component) {
            defer func() { <-sem }()
            g.processComponent(ctx, c)
        }(component)
    }
    return nil
}

Alert Writer

Profile: I/O-bound (SARIF generation), batch processing

Key optimizations:

  • Precompute report templates at startup
  • Batch writes to reduce syscalls
  • Pool buffers for SARIF report generation
  • Use strings.Builder for alert message construction
// BOMvault alert-writer pattern
var (
    reportTemplates = template.Must(template.ParseGlob("templates/*.html"))
    bufPool         = sync.Pool{New: func() any { return new(bytes.Buffer) }}
)

func (w *AlertWriter) GenerateSARIF(findings []*Finding) ([]byte, error) {
    buf := bufPool.Get().(*bytes.Buffer)
    buf.Reset()
    buf.Grow(len(findings) * 500)  // Estimate size
    defer bufPool.Put(buf)

    // Batch write to buffer, then single Write to output
}

Rule Categories by Priority

Priority Category Impact Prefix
1 Measurement & Profiling CRITICAL prof-
2 Allocation & Data Structures HIGH alloc-
3 Strings, Bytes & Encoding HIGH bytes-
4 Concurrency & Synchronization HIGH conc-
5 GC & Memory Limits HIGH gc-
6 I/O & Networking HIGH io-
7 Runtime & Scheduling MEDIUM rt-
8 Work Avoidance & Caching MEDIUM work-

Quick Reference

1. Measurement & Profiling (CRITICAL)

Rule Impact When to Apply
prof-use-testing-benchmarks Foundation Always benchmark before optimizing
prof-report-allocs Foundation When allocation rate matters
prof-benchmark-timers Foundation When setup skews results
prof-cpu-profile Foundation CPU-bound workloads
prof-heap-profile Foundation Memory issues, GC pressure

2. Allocation & Data Structures (HIGH)

Rule Impact When to Apply
alloc-preallocate-slices 2-10x Known size, append loops
alloc-preallocate-maps 2-5x Known cardinality
alloc-copy-to-avoid-retention Memory leak Subslices of large arrays
alloc-use-copy-builtin 2-3x Slice-to-slice moves
alloc-avoid-string-byte-conv 2x Frequent conversions
alloc-use-zero-value-buffers Minor Buffer initialization

3. Strings, Bytes & Encoding (HIGH)

Rule Impact When to Apply
bytes-use-strings-builder 100-1000x String concatenation loops (vs + operator)
bytes-use-bytes-buffer 10-100x Byte accumulation
bytes-grow-when-known 2-5x Known final size
bytes-avoid-fmt-in-hot-path 5-10x Number formatting
bytes-precompile-regexp 10-100x Regex in hot path

4. Concurrency & Synchronization (HIGH)

Rule Impact When to Apply
conc-limit-goroutines Stability Unbounded parallelism
conc-bounded-channels 2-5x Burst absorption
conc-use-context-cancel Resource safety Long-running operations
conc-reduce-lock-contention 2-10x Mutex in profile
conc-use-atomics 5-10x Simple counters
conc-pass-context Resource safety All API boundaries

5. GC & Memory Limits (HIGH)

Rule Impact When to Apply
gc-set-gomemlimit OOM prevention Containerized apps
gc-tune-gogc CPU/memory tradeoff GC overhead visible
gc-use-sync-pool 10-50x Short-lived buffers
gc-reset-before-put Memory leak Pooled objects with refs
gc-avoid-pooling-large Memory Large objects (>32KB)

6. I/O & Networking (HIGH)

Rule Impact When to Apply
io-buffered-io 10x Unbuffered file I/O
io-stream-large-bodies O(1) memory Large HTTP bodies
io-reuse-http-client 7-10x Multiple HTTP requests
io-tune-transport 2-5x High concurrency HTTP
io-set-timeouts Stability All HTTP servers/clients

7. Runtime & Scheduling (MEDIUM)

Rule Impact When to Apply
rt-avoid-busy-loop 100x CPU Polling loops
rt-stop-tickers Resource leak time.NewTicker usage
rt-set-gomaxprocs Container CPU Docker/ECS/K8s
rt-use-timeout-contexts Stability External calls

8. Work Avoidance & Caching (MEDIUM)

Rule Impact When to Apply
work-cache-compiled-regex 10-100x Regex in request path
work-cache-lookups O(1) vs O(n) Repeated containment checks
work-batch-small-writes 3-10x Many small writes
work-precompute-templates 10-100x Template in request path
work-short-circuit-common 2-10x Common trivial inputs

Decision Trees

"My service is slow"

Is it CPU-bound? (CPU near 100%)
├── Yes → Profile CPU
│   ├── Hot function is I/O → Check io-* rules
│   ├── Hot function is encoding → Check bytes-* rules
│   ├── Hot function is your code → Check work-* rules
│   └── Hot function is GC → Check gc-* rules
└── No → Profile for blocking
    ├── Mutex contention → Check conc-reduce-lock-contention
    ├── Channel blocking → Check conc-bounded-channels
    ├── Network I/O → Check io-* rules
    └── Disk I/O → Check io-buffered-io

"My service uses too much memory"

Is memory growing over time?
├── Yes (leak) →
│   ├── Goroutine count growing → Check context cancellation
│   ├── Map growing → Add eviction/TTL
│   ├── Slice retention → Use copy() for subslices
│   └── Pooled object refs → Reset before Put
└── No (steady but high) →
    ├── Large allocations → Stream instead of ReadAll
    ├── Many small allocations → Use sync.Pool
    ├── High peak usage → Set GOMEMLIMIT
    └── Buffer reallocation → Preallocate with known size

"My service has GC problems"

Is GC taking too much CPU?
├── Yes →
│   ├── Many objects → Pool short-lived objects
│   ├── Large heap → Set GOMEMLIMIT higher
│   └── Frequent cycles → Increase GOGC (200-400)
└── No, but pauses are long →
    ├── Large heap → Reduce allocation rate
    └── Pointer-heavy structures → Consider flat arrays

Profiling Cheat Sheet

Enable pprof in Production

import _ "net/http/pprof"

func main() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    // ... rest of app
}

Common pprof Commands

# Interactive mode
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
go tool pprof http://localhost:6060/debug/pprof/heap

# Web UI (recommended)
go tool pprof -http=:8080 cpu.prof

# Command-line analysis
go tool pprof -top cpu.prof
go tool pprof -list=FunctionName cpu.prof
go tool pprof -png -output=profile.png cpu.prof

# Compare profiles
go tool pprof -base before.prof after.prof

# Allocation analysis
go tool pprof -alloc_objects heap.prof  # Count of allocations
go tool pprof -alloc_space heap.prof    # Bytes allocated
go tool pprof -inuse_objects heap.prof  # Current live objects
go tool pprof -inuse_space heap.prof    # Current memory usage

Benchmark Commands

# Run all benchmarks
go test -bench=. -benchmem ./...

# Run specific benchmark
go test -bench=BenchmarkProcess -benchmem

# Multiple runs for statistical significance
go test -bench=. -benchmem -count=10 | tee results.txt

# Compare results
go install golang.org/x/perf/cmd/benchstat@latest
benchstat before.txt after.txt

# Generate profiles from benchmarks
go test -bench=BenchmarkProcess -cpuprofile=cpu.prof -memprofile=mem.prof

Profile-Guided Optimization (PGO)

Go 1.21+ supports PGO for 2-7% performance improvement in production workloads.

PGO Workflow

# Step 1: Collect production CPU profile (30+ seconds recommended)
curl http://localhost:6060/debug/pprof/profile?seconds=60 > default.pgo

# Step 2: Place profile in package directory
cp default.pgo ./cmd/myservice/default.pgo

# Step 3: Build with PGO (auto-detects default.pgo)
go build ./cmd/myservice

# Step 4: Verify PGO was applied
go build -gcflags="-d=pgo" ./cmd/myservice 2>&1 | grep "PGO"

Best practices:

  • Collect profiles under realistic production load
  • Re-collect profiles periodically (weekly/monthly)
  • PGO improves inlining and devirtualization decisions
  • Works best for CPU-bound workloads

PGO Impact by Workload Type

Workload Type Expected Improvement Notes
HTTP services 2-4% Helps with routing, JSON, template code
GRPC services 3-5% Protocol buffer encoding benefits
CLI tools 2-3% Shorter startup time
Computation-heavy 5-7% Best for math, parsing, encoding

Go 1.24 Features (January 2025+)

Go 1.24 introduces significant runtime improvements:

Swiss Tables for Maps

Maps now use Swiss Tables internally for ~10% faster operations on average:

// No code changes required - automatic in Go 1.24+
m := make(map[string]int)  // Uses Swiss Tables internally

Impact: Lookup and iteration 10-30% faster depending on workload.

testing.B.Loop for Benchmarks

New idiomatic benchmark pattern (Go 1.24+):

// Go 1.23 and earlier
func BenchmarkProcess(b *testing.B) {
    for i := 0; i < b.N; i++ {
        process()
    }
}

// Go 1.24+ (preferred)
func BenchmarkProcess(b *testing.B) {
    for b.Loop() {
        process()
    }
}

Benefits: Avoids common mistakes with benchmark timers, cleaner syntax.

Version Compatibility Table

Feature Minimum Go Version Impact
Generics 1.18 Type-safe pools
GOMEMLIMIT 1.19 OOM prevention
PGO 1.21 2-7%
maps stdlib package 1.21 Clone, Keys
slices stdlib package 1.21 Sort, Clone
sync.OnceFunc 1.21 Lazy init
cmp package 1.21 Generic compare
log/slog 1.21 Structured logs
Swiss Tables (maps) 1.24 10% faster maps
testing.B.Loop 1.24 Cleaner benchmarks

References

Full Compiled Document

For the complete guide with all rules expanded: AGENTS.md

Weekly Installs
3
GitHub Stars
1
First Seen
Mar 1, 2026
Installed on
opencode3
gemini-cli3
github-copilot3
codex3
kimi-cli3
amp3