dotnet-ci-benchmarking

SKILL.md

dotnet-ci-benchmarking

Continuous benchmarking guidance for detecting performance regressions in CI pipelines. Covers baseline file management with BenchmarkDotNet JSON exporters, GitHub Actions workflows for artifact-based baseline comparison, regression detection patterns with configurable thresholds, and alerting strategies for performance degradation.

Version assumptions: BenchmarkDotNet v0.14+ for JSON export, GitHub Actions runner environment. Examples use actions/upload-artifact@v4 and actions/download-artifact@v4.

Scope

  • Baseline file management with BenchmarkDotNet JSON exporters
  • GitHub Actions workflows for artifact-based baseline comparison
  • Regression detection with configurable thresholds
  • Alerting strategies for performance degradation

Out of scope

  • BenchmarkDotNet setup and benchmark class design -- see [skill:dotnet-benchmarkdotnet]
  • Performance architecture patterns -- see [skill:dotnet-performance-patterns]
  • Profiling tools (dotnet-counters, dotnet-trace, dotnet-dump) -- see [skill:dotnet-profiling]
  • OpenTelemetry metrics and distributed tracing -- see [skill:dotnet-observability]
  • Composable CI/CD workflow design -- see [skill:dotnet-gha-patterns]

Cross-references: [skill:dotnet-benchmarkdotnet] for benchmark class setup and JSON exporter configuration, [skill:dotnet-observability] for correlating benchmark regressions with runtime metrics changes, [skill:dotnet-gha-patterns] for composable workflow patterns (reusable workflows, composite actions, matrix builds).


Baseline File Management

BenchmarkDotNet JSON Export

BenchmarkDotNet's JSON exporter produces machine-readable results for automated comparison. Configure the exporter in benchmark classes:


using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Exporters.Json;

[JsonExporterAttribute.Full]
[MemoryDiagnoser]
public class CriticalPathBenchmarks
{
    [Benchmark(Baseline = true)]
    public void ProcessOrder() { /* ... */ }

    [Benchmark]
    public void ProcessOrderOptimized() { /* ... */ }
}

```text

Or configure via custom config for all benchmark classes:

```csharp

using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Exporters.Json;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

var config = ManualConfig.Create(DefaultConfig.Instance)
    .AddJob(Job.ShortRun)  // fewer iterations for CI speed
    .AddExporter(JsonExporter.Full)
    .WithArtifactsPath("./benchmark-results");

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config);

```json

### JSON Export Structure

The exported JSON file (`*-report-full.json`) contains structured benchmark results:

```json

{
  "Title": "CriticalPathBenchmarks",
  "Benchmarks": [
    {
      "FullName": "MyApp.Benchmarks.CriticalPathBenchmarks.ProcessOrder",
      "Statistics": {
        "Mean": 1234.5678,
        "Median": 1230.1234,
        "StandardDeviation": 15.234,
        "StandardError": 4.812
      },
      "Memory": {
        "BytesAllocatedPerOperation": 1024,
        "Gen0Collections": 0.0012,
        "Gen1Collections": 0,
        "Gen2Collections": 0
      }
    }
  ]
}

```text

Key fields for regression comparison:

| Field                               | Purpose                                         |
| ----------------------------------- | ----------------------------------------------- |
| `Statistics.Mean`                   | Average execution time (nanoseconds)            |
| `Statistics.Median`                 | Middle execution time (more robust to outliers) |
| `Statistics.StandardDeviation`      | Measurement variability                         |
| `Memory.BytesAllocatedPerOperation` | GC allocation per operation                     |

### Baseline Storage Strategies

| Strategy                         | Pros                                   | Cons                                                            | Best For                                |
| -------------------------------- | -------------------------------------- | --------------------------------------------------------------- | --------------------------------------- |
| Git-committed baseline file      | Versioned, auditable, no external deps | Repo size grows; must update deliberately                       | Small benchmark suites, stable hardware |
| GitHub Actions artifacts         | No repo bloat; automatic retention     | 90-day default retention; cross-workflow access requires tokens | Large benchmark suites, shared runners  |
| External storage (S3/Azure Blob) | Unlimited history; cross-repo sharing  | Extra infrastructure; credential management                     | Multi-repo benchmark comparison         |

This skill focuses on the **GitHub Actions artifact** strategy as the default. For composable workflow patterns and
reusable actions, see [skill:dotnet-gha-patterns].

---

## GitHub Actions Benchmark Workflow

### Basic Benchmark Workflow

```yaml

name: Benchmarks

on:
  pull_request:
    paths:
      - 'src/**'
      - 'benchmarks/**'
  workflow_dispatch:

permissions:
  contents: read
  actions: read # required for artifact download

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Run benchmarks
        run: dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- --exporters json

      - name: Upload benchmark results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results-${{ github.sha }}
          path: benchmarks/BenchmarkDotNet.Artifacts/results/
          retention-days: 90

```text

### Baseline Comparison Workflow

This workflow downloads the baseline from a previous run and compares against current results:

```yaml

name: Benchmark Regression Check

on:
  pull_request:
    paths:
      - 'src/**'
      - 'benchmarks/**'

permissions:
  contents: read
  actions: read

env:
  BENCHMARK_PROJECT: benchmarks/MyBenchmarks.csproj
  RESULTS_DIR: benchmarks/BenchmarkDotNet.Artifacts/results

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Download baseline results
        uses: actions/download-artifact@v4
        with:
          name: benchmark-baseline
          path: ./baseline-results
        continue-on-error: true
        id: download-baseline

      - name: Run benchmarks
        run: dotnet run -c Release --project ${{ env.BENCHMARK_PROJECT }} -- --exporters json

      - name: Compare with baseline
        if: steps.download-baseline.outcome == 'success'
        shell: bash
        run: |
          set -euo pipefail
          python3 scripts/compare-benchmarks.py \
            --baseline ./baseline-results \
            --current "${{ env.RESULTS_DIR }}" \
            --threshold 10 \
            --output benchmark-comparison.md

      - name: Upload current results as new baseline
        if: github.ref == 'refs/heads/main'
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-baseline
          path: ${{ env.RESULTS_DIR }}/
          retention-days: 90
          overwrite: true

      - name: Upload comparison report
        if: steps.download-baseline.outcome == 'success'
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-comparison-${{ github.sha }}
          path: benchmark-comparison.md
          retention-days: 30

```markdown

**Key design decisions:**

- `continue-on-error: true` on baseline download handles first-run (no baseline exists yet)
- Baseline is only updated from `main` branch merges to prevent PR branches from polluting the baseline
- `overwrite: true` replaces the previous baseline artifact

For converting these inline workflows into reusable `workflow_call` patterns, see [skill:dotnet-gha-patterns].

---

## Regression Detection Patterns

### Threshold-Based Comparison

Compare current benchmark results against baseline using percentage thresholds. A regression is flagged when the current
mean exceeds the baseline mean by more than the configured threshold:

```python

#!/usr/bin/env python3
"""compare-benchmarks.py -- Detect benchmark regressions from BenchmarkDotNet JSON exports."""

import json
import sys
from pathlib import Path

def load_benchmarks(results_dir: str) -> dict:
    """Load benchmark results from BenchmarkDotNet JSON export files."""
    benchmarks = {}
    for json_file in Path(results_dir).glob("*-report-full.json"):
        with open(json_file) as f:
            data = json.load(f)
        for bm in data.get("Benchmarks", []):
            name = bm["FullName"]
            benchmarks[name] = {
                "mean": bm["Statistics"]["Mean"],
                "median": bm["Statistics"]["Median"],
                "stddev": bm["Statistics"]["StandardDeviation"],
                "allocated": bm.get("Memory", {}).get("BytesAllocatedPerOperation", 0),
            }
    return benchmarks

def compare(baseline_dir: str, current_dir: str, threshold_pct: float) -> list:
    """Compare current results against baseline. Returns list of regressions."""
    baseline = load_benchmarks(baseline_dir)
    current = load_benchmarks(current_dir)
    regressions = []

    for name, curr in current.items():
        if name not in baseline:
            continue  # new benchmark, no comparison possible
        base = baseline[name]
        if base["mean"] == 0:
            continue  # avoid division by zero

        time_change_pct = ((curr["mean"] - base["mean"]) / base["mean"]) * 100
        alloc_change = curr["allocated"] - base["allocated"]

        if time_change_pct > threshold_pct:
            regressions.append({
                "name": name,
                "baseline_mean": base["mean"],
                "current_mean": curr["mean"],
                "change_pct": time_change_pct,
                "alloc_change": alloc_change,
            })

    return regressions

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description="Compare BenchmarkDotNet results")
    parser.add_argument("--baseline", required=True, help="Path to baseline results directory")
    parser.add_argument("--current", required=True, help="Path to current results directory")
    parser.add_argument("--threshold", type=float, default=10.0,
                        help="Regression threshold percentage (default: 10)")
    parser.add_argument("--output", default="comparison.md", help="Output markdown file")
    args = parser.parse_args()

    regressions = compare(args.baseline, args.current, args.threshold)

    with open(args.output, "w") as f:
        if regressions:
            f.write("## Benchmark Regressions Detected\n\n")
            f.write("| Benchmark | Baseline (ns) | Current (ns) | Change | Alloc Delta |\n")
            f.write("|-----------|--------------|-------------|--------|-------------|\n")
            for r in regressions:
                f.write(f"| `{r['name']}` | {r['baseline_mean']:.2f} | "
                        f"{r['current_mean']:.2f} | +{r['change_pct']:.1f}% | "
                        f"{r['alloc_change']:+d} B |\n")
            f.write(f"\nThreshold: {args.threshold}%\n")
        else:
            f.write("## Benchmark Results\n\nNo regressions detected ")
            f.write(f"(threshold: {args.threshold}%).\n")

    if regressions:
        print(f"REGRESSION: {len(regressions)} benchmark(s) exceeded "
              f"{args.threshold}% threshold", file=sys.stderr)
        sys.exit(1)

```text

### Choosing Thresholds

| Environment                   | Suggested Threshold | Rationale                                                    |
| ----------------------------- | ------------------- | ------------------------------------------------------------ |
| Dedicated benchmark hardware  | 5%                  | Low noise floor; small regressions are signal                |
| GitHub Actions shared runners | 10-15%              | Shared runners introduce 5-10% variance from noisy neighbors |
| Self-hosted runners           | 5-10%               | More stable than shared, but still monitor variance          |

**Calibrate thresholds empirically:** Run the same benchmark suite 5-10 times on your CI environment without code
changes. The maximum observed variance sets your noise floor. Set the threshold above this noise floor (typically 2x the
observed variance).

### Allocation Regression Detection

Memory allocation regressions are more reliable signals than timing regressions because allocations are deterministic
(not affected by noisy neighbors):

```python

# Add to the compare function:
if alloc_change > 0:
    regressions.append({
        "name": name,
        "type": "allocation",
        "baseline_alloc": base["allocated"],
        "current_alloc": curr["allocated"],
        "alloc_change": alloc_change,
    })

```text

Use allocation changes as a **hard gate** (zero tolerance for new allocations in zero-alloc paths) and timing changes as
a **soft gate** (warning with threshold).

---

## Alerting Strategies

### PR Comment with Regression Summary

Post benchmark comparison results as a PR comment for reviewer visibility:

```yaml

- name: Comment PR with results
  if: steps.download-baseline.outcome == 'success' && github.event_name == 'pull_request'
  uses: actions/github-script@v7
  with:
    script: |
      const fs = require('fs');
      const body = fs.readFileSync('benchmark-comparison.md', 'utf8');
      await github.rest.issues.createComment({
        owner: context.repo.owner,
        repo: context.repo.repo,
        issue_number: context.issue.number,
        body: body
      });

```text

### Fail the Build on Regression

Exit with non-zero status from the comparison script to fail the GitHub Actions job. This prevents merging PRs that
introduce performance regressions:

```yaml

- name: Check for regressions
  if: steps.download-baseline.outcome == 'success'
  shell: bash
  run: |
    set -euo pipefail
    python3 scripts/compare-benchmarks.py \
      --baseline ./baseline-results \
      --current "${{ env.RESULTS_DIR }}" \
      --threshold 10
    # Script exits non-zero if regressions found -- fails the job

```text

For required status checks and branch protection integration with benchmark gates, see [skill:dotnet-gha-patterns].

### Trend Tracking

For long-term trend analysis beyond single-PR comparison, upload results to a persistent store and track metrics over
time:

| Approach                           | Tool                                          | Complexity                            |
| ---------------------------------- | --------------------------------------------- | ------------------------------------- |
| GitHub Actions artifacts           | Built-in, 90-day retention                    | Low -- artifact download/upload only  |
| GitHub Pages with benchmark-action | `benchmark-action/github-action-benchmark@v1` | Medium -- auto-generates trend charts |
| External time-series DB            | InfluxDB, Prometheus + Grafana                | High -- full observability stack      |

The simplest approach for most projects is the artifact-based baseline comparison shown in this skill. Graduate to trend
tracking when you need historical regression analysis across many releases.

---

## CI-Specific BenchmarkDotNet Configuration

### ShortRun for CI Speed

Full benchmark runs take 10-30+ minutes. Use `Job.ShortRun` in CI to reduce iteration counts while retaining regression
detection capability:

```csharp

using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

public class CiConfig : ManualConfig
{
    public CiConfig()
    {
        AddJob(Job.ShortRun
            .WithWarmupCount(3)
            .WithIterationCount(5)
            .WithInvocationCount(1));

        AddExporter(BenchmarkDotNet.Exporters.Json.JsonExporter.Full);
    }
}

```json

Apply conditionally based on environment:

```csharp

var config = Environment.GetEnvironmentVariable("CI") is not null
    ? new CiConfig()
    : DefaultConfig.Instance;

BenchmarkRunner.Run<CriticalPathBenchmarks>(config);

```text

### Filtering Benchmarks for CI

Run only critical-path benchmarks in CI to reduce pipeline duration:

```bash

# Run only benchmarks in the "Critical" category
dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- \
  --filter *Critical* --exporters json

```bash

```csharp

[BenchmarkCategory("Critical")]
[MemoryDiagnoser]
[JsonExporterAttribute.Full]
public class CriticalPathBenchmarks
{
    [Benchmark]
    public void ProcessOrder() { /* ... */ }
}

[BenchmarkCategory("Extended")]
[MemoryDiagnoser]
public class ExtendedBenchmarks
{
    [Benchmark]
    public void RareCodePath() { /* ... */ }
}

```text

Run `Critical` benchmarks on every PR; run `Extended` benchmarks on a nightly schedule.

### Nightly Benchmark Schedule

```yaml

name: Nightly Benchmarks (Full Suite)

on:
  schedule:
    - cron: '0 3 * * *' # 3 AM UTC daily
  workflow_dispatch:

jobs:
  benchmark-full:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Run full benchmark suite
        run: dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- --exporters json
        # No --filter: runs all benchmarks including Extended category

      - name: Upload full results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-full-${{ github.run_number }}
          path: benchmarks/BenchmarkDotNet.Artifacts/results/
          retention-days: 90

```text

For scheduled workflow patterns and matrix builds across TFMs, see [skill:dotnet-gha-patterns].

---

## Agent Gotchas

1. **Use `Job.ShortRun` in CI, not `Job.Default`** -- default benchmark jobs run many iterations for statistical
   precision, taking 10-30+ minutes per benchmark class. CI pipelines need faster feedback with `ShortRun` (3 warmup, 5
   iteration).
2. **Set threshold above measured noise floor** -- shared CI runners introduce 5-10% timing variance from noisy
   neighbors. A 5% threshold on shared runners produces false positives. Calibrate by running the same code multiple
   times and measuring variance.
3. **Use allocation changes as hard gates** -- allocation counts are deterministic and unaffected by runner noise. A
   zero-to-nonzero allocation change is always a real regression, unlike timing variations.
4. **Only update baselines from main branch** -- if PR branches can update the baseline, a regression in one PR becomes
   the new baseline, masking it from subsequent comparisons.
5. **Always set `set -euo pipefail` in bash steps** -- without `pipefail`, a regression detection script that exits
   non-zero in a pipeline (e.g., `script | tee`) does not fail the GitHub Actions step.
6. **Handle missing baselines gracefully** -- the first CI run has no baseline to compare against. Use
   `continue-on-error: true` on the baseline download step and skip comparison when no baseline exists.
7. **Export JSON, not just Markdown** -- Markdown reports are human-readable but not machine-parseable for automated
   regression detection. Always include `[JsonExporterAttribute.Full]` or `JsonExporter.Full` in the config.
Weekly Installs
1
First Seen
11 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1