skills/latestaiagents/agent-skills/slo-sli-error-budgets

slo-sli-error-budgets

SKILL.md

SLOs, SLIs, and Error Budgets

Define and measure reliability in terms that matter to users.

Terminology

Term Definition Example
SLI Service Level Indicator - What you measure 99.2% of requests succeed
SLO Service Level Objective - Your target 99.9% availability
SLA Service Level Agreement - Contract with customers 99.5% with refund clause
Error Budget Allowed unreliability (100% - SLO) 0.1% = 43 min/month downtime

Common SLI Types

Availability

availability = successful_requests / total_requests

# Prometheus query
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

Latency

latency_sli = requests_under_threshold / total_requests

# Example: 99% of requests under 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

Throughput

throughput_sli = successful_operations / attempted_operations

# Example: Batch jobs
sum(job_succeeded_total) / sum(job_attempted_total)

Freshness

freshness_sli = fresh_data_requests / total_requests

# Example: Data updated within 1 minute
sum(data_age_seconds < 60) / count(data_age_seconds)

Choosing SLO Targets

The Nines Table

Availability Downtime/Year Downtime/Month Downtime/Week
99% 3.65 days 7.31 hours 1.68 hours
99.5% 1.83 days 3.65 hours 50.4 min
99.9% 8.77 hours 43.8 min 10.1 min
99.95% 4.38 hours 21.9 min 5.04 min
99.99% 52.6 min 4.38 min 1.01 min
99.999% 5.26 min 26.3 sec 6.05 sec

Guidelines for Setting SLOs

1. Start with user expectations
   - What do users actually need?
   - What are they getting today?

2. Consider dependencies
   - Your SLO can't exceed your dependencies
   - If database is 99.9%, you can't be 99.99%

3. Start conservative, tighten later
   - Easier to tighten SLO than loosen
   - Build confidence before committing

4. Different SLOs for different tiers
   - Premium customers: 99.99%
   - Free tier: 99.5%

Error Budget

Calculating Error Budget

Monthly Error Budget = (1 - SLO) × Time Period

Example for 99.9% SLO:
- Monthly budget = 0.1% × 43,200 minutes = 43.2 minutes
- Weekly budget = 0.1% × 10,080 minutes = 10.08 minutes

Error Budget Policies

# Example Error Budget Policy
error_budget_policy:
  healthy: # >50% budget remaining
    - Continue feature development
    - Normal deployment velocity

  caution: # 25-50% budget remaining
    - Reduce deployment frequency
    - Prioritize reliability work
    - Review recent incidents

  critical: # <25% budget remaining
    - Freeze non-critical deployments
    - All hands on reliability
    - Daily error budget review

  exhausted: # 0% budget remaining
    - Emergency only deployments
    - Postmortem all incidents
    - Leadership escalation

Error Budget Visualization

Error Budget: January 2026

SLO: 99.9% availability
Budget: 43.2 minutes

Consumption:
Week 1: ████░░░░░░░░░░░░░░░░  8 min (INC-121)
Week 2: ██░░░░░░░░░░░░░░░░░░  3 min
Week 3: ████████░░░░░░░░░░░░ 15 min (INC-125, INC-126)
Week 4: ████░░░░░░░░░░░░░░░░  7 min (INC-128)
        ────────────────────
Total:  ████████████████░░░░ 33 min consumed

Remaining: 10.2 min (24% of budget)
Status: ⚠️ CAUTION

SLO Implementation

Step 1: Define SLIs

# slo-config.yaml
slis:
  - name: availability
    description: Proportion of successful HTTP requests
    query: |
      sum(rate(http_requests_total{status!~"5.."}[{{window}}]))
      /
      sum(rate(http_requests_total[{{window}}]))

  - name: latency_p99
    description: 99th percentile request latency under 200ms
    query: |
      histogram_quantile(0.99,
        sum(rate(http_request_duration_seconds_bucket[{{window}}])) by (le)
      ) < 0.2

Step 2: Set SLO Targets

slos:
  - name: api-availability
    sli: availability
    target: 0.999  # 99.9%
    window: 30d

  - name: api-latency
    sli: latency_p99
    target: 0.99   # 99% of requests under 200ms
    window: 30d

Step 3: Configure Alerts

# Alert at different burn rates
alerts:
  - name: SLOBurnRateCritical
    slo: api-availability
    burn_rate: 14.4  # Exhausts monthly budget in 2 days
    window: 1h
    severity: critical

  - name: SLOBurnRateWarning
    slo: api-availability
    burn_rate: 6     # Exhausts monthly budget in 5 days
    window: 6h
    severity: warning

SLO Review Process

Weekly Review

## SLO Weekly Review - Week 4, January 2026

### Summary
| SLO | Target | Actual | Status |
|-----|--------|--------|--------|
| Availability | 99.9% | 99.85% | 🟡 |
| Latency P99 | <200ms | 187ms | 🟢 |
| Error Rate | <0.1% | 0.08% | 🟢 |

### Error Budget
- Consumed this week: 7 minutes
- Remaining this month: 10.2 minutes (24%)
- Projected end-of-month: 5 minutes (12%)

### Incidents
- INC-128: 7 min downtime (database failover)

### Actions
- [ ] Review INC-128 postmortem action items
- [ ] Consider pausing non-critical deploys

Quarterly Review

## SLO Quarterly Review - Q1 2026

### SLO Performance
| SLO | Target | Q1 Actual | Trend |
|-----|--------|-----------|-------|
| Availability | 99.9% | 99.92% | ↗️ |
| Latency | <200ms | 178ms | ↗️ |

### Error Budget Utilization
- January: 76% consumed
- February: 45% consumed
- March: 23% consumed
- Average: 48% consumed ✓

### Recommendations
1. Consider tightening availability SLO to 99.95%
2. Add latency SLO for P50 (currently unmeasured)
3. Review alerting thresholds based on budget consumption

Best Practices

DO

  • Base SLOs on user experience, not internal metrics
  • Start with fewer SLOs and add as needed
  • Review and adjust SLOs quarterly
  • Use error budgets to balance reliability and velocity
  • Document SLO decisions and rationale

DON'T

  • Set SLOs higher than dependencies allow
  • Create SLOs for every metric
  • Ignore error budget policies
  • Set SLOs without stakeholder buy-in
  • Treat SLOs as unchangeable

SLO Maturity Model

Level Characteristics
L1: Ad-hoc No formal SLOs, react to incidents
L2: Defined SLOs documented, basic monitoring
L3: Measured SLIs tracked, dashboards exist
L4: Managed Error budgets enforced, policies in place
L5: Optimized SLOs drive prioritization, continuous improvement
Weekly Installs
1
GitHub Stars
2
First Seen
Feb 5, 2026
Installed on
mcpjam1
claude-code1
replit1
junie1
windsurf1
zencoder1