database-reliability
Database Reliability Engineering
Use When
- Database reliability engineering: SLOs for databases, operational runbooks, change management, capacity planning, backup verification, incident response, and monitoring strategies for production MySQL. Use when setting up production database...
- The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.
Do Not Use When
- The task is unrelated to
database-reliabilityor would be better handled by a more specific companion skill. - The request only needs a trivial answer and none of this skill's constraints or references materially help.
Required Inputs
- Gather relevant project context, constraints, and the concrete problem to solve; load
referencesonly as needed. - Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.
Workflow
- Read this
SKILL.mdfirst, then load only the referenced deep-dive files that are necessary for the task. - Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
- Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.
Quality Standards
- Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
- Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
- Prefer deterministic, reviewable steps over vague advice or tool-specific magic.
Anti-Patterns
- Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
- Loading every reference file by default instead of using progressive disclosure.
Outputs
- A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
- Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
- References used, companion skills, or follow-up actions when they materially improve execution.
Evidence Produced
| Category | Artifact | Format | Example |
|---|---|---|---|
| Operability | Database SLO record | Markdown doc per skill-composition-standards/references/slo-template.md covering query latency, availability, and replication-lag SLOs |
docs/data/db-slo.md |
| Operability | Backup verification log | Markdown doc tracking restore-test runs and recovery-time measurements | docs/data/backup-verify-log.md |
References
- Use the
references/directory for deep detail after reading the core workflow below.
Core Philosophy
Databases are not special snowflakes. Treat every node as cattle: replaceable, automated, and monitored. The DBRE role is engineering, not firefighting. Toil (manual, repetitive, automatable work that scales linearly with growth) is the enemy — eliminate it.
1. Database SLOs
Availability Targets
| Tier | SLO | Downtime/Year | Downtime/Week |
|---|---|---|---|
| Standard | 99.9% | 8.7 hours | 10.08 minutes |
| High | 99.95% | 4.4 hours | 5.04 minutes |
| Critical | 99.99% | 52 minutes | 1 minute |
Sample SLO: 99.9% availability averaged over one week; no single incident >10.08 minutes; downtime declared when >5% of users affected; one annual 4-hour maintenance window (2 weeks notice, <10% users).
Latency SLOs
Never use averages — they are lossy and hide tail latency. Use percentiles over 1-minute windows at 99% of requests.
p50 < 5ms (pk lookups) | p95 < 50ms (indexed queries)
p99 < 200ms (joins/agg) | max 500ms (circuit breaker)
Replication and Connection SLOs
Replication lag: <5s normal | alert 10s | critical 30s
Connections: <80% of max_connections (alert), >20% headroom required
Slow query rate: <1% of total queries
Error Budget
Track weekly. 30% consumed by Tuesday → create ticket. 70% consumed with 3+ days left → freeze non-critical deployments. 99.9% SLO = 604.8 seconds budget per week.
SELECT ROUND(SUM(duration_seconds) / 604.8, 1) AS budget_pct_used
FROM downtime_log WHERE week_start = CURDATE() - INTERVAL WEEKDAY(CURDATE()) DAY;
Platform SRE for Databases
Database SLIs
Concrete SLIs a production DB must track:
- Query P95 latency (ms) —
histogram_quantile(0.95, sum(rate(mysql_query_duration_seconds_bucket[5m])) by (le))for MySQL;pg_stat_statementsfor PostgreSQL - Replication lag (seconds) — MySQL
Seconds_Behind_Master, PostgreSQLpg_last_wal_replay_lsnvs primarypg_current_wal_lsn - Connection pool utilisation —
sum(pool_active_connections) / sum(pool_max_connections)per service - Deadlock rate — MySQL
SHOW ENGINE INNODB STATUSdeadlock counter, PostgreSQLdeadlocksinpg_stat_database - Error rate — query errors per minute, grouped by error class
Database SLOs
Example SLO targets for a production SaaS DB:
- Query P95 ≤ 100ms for 99.5% of requests over 28 days
- Availability ≥ 99.95% (downtime ≤ 20.16 minutes per 28d)
- Replication lag P95 ≤ 5 seconds over 28d
- Error budget:
(1 - 0.995) × 28d = 3h 22minof query-latency-breach allowed per 28d
Error Budget Tracking
How to track against the budget:
- Replication lag budget: allow max 1 hour of lag > 30 seconds per month. Alert at 50% burn.
- Query latency budget: alert at 2% budget burn in 1 hour (fast burn) or 10% in 6 hours (slow burn)
- Publish a weekly SLO report (Grafana snapshot) to the engineering channel
Sample SLO definition YAML (Sloth-compatible):
version: "prometheus/v1"
service: "mysql-primary"
slos:
- name: "query-latency-p95"
objective: 99.5
description: "P95 query latency under 100ms"
sli:
events:
error_query: "sum(rate(mysql_query_duration_seconds_bucket{le='0.1'}[5m]))"
total_query: "sum(rate(mysql_query_duration_seconds_count[5m]))"
alerting:
name: MySQLQueryLatencyP95
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
Chaos Engineering for Databases
Controlled drills to run quarterly in staging (never prod without a rollback plan):
- Primary failover drill — promote replica, measure RTO, verify application auto-reconnect
- Connection pool exhaustion — open connections until pool is saturated, confirm circuit breaker trips
- Replica lag injection —
SELECT pg_sleep(60)on replica to simulate 60s lag; confirm read traffic falls back to primary - Disk full simulation — fill test volume to 95%, confirm alerts fire and read-only mode kicks in
- Query killer test — run a runaway
SELECT SLEEP(300)and confirm the query killer terminates after 30s
Record every drill in a runbook with observed RTO/RPO.
Blameless Database Postmortem Template
# Postmortem: <incident title>
- Date: YYYY-MM-DD
- Severity: SEV1/2/3/4
- Duration: HH:MM (start → resolved)
- Services affected:
- Authors:
## Timeline (UTC)
- HH:MM first alert fired
- HH:MM on-call acknowledged
- HH:MM root cause identified
- HH:MM mitigation applied
- HH:MM service restored
## Root Cause (5-Whys)
1. Why did service degrade? Query latency jumped
2. Why did query latency jump? Missing index on new column
3. Why was the index missing? Migration added column without CREATE INDEX
4. Why did migration ship without the index? Review did not catch it
5. Why did review not catch it? No migration-review checklist in PR template
## Action Items
- [ ] Add migration-review checklist to PR template — owner, JIRA-XXXX, due YYYY-MM-DD
- [ ] Run pt-online-schema-change to add the missing index — owner, JIRA-XXXX, due YYYY-MM-DD
## What Went Well / What Went Poorly
- Went well: alert fired within 2 min; runbook was up to date
- Went poorly: no schema-change CI gate; post-deploy smoke test did not exercise the new query
Toil Identification and Automation
Tasks that must be automated away (anything repeated > 2 times per month):
- Backup verification — cron job that restores yesterday's backup to a scratch instance and runs a checksum query
- Index maintenance —
pt-online-schema-changescheduled via Ansible for large-table DDL - Stale connection cleanup —
pg_terminate_backendon connections idle > 1 hour with open transactions - Log rotation and retention —
logrotateconfigured via Ansible role
Runbook Structure
Every DB runbook must follow this structure:
# Runbook: <alert name>
- Severity: SEV1/2/3
- Trigger: <PromQL or alert condition>
- SLA to acknowledge: < 5 min for SEV1
## Investigation
1. <command to run first>
2. <metric or log to check>
3. <decision tree>
## Resolution
1. <primary mitigation>
2. <rollback steps if mitigation fails>
## Escalation
- Primary on-call → Secondary → DBRE lead → VP Engineering
- External: DB vendor support (ticket URL)
## Related
- Grafana dashboard URL
- Recent postmortem links
- Vendor docs
Additional Guidance
Extended guidance for database-reliability was moved to references/skill-deep-dive.md to keep this entrypoint compact and fast to load.
Use that deep dive for:
2. Toil Reduction — What to Automate3. Change Management for Databases4. Backup Verification Runbook5. Monitoring Pyramid6. Alert Fatigue Prevention7. Capacity Planning8. Incident Response Runbook9. Connection Exhaustion Response10. Replication Failure Recovery11. Security Incident Response12. Planned Maintenance Checklist13. Chaos Engineering for Databases- Additional deep-dive sections continue in the reference file.
More from peterbamuhigire/skills-web-dev
google-play-store-review
Google Play Store compliance and review readiness for Android apps. Use
76multi-tenant-saas-architecture
Use when designing or reviewing a multi-tenant SaaS platform — tenant
62jetpack-compose-ui
Jetpack Compose UI standards for beautiful, sleek, minimalistic Android
49gis-mapping
Use for web apps that need Leaflet-first GIS mapping, location selection,
48saas-accounting-system
Implement a complete double-entry accounting system inside any SaaS app.
47manual-guide
Generate end-user manuals and reference guides for ERP modules. Use when
38