sre-operations
SKILL.md
SRE Operations — Site Reliability Engineering & Operational Excellence Specialist
Role
The SRE Operations specialist fuses Site Reliability Engineering, ITIL 4, Six Sigma quality disciplines, and Global Delivery Framework principles into a unified operational excellence program. This skill ensures security is embedded into reliability engineering, not bolted on afterward.
Phase 1 — SRE Security Integration Framework
SRE Error Budget with Security Dimension:
Traditional SRE:
Error Budget = 100% - SLO (e.g., 99.9% SLO → 0.1% error budget = 43.8 min/month)
Security-Extended SRE:
Error Budget dimensions:
1. Availability budget (uptime)
2. Security incident budget (time spent in security incident per month)
3. Vulnerability debt budget (ratio of unpatched critical CVEs vs. total)
4. Compliance drift budget (% time in non-compliant state)
Security SLOs (Service Level Objectives):
- Patch deployment time for critical CVEs: SLO = 100% within 24h
- MFA enforcement: SLO = 100% of privileged users at all times
- Certificate validity: SLO = 0 expired certs in production at any time
- Backup restore success: SLO = 100% successful restore test monthly
- Incident detection (MTTD): SLO = 95% of P1 incidents detected within 1h
Reliability vs. Security trade-off framework:
When reliability and security conflict (e.g., emergency patch requires downtime):
1. Assess security risk (CVSS score, active exploitation in wild)
2. Assess reliability impact (planned outage duration, affected users)
3. Apply risk matrix:
- Critical CVE + active exploitation → security wins; schedule emergency maintenance
- High CVE + not exploited → negotiate maintenance window within 7 days
- Medium/Low CVE → schedule in next regular maintenance window
4. Document decision in change management record
5. Inform CISO and product owner of decision and rationale
Phase 2 — ITIL 4 Service Management Integration
ITIL 4 Service Value Chain with security embedding:
Plan: Security requirements in service strategy; risk assessment
Improve: Security metrics in CSI (Continual Service Improvement) register
Engage: Security SLAs in customer agreements; vendor security requirements
Design: Security architecture review in service design; threat modeling
Obtain: Vendor security assessment before software/service procurement
Deliver: Security gates in release pipeline; CAB security sign-off
Support: Security incident integration with service desk; problem management
Key ITIL 4 processes with security controls:
| ITIL Process | Security Integration |
|---|---|
| Incident Management | Security incidents follow IR playbook; severity aligned to P1-P4 |
| Problem Management | Security root cause analysis; known error database includes security issues |
| Change Management | CAB includes security reviewer; emergency changes require ECAB |
| Release Management | Security sign-off gate before production release |
| Configuration Management | CMDB includes security attributes (patch level, encryption status, owner) |
| Service Level Management | Security SLOs in SLA; breach triggers executive notification |
| Availability Management | DR/BCP tested with security scenarios |
| Capacity Management | Security tooling capacity included in planning |
| Supplier Management | Vendor security assessment in procurement process |
| Knowledge Management | Security runbooks in ITSM knowledge base |
Phase 3 — Six Sigma for Security Quality
DMAIC applied to security processes:
Define — Problem Statement:
Security defect = any security control that fails to perform its intended function
Examples:
- MFA bypass due to misconfiguration
- Unencrypted backup discovered
- Privileged account not deprovisioned after termination
- Certificate expired in production
Project Charter must include:
- Problem statement (measurable)
- Business case (cost of failure: breach cost, regulatory fine, reputational)
- Goal: target defect rate (e.g., "reduce critical patch SLA breach from 15% to 0%")
- Scope: systems, processes, teams in scope
- Timeline and team
Measure — Baseline Security Metrics:
Defect rates to measure:
- Critical vulnerability patch SLA breach rate: [X%] (target: 0%)
- Access review completion rate: [X%] (target: 100%)
- Phishing simulation click rate: [X%] (target: <3%)
- Security training completion: [X%] (target: 100%)
- Change-induced security incidents: [N/month] (target: 0)
- Mean Time to Detect: [Xh] (target: <1h)
- Mean Time to Respond: [Xh] (target: <4h)
Measurement system analysis:
- Confirm data is accurate and consistent
- Define operational definitions for each metric
- Establish data collection cadence and ownership
Analyze — Root Cause:
Tools:
- Fishbone (Ishikawa) diagram: people, process, technology, environment
- 5-Whys analysis for specific defects
- Pareto chart: 80% of defects from 20% of causes
- Control charts: identify special vs. common cause variation
- FMEA: Failure Mode and Effects Analysis for security processes
Common root causes in security:
- Process: no defined process; process exists but not followed
- People: lack of training; no accountability; unclear ownership
- Technology: tool misconfiguration; integration gaps; outdated tooling
- Environment: complexity; legacy systems; rapid change pace
Improve — Solutions:
Security process improvement examples:
- Automate patch deployment (reduce human error)
- Implement automated access reviews (reduce manual effort)
- Integrate security training into onboarding (ensure coverage)
- Add pre-commit hooks for secrets detection (shift-left)
- Implement policy-as-code for compliance (eliminate configuration drift)
Pilot → Measure → Full deployment
Always validate improvement with data before full rollout.
Control — Sustain Improvements:
Control plan elements:
- Process owner: named individual accountable for maintaining improvement
- Control chart: ongoing measurement of key metric
- Corrective action: defined response if metric exceeds control limits
- Audit schedule: periodic verification the process is followed
- Documentation: updated runbooks, policies, and procedures
Phase 4 — Global Delivery Framework (GDF)
Security requirements for globally distributed teams:
Follow-the-Sun Security Coverage:
- SOC coverage: 24×7 via geographic rotation (Americas/EMEA/APAC)
- Incident response: on-call rotation covers all time zones; <15 min response SLA
- Security approvals: defined approval authority at each geography for time-sensitive decisions
Data Residency & Sovereignty:
- Identify data residency requirements per jurisdiction (GDPR: EU; China PIPL: China; India PDPB: India)
- Enforce data residency via cloud region controls; data never leaves mandated geography
- Cross-border data transfers: legal basis documented (SCCs, BCRs, adequacy decision)
- Data localization compliance mapped to each delivery location
Access Control by Geography:
- Principle of least privilege applied at geographic level
- Access to sensitive systems requires VPN + MFA + location-based conditional access
- Offshore access to crown jewel systems: requires documented business justification + CISO approval
- No export-controlled data (ITAR/EAR) accessible to restricted-country teams without license
Vendor & Third-Party Security (in global delivery context):
- All delivery partners assessed via standardized security questionnaire (SIG Lite minimum)
- Annual reassessment; immediate reassessment on security incident at vendor
- Right-to-audit clauses in all vendor contracts
- Shared responsibility matrix defined for all vendor relationships
- SOC 2 Type II or equivalent required for critical vendors
Global Security Training Delivery:
- Security training localized by language and cultural context
- Phishing simulations run across all geographies
- Regional compliance modules (GDPR for EMEA; PIPL for China delivery; etc.)
- Training completion tracked per geography; regional manager accountable
Phase 5 — Runbook Security Standards
Security runbook template (mandatory sections):
RUNBOOK: [System/Process Name] Security Operations
Scope: [What systems, services, processes this covers]
Owner: [Team + escalation contact]
Classification: [Security sensitivity of this runbook — CONFIDENTIAL]
Review cycle: [Quarterly minimum]
Last reviewed: [Date + reviewer]
1. NORMAL OPERATIONS
- Security monitoring cadence
- Routine access review process
- Certificate and credential rotation schedule
- Log review and alert triage process
2. SECURITY INCIDENT RESPONSE
- Detection indicators specific to this service
- Triage steps: verify → classify → contain
- Escalation matrix with contact details
- Evidence preservation steps for this service
- Communication templates (internal + regulatory)
3. CHANGE MANAGEMENT GATES
- Pre-change: security checklist (what to verify before change)
- During change: monitoring requirements
- Post-change: validation and verification steps
- Rollback procedure with security verification
4. DISASTER RECOVERY
- RTO and RPO targets with security dimension
- Backup verification process (including encryption check)
- Recovery sequence with security validation at each step
- Post-recovery security scan before accepting traffic
5. COMPLIANCE CONTROLS
- Applicable frameworks: [SOC2 controls, NIST controls mapped]
- Evidence collection: what to save, where, how long
- Audit support: contacts, evidence location, access process
Phase 6 — Security Chaos Engineering
Security fault injection testing (quarterly):
Scenarios to test in staging/pre-prod:
1. Certificate expiry simulation → verify alerting and auto-renewal
2. KMS key unavailability → verify graceful degradation (no plaintext fallback)
3. SIEM outage → verify backup logging; alert on gap
4. Identity provider outage → verify break-glass procedure
5. Network segmentation breach → verify east-west detection and blocking
6. Backup restore failure → verify DR procedure and RTO
7. Privileged account compromise → verify containment speed
Each test:
- Document hypothesis (what should happen?)
- Execute in controlled environment
- Measure actual vs. expected behavior
- Document gaps and remediate
- Re-test after remediation
Operational Excellence Metrics
| Metric | Target | Frequency |
|---|---|---|
| SLO compliance (all security SLOs) | 100% | Monthly |
| Change-induced security incidents | 0 | Monthly |
| Security runbook currency | 100% reviewed within 12 months | Quarterly audit |
| Mean Time to Detect (security events) | <1 hour | Weekly |
| Security training completion (all staff) | 100% | Quarterly |
| Six Sigma defect rate (security processes) | <3.4 DPMO (6σ) | Monthly |
| DR test success rate | 100% | Quarterly |
| Global delivery security audit pass rate | 100% | Annual |