Production Readiness

Systematic checklist to ensure services are ready for production deployment.

When to Use This Skill

Preparing a new service for production launch
Go-live readiness review
Production deployment checklist needed
Assessing service maturity
Pre-launch security review

Quick Readiness Assessment

Copy and complete this checklist:

Production Readiness Assessment:
Service: _______________
Date: _________________
Reviewer: _____________

Reliability:     [ ] Pass  [ ] Partial  [ ] Fail
Observability:   [ ] Pass  [ ] Partial  [ ] Fail
Security:        [ ] Pass  [ ] Partial  [ ] Fail
Operations:      [ ] Pass  [ ] Partial  [ ] Fail
Documentation:   [ ] Pass  [ ] Partial  [ ] Fail

Overall Status:  [ ] Ready  [ ] Conditional  [ ] Not Ready

Reliability Checklist

SLOs Defined

SLO Checklist:
- [ ] Availability SLO defined (e.g., 99.9%)
- [ ] Latency SLO defined (e.g., p99 < 200ms)
- [ ] Error rate SLO defined (e.g., < 0.1%)
- [ ] SLOs documented and communicated
- [ ] Error budget policy established

Fault Tolerance

Fault Tolerance:
- [ ] No single points of failure
- [ ] Graceful degradation implemented
- [ ] Circuit breakers for dependencies
- [ ] Retry logic with exponential backoff
- [ ] Timeouts configured for all external calls
- [ ] Rate limiting in place

Capacity

Capacity Planning:
- [ ] Load tested to 2x expected peak
- [ ] Auto-scaling configured (if applicable)
- [ ] Resource limits set (CPU, memory)
- [ ] Connection pool sizes appropriate
- [ ] Queue capacity sufficient

Data Resilience

Data Protection:
- [ ] Backups configured and tested
- [ ] Backup restoration tested
- [ ] Data replication in place
- [ ] RPO/RTO defined and achievable
- [ ] No data loss on service restart

Observability Checklist

Metrics

Metrics:
- [ ] RED metrics exposed (Rate, Errors, Duration)
- [ ] Resource metrics available (CPU, memory, disk)
- [ ] Business metrics tracked
- [ ] Dependency health metrics
- [ ] Custom metrics for key operations

Logging

Logging:
- [ ] Structured logging (JSON)
- [ ] Request/trace IDs in all logs
- [ ] Log levels appropriate (no excessive DEBUG)
- [ ] Sensitive data not logged
- [ ] Log retention configured

Tracing

Distributed Tracing:
- [ ] Trace context propagated
- [ ] Spans for external calls
- [ ] Key operations instrumented
- [ ] Sampling rate configured
- [ ] Trace storage/retention set

Alerting

Alerts:
- [ ] SLO-based alerts configured
- [ ] Alert thresholds tuned (not noisy)
- [ ] Runbooks linked to alerts
- [ ] Escalation paths defined
- [ ] On-call rotation assigned

Dashboards

Dashboards:
- [ ] Service health dashboard exists
- [ ] Key metrics visualized
- [ ] Dashboard accessible to team
- [ ] Dependencies shown
- [ ] Historical data available

Security Checklist

Authentication & Authorization

Auth:
- [ ] Authentication required for all endpoints
- [ ] Authorization checks implemented
- [ ] Service-to-service auth configured
- [ ] No hardcoded credentials
- [ ] Secrets in secret manager

Network Security

Network:
- [ ] TLS for all connections
- [ ] Network policies/firewall rules
- [ ] Internal services not publicly exposed
- [ ] Egress traffic controlled
- [ ] DDoS protection (if public)

Data Security

Data:
- [ ] Sensitive data encrypted at rest
- [ ] PII handling documented
- [ ] Data retention policy applied
- [ ] Audit logging for sensitive operations
- [ ] GDPR/compliance requirements met

Vulnerability Management

Vulnerabilities:
- [ ] Dependencies scanned for CVEs
- [ ] Container images scanned
- [ ] No critical vulnerabilities
- [ ] Security review completed
- [ ] Penetration testing (if required)

Operations Checklist

Deployment

Deployment:
- [ ] CI/CD pipeline configured
- [ ] Deployment is automated
- [ ] Rollback procedure documented
- [ ] Rollback tested
- [ ] Blue-green or canary supported
- [ ] Feature flags for risky changes

Runbooks

Runbooks:
- [ ] Startup/shutdown procedures
- [ ] Common troubleshooting steps
- [ ] Escalation procedures
- [ ] Disaster recovery steps
- [ ] Maintenance procedures

On-Call

On-Call Readiness:
- [ ] On-call rotation scheduled
- [ ] Team trained on service
- [ ] Escalation paths clear
- [ ] Contact information current
- [ ] Handoff procedures defined

Documentation Checklist

Documentation:
- [ ] Architecture diagram current
- [ ] API documentation complete
- [ ] README with setup instructions
- [ ] Dependencies documented
- [ ] Configuration documented
- [ ] Known issues/limitations listed

Rollback Plan

Every production deployment needs a rollback plan:

Rollback Plan:
- Rollback trigger: [What conditions trigger rollback]
- Rollback method: [How to rollback - automated/manual]
- Rollback time: [Expected time to complete]
- Data considerations: [Any data migration concerns]
- Verification: [How to verify rollback success]

Pre-Launch Final Checklist

Complete immediately before go-live:

Final Pre-Launch:
- [ ] All checklist items above addressed
- [ ] Stakeholders notified of launch
- [ ] War room/incident channel ready
- [ ] Key personnel available
- [ ] Monitoring dashboards open
- [ ] Rollback ready to execute
- [ ] Communication templates prepared

Common Blockers

See references/common-blockers.md for typical issues that block production readiness.

production-readiness