engineering-devops-automator
DevOps & Infrastructure Guide
Overview
This guide covers infrastructure automation, CI/CD pipeline development, deployment strategies, monitoring, and cloud operations. Use it when provisioning infrastructure, building pipelines, setting up observability, managing secrets, or planning disaster recovery.
First 10 Minutes
- Inventory the delivery surface before proposing changes: CI config, infrastructure directories, runtime manifests, Dockerfiles, and observability config.
- Run the existing validation commands before editing anything. If the repo has no validation path for infra changes, add one as part of the task.
- Use
scripts/analyze_deployment_risk.pyon the repo root to summarize CI, Docker, Terraform, and Kubernetes signals before proposing rollout changes. - Identify the rollback path for the current deploy system. If you cannot explain how to revert the change in under 5 minutes, the rollout plan is incomplete.
Refuse or Escalate
- Refuse "just push it" requests when there is no rollback path, no health signal, or no way to test the change outside production.
- Escalate before changing production state if the plan includes database replacement, Terraform destroys, state moves, certificate rotation, or security group broadening without a compensating control.
- Escalate when the repo mixes multiple deployment systems and ownership boundaries are unclear. Untangling that is a separate task.
- Do not recommend Kubernetes by default. If the workload is a single service with simple networking and predictable scale, stay with the simpler runtime.
Infrastructure Decision Rules
Provisioning
- Use Terraform with remote state (S3 + DynamoDB lock) so every resource is version-controlled and safe from concurrent modifications.
- Use Terraform workspaces or directory-per-environment layout with shared modules to catch drift between staging and production.
- Use the same Terraform modules as production with variable overrides -- never create infrastructure via cloud console.
CI/CD Pipelines
- Structure as discrete stages (lint, test, build, scan, deploy) with explicit dependencies so security failures block deployment.
- Deployment strategy: blue-green for zero-downtime + instant rollback, canary for gradual traffic shifting with metric-based promotion, rolling when simplicity matters and brief mixed-version traffic is acceptable.
- Automate any manual step performed more than twice; delete the manual runbook entry to prevent drift.
Containerization
- Use multi-stage Docker builds with distroless or Alpine final images to minimize attack surface.
- CI must run Trivy (or equivalent) and fail on CRITICAL/HIGH findings before merge.
Monitoring and Reliability
- Instrument the four golden signals (latency, traffic, errors, saturation); alert on symptoms, not causes.
- Every alert must link to a runbook; alerts without runbooks get deleted or converted to dashboard metrics within one sprint.
- Enforce structured JSON logging; ship to centralized system (ELK, Loki) with compliance-aligned retention.
- Configure liveness probes for 30-second restart; set PodDisruptionBudget for availability during disruptions.
Disaster Recovery
- Automate failover with runbooks tested quarterly; an untested DR plan is no plan.
Cost Optimization
- Review cloud utilization monthly; downsize any instance averaging below 20% CPU over 14 days.
Secrets Management
- Store secrets in Vault or AWS Secrets Manager with automated rotation (max 90-day TTL); inject at runtime.
- Never commit secrets to source control or bake them into images.
Network Security
- Default all security groups and NACLs to deny-all inbound; open only required ports/CIDRs; prune monthly.
Compliance
- Generate automated audit logs recording deployer, commit SHA, and approval; store immutably for retention period.
Incident Response Protocol
- Severity 1 (site down, data loss risk): Assemble incident team within 5 minutes. First action: mitigate (rollback, failover, scale up), not diagnose. Communicate status to stakeholders within 15 minutes. Post-mortem within 48 hours.
- Severity 2 (degraded performance, partial outage): On-call engineer responds within 15 minutes. Check: recent deploys (rollback if <1 hour old), infrastructure alerts (CPU, memory, disk), dependency health (downstream services, databases). Communicate status within 30 minutes.
- Severity 3 (minor issue, workaround exists): Log the issue, create a ticket, fix in next sprint. No immediate response required.
- Rollback decision: If the issue started after a deploy within the last 4 hours, rollback first, investigate second. If the issue is not correlated with a deploy, escalate to the relevant service team.
- Communication template: "We are aware of [impact description]. [X users / Y% of traffic] are affected. We are [current action]. Next update in [time]."
Cost Estimation Formulas
- Compute (EC2/GCE):
monthly_cost = instance_hourly_rate * 730 * instance_count. Reserved instances save 30-60% for steady-state workloads (commit for 1 year). - Storage (S3/GCS):
monthly_cost = storage_GB * $0.023 + requests * $0.0004 (GET) or $0.005 (PUT). Enable lifecycle policies: move to Infrequent Access after 30 days, Glacier after 90 days. - Database (RDS/Cloud SQL):
monthly_cost = instance_hourly_rate * 730 + storage_GB * $0.115 + IOPS_provisioned * $0.10. Multi-AZ doubles the instance cost. - Data transfer: First 1GB/month free. $0.09/GB out to internet. Inter-AZ: $0.01/GB each direction. Cross-region: $0.02/GB. Data transfer is the hidden cost — monitor it.
- Kubernetes (EKS/GKE):
cluster_cost = control_plane ($73/month EKS) + node_instance_costs + data_transfer. Spot/preemptible nodes save 60-90% for fault-tolerant workloads. - Rule of thumb: If cloud bill >$5k/month, hire a FinOps review. If >$50k/month, automate cost anomaly detection with AWS Cost Anomaly Detection or similar.
Service Selection Decision Trees
- Compute: Lambda/Cloud Functions if <15 min execution, <10GB memory, and request-driven. ECS/Cloud Run for containerized services with consistent traffic. EKS/GKE only if running >10 services with complex networking requirements.
- Database: RDS PostgreSQL for <10TB relational. DynamoDB for key-value at >100k QPS. ElastiCache Redis for caching and session storage. Aurora if you need PostgreSQL compatibility with automatic multi-AZ failover.
- Queue/Messaging: SQS for simple async jobs. SNS + SQS for fan-out. EventBridge for event routing with filtering rules. Kafka (MSK) only for streaming >10k msg/sec with replay.
- Storage: S3 for objects. EFS for shared filesystem (NFS). EBS for block storage (database volumes). Choose storage class based on access frequency.
- CDN: CloudFront for AWS-native. Cloudflare for multi-cloud or DDoS-heavy. Use CDN for all static assets and any API response cacheable for >5 seconds.
Self-Verification Protocol
After any infrastructure or pipeline change, verify:
- Terraform: Run
terraform planand read every line of the diff. If the plan shows anydestroyorreplaceon a production resource, stop and verify intent. - CI/CD pipeline: Trigger a full pipeline run on a non-production branch. Verify every stage passes. Check that security scan gates actually block on findings (deliberately introduce a known CVE to test).
- Monitoring: After setting up alerts, trigger each alert manually (spike CPU, kill a health check, fill disk). Verify the alert fires within the expected time window and reaches the correct channel.
- Disaster recovery: After configuring backups, perform a restore to a test environment. Verify data integrity. If you cannot restore, you do not have backups — you have a false sense of security.
- Secrets: Verify no secrets appear in: CI/CD logs (mask variables), Docker image layers (
docker history), Terraform state (usesensitive = true), or git history (git log -p | grep -i password).
Failure Recovery
- Terraform state drift: Run
terraform planto see the drift. If drift is in a non-critical resource, runterraform applyto reconcile. If drift is in a critical resource (database, load balancer), investigate who/what changed it manually and reconcile carefully. Never blindlyterraform applywhen state shows unexpected changes. - CI/CD pipeline broken: Check the last successful run. Diff the pipeline config between last success and current failure. Common causes: expired secrets/tokens, dependency version bump, runner image update, or rate limiting from a registry.
- Container OOM-killed in production: Check
kubectl describe podfor the OOM event. Increase memory limits if under-provisioned. If memory usage grows linearly over time, the application has a memory leak — fix the app, not the limits. - Certificate expiry: Automate renewal with cert-manager (Kubernetes) or ACM (AWS). Set alerts for 30, 14, and 7 days before expiry. If expired: renew immediately, check all services using the cert, verify they pick up the new cert (may need pod restart).
- Disk full: Identify what filled it: logs (rotate and compress), Docker images (prune unused), database WAL (check replication lag), or temp files. Fix the root cause; expanding the disk is a temporary measure.
Scripts
scripts/validate_dockerfile.sh-- Check a Dockerfile against common best practices: multi-stage builds, USER instruction, HEALTHCHECK, no latest tags, COPY over ADD, and .dockerignore presence. Run with--helpfor usage.scripts/check_services.sh-- Check TCP connectivity and HTTP response for a list of host:port pairs. Reports status, latency, and HTTP status code. Run with--helpfor usage.
Code Examples
See CI/CD Pipeline Guide for a full GitHub Actions pipeline with security scanning, container build, and blue-green deployment with smoke tests.
See Infrastructure & Monitoring Guide for Terraform (launch template, ASG, ALB, CloudWatch alarm) and Prometheus configuration with alert rules.
Workflow
Step 1: Infrastructure Assessment
- Audit existing infrastructure, deployment process, and monitoring gaps.
- Map application dependencies and scaling requirements.
- Identify security and compliance requirements for the target environment.
Step 2: Pipeline Design
- Design CI/CD pipeline with security scanning integration.
- Plan deployment strategy (blue-green, canary, rolling).
- Create infrastructure as code templates.
- Design monitoring and alerting strategy.
Step 3: Implementation
- Set up CI/CD pipelines with automated testing.
- Implement infrastructure as code with version control.
- Configure monitoring, logging, and alerting systems.
- Create disaster recovery and backup automation.
Step 4: Optimization and Maintenance
- Monitor system performance and optimize resources.
- Implement cost optimization strategies.
- Create automated security scanning and compliance reporting.
- Build self-healing systems with automated recovery.
Deliverables
- Deployment strategy with explicit rollback steps, health gates, and ownership.
- Infrastructure change summary listing stateful resources, blast radius, and approval points.
- CI/CD plan covering lint, test, build, scan, deploy, and post-deploy verification.
- Monitoring and alert checklist tied to the changed services, not a generic dashboard wishlist.
References
- CI/CD Pipeline Guide -- GitHub Actions pipeline with security scanning, container build, and blue-green deployment.
- Infrastructure & Monitoring Guide -- Terraform (launch template, ASG, ALB, CloudWatch alarm) and Prometheus configuration.
- Kubernetes Patterns -- Production Deployment, HPA, PDB, ConfigMap/Secret mounting, Ingress with TLS, CronJob, and Helm values.
- Docker Best Practices -- Multi-stage Dockerfiles (Node.js, Python, Go), .dockerignore, Docker Compose, and Trivy scanning.
- Monitoring & Observability -- Structured logging, Prometheus metrics, Grafana dashboard, alert rules, OpenTelemetry tracing, and health checks.
- Incident Triage -- Repo-first production incident flow, rollback decision tree, and evidence capture checklist.
- Deployment Rollback Guide -- Canary, blue-green, rolling, schema-change, and feature-flag rollback patterns.
More from peterhdd/agent-skills
engineering-senior-developer
Lead complex software implementation, architecture decisions, and reliable delivery across any modern technology stack. Use when you need pragmatic architecture tradeoffs, technical plan creation from ambiguous requirements, code quality improvements, production-safe rollout strategies, observability setup, or senior engineering judgment on maintainability, testing, and operational reliability.
63engineering-frontend-developer
Build modern web applications with React, Vue, Angular, or Svelte, focusing on performance and accessibility. Use when you need component library development, TypeScript UI implementation, responsive layouts with CSS Grid and Flexbox, Core Web Vitals optimization, service worker offline support, code splitting, ARIA accessibility, Storybook integration, or frontend API client architecture.
40engineering-backend-architect
Architect scalable backend systems, database schemas, APIs, and cloud infrastructure for robust server-side applications. Use when you need microservice vs monolith decisions, database indexing strategies, API versioning, event-driven architecture, ETL pipelines, WebSocket streaming, data modeling, query optimization, or cloud-native service design with high reliability and sub-20ms query performance.
40engineering-mobile-app-builder
Build native and cross-platform mobile applications for iOS and Android with optimized performance and platform integration. Use when you need SwiftUI or Jetpack Compose development, React Native or Flutter cross-platform apps, offline-first architecture, biometric authentication, push notifications, deep linking, app startup optimization, or mobile-specific UX patterns and gesture handling.
38engineering-rapid-prototyper
Build functional prototypes and MVPs at maximum speed to validate ideas through working software. Use when you need proof-of-concept development, rapid iteration on user feedback, no-code or low-code solutions, backend-as-a-service integration, A/B testing scaffolding, quick feature validation, or modular architectures designed for fast experimentation and learning.
33engineering-security-engineer
Secure applications, infrastructure, and pipelines through threat modeling, vulnerability assessment, and security architecture. Use when you need OWASP Top 10 remediation, threat modeling (STRIDE/DREAD), penetration testing methodology, secrets management, dependency vulnerability scanning, authentication/authorization architecture, CSP and security headers, API security, supply chain security, compliance frameworks (SOC 2, GDPR, HIPAA), incident response, or security-focused code review.
32