skills/peterhdd/agent-skills/engineering-devops-automator

engineering-devops-automator

SKILL.md

DevOps & Infrastructure Guide

Overview

This guide covers infrastructure automation, CI/CD pipeline development, deployment strategies, monitoring, and cloud operations. Use it when provisioning infrastructure, building pipelines, setting up observability, managing secrets, or planning disaster recovery.

First 10 Minutes

  • Inventory the delivery surface before proposing changes: CI config, infrastructure directories, runtime manifests, Dockerfiles, and observability config.
  • Run the existing validation commands before editing anything. If the repo has no validation path for infra changes, add one as part of the task.
  • Use scripts/analyze_deployment_risk.py on the repo root to summarize CI, Docker, Terraform, and Kubernetes signals before proposing rollout changes.
  • Identify the rollback path for the current deploy system. If you cannot explain how to revert the change in under 5 minutes, the rollout plan is incomplete.

Refuse or Escalate

  • Refuse "just push it" requests when there is no rollback path, no health signal, or no way to test the change outside production.
  • Escalate before changing production state if the plan includes database replacement, Terraform destroys, state moves, certificate rotation, or security group broadening without a compensating control.
  • Escalate when the repo mixes multiple deployment systems and ownership boundaries are unclear. Untangling that is a separate task.
  • Do not recommend Kubernetes by default. If the workload is a single service with simple networking and predictable scale, stay with the simpler runtime.

Infrastructure Decision Rules

Provisioning

  • Use Terraform with remote state (S3 + DynamoDB lock) so every resource is version-controlled and safe from concurrent modifications.
  • Use Terraform workspaces or directory-per-environment layout with shared modules to catch drift between staging and production.
  • Use the same Terraform modules as production with variable overrides -- never create infrastructure via cloud console.

CI/CD Pipelines

  • Structure as discrete stages (lint, test, build, scan, deploy) with explicit dependencies so security failures block deployment.
  • Deployment strategy: blue-green for zero-downtime + instant rollback, canary for gradual traffic shifting with metric-based promotion, rolling when simplicity matters and brief mixed-version traffic is acceptable.
  • Automate any manual step performed more than twice; delete the manual runbook entry to prevent drift.

Containerization

  • Use multi-stage Docker builds with distroless or Alpine final images to minimize attack surface.
  • CI must run Trivy (or equivalent) and fail on CRITICAL/HIGH findings before merge.

Monitoring and Reliability

  • Instrument the four golden signals (latency, traffic, errors, saturation); alert on symptoms, not causes.
  • Every alert must link to a runbook; alerts without runbooks get deleted or converted to dashboard metrics within one sprint.
  • Enforce structured JSON logging; ship to centralized system (ELK, Loki) with compliance-aligned retention.
  • Configure liveness probes for 30-second restart; set PodDisruptionBudget for availability during disruptions.

Disaster Recovery

  • Automate failover with runbooks tested quarterly; an untested DR plan is no plan.

Cost Optimization

  • Review cloud utilization monthly; downsize any instance averaging below 20% CPU over 14 days.

Secrets Management

  • Store secrets in Vault or AWS Secrets Manager with automated rotation (max 90-day TTL); inject at runtime.
  • Never commit secrets to source control or bake them into images.

Network Security

  • Default all security groups and NACLs to deny-all inbound; open only required ports/CIDRs; prune monthly.

Compliance

  • Generate automated audit logs recording deployer, commit SHA, and approval; store immutably for retention period.

Incident Response Protocol

  • Severity 1 (site down, data loss risk): Assemble incident team within 5 minutes. First action: mitigate (rollback, failover, scale up), not diagnose. Communicate status to stakeholders within 15 minutes. Post-mortem within 48 hours.
  • Severity 2 (degraded performance, partial outage): On-call engineer responds within 15 minutes. Check: recent deploys (rollback if <1 hour old), infrastructure alerts (CPU, memory, disk), dependency health (downstream services, databases). Communicate status within 30 minutes.
  • Severity 3 (minor issue, workaround exists): Log the issue, create a ticket, fix in next sprint. No immediate response required.
  • Rollback decision: If the issue started after a deploy within the last 4 hours, rollback first, investigate second. If the issue is not correlated with a deploy, escalate to the relevant service team.
  • Communication template: "We are aware of [impact description]. [X users / Y% of traffic] are affected. We are [current action]. Next update in [time]."

Cost Estimation Formulas

  • Compute (EC2/GCE): monthly_cost = instance_hourly_rate * 730 * instance_count. Reserved instances save 30-60% for steady-state workloads (commit for 1 year).
  • Storage (S3/GCS): monthly_cost = storage_GB * $0.023 + requests * $0.0004 (GET) or $0.005 (PUT). Enable lifecycle policies: move to Infrequent Access after 30 days, Glacier after 90 days.
  • Database (RDS/Cloud SQL): monthly_cost = instance_hourly_rate * 730 + storage_GB * $0.115 + IOPS_provisioned * $0.10. Multi-AZ doubles the instance cost.
  • Data transfer: First 1GB/month free. $0.09/GB out to internet. Inter-AZ: $0.01/GB each direction. Cross-region: $0.02/GB. Data transfer is the hidden cost — monitor it.
  • Kubernetes (EKS/GKE): cluster_cost = control_plane ($73/month EKS) + node_instance_costs + data_transfer. Spot/preemptible nodes save 60-90% for fault-tolerant workloads.
  • Rule of thumb: If cloud bill >$5k/month, hire a FinOps review. If >$50k/month, automate cost anomaly detection with AWS Cost Anomaly Detection or similar.

Service Selection Decision Trees

  • Compute: Lambda/Cloud Functions if <15 min execution, <10GB memory, and request-driven. ECS/Cloud Run for containerized services with consistent traffic. EKS/GKE only if running >10 services with complex networking requirements.
  • Database: RDS PostgreSQL for <10TB relational. DynamoDB for key-value at >100k QPS. ElastiCache Redis for caching and session storage. Aurora if you need PostgreSQL compatibility with automatic multi-AZ failover.
  • Queue/Messaging: SQS for simple async jobs. SNS + SQS for fan-out. EventBridge for event routing with filtering rules. Kafka (MSK) only for streaming >10k msg/sec with replay.
  • Storage: S3 for objects. EFS for shared filesystem (NFS). EBS for block storage (database volumes). Choose storage class based on access frequency.
  • CDN: CloudFront for AWS-native. Cloudflare for multi-cloud or DDoS-heavy. Use CDN for all static assets and any API response cacheable for >5 seconds.

Self-Verification Protocol

After any infrastructure or pipeline change, verify:

  • Terraform: Run terraform plan and read every line of the diff. If the plan shows any destroy or replace on a production resource, stop and verify intent.
  • CI/CD pipeline: Trigger a full pipeline run on a non-production branch. Verify every stage passes. Check that security scan gates actually block on findings (deliberately introduce a known CVE to test).
  • Monitoring: After setting up alerts, trigger each alert manually (spike CPU, kill a health check, fill disk). Verify the alert fires within the expected time window and reaches the correct channel.
  • Disaster recovery: After configuring backups, perform a restore to a test environment. Verify data integrity. If you cannot restore, you do not have backups — you have a false sense of security.
  • Secrets: Verify no secrets appear in: CI/CD logs (mask variables), Docker image layers (docker history), Terraform state (use sensitive = true), or git history (git log -p | grep -i password).

Failure Recovery

  • Terraform state drift: Run terraform plan to see the drift. If drift is in a non-critical resource, run terraform apply to reconcile. If drift is in a critical resource (database, load balancer), investigate who/what changed it manually and reconcile carefully. Never blindly terraform apply when state shows unexpected changes.
  • CI/CD pipeline broken: Check the last successful run. Diff the pipeline config between last success and current failure. Common causes: expired secrets/tokens, dependency version bump, runner image update, or rate limiting from a registry.
  • Container OOM-killed in production: Check kubectl describe pod for the OOM event. Increase memory limits if under-provisioned. If memory usage grows linearly over time, the application has a memory leak — fix the app, not the limits.
  • Certificate expiry: Automate renewal with cert-manager (Kubernetes) or ACM (AWS). Set alerts for 30, 14, and 7 days before expiry. If expired: renew immediately, check all services using the cert, verify they pick up the new cert (may need pod restart).
  • Disk full: Identify what filled it: logs (rotate and compress), Docker images (prune unused), database WAL (check replication lag), or temp files. Fix the root cause; expanding the disk is a temporary measure.

Scripts

  • scripts/validate_dockerfile.sh -- Check a Dockerfile against common best practices: multi-stage builds, USER instruction, HEALTHCHECK, no latest tags, COPY over ADD, and .dockerignore presence. Run with --help for usage.
  • scripts/check_services.sh -- Check TCP connectivity and HTTP response for a list of host:port pairs. Reports status, latency, and HTTP status code. Run with --help for usage.

Code Examples

See CI/CD Pipeline Guide for a full GitHub Actions pipeline with security scanning, container build, and blue-green deployment with smoke tests.

See Infrastructure & Monitoring Guide for Terraform (launch template, ASG, ALB, CloudWatch alarm) and Prometheus configuration with alert rules.

Workflow

Step 1: Infrastructure Assessment

  • Audit existing infrastructure, deployment process, and monitoring gaps.
  • Map application dependencies and scaling requirements.
  • Identify security and compliance requirements for the target environment.

Step 2: Pipeline Design

  • Design CI/CD pipeline with security scanning integration.
  • Plan deployment strategy (blue-green, canary, rolling).
  • Create infrastructure as code templates.
  • Design monitoring and alerting strategy.

Step 3: Implementation

  • Set up CI/CD pipelines with automated testing.
  • Implement infrastructure as code with version control.
  • Configure monitoring, logging, and alerting systems.
  • Create disaster recovery and backup automation.

Step 4: Optimization and Maintenance

  • Monitor system performance and optimize resources.
  • Implement cost optimization strategies.
  • Create automated security scanning and compliance reporting.
  • Build self-healing systems with automated recovery.

Deliverables

  • Deployment strategy with explicit rollback steps, health gates, and ownership.
  • Infrastructure change summary listing stateful resources, blast radius, and approval points.
  • CI/CD plan covering lint, test, build, scan, deploy, and post-deploy verification.
  • Monitoring and alert checklist tied to the changed services, not a generic dashboard wishlist.

References

  • CI/CD Pipeline Guide -- GitHub Actions pipeline with security scanning, container build, and blue-green deployment.
  • Infrastructure & Monitoring Guide -- Terraform (launch template, ASG, ALB, CloudWatch alarm) and Prometheus configuration.
  • Kubernetes Patterns -- Production Deployment, HPA, PDB, ConfigMap/Secret mounting, Ingress with TLS, CronJob, and Helm values.
  • Docker Best Practices -- Multi-stage Dockerfiles (Node.js, Python, Go), .dockerignore, Docker Compose, and Trivy scanning.
  • Monitoring & Observability -- Structured logging, Prometheus metrics, Grafana dashboard, alert rules, OpenTelemetry tracing, and health checks.
  • Incident Triage -- Repo-first production incident flow, rollback decision tree, and evidence capture checklist.
  • Deployment Rollback Guide -- Canary, blue-green, rolling, schema-change, and feature-flag rollback patterns.
Weekly Installs
12
GitHub Stars
7
First Seen
13 days ago
Installed on
codex12
claude-code11
github-copilot11
kimi-cli11
amp11
cline11