Senior DevOps Engineer
Overview
Design, build, and maintain production infrastructure and deployment pipelines. This skill covers Docker containerization, Kubernetes orchestration, CI/CD with GitHub Actions, infrastructure-as-code with Terraform/Pulumi, monitoring with Prometheus/Grafana, alerting strategies, zero-downtime deployments, and rollback procedures.
Phase 1: Infrastructure Design
- Define deployment topology (single server, cluster, multi-region)
- Choose containerization strategy (Docker, Buildpacks)
- Select orchestration platform (Kubernetes, ECS, Cloud Run)
- Plan networking (load balancers, DNS, TLS)
- Design secret management approach
STOP — Present infrastructure design to user for approval before implementation.
Infrastructure Decision Table
| Scale |
Topology |
Orchestration |
Recommended |
| Hobby / MVP |
Single server |
Docker Compose |
Railway, Fly.io |
| Startup (< 100k users) |
Small cluster |
ECS, Cloud Run |
AWS ECS, GCP Cloud Run |
| Growth (100k - 1M users) |
Multi-AZ cluster |
Kubernetes |
EKS, GKE |
| Enterprise (1M+ users) |
Multi-region |
Kubernetes + service mesh |
EKS/GKE + Istio |
| Compliance-heavy |
Dedicated/private cloud |
Kubernetes |
Self-managed K8s |
Phase 2: Pipeline Implementation
- Build CI pipeline (lint, test, build, security scan)
- Build CD pipeline (deploy to staging, production)
- Configure environment-specific settings
- Set up artifact registry (container images, packages)
- Implement deployment strategy (blue-green, canary, rolling)
STOP — Validate pipeline config syntax and present for review.
Phase 3: Observability
- Deploy monitoring stack (Prometheus, Grafana)
- Configure alerting rules and escalation
- Set up log aggregation
- Implement distributed tracing
- Create runbooks for common incidents
STOP — Verify monitoring covers all critical services before declaring complete.
Dockerfile Best Practices
FROM node:20-alpine AS base
WORKDIR /app
FROM base AS deps
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && pnpm install --frozen-lockfile --prod
FROM base AS build-deps
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && pnpm install --frozen-lockfile
FROM build-deps AS builder
COPY . .
RUN pnpm build
FROM base AS runner
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 app && \
adduser --system --uid 1001 app
USER app
COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s \
CMD wget -qO- http://localhost:3000/health || exit 1
EXPOSE 3000
CMD ["node", "dist/server.js"]
Key Dockerfile Rules
| Rule |
Why |
| Multi-stage builds |
Minimize image size |
.dockerignore file |
Exclude node_modules, .git, tests |
| Non-root user |
Security hardening |
| Specific base image versions |
Reproducible builds |
| Layer ordering (deps before src) |
Cache efficiency |
| HEALTHCHECK instruction |
Container health monitoring |
| No secrets in build args/layers |
Prevent credential leaks |
Docker Compose Patterns
services:
app:
build:
context: .
dockerfile: Dockerfile
target: runner
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgresql://postgres:postgres@db:5432/app
- REDIS_URL=redis://cache:6379
depends_on:
db:
condition: service_healthy
cache:
condition: service_started
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
db:
image: postgres:16-alpine
volumes:
- postgres_data:/var/lib/postgresql/data
environment:
POSTGRES_DB: app
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 5
cache:
image: redis:7-alpine
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:
GitHub Actions Workflow
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v3
- uses: actions/setup-node@v4
with:
node-version: 20
cache: pnpm
- run: pnpm install --frozen-lockfile
- run: pnpm lint
- run: pnpm typecheck
- run: pnpm test -- --coverage
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npx audit-ci --moderate
- uses: aquasecurity/trivy-action@master
with:
scan-type: fs
severity: HIGH,CRITICAL
build-and-push:
needs: [lint-and-test, security-scan]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
needs: build-and-push
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: echo "Deploying ${{ github.sha }}"
Terraform / Pulumi Patterns
Terraform Structure
modules/
vpc/
main.tf, variables.tf, outputs.tf
ecs/
main.tf, variables.tf, outputs.tf
environments/
staging/
main.tf, terraform.tfvars
production/
main.tf, terraform.tfvars
Key IaC Rules
| Rule |
Why |
| Remote state backend (S3 + DynamoDB) |
Shared state, locking |
| State locking |
Prevent concurrent modifications |
| Environment-specific variable files |
Separation of concerns |
| Module versioning |
Reproducible shared infra |
terraform plan in CI |
Catch issues before apply |
| Drift detection on schedule |
Detect manual changes |
| Tag all resources |
Ownership, cost allocation |
Monitoring (Prometheus + Grafana)
USE Method (Resources)
| Resource |
Utilization |
Saturation |
Errors |
| CPU |
cpu_usage_percent |
cpu_throttled |
— |
| Memory |
memory_usage_bytes |
oom_kills |
— |
| Disk |
disk_usage_percent |
io_wait |
disk_errors |
| Network |
bytes_total |
queue_length |
errors_total |
RED Method (Services)
- Rate: requests per second
- Errors: error rate per second
- Duration: latency distribution (p50, p95, p99)
Alerting Rules
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
Alerting Best Practices
| Practice |
Why |
| Alert on symptoms, not causes |
Reduces noise, focuses on impact |
| Every alert has a runbook link |
Enables fast response |
| Tiered severity |
critical=page, warning=ticket, info=log |
| Aggregate before alerting |
Avoid flapping |
| Review and prune quarterly |
Prevent alert fatigue |
Zero-Downtime Deployment Strategies
| Strategy |
How It Works |
Risk |
Rollback Speed |
| Rolling |
Replace instances one at a time |
Low |
Medium |
| Blue-Green |
Switch traffic between two environments |
Low |
Instant |
| Canary |
Route small % to new version, gradually increase |
Very Low |
Instant |
| Feature Flags |
Deploy code dark, enable via flag |
Very Low |
Instant |
Rollback Procedures
- Automated: health check fails -> automatic rollback
- Manual:
kubectl rollout undo deployment/app
- Database: forward-only migrations with backward compatibility
- Config: revert via secret manager version
Database Migration Safety
| Rule |
Rationale |
| Migrations must be backward compatible |
Old code + new schema must work |
| Never rename/drop columns in same deploy |
Two-phase change required |
| Two-phase: add column -> deploy -> remove old |
Zero-downtime schema evolution |
| Always test rollback of each migration |
Ensure reversibility |
Anti-Patterns / Common Mistakes
| Anti-Pattern |
Why It Is Wrong |
What to Do Instead |
| Manual production deployments |
No audit trail, error-prone |
Automate via CI/CD |
| Shared or hardcoded secrets |
Security breach risk |
Use secrets manager |
| No rollback plan before deploying |
Stuck if deploy fails |
Document rollback before every deploy |
latest tag for production images |
Non-reproducible |
Pin specific version tags |
| Running containers as root |
Security vulnerability |
Use non-root user in Dockerfile |
| Alert fatigue from non-actionable alerts |
Real issues get missed |
Alert on symptoms, tune thresholds |
| Skipping staging environment |
Bugs found in production |
Always deploy to staging first |
| Snowflake servers with manual config |
Cannot reproduce, cannot scale |
Infrastructure as code |
| Monitoring without alerting |
Nobody notices problems |
Wire alerts to monitoring |
Key Principles
- Infrastructure as code — no manual changes to production
- Immutable infrastructure — replace, do not patch
- Cattle, not pets — servers are disposable
- Shift left security — scan early in pipeline
- Least privilege — minimal permissions everywhere
- Automate everything that runs more than twice
- Test the disaster recovery plan regularly
Documentation Lookup (Context7)
Use mcp__context7__resolve-library-id then mcp__context7__query-docs for up-to-date docs. Returned docs override memorized knowledge.
docker — for Dockerfile syntax, compose configuration, or multi-stage builds
kubernetes — for resource manifests, kubectl commands, or Helm charts
terraform — for provider configuration, resource blocks, or state management
Integration Points
| Skill |
Integration |
deployment |
Provides higher-level deploy pipeline orchestration |
security-review |
Security scan stage in CI pipeline |
planning |
Infrastructure changes are planned like features |
verification-before-completion |
Post-deploy verification gate |
finishing-a-development-branch |
Merge triggers deployment pipeline |
mcp-builder |
MCP servers need containerization and deployment |
Skill Type
FLEXIBLE — Adapt tooling and patterns to the project's cloud provider, team size, and operational maturity. The principles (IaC, immutability, observability) are constant; the specific tools are interchangeable.