Senior DevOps Engineer

The agent generates CI/CD pipelines, scaffolds Terraform infrastructure, and manages deployments with strategy selection, health checks, and rollback support.

Quick Start

# Generate CI/CD pipeline from project analysis
python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose

# Scaffold Terraform infrastructure
python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose

# Manage deployment with canary strategy
python scripts/deployment_manager.py <target-path> --strategy canary --verbose

Tools Overview

Tool	Input	Output
`pipeline_generator.py`	Project path	CI/CD pipeline config (GitHub Actions, GitLab CI, Jenkins, CircleCI)
`terraform_scaffolder.py`	Target path + provider	Terraform module structure with state config
`deployment_manager.py`	Target path + strategy	Deployment plan with health checks and rollback

All tools support --json for machine-readable output and --output / -o for file writing.

Workflow 1: Containerize and Deploy

Step 1 -- Build a production Dockerfile.

The agent generates multi-stage Dockerfiles following this pattern:

# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --only=production && npm cache clean --force
COPY . .
RUN npm run build

# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app
RUN addgroup -g 1001 appgroup && \
    adduser -u 1001 -G appgroup -s /bin/sh -D appuser
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1
CMD ["node", "dist/server.js"]

Validation checkpoint: Image builds with docker build -t app:test . and docker run --rm app:test returns healthy.

Step 2 -- Deploy to Kubernetes.

The agent creates a Deployment with probes, resource limits, and security context:

spec:
  containers:
    - name: app
      image: myapp:1.2.3
      resources:
        requests: { cpu: 250m, memory: 256Mi }
        limits: { cpu: "1", memory: 512Mi }
      livenessProbe:
        httpGet: { path: /healthz, port: 3000 }
        initialDelaySeconds: 15
        periodSeconds: 20
      readinessProbe:
        httpGet: { path: /ready, port: 3000 }
        initialDelaySeconds: 5
        periodSeconds: 10
      startupProbe:
        httpGet: { path: /healthz, port: 3000 }
        failureThreshold: 30
        periodSeconds: 10

Probe decision:

startupProbe: Slow-starting apps (JVM, model loading). Prevents liveness from killing during startup.
livenessProbe: Detects deadlocks. Keep simple -- do not check downstream dependencies.
readinessProbe: Controls traffic routing. Include dependency checks here.

Validation checkpoint: kubectl get pods -l app=myapp shows all pods Running and Ready.

Workflow 2: Infrastructure as Code with Terraform

Step 1 -- Scaffold the module structure.

python scripts/terraform_scaffolder.py ./infrastructure --provider aws --env production --verbose

The agent produces:

infrastructure/
  modules/
    vpc/         # main.tf, variables.tf, outputs.tf
    eks/
    rds/
  environments/
    staging/     # main.tf, terraform.tfvars, backend.tf
    production/

Step 2 -- Configure remote state.

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Step 3 -- Run drift detection in CI.

terraform plan -detailed-exitcode -out=plan.tfplan
# Exit 0 = clean, Exit 1 = error, Exit 2 = drift detected

Validation checkpoint: terraform plan shows no unexpected changes. Drift alerts fire within 24 hours.

Key rules:

One state file per environment per component (blast radius control)
Never store state locally or in git
Run terraform plan in CI, terraform apply only after approval
Use directories for environment separation, modules for shared logic

Workflow 3: CI/CD Pipeline Design

python scripts/pipeline_generator.py /path/to/project --platform github-actions --json

The agent generates pipelines following these principles:

Fail fast -- lint and unit tests before expensive integration tests
Cache aggressively -- node_modules, Docker layers, pip packages
Immutable artifacts -- build once, deploy the same artifact everywhere
Gate promotions -- manual approval or smoke tests before production
Parallel execution -- independent test suites and security scans run concurrently

Example: GitHub Actions with matrix testing and deployment gates

jobs:
  test:
    strategy:
      matrix:
        node-version: [18, 20]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "${{ matrix.node-version }}", cache: npm }
      - run: npm ci && npm run lint && npm test -- --coverage

  build:
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build
    environment: staging
    steps:
      - run: helm upgrade --install app charts/myapp --set image.tag=${{ github.sha }} --wait

  deploy-production:
    needs: deploy-staging
    environment: production  # requires manual approval

Validation checkpoint: Pipeline runs in under 15 minutes. All stages produce exit code 0.

Deployment Strategy Selection

Strategy	Risk	Rollback Speed	Infra Cost	Best For
Rolling	Medium	Minutes	1x	Stateless services, internal APIs
Blue-Green	Low	Seconds	2x	Mission-critical, zero-downtime
Canary	Low	Seconds	1.1x	User-facing, gradual validation
Feature Flags	Lowest	Instant	1x	Granular control, A/B testing

Canary promotion ladder:

Deploy at 5% traffic. Monitor error rate and latency for 10 min.
Promote to 25%. Monitor 10 min.
Promote to 50%. Monitor 15 min.
Promote to 100%.
Automated rollback if error rate exceeds baseline by 2x at any step.

Monitoring Essentials

Every service dashboard includes the Four Golden Signals:

Latency -- P50, P90, P99 response times
Traffic -- Requests per second by endpoint and status code
Errors -- 5xx rate, 4xx rate, application error codes
Saturation -- CPU, memory, connection pool, queue depth

SLO targets (example):

Service	SLI	SLO	Error Budget
API Gateway	Successful requests / Total	99.9% (43.8 min/month downtime)	0.1%
API Latency	Requests < 500ms / Total	P99 < 500ms	1%

When the error budget is exhausted, the agent recommends freezing feature deployments until the budget recovers.

Anti-Patterns

Monolithic state -- one Terraform state for everything. Split by component and environment.
latest tag in production -- always use specific image tags.
Secrets in image layers -- inject at runtime via environment or mounted secrets. Verify with docker history --no-trunc.
No resource limits -- every container needs CPU/memory limits to prevent noisy-neighbor attacks.
Manual deployments -- automate with approval gates instead.

Troubleshooting

Problem	Cause	Solution
Terraform state lock stuck	Interrupted `terraform apply` left DynamoDB lock	`terraform force-unlock <LOCK_ID>` after confirming no apply running
Pods in `CrashLoopBackOff`	Failing health checks or missing config/secrets	`kubectl logs <pod>`, verify ConfigMaps/Secrets, increase `startupProbe.failureThreshold`
Docker builds slow (10+ min)	Layer cache invalidated by early COPY of changing files	Copy dependency manifests before source; use BuildKit cache mounts
Helm upgrade fails "another operation in progress"	Previous release in pending/failed state	`helm history <release>`, then `helm rollback <release> <last-good>`
Canary shows healthy but users report errors	Metrics aggregated across all pods mask canary errors	Use per-revision metric labels; configure Istio/Nginx to tag canary traffic

References

Guide	Path	Content
CI/CD Pipeline Guide	`references/cicd_pipeline_guide.md`	Pipeline patterns, platform comparisons, optimization
Infrastructure as Code	`references/infrastructure_as_code.md`	Terraform patterns, module design, state management
Deployment Strategies	`references/deployment_strategies.md`	Strategy details, rollback procedures, traffic management

See also: references/kubernetes_patterns.md for Helm charts, HPA/VPA/KEDA decisions, network policies, and RBAC patterns. references/cloud_platform_guide.md for AWS/GCP/Azure service comparison, multi-cloud strategy, and cost optimization.

Integration Points

Skill	Integration
`senior-secops`	Security scanning in CI/CD, container image scanning, compliance checks
`senior-architect`	Infrastructure design decisions, service topology
`senior-backend`	Application containerization, health endpoints, config management
`code-reviewer`	Terraform plan review, pipeline config review
`incident-commander`	Incident escalation, postmortem, rollback procedures

Last Updated: April 2026 Version: 2.1.0

senior-devops