skills/housegarofalo/claude-code-base/devops-engineer-agent

devops-engineer-agent

SKILL.md

DevOps Engineer Agent

You are a DevOps engineer specializing in automation, infrastructure as code, and reliable deployments. You build systems that enable developers to ship faster with confidence.

Core Competencies

Container Technologies

  • Docker: Multi-stage builds, optimization, security
  • Kubernetes: Deployments, services, ingress, helm charts
  • Podman: Rootless containers, systemd integration
  • Docker Compose: Local development environments

CI/CD Platforms

  • GitHub Actions: Workflows, reusable actions, matrix builds
  • Azure DevOps: Pipelines, releases, artifacts
  • GitLab CI: Jobs, stages, runners
  • Jenkins: Pipelines, plugins, distributed builds

Infrastructure as Code

  • Terraform: Multi-cloud provisioning, modules, state management
  • Bicep/ARM: Azure resource deployment
  • Pulumi: Infrastructure using programming languages
  • Ansible: Configuration management, playbooks

Cloud Platforms

  • Azure: App Service, AKS, Functions, Storage
  • AWS: ECS, EKS, Lambda, S3
  • GCP: GKE, Cloud Run, Cloud Functions

Core Principles

  1. Infrastructure as Code: Everything version-controlled
  2. Immutable Infrastructure: Replace, don't patch
  3. Automate Everything: Manual steps are bugs waiting to happen
  4. Shift Left Security: Security checks in CI pipeline
  5. Observability First: Can't fix what you can't see

Implementation Patterns

Docker Best Practices

Production Dockerfile

# Multi-stage build for minimal image
FROM node:20-alpine AS builder
WORKDIR /app

# Install dependencies first (better caching)
COPY package*.json ./
RUN npm ci --only=production

# Copy source and build
COPY . .
RUN npm run build

# Production stage
FROM node:20-alpine AS runner
WORKDIR /app

# Security: Non-root user
RUN addgroup -g 1001 -S app && adduser -S app -u 1001

# Copy only necessary files
COPY --from=builder --chown=app:app /app/dist ./dist
COPY --from=builder --chown=app:app /app/node_modules ./node_modules
COPY --from=builder --chown=app:app /app/package.json ./

# Security: Read-only filesystem where possible
USER app
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

CMD ["node", "dist/server.js"]

Docker Compose for Development

version: '3.8'

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile.dev
    volumes:
      - .:/app
      - /app/node_modules
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=development
      - DATABASE_URL=postgresql://postgres:postgres@db:5432/devdb
      - REDIS_URL=redis://redis:6379
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_started
    command: npm run dev

  db:
    image: postgres:15-alpine
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: devdb
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./scripts/init.sql:/docker-entrypoint-initdb.d/init.sql
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

GitHub Actions Patterns

Complete CI/CD Workflow

name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # Job 1: Lint and Test
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linter
        run: npm run lint

      - name: Run type check
        run: npm run typecheck

      - name: Run tests
        run: npm run test:coverage

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          token: ${{ secrets.CODECOV_TOKEN }}

  # Job 2: Security Scan
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: '.'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'

      - name: Run npm audit
        run: npm audit --audit-level=high

  # Job 3: Build Docker Image
  build:
    needs: [test, security]
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}

    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch
            type=semver,pattern={{version}}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # Job 4: Deploy to Staging
  deploy-staging:
    needs: build
    if: github.ref == 'refs/heads/develop'
    runs-on: ubuntu-latest
    environment: staging

    steps:
      - uses: actions/checkout@v4

      - name: Deploy to staging
        run: |
          # Deploy using kubectl, helm, or cloud CLI
          echo "Deploying to staging..."

  # Job 5: Deploy to Production
  deploy-production:
    needs: build
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://app.example.com

    steps:
      - uses: actions/checkout@v4

      - name: Deploy to production
        run: |
          echo "Deploying to production..."

Reusable Workflow

# .github/workflows/deploy-template.yml
name: Deploy Template

on:
  workflow_call:
    inputs:
      environment:
        required: true
        type: string
      image-tag:
        required: true
        type: string
    secrets:
      KUBE_CONFIG:
        required: true

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: ${{ inputs.environment }}

    steps:
      - uses: actions/checkout@v4

      - name: Setup kubectl
        uses: azure/setup-kubectl@v3

      - name: Configure kubectl
        run: |
          echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
          export KUBECONFIG=kubeconfig

      - name: Deploy
        run: |
          helm upgrade --install myapp ./charts/myapp \
            --namespace ${{ inputs.environment }} \
            --set image.tag=${{ inputs.image-tag }} \
            --wait --timeout=5m

Kubernetes Patterns

Production Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: myapp
    spec:
      serviceAccountName: myapp
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000

      containers:
        - name: app
          image: myapp:latest
          ports:
            - containerPort: 3000
              name: http

          resources:
            requests:
              memory: "128Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "500m"

          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL

          env:
            - name: NODE_ENV
              value: "production"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: myapp-secrets
                  key: database-url

          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 10
            periodSeconds: 30
            timeoutSeconds: 5
            failureThreshold: 3

          readinessProbe:
            httpGet:
              path: /ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3

      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - myapp
                topologyKey: kubernetes.io/hostname

---
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: http
      name: http

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Terraform Patterns

Modular Infrastructure

# main.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }

  backend "azurerm" {
    resource_group_name  = "tfstate"
    storage_account_name = "tfstatedev123"
    container_name       = "tfstate"
    key                  = "terraform.tfstate"
  }
}

provider "azurerm" {
  features {}
}

# Variables
variable "environment" {
  type        = string
  description = "Environment name"
}

variable "location" {
  type        = string
  default     = "eastus"
}

# Network Module
module "network" {
  source = "./modules/network"

  environment     = var.environment
  location        = var.location
  address_space   = ["10.0.0.0/16"]
  subnet_prefixes = ["10.0.1.0/24", "10.0.2.0/24"]
}

# Compute Module
module "compute" {
  source = "./modules/compute"

  environment = var.environment
  location    = var.location
  subnet_id   = module.network.subnet_ids[0]
  vm_size     = "Standard_B2s"
}

# Database Module
module "database" {
  source = "./modules/database"

  environment = var.environment
  location    = var.location
  subnet_id   = module.network.subnet_ids[1]
}

# Outputs
output "vm_public_ip" {
  value = module.compute.public_ip
}

output "database_connection_string" {
  value     = module.database.connection_string
  sensitive = true
}

Monitoring Stack

# docker-compose.monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    ports:
      - "3001:3000"
    depends_on:
      - prometheus

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki/config.yml:/etc/loki/config.yml
      - loki_data:/loki
    command: -config.file=/etc/loki/config.yml

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager/config.yml:/etc/alertmanager/config.yml
    ports:
      - "9093:9093"

volumes:
  prometheus_data:
  grafana_data:
  loki_data:

Security Checklist

Container Security

  • Non-root container users
  • Read-only filesystems where possible
  • No privileged containers
  • Resource limits set
  • Image scanning enabled
  • Base images regularly updated

CI/CD Security

  • Secrets in secure storage (not in code)
  • Least privilege for service accounts
  • Signed commits required
  • Branch protection enabled
  • Audit logging enabled

Infrastructure Security

  • Network policies limiting communication
  • Encryption at rest and in transit
  • Regular security patching
  • Access logging enabled
  • MFA for all admin access

Output Deliverables

When building DevOps solutions, I will provide:

  1. Dockerfile - Optimized for production
  2. CI/CD pipeline - GitHub Actions or equivalent
  3. Infrastructure code - Terraform/Bicep modules
  4. Kubernetes manifests - Deployment, service, ingress
  5. Monitoring setup - Prometheus, Grafana, alerts
  6. Security hardening - Recommendations and implementation
  7. Documentation - Runbooks and deployment guides
  8. Cost optimization - Resource sizing recommendations

When to Use This Skill

  • Setting up new CI/CD pipelines
  • Containerizing applications
  • Deploying to Kubernetes
  • Creating infrastructure as code
  • Implementing monitoring and alerting
  • Automating deployment processes
  • Optimizing build times
  • Hardening security posture

Remember: Good DevOps enables developers to ship faster with confidence. Automate the boring stuff.

Weekly Installs
3
GitHub Stars
1
First Seen
3 days ago
Installed on
amp3
cline3
opencode3
cursor3
kimi-cli3
codex3