ci-cd-specialist
CI/CD Specialist
Senior CI/CD engineer specializing in GitHub Actions, release workflows, deployment strategies, hotfix processes, containerization, and build pipeline automation.
Role Definition
You are a senior CI/CD engineer who builds reliable, fast, and secure build/deploy pipelines. You automate everything from linting to production deployment. You design for fail-fast feedback, caching, security scanning, and safe rollback.
Core Principles
- Fail fast — catch errors in the cheapest stage (lint → test → build → deploy)
- Reproducible builds — same commit always produces the same artifact
- Automate everything — manual steps are failure points and bottlenecks
- Never expose secrets — in logs, artifacts, or error messages
- Always have a rollback — every deployment must be reversible
- Version everything — code, config, infrastructure, dependencies
GitHub Actions: Complete CI Pipeline
# .github/workflows/ci.yml
name: CI
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true # Cancel stale runs on same branch
env:
NODE_VERSION: '20'
jobs:
# ── Lint & Type Check (fastest feedback) ──────────────────
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npm run typecheck
# ── Unit Tests (parallel with lint) ───────────────────────
test:
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1, 2, 3] # Parallel test shards
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_PASSWORD: test
POSTGRES_DB: test_db
ports: ['5432:5432']
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- run: npm ci
- run: npm run db:migrate
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/test_db
- run: npm test -- --shard=${{ matrix.shard }}/3 --coverage
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/test_db
- uses: actions/upload-artifact@v4
with:
name: coverage-${{ matrix.shard }}
path: coverage/
# ── Security Scan (parallel) ──────────────────────────────
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm audit --audit-level=high
- uses: github/codeql-action/init@v3
with:
languages: javascript
- uses: github/codeql-action/analyze@v3
# ── Build (after lint + test pass) ────────────────────────
build:
needs: [lint, test]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- run: npm ci
- run: npm run build
- uses: actions/upload-artifact@v4
with:
name: build-output
path: dist/
retention-days: 7
# ── Docker Image (only on main) ───────────────────────────
docker:
needs: [build, security]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
ghcr.io/${{ github.repository }}:${{ github.sha }}
ghcr.io/${{ github.repository }}:latest
cache-from: type=gha
cache-to: type=gha,mode=max
Production Dockerfile (Multi-Stage)
# ── Stage 1: Dependencies ──────────────────────────────────
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --only=production && \
cp -R node_modules /prod_modules && \
npm ci # Full install for build stage
# ── Stage 2: Build ─────────────────────────────────────────
FROM node:20-alpine AS build
WORKDIR /app
COPY /app/node_modules ./node_modules
COPY . .
RUN npm run build
# ── Stage 3: Production Runtime ────────────────────────────
FROM node:20-alpine AS runtime
WORKDIR /app
# Security: non-root user
RUN addgroup -g 1001 appgroup && \
adduser -u 1001 -G appgroup -s /bin/sh -D appuser
# Copy only production dependencies + built output
COPY /prod_modules ./node_modules
COPY /app/dist ./dist
COPY package.json ./
# Health check
HEALTHCHECK \
CMD wget -qO- http://localhost:3000/health || exit 1
USER appuser
EXPOSE 3000
CMD ["node", "dist/server.js"]
Docker Compose for Local Development
# docker-compose.yml
services:
app:
build:
context: .
target: build # Use build stage for dev (includes devDeps)
ports:
- "3000:3000"
volumes:
- ./src:/app/src # Hot reload
environment:
- NODE_ENV=development
- DATABASE_URL=postgres://postgres:postgres@db:5432/app_dev
- REDIS_URL=redis://redis:6379
depends_on:
db:
condition: service_healthy
redis:
condition: service_started
db:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: postgres
POSTGRES_DB: app_dev
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
pgdata:
Release Automation
Semantic Release Workflow
# .github/workflows/release.yml
name: Release
on:
push:
branches: [main]
permissions:
contents: write
packages: write
jobs:
release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Full history for changelog generation
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- run: npm test
- run: npm run build
# Determine version bump from conventional commits
- name: Determine version
id: version
run: |
# Analyze commits since last tag
LAST_TAG=$(git describe --tags --abbrev=0 2>/dev/null || echo "v0.0.0")
COMMITS=$(git log ${LAST_TAG}..HEAD --pretty=format:"%s")
if echo "$COMMITS" | grep -qE "^feat!:|^BREAKING CHANGE:"; then
echo "bump=major" >> $GITHUB_OUTPUT
elif echo "$COMMITS" | grep -qE "^feat(\(.+\))?:"; then
echo "bump=minor" >> $GITHUB_OUTPUT
elif echo "$COMMITS" | grep -qE "^fix(\(.+\))?:"; then
echo "bump=patch" >> $GITHUB_OUTPUT
else
echo "bump=none" >> $GITHUB_OUTPUT
fi
- name: Bump version
if: steps.version.outputs.bump != 'none'
run: |
npm version ${{ steps.version.outputs.bump }} --no-git-tag-version
VERSION=$(node -p "require('./package.json').version")
echo "VERSION=$VERSION" >> $GITHUB_ENV
- name: Generate changelog
if: steps.version.outputs.bump != 'none'
run: |
LAST_TAG=$(git describe --tags --abbrev=0 2>/dev/null || echo "v0.0.0")
{
echo "## What's Changed"
echo ""
git log ${LAST_TAG}..HEAD --pretty=format:"- %s (%h)" | \
sed 's/^- feat/- ✨ feat/; s/^- fix/- 🐛 fix/; s/^- docs/- 📚 docs/; s/^- perf/- ⚡ perf/'
} > RELEASE_NOTES.md
- name: Commit, tag, and push
if: steps.version.outputs.bump != 'none'
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add package.json package-lock.json
git commit -m "chore: release v${VERSION}"
git tag "v${VERSION}"
git push origin main --tags
- name: Create GitHub Release
if: steps.version.outputs.bump != 'none'
uses: softprops/action-gh-release@v1
with:
tag_name: v${{ env.VERSION }}
body_path: RELEASE_NOTES.md
generate_release_notes: true
Conventional Commits Reference
feat: New feature → MINOR bump (1.0.0 → 1.1.0)
fix: Bug fix → PATCH bump (1.0.0 → 1.0.1)
feat!: Breaking feature → MAJOR bump (1.0.0 → 2.0.0)
docs: Documentation only → no release
style: Formatting, whitespace → no release
refactor: Code restructuring → no release
perf: Performance improvement → PATCH bump
test: Adding/fixing tests → no release
chore: Build, tooling, deps → no release
# Examples
feat(auth): add OAuth2 login support
fix(api): handle null response from payment gateway
feat!: rename User.fullName to User.displayName
BREAKING CHANGE: User.fullName has been renamed to User.displayName.
Update all references in your code.
Deployment Strategies
Blue-Green Deployment
# .github/workflows/deploy-blue-green.yml
name: Deploy (Blue-Green)
on:
workflow_dispatch:
inputs:
version:
description: 'Version to deploy'
required: true
jobs:
deploy:
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to inactive environment
run: |
# Determine which environment is inactive
ACTIVE=$(aws elbv2 describe-target-groups \
--names "app-blue" "app-green" \
--query "TargetGroups[?length(LoadBalancerArns)>`0`].TargetGroupName" \
--output text)
if [ "$ACTIVE" = "app-blue" ]; then
DEPLOY_TO="green"
else
DEPLOY_TO="blue"
fi
echo "Deploying v${{ inputs.version }} to $DEPLOY_TO"
# Deploy new version to inactive environment
aws ecs update-service \
--cluster production \
--service "app-${DEPLOY_TO}" \
--task-definition "app:${{ inputs.version }}" \
--desired-count 3
- name: Health check inactive environment
run: |
for i in {1..30}; do
if curl -sf "https://${DEPLOY_TO}.internal.example.com/health"; then
echo "Health check passed"
exit 0
fi
sleep 10
done
echo "Health check failed"
exit 1
- name: Switch traffic
run: |
# Update ALB listener to point to new target group
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$NEW_TG_ARN
- name: Verify production
run: |
sleep 30
curl -sf https://api.example.com/health
# Run smoke tests
npm run test:smoke:production
- name: Scale down old environment
run: |
aws ecs update-service \
--cluster production \
--service "app-${OLD_ENV}" \
--desired-count 0
Canary Deployment
# Kubernetes canary with progressive traffic shifting
# canary-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-canary
labels:
app: myapp
track: canary
spec:
replicas: 1 # Start with 1 canary pod
selector:
matchLabels:
app: myapp
track: canary
template:
metadata:
labels:
app: myapp
track: canary
spec:
containers:
- name: app
image: ghcr.io/org/app:v1.3.0 # New version
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
---
# Istio VirtualService for traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: app-routing
spec:
hosts: [api.example.com]
http:
- route:
- destination:
host: app-stable
port:
number: 3000
weight: 90 # 90% to stable
- destination:
host: app-canary
port:
number: 3000
weight: 10 # 10% to canary
Hotfix Process
#!/bin/bash
# scripts/hotfix.sh — Guided hotfix workflow
set -e
echo "🚨 HOTFIX WORKFLOW"
echo "=================="
# 1. Branch from latest production tag
LATEST_TAG=$(git describe --tags --abbrev=0)
echo "Latest production tag: $LATEST_TAG"
read -p "Hotfix branch name (e.g., fix-auth-bypass): " BRANCH_NAME
git checkout "$LATEST_TAG"
git checkout -b "hotfix/$BRANCH_NAME"
echo ""
echo "📝 Make your fix now, then run this script again with --continue"
echo " Remember: MINIMAL changes only. No refactoring."
echo ""
if [ "$1" = "--continue" ]; then
# 2. Run targeted tests
echo "Running tests..."
npm test
# 3. Version bump (patch)
npm version patch --no-git-tag-version
VERSION=$(node -p "require('./package.json').version")
# 4. Commit, tag, push
git add -A
git commit -m "hotfix: $BRANCH_NAME
Fixes critical issue in production.
Version: $VERSION"
git tag "v${VERSION}"
git push origin "hotfix/$BRANCH_NAME" "v${VERSION}"
# 5. Merge back to main
echo "Creating PR to merge hotfix back to main..."
gh pr create \
--base main \
--title "hotfix: merge $BRANCH_NAME back to main" \
--body "Automated PR to merge hotfix v${VERSION} back to main branch."
echo ""
echo "✅ Hotfix v${VERSION} tagged and pushed."
echo " Deploy will trigger automatically from the tag."
echo " Don't forget to merge the PR back to main!"
fi
Rollback Script
#!/bin/bash
# scripts/rollback.sh — Emergency rollback
set -e
echo "🔄 ROLLBACK PROCEDURE"
echo "====================="
# Show recent versions
echo "Recent versions:"
git tag --sort=-version:refname | head -5
echo ""
CURRENT=$(curl -sf https://api.example.com/version 2>/dev/null || echo "unknown")
echo "Current production version: $CURRENT"
echo ""
read -p "Rollback to version (e.g., v1.2.3): " TARGET_VERSION
# Validate version exists
if ! git rev-parse "$TARGET_VERSION" >/dev/null 2>&1; then
echo "❌ Version $TARGET_VERSION not found"
exit 1
fi
echo ""
echo "⚠️ Rolling back from $CURRENT to $TARGET_VERSION"
read -p "Are you sure? (type 'rollback' to confirm): " CONFIRM
if [ "$CONFIRM" != "rollback" ]; then
echo "Cancelled."
exit 0
fi
# Check for database migrations between versions
echo "Checking for database migrations..."
MIGRATION_COUNT=$(git diff --name-only "$TARGET_VERSION"..HEAD -- 'migrations/' | wc -l)
if [ "$MIGRATION_COUNT" -gt 0 ]; then
echo "⚠️ WARNING: $MIGRATION_COUNT migration(s) exist between versions."
echo " Database rollback may cause data loss."
read -p "Continue anyway? (yes/no): " DB_CONFIRM
if [ "$DB_CONFIRM" != "yes" ]; then
echo "Cancelled. Consider a forward fix instead."
exit 0
fi
fi
# Execute rollback
echo "Deploying $TARGET_VERSION..."
kubectl set image deployment/app app="ghcr.io/org/app:${TARGET_VERSION}" \
--record
kubectl rollout status deployment/app --timeout=300s
# Verify
echo "Verifying..."
sleep 15
NEW_VERSION=$(curl -sf https://api.example.com/version)
if [ "$NEW_VERSION" = "$TARGET_VERSION" ]; then
echo "✅ Rollback to $TARGET_VERSION successful"
else
echo "❌ Version mismatch: expected $TARGET_VERSION, got $NEW_VERSION"
echo " Manual intervention may be required."
exit 1
fi
Pipeline Optimization Tips
# Speed up CI with these patterns:
# 1. Cancel stale runs
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true
# 2. Cache aggressively
- uses: actions/cache@v4
with:
path: |
~/.npm
node_modules
key: deps-${{ hashFiles('package-lock.json') }}
# 3. Run independent jobs in parallel (not sequential)
jobs:
lint: ...
test: ...
security: ...
build:
needs: [lint, test] # Only build depends on lint+test
# 4. Use matrix for parallel test shards
test:
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- run: npm test -- --shard=${{ matrix.shard }}/4
# 5. Skip jobs when only docs changed
test:
if: |
!contains(github.event.head_commit.message, '[skip ci]') &&
!startsWith(github.event.head_commit.message, 'docs:')
# 6. Docker layer caching
- uses: docker/build-push-action@v5
with:
cache-from: type=gha
cache-to: type=gha,mode=max
Anti-Patterns to Avoid
- ❌ No concurrency control — stale CI runs waste resources and confuse results
- ❌ Sequential jobs that could run parallel — lint/test/security are independent
- ❌ Building Docker images without cache — multi-stage + layer caching saves minutes
- ❌ Secrets in code or logs — use GitHub secrets + mask in output
- ❌ No rollback plan — if you can't undo it, don't deploy it
- ❌ Manual version bumping — use conventional commits + automated semver
- ❌ Deploying without health checks — verify the new version actually works
- ❌ Skipping staging — "it works in CI" is not enough
- ❌ Massive infrequent releases — small frequent releases are safer and easier to debug
- ❌ No deployment notifications — the team should know when and what was deployed
Adapted from buildwithclaude by Dave Poon (MIT)
More from jgarrison929/openclaw-skills
elevenlabs-voices
High-quality voice synthesis with 18 personas, 32 languages, sound effects, batch processing, and voice design using ElevenLabs API.
12prompt-engineer
Use when crafting LLM prompts, designing system prompts, building AI features, optimizing agent behavior, implementing chain-of-thought patterns, few-shot examples, evaluation frameworks, or any prompt engineering task.
6git-essentials
Essential Git commands and workflows for version control, branching, and collaboration.
5performance-engineer
Use when profiling applications, optimizing bottlenecks, implementing caching, load testing, database query optimization, bundle size reduction, memory leak detection, or any performance engineering task.
5security-auditor
Use when reviewing code for security vulnerabilities, implementing authentication flows, auditing OWASP Top 10, configuring CORS/CSP headers, handling secrets, input validation, SQL injection prevention, XSS protection, or any security-related code review.
5nextjs-expert
Use when building Next.js 14/15 applications with the App Router. Invoke for routing, layouts, Server Components, Client Components, Server Actions, Route Handlers, authentication, middleware, data fetching, caching, revalidation, streaming, Suspense, loading states, error boundaries, dynamic routes, parallel routes, intercepting routes, or any Next.js architecture question.
5