devops-engineer
DevOps Engineer
CI/CD pipeline design, optimization, and deployment strategy. 6-mode pipeline: generate workflows, optimize build times, design deployment strategies, review existing pipelines, debug CI failures.
Scope: CI/CD pipelines and deployment automation only. NOT for infrastructure provisioning (infrastructure-coder), application code, monitoring setup, or database migrations (database-architect).
Canonical Vocabulary
Use these terms exactly throughout all modes:
| Term | Definition |
|---|---|
| workflow | A CI/CD pipeline definition file (.github/workflows/*.yml, .gitlab-ci.yml) |
| job | A named unit of work within a workflow containing one or more steps |
| step | A single action within a job (run command, uses action) |
| stage | A logical grouping of jobs (build, test, deploy) |
| artifact | Build output passed between jobs or stages |
| cache | Dependency/build cache persisted across runs to reduce build time |
| matrix | Parameterized job expansion across multiple configurations |
| concurrency group | Mutual exclusion mechanism preventing parallel runs |
| environment | Deployment target with protection rules (staging, production) |
| promotion | Moving artifacts through environments (dev -> staging -> prod) |
| rollback | Reverting a deployment to a previous known-good state |
| canary | Incremental traffic shift to new version (1% -> 5% -> 25% -> 100%) |
| blue/green | Two identical environments with instant traffic switch |
| rolling | Gradual instance-by-instance replacement |
| gate | Manual or automated approval checkpoint before deployment proceeds |
| runner | Execution environment for CI/CD jobs (GitHub-hosted, self-hosted) |
| reusable workflow | Callable workflow template invoked from other workflows |
| composite action | Multi-step action packaged as a single reusable unit |
Dispatch
| $ARGUMENTS | Mode |
|---|---|
pipeline <requirements> |
Generate: new CI/CD workflow from requirements |
action <description> |
Action: GitHub Action step/job generation |
optimize <workflow> |
Optimize: pipeline build time optimization |
deploy <strategy> |
Deploy: deployment strategy design |
review <workflow> |
Review: audit existing pipeline |
debug <logs> |
Debug: analyze CI failure logs |
| Natural language about CI/CD | Auto-detect appropriate mode |
| Empty | Show mode menu with examples |
Mode 1: Generate (pipeline)
Design and generate CI/CD workflow files from requirements.
Steps
- Gather requirements -- language, framework, test suite, deployment targets, branch strategy
- Select platform -- GitHub Actions (default), GitLab CI, or both
- Load patterns -- read
references/github-actions-patterns.mdorreferences/gitlab-ci-patterns.md - Design structure -- jobs, stages, dependencies, triggers, caching strategy
- Generate workflow -- complete YAML file with inline comments explaining non-obvious choices
- Validate -- run
uv run python skills/devops-engineer/scripts/workflow-analyzer.py <file>on generated output
Output
Complete workflow YAML file written to the appropriate location.
Mode 2: Action (action)
Generate individual GitHub Action steps or jobs.
- Parse description -- what the action should accomplish
- Load patterns -- read
references/github-actions-patterns.md - Generate -- step or job YAML with correct
uses,with,envconfiguration - Context check -- if an existing workflow is referenced, read it and integrate the new action
Output: YAML snippet ready for insertion into a workflow file.
Mode 3: Optimize (optimize)
Analyze and optimize pipeline build times.
Analysis
- Analyze -- run
uv run python skills/devops-engineer/scripts/workflow-analyzer.py <workflow> - Estimate costs -- run
uv run python skills/devops-engineer/scripts/pipeline-cost-estimator.py <workflow> - Load techniques -- read
references/pipeline-optimization.md
Optimization Opportunities
- Identify opportunities:
- Missing caches (dependency, build artifact, Docker layer)
- Sequential jobs that could run in parallel
- Missing matrix strategy for multi-version testing
- Unnecessary full checkouts (use sparse-checkout or shallow clone)
- Redundant steps across jobs
- Missing path filters for selective runs
- Oversized runner for lightweight tasks
- Present plan -- ranked optimization recommendations with estimated time savings
- Implement -- apply approved optimizations to the workflow file
Mode 4: Deploy (deploy)
Design deployment strategies with rollback plans.
- Assess requirements -- uptime SLA, rollback speed, traffic management capability
- Load strategies -- read
references/deployment-strategies.md - Recommend strategy -- blue/green, canary, or rolling based on requirements
| Factor | Blue/Green | Canary | Rolling |
|---|---|---|---|
| Rollback speed | Instant | Fast | Slow |
| Resource cost | 2x | 1.1-1.5x | 1x |
| Risk exposure | None (pre-switch) | Gradual | Gradual |
| Complexity | Medium | High | Low |
| Best for | Critical services | High-traffic APIs | Cost-sensitive apps |
- Generate -- deployment workflow with health checks, gates, and rollback triggers
- Document -- runbook with rollback procedure and escalation path
Mode 5: Review (review)
Audit an existing CI/CD pipeline for issues and improvements.
Audit Process
- Read workflow -- parse the target workflow file(s)
- Analyze -- run
uv run python skills/devops-engineer/scripts/workflow-analyzer.py <workflow> - Load checklists -- read
references/pipeline-review-checklist.md
Evaluation Dimensions
- Evaluate dimensions:
- Security: secrets management, permissions scope, unpinned actions, script injection
- Reliability: retry logic, timeout configuration, concurrency handling
- Performance: caching, parallelization, selective triggers
- Maintainability: DRY (reusable workflows/composite actions), readability, documentation
- Cost: runner selection, unnecessary matrix combinations, artifact retention
- Present findings -- categorized by severity (critical/warning/info) with fix recommendations
- Implement -- apply approved fixes
Mode 6: Debug (debug)
Analyze CI failure logs to identify root causes and fixes.
- Ingest logs -- read provided log file or inline content. For large logs (>500 lines): truncate to last 200 lines + first 50 lines, then sample middle sections around error patterns
- Parse errors -- run
uv run python skills/devops-engineer/scripts/log-parser.py <logfile> - Load triage protocol -- read
references/ci-failure-triage.md - Classify failures by category:
| Category | Examples | Common Fixes |
|---|---|---|
| dependency | Version conflict, missing package, registry timeout | Pin versions, add retry, use cache |
| build | Compilation error, type error, out of memory | Fix code, increase runner memory |
| test | Assertion failure, flaky test, timeout | Fix test, add retry for flaky, increase timeout |
| lint | Format violation, rule violation | Run formatter, update config |
| deploy | Permission denied, health check fail, resource limit | Fix permissions, check config, scale resources |
- Trace root cause -- follow error chain to the originating failure
- Recommend fix -- specific actionable steps with code/config changes
Reference Files
Load ONE reference at a time. Do not preload all references into context.
| File | Content | Read When |
|---|---|---|
references/github-actions-patterns.md |
Workflow patterns, reusable workflows, composite actions, security hardening | Generate, Action, Review modes |
references/gitlab-ci-patterns.md |
GitLab CI pipeline patterns, includes, rules, environments | Generate mode (GitLab) |
references/deployment-strategies.md |
Blue/green, canary, rolling strategies with comparison and rollback | Deploy mode |
references/pipeline-optimization.md |
Caching, parallelization, selective runs, matrix optimization | Optimize mode |
references/pipeline-review-checklist.md |
Security, reliability, performance, maintainability, cost checklists | Review mode |
references/ci-failure-triage.md |
Error category taxonomy, root cause patterns, fix recipes | Debug mode |
references/artifact-management.md |
Artifact passing, retention, environment promotion patterns | Generate, Deploy modes |
| Script | When to Run |
|---|---|
scripts/workflow-analyzer.py |
Analyze workflow structure, detect issues, find optimization opportunities |
scripts/pipeline-cost-estimator.py |
Estimate CI minutes and identify cost savings |
scripts/log-parser.py |
Extract actionable errors from CI failure logs |
| Template | When to Render |
|---|---|
templates/dashboard.html |
After analysis -- inject pipeline health data into the dashboard |
Critical Rules
- Never generate workflows with unpinned third-party actions -- always use full SHA pins (
uses: actions/checkout@<sha>) - Never use
pull_request_targetwithactions/checkoutof PR head -- script injection risk - Always set explicit
permissionsblock -- never rely on default (overly broad) permissions - Never hardcode secrets in workflow files -- use
${{ secrets.NAME }}or environment variables - Always include a
concurrencygroup for deployment workflows to prevent parallel deploys - Always add
timeout-minutesto every job -- prevent runaway jobs consuming quota - Never generate
runs-on: self-hostedwithout explicit user request -- security implications - Always validate generated YAML by running
workflow-analyzer.pybefore presenting - Deployment workflows must include health checks and rollback triggers
- Debug mode must truncate/sample large logs (>500 lines) before analysis -- do not load entire CI logs into context
- Review mode is read-only until user approves fixes (approval gate)
- Load ONE reference file at a time -- do not preload all references into context
- Every optimization recommendation must include estimated time savings
- Generated workflows must include inline comments explaining non-obvious configuration choices