qcsd-production-swarm
QCSD Production Swarm v1.0
Post-release production health assessment and QCSD feedback loop closure.
Overview
The Production Swarm assesses release health in the live production environment using DORA metrics, incident RCA, defect prediction, and cross-phase feedback loops. It renders a HEALTHY / DEGRADED / CRITICAL decision and is the only QCSD phase with dual responsibility: assessing current production health AND closing the feedback loop back to Ideation and Refinement phases.
QCSD Phase Positioning
| Phase | Swarm | Decision | When |
|---|---|---|---|
| Ideation | qcsd-ideation-swarm | GO / CONDITIONAL / NO-GO | PI/Sprint Planning |
| Refinement | qcsd-refinement-swarm | READY / CONDITIONAL / NOT-READY | Sprint Refinement |
| Development | qcsd-development-swarm | SHIP / CONDITIONAL / HOLD | During Sprint |
| Verification | qcsd-cicd-swarm | RELEASE / REMEDIATE / BLOCK | Pre-Release / CI-CD |
| Production | qcsd-production-swarm | HEALTHY / DEGRADED / CRITICAL | Post-Release |
Parameters
TELEMETRY_DATA: Path to production telemetry, incident reports, and DORA metrics (required)RELEASE_ID: Release identifier for tracking (optional)OUTPUT_FOLDER: Where to save reports (default:${PROJECT_ROOT}/Agentic QCSD/production/)SLA_DEFINITIONS: Path to SLA/SLO target definitions (optional)
ENFORCEMENT RULES - READ FIRST
| Rule | Enforcement |
|---|---|
| E1 | You MUST spawn ALL THREE core agents in Step 2. No exceptions. |
| E2 | You MUST put all parallel Task calls in a SINGLE message. |
| E3 | You MUST STOP and WAIT after each batch. No proceeding early. |
| E4 | You MUST spawn conditional agents if flags are TRUE. No skipping. |
| E5 | You MUST apply HEALTHY/DEGRADED/CRITICAL logic exactly as specified in Step 5. |
| E6 | You MUST generate the full report structure. No abbreviated versions. |
| E7 | Each agent MUST read its reference files before analysis. |
| E8 | You MUST run BOTH feedback agents in Step 8 SEQUENTIALLY. Always. Both agents. |
| E9 | You MUST execute Step 7 learning persistence. No skipping. |
PROHIBITED BEHAVIORS:
- Summarizing instead of spawning agents
- Skipping agents "for brevity"
- Proceeding before background tasks complete
- Providing your own analysis instead of spawning specialists
- Omitting report sections or using placeholder text
Step Execution Protocol
This skill uses a micro-file step architecture. Each step is a self-contained file loaded one at a time to avoid "lost in the middle" context degradation.
Execute steps sequentially by reading each step file with the Read tool.
Steps
- Flag Detection --
steps/01-flag-detection.md-- Retrieve CI/CD signals, detect telemetry source, evaluate all 7 flags - Core Agents --
steps/02-core-agents.md-- Spawn qe-metrics-optimizer, qe-defect-predictor, qe-root-cause-analyzer in parallel - Batch 1 Results --
steps/03-batch1-results.md-- Wait for core agents, extract all metrics - Conditional Agents --
steps/04-conditional-agents.md-- Spawn flagged conditional agents in parallel - Decision Synthesis --
steps/05-decision-synthesis.md-- Apply HEALTHY/DEGRADED/CRITICAL logic - Report Generation --
steps/06-report-generation.md-- Generate executive summary and full report - Learning Persistence --
steps/07-learning-persistence.md-- Store findings to memory, save persistence record - Feedback Loop --
steps/08-feedback-loop.md-- Run learning coordinator then transfer specialist (sequential) - Final Output --
steps/09-final-output.md-- Display completion summary with all scores
Execution Instructions
- Use the Read tool to load the current step file (e.g.,
Read({ file_path: ".claude/skills/qcsd-production-swarm/steps/01-flag-detection.md" })) - Execute the step's instructions completely
- Verify all success criteria are met before proceeding
- Pass the step's output as context to the next step
- If a step fails, halt and report the failure point -- do not skip ahead
Resume Support
To resume from a specific step: specify --from-step N and the orchestrator will
skip to step N. Ensure you have the required prerequisite data from prior steps.
Agent Inventory
| Agent | Type | Domain | Batch |
|---|---|---|---|
| qe-metrics-optimizer | Core (always) | learning-optimization | 1 |
| qe-defect-predictor | Core (always) | defect-intelligence | 1 |
| qe-root-cause-analyzer | Core (always) | defect-intelligence | 1 |
| qe-chaos-engineer | Conditional (HAS_INFRASTRUCTURE_CHANGE) | chaos-resilience | 2 |
| qe-performance-tester | Conditional (HAS_PERFORMANCE_SLA) | chaos-resilience | 2 |
| qe-regression-analyzer | Conditional (HAS_REGRESSION_RISK) | defect-intelligence | 2 |
| qe-pattern-learner | Conditional (HAS_RECURRING_INCIDENTS) | defect-intelligence | 2 |
| qe-middleware-validator | Conditional (HAS_MIDDLEWARE) | enterprise-integration | 2 |
| qe-sap-rfc-tester | Conditional (HAS_SAP_INTEGRATION) | enterprise-integration | 2 |
| qe-sod-analyzer | Conditional (HAS_AUTHORIZATION) | enterprise-integration | 2 |
| qe-learning-coordinator | Feedback (always, sequential) | learning-optimization | 3 |
| qe-transfer-specialist | Feedback (always, sequential) | learning-optimization | 3 |
Total: 12 agents (3 core + 7 conditional + 2 feedback)
Quality Gate Thresholds
| Metric | HEALTHY | DEGRADED | CRITICAL |
|---|---|---|---|
| DORA Score | >= 0.7 | 0.4 - 0.69 | < 0.4 |
| SLA Compliance | >= 99% | 95 - 98.9% | < 95% |
| Incident Severity | P3/P4/NONE | P2 | P0/P1 |
| Defect Trend | declining/stable | stable (density > 2) | increasing + density > 5 |
| RCA Completeness | >= 80% | 50 - 79% | < 50% |
Report Filename Mapping
| Agent | Report Filename | Step |
|---|---|---|
| qe-metrics-optimizer | 02-dora-metrics.md |
2 |
| qe-defect-predictor | 03-defect-prediction.md |
2 |
| qe-root-cause-analyzer | 04-root-cause-analysis.md |
2 |
| qe-chaos-engineer | 05-chaos-resilience.md |
4 |
| qe-performance-tester | 06-performance-sla.md |
4 |
| qe-regression-analyzer | 07-regression-analysis.md |
4 |
| qe-pattern-learner | 08-pattern-analysis.md |
4 |
| Learning Persistence | 09-learning-persistence.json |
7 |
| qe-middleware-validator | 10-middleware-health.md |
4 |
| qe-sap-rfc-tester | 11-sap-health.md |
4 |
| qe-sod-analyzer | 12-sod-compliance.md |
4 |
| Feedback agents | 13-feedback-loops.md |
8 |
| Synthesis | 01-executive-summary.md |
6 |
Execution Model Options
| Model | When to Use | Agent Spawn |
|---|---|---|
| Task Tool (PRIMARY) | Claude Code sessions | Task({ subagent_type, run_in_background: true }) |
| MCP Tools | MCP server available | fleet_init({}) / task_submit({}) |
| CLI | Terminal/scripts | swarm init / agent spawn |
Key Principle
Production health is measured by outcomes, not intentions. This swarm provides evidence-based production assessment and closes the QCSD feedback loop.