end-to-end-orchestrator
End-to-End Orchestrator
Overview
end-to-end-orchestrator provides single-command complete development workflows, coordinating all 5 multi-ai skills from research through production deployment.
Purpose: Transform "I want feature X" into production-ready code through automated skill coordination
Pattern: Workflow-based (5-stage pipeline with quality gates)
Key Innovation: Automatic orchestration of research → planning → implementation → testing → verification with failure recovery and quality gates
The Complete Pipeline:
Input: Feature description
↓
1. Research (multi-ai-research) [optional]
↓ [Quality Gate: Research complete]
2. Planning (multi-ai-planning)
↓ [Quality Gate: Plan ≥90/100]
3. Implementation (multi-ai-implementation)
↓ [Quality Gate: Tests pass, coverage ≥80%]
4. Testing (multi-ai-testing)
↓ [Quality Gate: Coverage ≥95%, verified]
5. Verification (multi-ai-verification)
↓ [Quality Gate: Score ≥90/100, all layers pass]
Output: Production-ready code
When to Use
Use end-to-end-orchestrator when:
- Implementing complete features (not quick fixes)
- Want automated workflow (not manual skill chaining)
- Production-quality required (all gates must pass)
- Time optimization important (parallel where possible)
- Need failure recovery (automatic retry/rollback)
When NOT to Use:
- Quick fixes (<30 minutes)
- Exploratory work (uncertain requirements)
- Manual control preferred (step through each phase)
Prerequisites
Required
- All 5 multi-ai skills installed:
- multi-ai-research
- multi-ai-planning
- multi-ai-implementation
- multi-ai-testing
- multi-ai-verification
Optional
- agent-memory-system (for learning from past work)
- hooks-manager (for automation)
- Gemini CLI, Codex CLI (for tri-AI research)
Complete Workflow
Stage 1: Research (Optional)
Purpose: Ground implementation in proven patterns
Process:
-
Determine if Research Needed:
// Check if objective is familiar const similarWork = await recallMemory({ type: 'episodic', query: objective }); if (similarWork.length === 0) { // Unfamiliar domain → research needed needsResearch = true; } else { // Familiar → can skip research, use past learnings needsResearch = false; } -
Execute Research (if needed):
Use multi-ai-research for "[domain] implementation patterns and best practices"What It Provides:
- Claude research: Official docs, codebase patterns
- Gemini research: Web best practices, latest trends
- Codex research: GitHub patterns, code examples
- Quality: ≥95/100 with 100% citations
-
Quality Gate: Research Complete:
✅ Research findings documented ✅ Patterns identified (minimum 2) ✅ Best practices extracted (minimum 3) ✅ Quality score ≥95/100If Fail: Research incomplete → retry research OR proceed without (user decides)
Outputs:
- Research findings (.analysis/ANALYSIS_FINAL.md)
- Patterns and best practices
- Implementation recommendations
Time: 30-60 minutes (can skip if familiar domain)
Next: Proceed to Stage 2
Stage 2: Planning
Purpose: Create agent-executable plan with quality ≥90/100
Process:
-
Load Research Context (if research done):
let context = ""; if (researchDone) { context = await readFile('.analysis/ANALYSIS_FINAL.md'); } -
Invoke Planning:
Use multi-ai-planning to create plan for [objective] ${context ? `Research findings available in: .analysis/ANALYSIS_FINAL.md` : ''} Create comprehensive plan following 6-step workflow.What It Does:
- Analyzes objective
- Hierarchical decomposition (8-15 tasks)
- Maps dependencies, identifies parallel
- Plans verification for all tasks
- Scores quality (0-100)
-
Quality Gate: Plan Approved:
✅ Plan created ✅ Quality score ≥90/100 ✅ All tasks have verification ✅ Dependencies mapped ✅ No circular dependenciesIf Fail (score <90):
- Review gap analysis
- Apply recommended fixes
- Re-verify
- Retry up to 2 times
- If still <90: Escalate to human review
-
Save Plan to Shared State:
# Save for next stage cp plans/[plan-id]/plan.json .multi-ai-context/plan.json
Outputs:
- plan.json (machine-readable)
- PLAN.md (human-readable)
- COORDINATION.md (execution guide)
- Quality ≥90/100
Time: 1.5-3 hours
Next: Proceed to Stage 3
Stage 3: Implementation
Purpose: Execute plan with TDD, produce working code
Process:
-
Load Plan:
const plan = JSON.parse(readFile('.multi-ai-context/plan.json')); console.log(`📋 Loaded plan: ${plan.objective}`); console.log(` Tasks: ${plan.tasks.length}`); console.log(` Estimated: ${plan.metadata.estimated_total_hours} hours`); -
Invoke Implementation:
Use multi-ai-implementation following plan in .multi-ai-context/plan.json Execute all 6 steps: 1. Explore & gather context 2. Plan architecture (plan already created, refine as needed) 3. Implement incrementally with TDD 4. Coordinate multi-agent (if parallel tasks) 5. Integration & E2E testing 6. Quality verification before commit Success criteria from plan.What It Does:
- Explores codebase (progressive disclosure)
- Implements incrementally (<200 lines per commit)
- Test-driven development (tests first)
- Multi-agent coordination for parallel tasks
- Continuous testing during implementation
- Doom loop prevention (max 3 retries)
-
Quality Gate: Implementation Complete:
✅ All plan tasks implemented ✅ All tests passing ✅ Coverage ≥80% (gate), ideally ≥95% ✅ No regressions ✅ Doom loop avoided (< max retries)If Fail:
- Identify failing task
- Retry with different approach
- If 3 failures: Escalate to human
- Save state for recovery
-
Save Implementation State:
# Save for next stage echo '{ "status": "implemented", "files_changed": [...], "tests_run": 95, "tests_passed": 95, "coverage": 87, "commits": ["abc123", "def456"] }' > .multi-ai-context/implementation-status.json
Outputs:
- Working code
- Tests passing
- Coverage ≥80%
- Commits created
Time: 3-10 hours (varies by complexity)
Next: Proceed to Stage 4
Stage 4: Testing (Independent Verification)
Purpose: Verify tests are comprehensive and prevent gaming
Process:
-
Load Implementation Context:
const implStatus = JSON.parse( readFile('.multi-ai-context/implementation-status.json') ); console.log(`🧪 Testing implementation:`); console.log(` Files changed: ${implStatus.files_changed.length}`); console.log(` Current coverage: ${implStatus.coverage}%`); -
Invoke Independent Testing:
Use multi-ai-testing independent verification workflow Verify: - Tests in: tests/ - Code in: src/ - Specifications in: .multi-ai-context/plan.json Workflows to execute: 1. Test quality verification (independent agent) 2. Coverage validation (≥95% target) 3. Edge case discovery (AI-powered) 4. Multi-agent ensemble scoring (if critical feature) Score test quality (0-100).What It Does:
- Independent verification (separate agent from impl)
- Checks tests match specifications (not just what code does)
- Generates additional edge case tests
- Multi-agent ensemble for quality scoring
- Prevents overfitting
-
Quality Gate: Testing Verified:
✅ Test quality score ≥90/100 ✅ Coverage ≥95% (target achieved) ✅ Independent verification passed ✅ No test gaming detected ✅ Edge cases coveredIf Fail:
- Review test quality issues
- Generate additional tests
- Re-verify
- Max 2 retries, then escalate
-
Save Testing State:
echo '{ "status": "tested", "test_quality_score": 92, "coverage": 96, "tests_total": 112, "edge_cases": 23, "gaming_detected": false }' > .multi-ai-context/testing-status.json
Outputs:
- Test quality ≥90/100
- Coverage ≥95%
- Independent verification passed
Time: 1-3 hours
Next: Proceed to Stage 5
Stage 5: Verification (Multi-Layer QA)
Purpose: Final quality assurance before production
Process:
-
Load All Context:
const plan = JSON.parse(readFile('.multi-ai-context/plan.json')); const implStatus = JSON.parse(readFile('.multi-ai-context/implementation-status.json')); const testStatus = JSON.parse(readFile('.multi-ai-context/testing-status.json')); console.log(`🔍 Final verification:`); console.log(` Objective: ${plan.objective}`); console.log(` Implementation: ${implStatus.status}`); console.log(` Testing: ${testStatus.coverage}% coverage`); -
Invoke Multi-Layer Verification:
Use multi-ai-verification for complete quality check Verify: - Code: src/ - Tests: tests/ - Plan: .multi-ai-context/plan.json Execute all 5 layers: 1. Rules-based (linting, types, schema, SAST) 2. Functional (tests, coverage, examples) 3. Visual (if UI: screenshots, a11y) 4. Integration (E2E, API compatibility) 5. Quality scoring (LLM-as-judge, 0-100) All 5 quality gates must pass.What It Does:
- Runs all 5 verification layers
- Each layer is independent
- LLM-as-judge for holistic assessment
- Agent-as-a-Judge can execute tools to verify claims
- Multi-agent ensemble for critical features
-
Quality Gate: Production Ready:
✅ Layer 1 (Rules): PASS ✅ Layer 2 (Functional): PASS, coverage 96% ✅ Layer 3 (Visual): PASS or SKIPPED ✅ Layer 4 (Integration): PASS ✅ Layer 5 (Quality): 92/100 ≥90 ✅ ALL GATES PASSED → PRODUCTION APPROVEDIf Fail:
- Review gap analysis from failed layer
- Apply recommended fixes
- Re-verify from failed layer (not all 5)
- Max 2 retries per layer
- If still failing: Escalate to human
-
Generate Final Report:
# Feature Implementation Complete **Objective**: [from plan] ## Pipeline Execution Summary ### Stage 1: Research - Status: ✅ Complete - Quality: 97/100 - Time: 52 minutes ### Stage 2: Planning - Status: ✅ Complete - Quality: 94/100 - Tasks: 23 - Time: 1.8 hours ### Stage 3: Implementation - Status: ✅ Complete - Files changed: 15 - Lines added: 847 - Commits: 12 - Time: 6.2 hours ### Stage 4: Testing - Status: ✅ Complete - Test quality: 92/100 - Coverage: 96% - Tests: 112 - Time: 1.5 hours ### Stage 5: Verification - Status: ✅ Complete - Quality score: 92/100 - All layers: PASS - Time: 1.2 hours ## Final Metrics - **Total Time**: 11.3 hours - **Quality**: 92/100 - **Coverage**: 96% - **Status**: ✅ PRODUCTION READY ## Commits - abc123: feat: Add database schema - def456: feat: Implement OAuth integration - [... 10 more ...] ## Next Steps - Create PR for team review - Deploy to staging - Production release -
Save to Memory (if agent-memory-system available):
await storeMemory({ type: 'episodic', event: { description: `Complete implementation: ${objective}`, outcomes: { total_time: 11.3, quality_score: 92, test_coverage: 96, stages_completed: 5 }, learnings: extractedDuringPipeline } });
Outputs:
- Production-ready code
- Comprehensive final report
- Commits created
- PR ready (if requested)
- Memory saved for future learning
Time: 30-90 minutes
Result: ✅ PRODUCTION READY
Failure Recovery
Failure Handling at Each Stage
Stage Fails → Recovery Strategy:
Research Fails:
- Retry with different sources
- Skip research (use memory if available)
- Escalate to human if critical gap
Planning Fails (score <90):
- Review gap analysis
- Apply fixes automatically if possible
- Retry planning (max 2 attempts)
- Escalate if still <90
Implementation Fails:
- Identify failing task
- Automatic rollback to last checkpoint
- Retry with alternative approach
- Doom loop prevention (max 3 retries)
- Escalate with full error context
Testing Fails (coverage <80% or quality <90):
- Generate additional tests for gaps
- Retry verification
- Max 2 retries
- Escalate with coverage report
Verification Fails (score <90 or layer fails):
- Apply auto-fixes for Layer 1-2 issues
- Manual fixes needed for Layer 3-5
- Re-verify from failed layer (not all 5)
- Max 2 retries per layer
- Escalate with quality report
Escalation Protocol
When to Escalate to Human:
- Any stage fails 3 times (doom loop)
- Planning quality <80 after 2 retries
- Implementation doom loop detected
- Verification score <80 after 2 retries
- Budget exceeded (if cost tracking enabled)
- Circular dependency detected
- Irrecoverable error (file system, permissions)
Escalation Format:
# ⚠️ ESCALATION REQUIRED
**Stage**: Implementation (Stage 3)
**Failure**: Doom loop detected (3 failed attempts)
## Context
- Objective: Implement user authentication
- Failing Task: 2.2.2 Token generation
- Error: Tests fail with "undefined userId" repeatedly
## Attempts Made
1. Attempt 1: Added userId to payload → Same error
2. Attempt 2: Changed payload structure → Same error
3. Attempt 3: Different JWT library → Same error
## Root Cause Analysis
- Tests expect `user.id` but implementation uses `user.userId`
- Mismatch in data model between test and implementation
- Auto-fix failed 3 times
## Recommended Actions
1. Review test specifications vs. implementation
2. Align data model (user.id vs. user.userId)
3. Manual intervention required
## State Saved
- Checkpoint: checkpoint-003 (before attempts)
- Rollback available: `git checkout checkpoint-003`
- Continue after fix: Resume from Task 2.2.2
Parallel Execution Optimization
Identifying Parallel Opportunities
From Plan:
const plan = readFile('.multi-ai-context/plan.json');
// Plan identifies parallel groups
const parallelGroups = plan.parallel_groups;
// Example:
// Group 1: Tasks 2.1, 2.2, 2.3 (independent)
// Can execute in parallel
Executing Parallel Tasks
Pattern:
// Stage 3: Implementation with parallel tasks
const parallelGroup = plan.parallel_groups.find(g => g.group_id === 'pg2');
// Spawn parallel implementation agents
const results = await Promise.all(
parallelGroup.tasks.map(taskId => {
const task = plan.tasks.find(t => t.id === taskId);
return task({
description: `Implement ${task.description}`,
prompt: `Implement task ${task.id}: ${task.description}
Specifications from plan:
${JSON.stringify(task, null, 2)}
Success criteria:
${task.verification.success_criteria.join('\n')}
Write implementation and tests.
Report completion status.`
});
})
);
// Verify all parallel tasks completed
const allSucceeded = results.every(r => r.status === 'complete');
if (allSucceeded) {
// Proceed to integration
} else {
// Handle failures
}
Time Savings: 20-40% faster than sequential execution
State Management
Cross-Skill State Sharing
Shared Context Directory: .multi-ai-context/
Standard Files:
.multi-ai-context/
├── research-findings.json # From multi-ai-research
├── plan.json # From multi-ai-planning
├── implementation-status.json # From multi-ai-implementation
├── testing-status.json # From multi-ai-testing
├── verification-report.json # From multi-ai-verification
├── pipeline-state.json # Orchestrator state
└── failure-history.json # For doom loop detection
Benefits:
- Skills don't duplicate work
- Later stages read earlier outputs
- Failure recovery knows full state
- Memory can be saved from shared state
Progress Tracking
Real-Time Progress:
{
"pipeline_id": "pipeline_20250126_1200",
"objective": "Implement user authentication",
"started_at": "2025-01-26T12:00:00Z",
"current_stage": 3,
"stages": [
{
"stage": 1,
"name": "Research",
"status": "complete",
"duration_minutes": 52,
"quality": 97
},
{
"stage": 2,
"name": "Planning",
"status": "complete",
"duration_minutes": 108,
"quality": 94
},
{
"stage": 3,
"name": "Implementation",
"status": "in_progress",
"started_at": "2025-01-26T13:48:00Z",
"tasks_total": 23,
"tasks_complete": 15,
"tasks_remaining": 8,
"percent_complete": 65
},
{
"stage": 4,
"name": "Testing",
"status": "pending"
},
{
"stage": 5,
"name": "Verification",
"status": "pending"
}
],
"estimated_completion": "2025-01-26T20:00:00Z",
"quality_target": 90,
"current_quality_estimate": 92
}
Query Progress:
# Check current status
cat .multi-ai-context/pipeline-state.json | jq '.current_stage, .stages[2].percent_complete'
# Output: Stage 3, 65% complete
Workflow Modes
Standard Mode (Full Pipeline)
All 5 Stages:
Research → Planning → Implementation → Testing → Verification
Time: 8-20 hours Quality: Maximum (all gates, ≥90) Use For: Production features, complex implementations
Fast Mode (Skip Research)
4 Stages (familiar domains):
Planning → Implementation → Testing → Verification
Time: 6-15 hours Quality: High (all gates except research) Use For: Familiar domains, time-sensitive features
Quick Mode (Essential Gates Only)
Implementation + Basic Verification:
Planning → Implementation → Testing (basic) → Verification (Layers 1-2 only)
Time: 3-8 hours Quality: Good (essential gates only) Use For: Internal tools, prototypes
Best Practices
1. Always Run Planning Stage
Even for "simple" features - planning quality ≥90 prevents issues
2. Use Memory to Skip Research
If similar work done before, recall patterns instead of researching
3. Monitor Progress
Check .multi-ai-context/pipeline-state.json to track progress
4. Trust the Quality Gates
If gate fails, there's a real issue - don't skip fixes
5. Save State Frequently
Each stage completion saves state (enables recovery)
6. Review Final Report
Complete understanding of what was built and quality achieved
Integration Points
With All 5 Multi-AI Skills
Coordinates:
- multi-ai-research (Stage 1)
- multi-ai-planning (Stage 2)
- multi-ai-implementation (Stage 3)
- multi-ai-testing (Stage 4)
- multi-ai-verification (Stage 5)
Provides:
- Automatic skill invocation
- Quality gate enforcement
- Failure recovery
- State management
- Progress tracking
- Final reporting
With agent-memory-system
Before Pipeline:
- Recall similar past work
- Load learned patterns
- Skip research if memory sufficient
After Pipeline:
- Save complete episode to memory
- Extract learnings
- Update procedural patterns
- Improve estimation accuracy
With hooks-manager
Session Hooks:
- SessionStart: Load pipeline state
- SessionEnd: Save pipeline progress
- PostToolUse: Track stage completions
Notification Hooks:
- Send telemetry on stage completions
- Alert on gate failures
- Track quality scores
Quick Reference
The 5-Stage Pipeline
| Stage | Skill | Time | Quality Gate | Output |
|---|---|---|---|---|
| 1 | multi-ai-research | 30-60m | ≥95/100 | Research findings |
| 2 | multi-ai-planning | 1.5-3h | ≥90/100 | Executable plan |
| 3 | multi-ai-implementation | 3-10h | Tests pass, ≥80% cov | Working code |
| 4 | multi-ai-testing | 1-3h | ≥95% cov, quality ≥90 | Verified tests |
| 5 | multi-ai-verification | 1-3h | ≥90/100, all layers | Production ready |
Total: 8-20 hours → Production-ready feature
Workflow Modes
| Mode | Stages | Time | Quality | Use For |
|---|---|---|---|---|
| Standard | All 5 | 8-20h | Maximum | Production features |
| Fast | 2-5 (skip research) | 6-15h | High | Familiar domains |
| Quick | 2,3,4,5 (basic) | 3-8h | Good | Internal tools |
Quality Gates
- Research: ≥95/100, patterns identified
- Planning: ≥90/100, all tasks verifiable
- Implementation: Tests pass, coverage ≥80%
- Testing: Quality ≥90/100, coverage ≥95%
- Verification: ≥90/100, all 5 layers pass
end-to-end-orchestrator provides complete automation from feature description to production-ready code, coordinating all 5 multi-ai skills with quality gates, failure recovery, and state management - delivering enterprise-grade development workflows in a single command.
For examples, see examples/. For failure recovery, see Failure Recovery section.