multi-ai-verification
Multi-AI Verification
Overview
multi-ai-verification provides comprehensive quality assurance through a 5-layer verification pyramid, from automated rules to LLM-as-judge evaluation.
Purpose: Multi-layer independent verification ensuring production-ready quality
Pattern: Task-based (5 independent verification operations, one per layer)
Key Innovation: 5-layer pyramid (95% automated at base → 0% at apex) with independent verification preventing bias and test gaming
Core Principles (validated by tri-AI research):
- Multi-Layer Defense - 5 layers catch different types of issues
- Independent Verification - Separate agent from implementation/testing
- Progressive Automation - Automate what can be automated (95% → 0%)
- Quality Scoring - Objective 0-100 scoring with ≥90 threshold
- Actionable Feedback - 100% feedback is specific and actionable (What/Where/Why/How/Priority)
Quality Gates: All 5 layers must pass for production approval
When to Use
Use multi-ai-verification when:
- Final quality check before commit/deployment
- Independent code review (preventing bias)
- Security verification (OWASP, vulnerabilities)
- Comprehensive QA (all layers)
- Test quality verification (prevent gaming)
- Production readiness validation
Prerequisites
Required
- Code to verify (implementation complete)
- Tests available (for functional verification)
- Quality standards defined
Recommended
- multi-ai-testing - For generating/running tests
- multi-ai-implementation - For implementing fixes
Tools Available
- Linters (ESLint, Pylint)
- Type checkers (TypeScript, mypy)
- Coverage tools (c8, pytest-cov)
- Security scanners (Semgrep, Bandit)
- Test frameworks (Jest, pytest)
The 5-Layer Verification Pyramid
Layer 5: Quality Scoring
(LLM-as-Judge, 0-20% automated)
/\
/ \
Layer 4: Integration
(E2E, System, 20-30% automated)
/ \
/ \
Layer 3: Visual
(UI, Screenshots, 30-50% automated)
/ \
/ \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
/ \
/ \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)
Principle: Fail fast at automated layers (cheap, fast) before expensive LLM-as-judge evaluation
Verification Operations
Operation 1: Rules-Based Verification (Layer 1)
Purpose: Automated validation of code structure, formatting, types
Automation: 95% automated Speed: Seconds (fast feedback) Confidence: High (deterministic)
Process:
-
Schema Validation (if applicable):
# Validate JSON/YAML against schemas ajv validate -s plan.schema.json -d plan.json ajv validate -s task.schema.json -d tasks/*.json -
Linting:
# JavaScript/TypeScript npx eslint src/**/*.{ts,tsx,js,jsx} # Python pylint src/**/*.py # Expected: Zero linting errors -
Type Checking:
# TypeScript npx tsc --noEmit # Python mypy src/ # Expected: Zero type errors -
Format Validation:
# Check formatting npx prettier --check src/**/*.{ts,tsx} # Or auto-fix npx prettier --write src/**/*.{ts,tsx} -
Security Scanning (SAST):
# Static security analysis npx semgrep --config=auto src/ # Or for Python bandit -r src/ # Check for: # - Hardcoded secrets # - SQL injection risks # - XSS vulnerabilities # - Insecure dependencies -
Generate Layer 1 Report:
# Layer 1: Rules-Based Verification ## Schema Validation ✅ plan.json validates ✅ All task files validate ## Linting ✅ 0 linting errors ⚠️ 3 warnings (non-blocking) ## Type Checking ✅ 0 type errors ## Formatting ✅ All files formatted correctly ## Security Scan (SAST) ✅ No critical vulnerabilities ⚠️ 1 medium: Weak password hashing rounds (bcrypt) **Layer 1 Status**: ✅ PASS (0 critical issues) **Issues to Address**: 1 medium security issue
Outputs:
- Lint report (errors/warnings)
- Type check results
- Schema validation results
- Security scan findings
- Layer 1 status (PASS/FAIL)
Validation:
- All automated checks run
- Results documented
- Critical issues = 0 for PASS
- Actionable feedback for warnings
Time Estimate: 15-30 minutes (mostly automated)
Gate 1: ✅ PASS if no critical issues (warnings acceptable)
Operation 2: Functional Verification (Layer 2)
Purpose: Validate functionality through test execution and coverage
Automation: 60-80% automated Speed: Minutes (medium feedback) Confidence: High (measurable outcomes)
Process:
-
Execute Complete Test Suite:
# Run all tests with coverage npm test -- --coverage --verbose # Capture results # - Tests passed/failed # - Coverage metrics # - Execution time -
Validate Example Code (from documentation):
# Extract examples from SKILL.md # Execute each example automatically # Verify outputs match expected # Target: ≥90% examples work -
Check Coverage:
# Coverage Report **Line Coverage**: 87% ✅ (gate: ≥80%) **Branch Coverage**: 82% ✅ **Function Coverage**: 92% ✅ **Path Coverage**: 74% ✅ **Gate Status**: PASS ✅ (all ≥80%) **Uncovered Code**: - src/admin/legacy.ts: 23% (low priority) - src/utils/deprecated.ts: 15% (deprecated, ok) -
Regression Testing (for updates):
# Compare before/after git diff main...feature --stat # Run all tests npm test # Verify: No new failures (regression prevention) -
Performance Validation:
# Run performance tests npm run test:performance # Check response times # Verify: Within acceptable ranges -
Generate Layer 2 Report:
# Layer 2: Functional Verification ## Test Execution ✅ 245/245 tests passing (100%) ⏱️ Execution time: 8.3 seconds ## Coverage ✅ Line: 87% (gate: ≥80%) ✅ Branch: 82% ✅ Function: 92% ## Example Validation ✅ 18/20 examples work (90%) ❌ 2 examples fail (outdated) ## Regression ✅ All existing tests still pass ## Performance ✅ All endpoints <200ms **Layer 2 Status**: ✅ PASS **Issues**: 2 outdated examples (update docs)
Outputs:
- Test execution results
- Coverage report
- Example validation results
- Regression check
- Performance metrics
- Layer 2 status
Validation:
- All tests executed
- Coverage meets gate (≥80%)
- Examples validated (≥90%)
- No regressions
- Performance acceptable
Time Estimate: 30-60 minutes
Gate 2: ✅ PASS if tests pass + coverage ≥80%
Operation 3: Visual Verification (Layer 3)
Purpose: Validate UI appearance, layout, accessibility (for UI features)
Automation: 30-50% automated Speed: Minutes-Hours Confidence: Medium (subjective elements)
Process:
-
Screenshot Generation:
# Generate screenshots of UI npx playwright test --screenshot=on # Or manually: # Open application # Capture screenshots of key views -
Visual Comparison (if previous version exists):
# Compare against baseline npx playwright test --update-snapshots=missing # Or use Percy/Chromatic for visual regression npx percy snapshot screenshots/ -
Layout Validation:
# Visual Checklist ## Layout - [ ] Components positioned correctly - [ ] Spacing/margins match mockup - [ ] Alignment proper - [ ] No overlapping elements ## Styling - [ ] Colors match design system - [ ] Typography correct (fonts, sizes) - [ ] Icons/images display properly ## Responsiveness - [ ] Mobile view (320px-480px): ✅ - [ ] Tablet view (768px-1024px): ✅ - [ ] Desktop view (>1024px): ✅ -
Accessibility Testing:
# Automated accessibility scan npx axe-core src/ # Check WCAG compliance npx pa11y http://localhost:3000 # Manual checks: # - Keyboard navigation # - Screen reader compatibility # - Color contrast ratios -
Generate Layer 3 Report:
# Layer 3: Visual Verification ## Screenshot Comparison ✅ Login page matches mockup ✅ Dashboard layout correct ⚠️ Profile page: Avatar alignment off by 5px ## Responsiveness ✅ Mobile: All components visible ✅ Tablet: Layout adapts correctly ✅ Desktop: Full functionality ## Accessibility ✅ WCAG 2.1 AA compliance ✅ Keyboard navigation works ⚠️ 2 color contrast warnings (non-critical) **Layer 3 Status**: ✅ PASS (minor issues acceptable) **Issues**: Avatar alignment (cosmetic), contrast warnings
Outputs:
- Screenshots of UI
- Visual comparison results
- Responsiveness validation
- Accessibility report
- Layer 3 status
Validation:
- Screenshots captured
- Visual comparison done (if applicable)
- Layout validated
- Responsiveness tested
- Accessibility checked
- No critical visual issues
Time Estimate: 30-90 minutes (skip if no UI)
Gate 3: ✅ PASS if no critical visual/a11y issues
Operation 4: Integration Verification (Layer 4)
Purpose: Validate system-level integration, data flow, API compatibility
Automation: 20-30% automated Speed: Hours (complex) Confidence: Medium-High
Process:
-
Component Integration Tests:
# Run integration test suite npm test -- tests/integration/ # Verify components work together # - Database ← → API # - API ← → Frontend # - Frontend ← → User -
Data Flow Validation:
# Data Flow Verification **Flow 1: User Registration** Frontend form → API endpoint → Validation → Database → Email service ✅ Data flows correctly ✅ No data loss ✅ Transactions atomic **Flow 2: Authentication** Login request → API → Database lookup → Token generation → Response ✅ Token generated correctly ✅ Session stored ✅ Response includes token -
API Integration Tests:
# Test all API endpoints npm run test:api # Verify: # - All endpoints respond # - Status codes correct # - Response formats match spec # - Error handling works -
End-to-End Workflow Tests:
// Complete user journeys test('Complete registration and login flow', async () => { // 1. Register new user const registerResponse = await api.post('/register', userData); expect(registerResponse.status).toBe(201); // 2. Confirm email const confirmResponse = await api.get(confirmLink); expect(confirmResponse.status).toBe(200); // 3. Login const loginResponse = await api.post('/login', credentials); expect(loginResponse.status).toBe(200); expect(loginResponse.data.token).toBeDefined(); // 4. Access protected resource const profileResponse = await api.get('/profile', { headers: { Authorization: `Bearer ${loginResponse.data.token}` } }); expect(profileResponse.status).toBe(200); }); -
Dependency Compatibility:
# Check external dependencies work npm audit # Check for breaking changes npm outdated # Verify integration with services # - Database connection # - Redis/cache # - External APIs -
Generate Layer 4 Report:
# Layer 4: Integration Verification ## Component Integration ✅ 12/12 integration tests passing ✅ All components integrate correctly ## Data Flow ✅ All 5 data flows validated ✅ No data loss or corruption ## API Integration ✅ All 15 endpoints functional ✅ Response formats correct ✅ Error handling works ## E2E Workflows ✅ 8/8 user journeys complete successfully ✅ No workflow breaks ## Dependencies ✅ 0 critical vulnerabilities ⚠️ 2 moderate (non-blocking) **Layer 4 Status**: ✅ PASS
Outputs:
- Integration test results
- Data flow validation
- API compatibility report
- E2E workflow results
- Dependency audit
- Layer 4 status
Validation:
- Integration tests pass
- Data flows validated
- APIs integrate correctly
- E2E workflows function
- Dependencies secure
Time Estimate: 45-90 minutes
Gate 4: ✅ PASS if all integration tests pass, no critical dependencies
Operation 5: Quality Scoring (Layer 5)
Purpose: Holistic quality assessment using LLM-as-judge and Agent-as-a-Judge patterns
Automation: 0-20% automated Speed: Hours (expensive) Confidence: Medium (requires judgment)
Process:
-
Spawn Independent Quality Assessor (Agent-as-a-Judge):
Key: Use different model family if possible (prevent self-preference bias)
const qualityAssessment = await task({ description: "Assess code quality holistically", prompt: `Evaluate code quality in src/ and tests/. DO NOT read implementation conversation history. You have access to tools: - Read files - Execute tests - Run linters - Query database (if needed) Assess 5 dimensions (score each /20): 1. CORRECTNESS (/20): - Logic correctness - Edge case handling - Error handling completeness - Security considerations 2. FUNCTIONALITY (/20): - Meets all requirements - User workflows work - Performance acceptable - No regressions 3. QUALITY (/20): - Code maintainability - Best practices followed - Anti-patterns avoided - Documentation complete 4. INTEGRATION (/20): - Components integrate smoothly - API contracts correct - Data flow works - Backward compatible 5. SECURITY (/20): - No vulnerabilities - Input validation - Authentication/authorization - Data protection TOTAL: /100 (sum of 5 dimensions) For each dimension, provide: - Score (/20) - Strengths (what's good) - Weaknesses (what needs improvement) - Evidence (file:line references) - Recommendations (specific, actionable) Write comprehensive report to: quality-assessment.md` }); -
Multi-Agent Ensemble (for critical features):
3-5 Agent Voting Committee:
// Spawn 3 independent quality assessors const [judge1, judge2, judge3] = await Promise.all([ task({description: "Quality Judge 1", prompt: assessmentPrompt}), task({description: "Quality Judge 2", prompt: assessmentPrompt}), task({description: "Quality Judge 3", prompt: assessmentPrompt}) ]); // Aggregate scores const scores = { correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]), functionality: median([...]), quality: median([...]), integration: median([...]), security: median([...]) }; const totalScore = sum(Object.values(scores)); // Total /100 // Check variance const totalScores = [judge1.total, judge2.total, judge3.total]; const variance = max(totalScores) - min(totalScores); if (variance > 15) { // High disagreement → spawn 2 more judges (total 5) // Use 5-agent ensemble for final score } // Final score: median of 3 or 5 -
Calibration Against Rubric:
# Scoring Calibration ## Correctness: 18/20 (Excellent) **20**: Zero errors, all edge cases handled perfectly **18**: Minor edge case missing, otherwise excellent ✅ (achieved) **15**: 1-2 significant edge cases missing **10**: Some logic errors present **0**: Major functionality broken **Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor) ## Functionality: 19/20 (Excellent) [Similar rubric with evidence] ## Quality: 17/20 (Good) [Similar rubric with evidence] ## Integration: 18/20 (Excellent) [Similar rubric with evidence] ## Security: 16/20 (Good) [Similar rubric with evidence] **Total**: 88/100 ⚠️ (Below ≥90 gate) -
Gap Analysis (if <90):
# Quality Gap Analysis **Current Score**: 88/100 **Target**: ≥90/100 **Gap**: 2 points ## Critical Gaps (Blocking Approval) None ## High Priority (Should Fix for ≥90) 1. **Security: Weak bcrypt rounds** - **What**: bcrypt using 10 rounds (outdated) - **Where**: src/auth/hash.ts:15 - **Why**: Current standard is 12-14 rounds - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)` - **Priority**: High - **Impact**: +2 points → 90/100 ## Medium Priority 1. **Quality: Missing JSDoc for 3 functions** - Impact: +1 point → 91/100 **Recommendation**: Fix high priority issue to reach ≥90 threshold **Estimated Effort**: 15 minutes -
Generate Comprehensive Quality Report:
# Layer 5: Quality Scoring Report ## Executive Summary **Total Score**: 88/100 ⚠️ (Below ≥90 gate) **Status**: NEEDS MINOR REVISION ## Dimension Scores - Correctness: 18/20 ⭐⭐⭐⭐⭐ - Functionality: 19/20 ⭐⭐⭐⭐⭐ - Quality: 17/20 ⭐⭐⭐⭐ - Integration: 18/20 ⭐⭐⭐⭐⭐ - Security: 16/20 ⭐⭐⭐⭐ ## Strengths 1. Comprehensive test coverage (87%) 2. All functionality working correctly 3. Clean integration with all components 4. Good error handling ## Weaknesses 1. Bcrypt rounds below current standard (security) 2. Missing documentation for helper functions (quality) 3. One timezone edge case not handled (correctness) ## Recommendations (Prioritized) ### Priority 1 (High - Needed for ≥90) 1. Increase bcrypt rounds: 10 → 12 - File: src/auth/hash.ts:15 - Effort: 5 min - Impact: +2 points ### Priority 2 (Medium - Nice to Have) 1. Add JSDoc to helper functions - Files: src/utils/validation.ts - Effort: 30 min - Impact: +1 point 2. Handle timezone DST edge case - File: src/auth/tokens.ts:78 - Effort: 20 min - Impact: +1 point **Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90
Outputs:
- Quality score (0-100) with dimension breakdown
- Calibrated against rubric
- Gap analysis
- Prioritized recommendations (Critical/High/Medium/Low)
- Evidence-based feedback (file:line references)
- Action plan to reach ≥90
Validation:
- All 5 dimensions scored
- Scores calibrated against rubric
- Evidence provided for each score
- Gap analysis if <90
- Recommendations actionable
- Ensemble used for critical features (optional)
Time Estimate: 60-120 minutes (ensemble adds 30-60 min)
Gate 5: ✅ PASS if total score ≥90/100
Quality Gates Summary
All 5 Gates Must Pass for production approval:
Gate 1: Rules Pass ✅
↓ (Linting, types, schema, security)
Gate 2: Tests Pass ✅
↓ (All tests, coverage ≥80%)
Gate 3: Visual OK ✅
↓ (UI validated, a11y checked)
Gate 4: Integration OK ✅
↓ (E2E works, APIs integrate)
Gate 5: Quality ≥90 ✅
↓ (LLM-as-judge score ≥90/100)
✅ PRODUCTION APPROVED
If Any Gate Fails:
Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass
Appendix A: Independence Protocol
How Verification Independence is Maintained
Verification Agent Spawning:
// After implementation and testing complete
const verification = await task({
description: "Independent quality verification",
prompt: `Verify code quality independently.
DO NOT read prior conversation history.
Review:
- Code: src/**/*.ts
- Tests: tests/**/*.test.ts
- Specs: specs/requirements.md
Verify against specifications ONLY (not implementation decisions).
Use tools:
- Read files to inspect code
- Run tests to verify functionality
- Execute linters for quality checks
Score quality (0-100) with evidence.
Write report to: independent-verification.md`
});
Bias Prevention Checklist:
- Specifications written BEFORE implementation
- Verification agent prompt has no implementation context
- Agent evaluates against specs, not what code does
- Fresh context (via Task tool)
- Different model family used (if possible)
Validation of Independence:
## Independence Audit
**Expected Behavior**:
- ✅ Verifier finds 1-3 issues (healthy skepticism)
- ✅ Verifier references specifications
- ✅ Verifier uses tools to verify claims
**Warning Signs**:
- ⚠️ Verifier finds 0 issues (possible rubber stamp)
- ⚠️ Verifier doesn't use tools
- ⚠️ Verifier parrots implementation justifications
**If Warning**: Re-verify with stronger independence prompt
Appendix B: Operational Scoring Rubrics
Complete Rubrics for All 5 Dimensions
Correctness (/20)
20 (Perfect): Zero logic errors, all edge cases handled, security perfect 18 (Excellent): 1 minor edge case missing, otherwise flawless 15 (Good): 2-3 edge cases missing, no critical errors 12 (Acceptable): Some edge cases missing, 1 minor logic issue 10 (Needs Work): Multiple edge cases missing or 1 significant logic error 5 (Poor): Major logic errors present 0 (Broken): Critical functionality broken
Functionality (/20)
20: All requirements met, exceeds expectations 18: All requirements met, well implemented 15: All requirements met, basic implementation 12: 1 requirement partially missing 10: 2+ requirements partially missing 5: Several requirements not met 0: Core functionality missing
Quality (/20)
20: Exceptional code quality, best practices exemplified 18: High quality, follows best practices 15: Good quality, minor style issues 12: Acceptable quality, several style issues 10: Below standard, needs refactoring 5: Poor quality, significant issues 0: Unmaintainable code
Integration (/20)
20: Perfect integration, all touch points verified 18: Excellent integration, minor docs needed 15: Good integration, all major points work 12: Acceptable, 1-2 integration issues 10: Integration issues present 5: Multiple integration problems 0: Does not integrate
Security (/20)
20: Passes all security scans, OWASP compliant, hardened 18: Passes scans, 1 minor non-critical issue 15: Passes, 2-3 minor issues 12: 1 medium security issue 10: Multiple medium issues 5: 1 critical issue present 0: Multiple critical vulnerabilities
Appendix C: Technical Foundation
Verification Tools
Linting:
- ESLint (JavaScript/TypeScript)
- Pylint/Ruff (Python)
Type Checking:
- TypeScript compiler (tsc)
- mypy (Python)
Security (SAST):
- Semgrep (multi-language)
- Bandit (Python)
- npm audit (JavaScript)
Visual Testing:
- Playwright (screenshot, visual regression)
- Percy/Chromatic (visual diff)
- axe-core (accessibility)
Coverage:
- c8/nyc (JavaScript)
- pytest-cov (Python)
Cost Controls
Budget Caps:
- LLM-as-judge: $50/month
- Ensemble verification: $20/month
- Total verification: $70/month
Optimization:
- Cache quality scores for 24h (same code → same score)
- Skip Layer 5 for changes <50 lines
- Use ensemble (3-5 agents) only for critical features
- Use cheaper models for pre-filtering (Haiku for Layer 1-2)
Quick Reference
The 5 Layers
| Layer | Purpose | Automation | Time | Tools |
|---|---|---|---|---|
| 1 | Rules-based | 95% | 15-30m | Linters, types, SAST |
| 2 | Functional | 60-80% | 30-60m | Test execution, coverage |
| 3 | Visual | 30-50% | 30-90m | Screenshots, a11y |
| 4 | Integration | 20-30% | 45-90m | E2E, API tests |
| 5 | Quality Scoring | 0-20% | 60-120m | LLM-as-judge, ensemble |
Total: 3-6 hours for complete 5-layer verification
Quality Thresholds
- ≥90: ✅ Excellent (production-ready)
- 80-89: ⚠️ Good (needs minor improvements)
- 70-79: ❌ Acceptable (needs work before production)
- <70: ❌ Poor (significant rework required)
Gates
All 5 Must Pass:
- Rules pass (no critical lint/type/security)
- Tests pass + coverage ≥80%
- Visual OK (no critical UI issues)
- Integration OK (E2E works)
- Quality ≥90/100
multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.
For rubrics, see Appendix B. For independence protocol, see Appendix A.