skills/adaptationio/skrillz/multi-ai-verification

multi-ai-verification

SKILL.md

Multi-AI Verification

Overview

multi-ai-verification provides comprehensive quality assurance through a 5-layer verification pyramid, from automated rules to LLM-as-judge evaluation.

Purpose: Multi-layer independent verification ensuring production-ready quality

Pattern: Task-based (5 independent verification operations, one per layer)

Key Innovation: 5-layer pyramid (95% automated at base → 0% at apex) with independent verification preventing bias and test gaming

Core Principles (validated by tri-AI research):

  1. Multi-Layer Defense - 5 layers catch different types of issues
  2. Independent Verification - Separate agent from implementation/testing
  3. Progressive Automation - Automate what can be automated (95% → 0%)
  4. Quality Scoring - Objective 0-100 scoring with ≥90 threshold
  5. Actionable Feedback - 100% feedback is specific and actionable (What/Where/Why/How/Priority)

Quality Gates: All 5 layers must pass for production approval


When to Use

Use multi-ai-verification when:

  • Final quality check before commit/deployment
  • Independent code review (preventing bias)
  • Security verification (OWASP, vulnerabilities)
  • Comprehensive QA (all layers)
  • Test quality verification (prevent gaming)
  • Production readiness validation

Prerequisites

Required

  • Code to verify (implementation complete)
  • Tests available (for functional verification)
  • Quality standards defined

Recommended

  • multi-ai-testing - For generating/running tests
  • multi-ai-implementation - For implementing fixes

Tools Available

  • Linters (ESLint, Pylint)
  • Type checkers (TypeScript, mypy)
  • Coverage tools (c8, pytest-cov)
  • Security scanners (Semgrep, Bandit)
  • Test frameworks (Jest, pytest)

The 5-Layer Verification Pyramid

         Layer 5: Quality Scoring
         (LLM-as-Judge, 0-20% automated)
              /\
             /  \
        Layer 4: Integration
        (E2E, System, 20-30% automated)
          /      \
         /        \
    Layer 3: Visual
    (UI, Screenshots, 30-50% automated)
      /          \
     /            \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
  /              \
 /                \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)

Principle: Fail fast at automated layers (cheap, fast) before expensive LLM-as-judge evaluation


Verification Operations

Operation 1: Rules-Based Verification (Layer 1)

Purpose: Automated validation of code structure, formatting, types

Automation: 95% automated Speed: Seconds (fast feedback) Confidence: High (deterministic)

Process:

  1. Schema Validation (if applicable):

    # Validate JSON/YAML against schemas
    ajv validate -s plan.schema.json -d plan.json
    ajv validate -s task.schema.json -d tasks/*.json
    
  2. Linting:

    # JavaScript/TypeScript
    npx eslint src/**/*.{ts,tsx,js,jsx}
    
    # Python
    pylint src/**/*.py
    
    # Expected: Zero linting errors
    
  3. Type Checking:

    # TypeScript
    npx tsc --noEmit
    
    # Python
    mypy src/
    
    # Expected: Zero type errors
    
  4. Format Validation:

    # Check formatting
    npx prettier --check src/**/*.{ts,tsx}
    
    # Or auto-fix
    npx prettier --write src/**/*.{ts,tsx}
    
  5. Security Scanning (SAST):

    # Static security analysis
    npx semgrep --config=auto src/
    
    # Or for Python
    bandit -r src/
    
    # Check for:
    # - Hardcoded secrets
    # - SQL injection risks
    # - XSS vulnerabilities
    # - Insecure dependencies
    
  6. Generate Layer 1 Report:

    # Layer 1: Rules-Based Verification
    
    ## Schema Validation
    ✅ plan.json validates
    ✅ All task files validate
    
    ## Linting
    ✅ 0 linting errors
    ⚠️ 3 warnings (non-blocking)
    
    ## Type Checking
    ✅ 0 type errors
    
    ## Formatting
    ✅ All files formatted correctly
    
    ## Security Scan (SAST)
    ✅ No critical vulnerabilities
    ⚠️ 1 medium: Weak password hashing rounds (bcrypt)
    
    **Layer 1 Status**: ✅ PASS (0 critical issues)
    **Issues to Address**: 1 medium security issue
    

Outputs:

  • Lint report (errors/warnings)
  • Type check results
  • Schema validation results
  • Security scan findings
  • Layer 1 status (PASS/FAIL)

Validation:

  • All automated checks run
  • Results documented
  • Critical issues = 0 for PASS
  • Actionable feedback for warnings

Time Estimate: 15-30 minutes (mostly automated)

Gate 1: ✅ PASS if no critical issues (warnings acceptable)


Operation 2: Functional Verification (Layer 2)

Purpose: Validate functionality through test execution and coverage

Automation: 60-80% automated Speed: Minutes (medium feedback) Confidence: High (measurable outcomes)

Process:

  1. Execute Complete Test Suite:

    # Run all tests with coverage
    npm test -- --coverage --verbose
    
    # Capture results
    # - Tests passed/failed
    # - Coverage metrics
    # - Execution time
    
  2. Validate Example Code (from documentation):

    # Extract examples from SKILL.md
    # Execute each example automatically
    # Verify outputs match expected
    
    # Target: ≥90% examples work
    
  3. Check Coverage:

    # Coverage Report
    
    **Line Coverage**: 87% ✅ (gate: ≥80%)
    **Branch Coverage**: 82% ✅
    **Function Coverage**: 92% ✅
    **Path Coverage**: 74% ✅
    
    **Gate Status**: PASS ✅ (all ≥80%)
    
    **Uncovered Code**:
    - src/admin/legacy.ts: 23% (low priority)
    - src/utils/deprecated.ts: 15% (deprecated, ok)
    
  4. Regression Testing (for updates):

    # Compare before/after
    git diff main...feature --stat
    
    # Run all tests
    npm test
    
    # Verify: No new failures (regression prevention)
    
  5. Performance Validation:

    # Run performance tests
    npm run test:performance
    
    # Check response times
    # Verify: Within acceptable ranges
    
  6. Generate Layer 2 Report:

    # Layer 2: Functional Verification
    
    ## Test Execution
    ✅ 245/245 tests passing (100%)
    ⏱️ Execution time: 8.3 seconds
    
    ## Coverage
    ✅ Line: 87% (gate: ≥80%)
    ✅ Branch: 82%
    ✅ Function: 92%
    
    ## Example Validation
    ✅ 18/20 examples work (90%)
    ❌ 2 examples fail (outdated)
    
    ## Regression
    ✅ All existing tests still pass
    
    ## Performance
    ✅ All endpoints <200ms
    
    **Layer 2 Status**: ✅ PASS
    **Issues**: 2 outdated examples (update docs)
    

Outputs:

  • Test execution results
  • Coverage report
  • Example validation results
  • Regression check
  • Performance metrics
  • Layer 2 status

Validation:

  • All tests executed
  • Coverage meets gate (≥80%)
  • Examples validated (≥90%)
  • No regressions
  • Performance acceptable

Time Estimate: 30-60 minutes

Gate 2: ✅ PASS if tests pass + coverage ≥80%


Operation 3: Visual Verification (Layer 3)

Purpose: Validate UI appearance, layout, accessibility (for UI features)

Automation: 30-50% automated Speed: Minutes-Hours Confidence: Medium (subjective elements)

Process:

  1. Screenshot Generation:

    # Generate screenshots of UI
    npx playwright test --screenshot=on
    
    # Or manually:
    # Open application
    # Capture screenshots of key views
    
  2. Visual Comparison (if previous version exists):

    # Compare against baseline
    npx playwright test --update-snapshots=missing
    
    # Or use Percy/Chromatic for visual regression
    npx percy snapshot screenshots/
    
  3. Layout Validation:

    # Visual Checklist
    
    ## Layout
    - [ ] Components positioned correctly
    - [ ] Spacing/margins match mockup
    - [ ] Alignment proper
    - [ ] No overlapping elements
    
    ## Styling
    - [ ] Colors match design system
    - [ ] Typography correct (fonts, sizes)
    - [ ] Icons/images display properly
    
    ## Responsiveness
    - [ ] Mobile view (320px-480px): ✅
    - [ ] Tablet view (768px-1024px): ✅
    - [ ] Desktop view (>1024px): ✅
    
  4. Accessibility Testing:

    # Automated accessibility scan
    npx axe-core src/
    
    # Check WCAG compliance
    npx pa11y http://localhost:3000
    
    # Manual checks:
    # - Keyboard navigation
    # - Screen reader compatibility
    # - Color contrast ratios
    
  5. Generate Layer 3 Report:

    # Layer 3: Visual Verification
    
    ## Screenshot Comparison
    ✅ Login page matches mockup
    ✅ Dashboard layout correct
    ⚠️ Profile page: Avatar alignment off by 5px
    
    ## Responsiveness
    ✅ Mobile: All components visible
    ✅ Tablet: Layout adapts correctly
    ✅ Desktop: Full functionality
    
    ## Accessibility
    ✅ WCAG 2.1 AA compliance
    ✅ Keyboard navigation works
    ⚠️ 2 color contrast warnings (non-critical)
    
    **Layer 3 Status**: ✅ PASS (minor issues acceptable)
    **Issues**: Avatar alignment (cosmetic), contrast warnings
    

Outputs:

  • Screenshots of UI
  • Visual comparison results
  • Responsiveness validation
  • Accessibility report
  • Layer 3 status

Validation:

  • Screenshots captured
  • Visual comparison done (if applicable)
  • Layout validated
  • Responsiveness tested
  • Accessibility checked
  • No critical visual issues

Time Estimate: 30-90 minutes (skip if no UI)

Gate 3: ✅ PASS if no critical visual/a11y issues


Operation 4: Integration Verification (Layer 4)

Purpose: Validate system-level integration, data flow, API compatibility

Automation: 20-30% automated Speed: Hours (complex) Confidence: Medium-High

Process:

  1. Component Integration Tests:

    # Run integration test suite
    npm test -- tests/integration/
    
    # Verify components work together
    # - Database ← → API
    # - API ← → Frontend
    # - Frontend ← → User
    
  2. Data Flow Validation:

    # Data Flow Verification
    
    **Flow 1: User Registration**
    Frontend form → API endpoint → Validation → Database → Email service
    ✅ Data flows correctly
    ✅ No data loss
    ✅ Transactions atomic
    
    **Flow 2: Authentication**
    Login request → API → Database lookup → Token generation → Response
    ✅ Token generated correctly
    ✅ Session stored
    ✅ Response includes token
    
  3. API Integration Tests:

    # Test all API endpoints
    npm run test:api
    
    # Verify:
    # - All endpoints respond
    # - Status codes correct
    # - Response formats match spec
    # - Error handling works
    
  4. End-to-End Workflow Tests:

    // Complete user journeys
    test('Complete registration and login flow', async () => {
      // 1. Register new user
      const registerResponse = await api.post('/register', userData);
      expect(registerResponse.status).toBe(201);
    
      // 2. Confirm email
      const confirmResponse = await api.get(confirmLink);
      expect(confirmResponse.status).toBe(200);
    
      // 3. Login
      const loginResponse = await api.post('/login', credentials);
      expect(loginResponse.status).toBe(200);
      expect(loginResponse.data.token).toBeDefined();
    
      // 4. Access protected resource
      const profileResponse = await api.get('/profile', {
        headers: { Authorization: `Bearer ${loginResponse.data.token}` }
      });
      expect(profileResponse.status).toBe(200);
    });
    
  5. Dependency Compatibility:

    # Check external dependencies work
    npm audit
    
    # Check for breaking changes
    npm outdated
    
    # Verify integration with services
    # - Database connection
    # - Redis/cache
    # - External APIs
    
  6. Generate Layer 4 Report:

    # Layer 4: Integration Verification
    
    ## Component Integration
    ✅ 12/12 integration tests passing
    ✅ All components integrate correctly
    
    ## Data Flow
    ✅ All 5 data flows validated
    ✅ No data loss or corruption
    
    ## API Integration
    ✅ All 15 endpoints functional
    ✅ Response formats correct
    ✅ Error handling works
    
    ## E2E Workflows
    ✅ 8/8 user journeys complete successfully
    ✅ No workflow breaks
    
    ## Dependencies
    ✅ 0 critical vulnerabilities
    ⚠️ 2 moderate (non-blocking)
    
    **Layer 4 Status**: ✅ PASS
    

Outputs:

  • Integration test results
  • Data flow validation
  • API compatibility report
  • E2E workflow results
  • Dependency audit
  • Layer 4 status

Validation:

  • Integration tests pass
  • Data flows validated
  • APIs integrate correctly
  • E2E workflows function
  • Dependencies secure

Time Estimate: 45-90 minutes

Gate 4: ✅ PASS if all integration tests pass, no critical dependencies


Operation 5: Quality Scoring (Layer 5)

Purpose: Holistic quality assessment using LLM-as-judge and Agent-as-a-Judge patterns

Automation: 0-20% automated Speed: Hours (expensive) Confidence: Medium (requires judgment)

Process:

  1. Spawn Independent Quality Assessor (Agent-as-a-Judge):

    Key: Use different model family if possible (prevent self-preference bias)

    const qualityAssessment = await task({
      description: "Assess code quality holistically",
      prompt: `Evaluate code quality in src/ and tests/.
    
      DO NOT read implementation conversation history.
    
      You have access to tools:
      - Read files
      - Execute tests
      - Run linters
      - Query database (if needed)
    
      Assess 5 dimensions (score each /20):
    
      1. CORRECTNESS (/20):
         - Logic correctness
         - Edge case handling
         - Error handling completeness
         - Security considerations
    
      2. FUNCTIONALITY (/20):
         - Meets all requirements
         - User workflows work
         - Performance acceptable
         - No regressions
    
      3. QUALITY (/20):
         - Code maintainability
         - Best practices followed
         - Anti-patterns avoided
         - Documentation complete
    
      4. INTEGRATION (/20):
         - Components integrate smoothly
         - API contracts correct
         - Data flow works
         - Backward compatible
    
      5. SECURITY (/20):
         - No vulnerabilities
         - Input validation
         - Authentication/authorization
         - Data protection
    
      TOTAL: /100 (sum of 5 dimensions)
    
      For each dimension, provide:
      - Score (/20)
      - Strengths (what's good)
      - Weaknesses (what needs improvement)
      - Evidence (file:line references)
      - Recommendations (specific, actionable)
    
      Write comprehensive report to: quality-assessment.md`
    });
    
  2. Multi-Agent Ensemble (for critical features):

    3-5 Agent Voting Committee:

    // Spawn 3 independent quality assessors
    const [judge1, judge2, judge3] = await Promise.all([
      task({description: "Quality Judge 1", prompt: assessmentPrompt}),
      task({description: "Quality Judge 2", prompt: assessmentPrompt}),
      task({description: "Quality Judge 3", prompt: assessmentPrompt})
    ]);
    
    // Aggregate scores
    const scores = {
      correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]),
      functionality: median([...]),
      quality: median([...]),
      integration: median([...]),
      security: median([...])
    };
    
    const totalScore = sum(Object.values(scores)); // Total /100
    
    // Check variance
    const totalScores = [judge1.total, judge2.total, judge3.total];
    const variance = max(totalScores) - min(totalScores);
    
    if (variance > 15) {
      // High disagreement → spawn 2 more judges (total 5)
      // Use 5-agent ensemble for final score
    }
    
    // Final score: median of 3 or 5
    
  3. Calibration Against Rubric:

    # Scoring Calibration
    
    ## Correctness: 18/20 (Excellent)
    **20**: Zero errors, all edge cases handled perfectly
    **18**: Minor edge case missing, otherwise excellent ✅ (achieved)
    **15**: 1-2 significant edge cases missing
    **10**: Some logic errors present
    **0**: Major functionality broken
    
    **Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor)
    
    ## Functionality: 19/20 (Excellent)
    [Similar rubric with evidence]
    
    ## Quality: 17/20 (Good)
    [Similar rubric with evidence]
    
    ## Integration: 18/20 (Excellent)
    [Similar rubric with evidence]
    
    ## Security: 16/20 (Good)
    [Similar rubric with evidence]
    
    **Total**: 88/100 ⚠️ (Below ≥90 gate)
    
  4. Gap Analysis (if <90):

    # Quality Gap Analysis
    
    **Current Score**: 88/100
    **Target**: ≥90/100
    **Gap**: 2 points
    
    ## Critical Gaps (Blocking Approval)
    None
    
    ## High Priority (Should Fix for ≥90)
    1. **Security: Weak bcrypt rounds**
       - **What**: bcrypt using 10 rounds (outdated)
       - **Where**: src/auth/hash.ts:15
       - **Why**: Current standard is 12-14 rounds
       - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)`
       - **Priority**: High
       - **Impact**: +2 points → 90/100
    
    ## Medium Priority
    1. **Quality: Missing JSDoc for 3 functions**
       - Impact: +1 point → 91/100
    
    **Recommendation**: Fix high priority issue to reach ≥90 threshold
    **Estimated Effort**: 15 minutes
    
  5. Generate Comprehensive Quality Report:

    # Layer 5: Quality Scoring Report
    
    ## Executive Summary
    **Total Score**: 88/100 ⚠️ (Below ≥90 gate)
    **Status**: NEEDS MINOR REVISION
    
    ## Dimension Scores
    - Correctness: 18/20 ⭐⭐⭐⭐⭐
    - Functionality: 19/20 ⭐⭐⭐⭐⭐
    - Quality: 17/20 ⭐⭐⭐⭐
    - Integration: 18/20 ⭐⭐⭐⭐⭐
    - Security: 16/20 ⭐⭐⭐⭐
    
    ## Strengths
    1. Comprehensive test coverage (87%)
    2. All functionality working correctly
    3. Clean integration with all components
    4. Good error handling
    
    ## Weaknesses
    1. Bcrypt rounds below current standard (security)
    2. Missing documentation for helper functions (quality)
    3. One timezone edge case not handled (correctness)
    
    ## Recommendations (Prioritized)
    
    ### Priority 1 (High - Needed for ≥90)
    1. Increase bcrypt rounds: 10 → 12
       - File: src/auth/hash.ts:15
       - Effort: 5 min
       - Impact: +2 points
    
    ### Priority 2 (Medium - Nice to Have)
    1. Add JSDoc to helper functions
       - Files: src/utils/validation.ts
       - Effort: 30 min
       - Impact: +1 point
    
    2. Handle timezone DST edge case
       - File: src/auth/tokens.ts:78
       - Effort: 20 min
       - Impact: +1 point
    
    **Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90
    

Outputs:

  • Quality score (0-100) with dimension breakdown
  • Calibrated against rubric
  • Gap analysis
  • Prioritized recommendations (Critical/High/Medium/Low)
  • Evidence-based feedback (file:line references)
  • Action plan to reach ≥90

Validation:

  • All 5 dimensions scored
  • Scores calibrated against rubric
  • Evidence provided for each score
  • Gap analysis if <90
  • Recommendations actionable
  • Ensemble used for critical features (optional)

Time Estimate: 60-120 minutes (ensemble adds 30-60 min)

Gate 5: ✅ PASS if total score ≥90/100


Quality Gates Summary

All 5 Gates Must Pass for production approval:

Gate 1: Rules Pass ✅
   ↓ (Linting, types, schema, security)

Gate 2: Tests Pass ✅
   ↓ (All tests, coverage ≥80%)

Gate 3: Visual OK ✅
   ↓ (UI validated, a11y checked)

Gate 4: Integration OK ✅
   ↓ (E2E works, APIs integrate)

Gate 5: Quality ≥90 ✅
   ↓ (LLM-as-judge score ≥90/100)

✅ PRODUCTION APPROVED

If Any Gate Fails:

Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass

Appendix A: Independence Protocol

How Verification Independence is Maintained

Verification Agent Spawning:

// After implementation and testing complete
const verification = await task({
  description: "Independent quality verification",
  prompt: `Verify code quality independently.

  DO NOT read prior conversation history.

  Review:
  - Code: src/**/*.ts
  - Tests: tests/**/*.test.ts
  - Specs: specs/requirements.md

  Verify against specifications ONLY (not implementation decisions).

  Use tools:
  - Read files to inspect code
  - Run tests to verify functionality
  - Execute linters for quality checks

  Score quality (0-100) with evidence.
  Write report to: independent-verification.md`
});

Bias Prevention Checklist:

  • Specifications written BEFORE implementation
  • Verification agent prompt has no implementation context
  • Agent evaluates against specs, not what code does
  • Fresh context (via Task tool)
  • Different model family used (if possible)

Validation of Independence:

## Independence Audit

**Expected Behavior**:
- ✅ Verifier finds 1-3 issues (healthy skepticism)
- ✅ Verifier references specifications
- ✅ Verifier uses tools to verify claims

**Warning Signs**:
- ⚠️ Verifier finds 0 issues (possible rubber stamp)
- ⚠️ Verifier doesn't use tools
- ⚠️ Verifier parrots implementation justifications

**If Warning**: Re-verify with stronger independence prompt

Appendix B: Operational Scoring Rubrics

Complete Rubrics for All 5 Dimensions

Correctness (/20)

20 (Perfect): Zero logic errors, all edge cases handled, security perfect 18 (Excellent): 1 minor edge case missing, otherwise flawless 15 (Good): 2-3 edge cases missing, no critical errors 12 (Acceptable): Some edge cases missing, 1 minor logic issue 10 (Needs Work): Multiple edge cases missing or 1 significant logic error 5 (Poor): Major logic errors present 0 (Broken): Critical functionality broken

Functionality (/20)

20: All requirements met, exceeds expectations 18: All requirements met, well implemented 15: All requirements met, basic implementation 12: 1 requirement partially missing 10: 2+ requirements partially missing 5: Several requirements not met 0: Core functionality missing

Quality (/20)

20: Exceptional code quality, best practices exemplified 18: High quality, follows best practices 15: Good quality, minor style issues 12: Acceptable quality, several style issues 10: Below standard, needs refactoring 5: Poor quality, significant issues 0: Unmaintainable code

Integration (/20)

20: Perfect integration, all touch points verified 18: Excellent integration, minor docs needed 15: Good integration, all major points work 12: Acceptable, 1-2 integration issues 10: Integration issues present 5: Multiple integration problems 0: Does not integrate

Security (/20)

20: Passes all security scans, OWASP compliant, hardened 18: Passes scans, 1 minor non-critical issue 15: Passes, 2-3 minor issues 12: 1 medium security issue 10: Multiple medium issues 5: 1 critical issue present 0: Multiple critical vulnerabilities


Appendix C: Technical Foundation

Verification Tools

Linting:

  • ESLint (JavaScript/TypeScript)
  • Pylint/Ruff (Python)

Type Checking:

  • TypeScript compiler (tsc)
  • mypy (Python)

Security (SAST):

  • Semgrep (multi-language)
  • Bandit (Python)
  • npm audit (JavaScript)

Visual Testing:

  • Playwright (screenshot, visual regression)
  • Percy/Chromatic (visual diff)
  • axe-core (accessibility)

Coverage:

  • c8/nyc (JavaScript)
  • pytest-cov (Python)

Cost Controls

Budget Caps:

  • LLM-as-judge: $50/month
  • Ensemble verification: $20/month
  • Total verification: $70/month

Optimization:

  • Cache quality scores for 24h (same code → same score)
  • Skip Layer 5 for changes <50 lines
  • Use ensemble (3-5 agents) only for critical features
  • Use cheaper models for pre-filtering (Haiku for Layer 1-2)

Quick Reference

The 5 Layers

Layer Purpose Automation Time Tools
1 Rules-based 95% 15-30m Linters, types, SAST
2 Functional 60-80% 30-60m Test execution, coverage
3 Visual 30-50% 30-90m Screenshots, a11y
4 Integration 20-30% 45-90m E2E, API tests
5 Quality Scoring 0-20% 60-120m LLM-as-judge, ensemble

Total: 3-6 hours for complete 5-layer verification

Quality Thresholds

  • ≥90: ✅ Excellent (production-ready)
  • 80-89: ⚠️ Good (needs minor improvements)
  • 70-79: ❌ Acceptable (needs work before production)
  • <70: ❌ Poor (significant rework required)

Gates

All 5 Must Pass:

  1. Rules pass (no critical lint/type/security)
  2. Tests pass + coverage ≥80%
  3. Visual OK (no critical UI issues)
  4. Integration OK (E2E works)
  5. Quality ≥90/100

multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.

For rubrics, see Appendix B. For independence protocol, see Appendix A.

Weekly Installs
1
Installed on
claude-code1