aeo-core
AEO Core - Confidence Engine
Purpose: Calculate confidence scores (0-1) and decide whether to execute autonomously or involve the human.
Activation
Loads when user types /aeo in Claude Code.
Confidence Calculation
Phase 0: Spec Validation (First Gate)
// Invoke spec-validator to check if task is well-defined
spec_score = aeo_spec_validator.validate(task)
if spec_score < 40:
return REFUSE("Spec too unclear - need more details")
// Continue with confidence calculation
Phase 1: Rule-Based Score
Start with base confidence of 0.50, then adjust:
Add for Clarity (+0.15 each):
- Explicit acceptance criteria defined
- Tech stack specified
- Dependencies listed
- Test requirements defined
Add for Context (+0.10 each):
- Similar successful task exists in memory
- Familiar codebase area
- Recent successful commits in this area
Subtract for Risk (-0.10 each):
- Touching authentication/security
- Modifying core infrastructure
- Large scope (>5 files, >500 LOC)
- Unclear dependencies
base_confidence = clamp(0.50 + clarity_score + context_score - risk_score, 0.0, 1.0)
Phase 2: Spec Score Adjustment
// Adjust based on spec quality
if spec_score >= 80: base_confidence += 0.10 // Excellent spec
elif spec_score < 60: base_confidence -= 0.10 // Poor spec
Phase 3: Security Multipliers
// Critical areas get confidence penalty
if task.touches_payments: base_confidence *= 0.5 // Payments need human oversight
elif task.touches_auth: base_confidence *= 0.7 // Auth needs review
Phase 4: LLM Adjustment (Optional)
If you have uncertainty about the task, adjust ±0.10:
final_confidence = clamp(base_confidence + llm_adjustment, 0.0, 1.0)
Decision Thresholds
Based on final_confidence, decide execution path:
≥ 0.85: AUTONOMOUS
- Execute without asking
- Note: "Confidence: 0.XX - proceeding autonomously"
- Continue to execution loop
≥ 0.70: ADVISORY
- Note risk clearly
- Offer to pause: "Confidence: 0.XX - [CONCERNS]. I can proceed or pause."
- If no response in 5 seconds, continue
- Otherwise wait for human input
≥ 0.50: BLOCKING
- Explain concerns
- Wait for confirmation before proceeding
- Format:
⚠️ CONFIDENCE BELOW THRESHOLD Confidence: 0.XX Threshold: 0.70 Concerns: • [Spec] Missing acceptance criteria • [Risk] Touching authentication • [Context] No similar tasks in memory Options: 1. Proceed with assumptions 2. Clarify spec first 3. Break into smaller tasks Please confirm (1-3):
< 0.50: REFUSE
- Explain why task can't be executed
- Request clarification or spec improvement
- Format:
❌ CANNOT EXECUTE - INSUFFICIENT CONFIDENCE Confidence: 0.XX Why: • Spec score: 35/100 (below 40 threshold) • Touching security without clear requirements • No acceptance criteria defined What's needed: 1. Clear acceptance criteria 2. Security requirements specified 3. Test requirements defined Please improve spec and try again.
Learning from Outcomes
After task completes, write signal to memory:
# Append to signal log
echo '{
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"task_id": "unique-id",
"task_description": "brief description",
"predicted_confidence": 0.85,
"actual_difficulty": "easy|medium|hard",
"success": true,
"adjustment": +0.05
}' >> ~/.claude/MEMORY/aeo-signals.jsonl
Actual Difficulty Rating:
- easy: Task went smoothly, no blockers
- medium: Minor issues or clarifications needed
- hard: Significant problems, multiple iterations
Confidence Adjustment:
- easy + success: +0.05
- medium + success: +0.00
- hard + success: -0.05
- any failure: -0.10
Rolling Window: Keep last 100 signals, calculate adjustment average
Reading Past Signals
On startup, read recent signals to calibrate:
# Get last 100 signals
tail -100 ~/.claude/MEMORY/aeo-signals.jsonl | jq -s '. | map(.adjustment) | add / length'
Apply average adjustment as offset to all confidence calculations.
Integration Flow
- User activates: Types
/aeo - Calculate confidence: Follow phases 0-4
- Make decision: Based on thresholds
- If autonomous: Execute task
- If advisory/blocking: Invoke aeo-escalation skill
- Post-execution: Invoke aeo-qa-agent for review
- Record outcome: Write to signal log
- Update model: Adjust future confidence based on outcome
Memory Files
- Signals:
$PAI_DIR/MEMORY/aeo-signals.jsonl - Escalations:
$PAI_DIR/MEMORY/aeo-escalations.jsonl - Patterns:
$PAI_DIR/MEMORY/aeo-failure-patterns.json
Escalation Triggers
Invoke aeo-escalation skill when:
- Confidence < 0.70 (advisory/blocking)
- Spec score < 40 (refuse)
- QA veto occurs
- Failure pattern can't be resolved
- Cost limit approaching (if cost-governor enabled)
Example Session
User: /aeo
User: Add user authentication with email verification
AEO: [Invoking aeo-spec-validator]
AEO: Spec score: 72/100
AEO: Calculating confidence...
- Base: 0.50
- Clarity: +0.30 (acceptance criteria, tech stack)
- Context: +0.10 (similar task in memory)
- Risk: -0.10 (touching auth)
- Spec adj: -0.10 (spec < 80)
- Security mult: ×0.7
- Final: 0.49
AEO: [Invokes aeo-escalation]
AEO: ❌ CONFIDENCE BELOW THRESHOLD
Confidence: 0.49
Concerns:
• [Spec] Missing security requirements
• [Risk] Touching authentication
• [Context] Need email service details
Options:
1. Add security requirements and proceed
2. Provide email service details
3. Break into smaller tasks
Please clarify (1-3):
User: 2
User: We use Resend for emails, API key in .env
AEO: Recalculating confidence with added context...
Final: 0.71
AEO: ⚡ ADVISORY - Confidence: 0.71
[Acceptance criteria defined]
[Tech stack: Node.js, bcrypt, jwt]
[Email: Resend, API key in .env]
Proceeding with implementation. I'll pause if issues arise.
[Implementation proceeds]
Special Cases
Repeated Tasks
If same task done successfully 3+ times:
- Add +0.10 to confidence
- Flag as "routine - can be autonomous"
High-Risk Areas
Never reach full autonomy for:
- Payment processing (max 0.70)
- Authentication changes (max 0.75)
- Database migrations (max 0.80)
- Production deployments (max 0.85)
Emergency Rollbacks
If task causes test failures or errors:
- Immediately rollback
- Write failure signal
- Reduce confidence by 0.20
- Require human review before retry