edd
EDD (Eval-Driven Development) Framework v2.64
Eval-Driven Development is a quality-first development pattern that enforces define-before-implement workflow with structured evaluations.
v2.88 Key Changes (MODEL-AGNOSTIC)
- Model-agnostic: Uses model configured in
~/.claude/settings.jsonor CLI/env vars - No flags required: Works with the configured default model
- Flexible: Works with GLM-5, Claude, Minimax, or any configured model
- Settings-driven: Model selection via
ANTHROPIC_DEFAULT_*_MODELenv vars
What is EDD?
EDD provides a systematic approach to software development with three phases:
- DEFINE - Create structured eval specifications using TEMPLATE.md
- IMPLEMENT - Build features according to eval definitions
- VERIFY - Validate implementation against eval criteria
Check Types
| Prefix | Type | Purpose |
|---|---|---|
CC- |
Capability Checks | Feature capabilities and functionality |
BC- |
Behavior Checks | Expected behaviors and responses |
NFC- |
Non-Functional Checks | Performance, security, maintainability |
Usage
# Invoke EDD workflow
/edd "Define memory-search feature"
# CLI script (if available)
ralph edd define memory-search
ralph edd check memory-search
Components
- TEMPLATE.md: Template for creating eval definitions
- edd.sh: CLI script for eval management
- /edd skill: Skill invocation from Claude Code
- ~/.claude/evals/: Directory for eval definitions
Template Structure
Each eval definition includes:
- Capability Checks (CC-) - What the feature can do
- Behavior Checks (BC-) - How the feature behaves
- Non-Functional Checks (NFC-) - Performance, security, etc.
- Implementation Notes - Technical guidance
- Verification Evidence - Test results
Example: memory-search.md
# Memory Search Eval
**Status**: DRAFT
**Created**: 2026-01-30
## Capability Checks
- [ ] CC-1: Search across semantic memory
- [ ] CC-2: Support filtering by type
## Behavior Checks
- [ ] BC-1: Returns ranked results
- [ ] BC-2: Handles empty queries gracefully
## Non-Functional Checks
- [ ] NFC-1: Search completes in <2s
- [ ] NFC-2: Memory usage <100MB
## Implementation Notes
- Use parallel search for performance
- Cache frequent queries
## Verification Evidence
- Test results attached
Integration with Orchestrator
EDD integrates with the orchestrator workflow to ensure quality-first development:
- Clarify phase - Define evals
- Plan phase - Review eval requirements
- Implement phase - Build to eval specs
- Validate phase - Verify against evals
Swarm Mode Integration (v2.81.1)
EDD framework now supports swarm mode for parallel evaluation across multiple check types.
Auto-Spawn Configuration
When invoked via /edd, the framework automatically spawns a specialized evaluation team:
Task:
subagent_type: "general-purpose"
model: "sonnet"
team_name: "edd-evaluation-team"
name: "edd-coordinator"
mode: "delegate"
run_in_background: true
prompt: |
Execute Eval-Driven Development workflow for: $ARGUMENTS
EDD Pattern:
1. DEFINE - Create structured eval specifications
2. DISTRIBUTE - Assign check types to specialists
3. VERIFY - Validate against eval criteria
4. CONSOLIDATE - Merge findings from all evaluators
Team Composition
| Role | Purpose | Specialization |
|---|---|---|
| Coordinator | EDD workflow orchestration | Manages eval lifecycle, consolidates findings |
| Teammate 1 | Capability Checks specialist | CC- prefix: feature capabilities and functionality |
| Teammate 2 | Behavior Checks specialist | BC- prefix: expected behaviors and responses |
| Teammate 3 | Non-Functional Checks specialist | NFC- prefix: performance, security, maintainability |
Swarm Mode Workflow
User invokes: /edd "Define memory-search feature"
1. Team "edd-evaluation-team" created
2. Coordinator (edd-coordinator) receives task
3. 3 Teammates spawned with check-type specializations
4. Eval definition distributed:
- Teammate 1 → Capability Checks (CC-)
- Teammate 2 → Behavior Checks (BC-)
- Teammate 3 → Non-Functional Checks (NFC-)
5. Teammates work in parallel (background execution)
6. Coordinator monitors progress and gathers results
7. Findings consolidated into single eval specification
8. Final eval document returned
Parallel Evaluation Pattern
Each teammate focuses on their check type:
# Teammate 1: Capability Checks
CC-1: Feature can perform X
CC-2: Feature supports Y configuration
CC-3: Feature integrates with Z system
# Teammate 2: Behavior Checks
BC-1: Feature handles error case A gracefully
BC-2: Feature returns expected response for B
BC-3: Feature maintains state across C
# Teammate 3: Non-Functional Checks
NFC-1: Response time < 100ms
NFC-2: Memory usage < 50MB
NFC-3: Security vulnerability scan passes
Communication Between Teammates
Teammates use the built-in mailbox system:
# Teammate sends finding to coordinator
SendMessage:
type: "message"
recipient: "edd-coordinator"
content: "CC-3 defined: Feature integrates with auth system via OAuth2"
Task List Coordination
All teammates share a unified task list:
# Location: ~/.claude/tasks/edd-evaluation-team/tasks.json
# Example tasks:
[
{"id": "1", "subject": "Define Capability Checks", "owner": "teammate-1"},
{"id": "2", "subject": "Define Behavior Checks", "owner": "teammate-2"},
{"id": "3", "subject": "Define Non-Functional Checks", "owner": "teammate-3"},
{"id": "4", "subject": "Consolidate eval specification", "owner": "edd-coordinator"}
]
Manual Override
To disable swarm mode:
/edd "Define feature X" --no-swarm
Output Location
# Evals saved to ~/.claude/evals/
ls ~/.claude/evals/
# View last eval
cat ~/.claude/evals/latest.md
Testing
Test suite: tests/test_v264_edd_framework.bats (33 tests)
Run tests:
bats tests/test_v264_edd_framework.bats
Swarm Mode Tests
Additional tests for swarm mode integration:
# Test swarm team creation
tests/edd/test-swarm-team-creation.sh
# Test parallel evaluation
tests/edd/test-parallel-evaluation.sh
Status
Current: Framework defined with swarm mode integration (v2.81.1) Note: TEMPLATE.md and evals directory structure ready for use
Version: v2.64 | Status: DRAFT | Tests: 33 passing
Action Reporting (v2.93.0)
Esta skill genera reportes automáticos completos para trazabilidad:
Reporte Automático
Cuando esta skill completa, se genera automáticamente:
- En la conversación de Claude: Resultados visibles
- En el repositorio:
docs/actions/edd/{timestamp}.md - Metadatos JSON:
.claude/metadata/actions/edd/{timestamp}.json
Contenido del Reporte
Cada reporte incluye:
- ✅ Summary: Descripción de la tarea ejecutada
- ✅ Execution Details: Duración, iteraciones, archivos modificados
- ✅ Results: Errores encontrados, recomendaciones
- ✅ Next Steps: Próximas acciones sugeridas
Ver Reportes Anteriores
# Listar todos los reportes de esta skill
ls -lt docs/actions/edd/
# Ver el reporte más reciente
cat $(ls -t docs/actions/edd/*.md | head -1)
# Buscar reportes fallidos
grep -l "Status: FAILED" docs/actions/edd/*.md
Generación Manual (Opcional)
source .claude/lib/action-report-lib.sh
start_action_report "edd" "Task description"
# ... ejecución ...
complete_action_report "success" "Summary" "Recommendations"
Referencias del Sistema
- Action Reports System - Documentación completa
- action-report-lib.sh - Librería helper
- action-report-generator.sh - Generador
More from alfredolopez80/multi-agent-ralph-loop
stop-slop
A skill for removing AI-generated writing patterns ('slop') from prose. Eliminates telltale signs of AI writing like filler phrases, excessive hedging, overly formal language, and mechanical sentence structures. Use when: writing content that should sound human and natural, editing AI-generated drafts, cleaning up prose for publication, or any content that needs to sound authentic rather than AI-generated. Triggers: 'stop-slop', 'remove AI tells', 'clean up prose', 'make it sound human', 'edit AI writing'.
10gemini-cli
|
2minimax
Custom skill for minimax
1clarify
Intensive requirement clarification using structured AskUserQuestion workflow. Gathers MUST_HAVE (blocking) and NICE_TO_HAVE (optional) information before implementation. Use when: (1) starting new feature implementation, (2) requirements are ambiguous, (3) multiple approaches possible, (4) before writing any code. Triggers: /clarify, 'clarify requirements', 'ask questions', 'gather requirements'.
1security
Security audit with Codex + MiniMax second opinion. Integrates ralph-security agent (6 quality pillars, OWASP A01-A10). Uses LSP for code navigation during analysis. Use when: (1) /security is invoked, (2) task relates to security functionality.
1adr
Architecture Decision Records management. Actions: create (new ADR), list (show all), search (find by keyword). Use when: (1) making architecture decisions, (2) choosing between technologies, (3) documenting trade-offs. Triggers: /adr, 'architecture decision', 'decision record', 'document decision'.
1