self-improving-agent-builder
Self-Improving Agent Builder
Purpose
Run a closed-loop improvement cycle on any goal-seeking agent implementation:
EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (repeat)
Each iteration measures L1-L12 progressive test scores, identifies failures
with error_analyzer.py, runs a research step with hypothesis/evidence/
counter-arguments, applies targeted fixes, and gates promotion through
regression checks.
When I Activate
- "improve agent" or "self-improving loop"
- "agent eval loop" or "run improvement cycle"
- "benchmark agents" or "compare SDK implementations"
- "iterate on agent scores" or "fix agent regressions"
Quick Start
User: "Run the self-improving loop on the mini-framework agent for 3 iterations"
Skill: Executes 3 iterations of EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE
Reports per-iteration scores, net improvement, and commits/reverts.
Runner Script
The self-improvement loop is implemented as a Python CLI:
# Basic usage
python -m amplihack.eval.self_improve.runner --sdk mini --iterations 3
# Full options
python -m amplihack.eval.self_improve.runner \
--sdk mini \
--iterations 5 \
--improvement-threshold 2.0 \
--regression-tolerance 5.0 \
--levels L1 L2 L3 L4 L5 L6 \
--output-dir ./eval_results/self_improve \
--dry-run # evaluate only, don't apply changes
Source: src/amplihack/eval/self_improve/runner.py
The Loop (6 Phases per Iteration)
Phase 1: EVAL
Run the L1-L12 progressive test suite on the current agent implementation.
Execution:
python -m amplihack.eval.progressive_test_suite \
--agent-name <agent_name> \
--output-dir <output_dir>/iteration_N/eval \
--levels L1 L2 L3 L4 L5 L6
Output: Per-level scores and overall baseline.
Phase 2: ANALYZE
Classify failures using error_analyzer.py. Maps each failed question to a
failure taxonomy (retrieval_insufficient, temporal_ordering_wrong, etc.) and
the specific code component responsible.
from amplihack.eval.self_improve import analyze_eval_results
analyses = analyze_eval_results(level_results, score_threshold=0.6)
# Each ErrorAnalysis maps to:
# failure_mode -> affected_component -> prompt_template
Phase 3: RESEARCH (New)
The critical thinking step that prevents blind changes. For each proposed improvement:
- State hypothesis: What specific change will fix the failure?
- Gather evidence: From eval results, failure patterns, baseline scores
- Consider counter-arguments: What could go wrong? Risk of regression?
- Make decision: Apply, skip, or defer with full reasoning
Decisions are logged in research_decisions.json for auditability.
Decision criteria:
- Apply: Clear failure pattern + prompt template available + low score
- Skip: Score above 50% (likely stochastic variation)
- Defer: Ambiguous evidence, needs more data
Phase 4: IMPROVE
Apply the improvements approved by the research step. Priority order:
- Prompt template improvements (safest, highest impact)
- Retrieval strategy adjustments
- Code logic fixes (most risky, needs careful review)
Phase 5: RE-EVAL
Re-run the same eval suite after applying fixes to measure impact.
Phase 6: DECIDE
Promotion gate:
- Net improvement >= +2% overall score: COMMIT the changes
- Any single level regression > 5%: REVERT all changes
- Otherwise: COMMIT with marginal improvement note
Configuration
| Parameter | Default | Description |
|---|---|---|
sdk_type |
mini |
Which SDK: mini/claude/copilot/microsoft |
max_iterations |
5 |
Maximum improvement iterations |
improvement_threshold |
2.0 |
Minimum % improvement to commit |
regression_tolerance |
5.0 |
Maximum % regression on any level |
levels |
L1-L6 |
Which levels to evaluate |
output_dir |
./eval_results/self_improve |
Results directory |
dry_run |
false |
Evaluate only, don't apply changes |
Programmatic Usage
from amplihack.eval.self_improve import run_self_improvement, RunnerConfig
config = RunnerConfig(
sdk_type="mini",
max_iterations=3,
improvement_threshold=2.0,
regression_tolerance=5.0,
levels=["L1", "L2", "L3", "L4", "L5", "L6"],
output_dir="./eval_results/self_improve",
dry_run=False,
)
result = run_self_improvement(config)
print(f"Total improvement: {result.total_improvement:+.1f}%")
print(f"Final scores: {result.final_scores}")
4-Way Benchmark Mode
Compare all SDK implementations side by side:
User: "Run a 4-way benchmark comparing all SDK implementations"
Skill: Runs eval suite on mini, claude, copilot, microsoft
Generates comparison table with scores, LOC, and coverage.
Integration Points
src/amplihack/eval/self_improve/runner.py: Self-improvement loop runnersrc/amplihack/eval/self_improve/error_analyzer.py: Failure classificationsrc/amplihack/eval/progressive_test_suite.py: L1-L12 eval runnersrc/amplihack/agents/goal_seeking/sdk_adapters/: All 4 SDK implementationssrc/amplihack/eval/metacognition_grader.py: Advanced eval dimensionssrc/amplihack/eval/teaching_session.py: L7 teaching quality eval