a-evolve
A-Evolve: Agentic Evolution Skill
Apply the Solve → Observe → Evolve → Gate → Reload methodology from A-Evolve to iteratively improve agent performance. This skill is prompt-based — no external dependencies, no harness changes. You analyze failures, propose workspace mutations, and generate durable artifacts (skills, prompt patches, knowledge entries) that the agent can load in future runs.
Core Loop
When asked to evolve or improve agent performance, follow this 5-step loop:
1. Solve (Collect Evidence)
Gather the agent's execution artifacts. Ask the user for or locate:
- Run logs, error traces, or experiment outputs
- Pass/fail results per task
- Metric values (accuracy, reward, success rate)
- Any existing session files from previous runs
If inside AutoResearchClaw, look at:
artifacts/rc-*/— experiment outputs, charts, reviewsevolve.logor stage-specific logsreviews.md— peer review feedback- Sentinel watchdog reports
2. Observe (Diagnose)
Analyze the collected evidence to produce structured observations:
For each failed or underperforming task, identify:
- Error category: code bug, timeout, wrong approach, missing knowledge, API misuse, hallucinated reference, prompt ambiguity, etc.
- Root cause: What specifically went wrong and why
- Frequency: Is this a one-off or a recurring pattern across tasks?
- Severity: blocking (pipeline crash) / degrading (wrong result) / cosmetic (formatting issue)
Write observations as a structured list:
## Observations (Batch N)
### OBS-1: [Category] Short description
- Tasks affected: task_001, task_005, task_012
- Root cause: ...
- Frequency: 3/50 tasks (6%)
- Severity: degrading
### OBS-2: ...
3. Evolve (Propose Mutations)
Based on observations, propose one or more of these mutation types:
A. Generate a Skill (for recurring patterns, frequency ≥ 3)
Write a new SKILL.md file that teaches the agent how to handle this
pattern. A good evolved skill:
- Targets a specific failure category, not generic advice
- Contains concrete steps the agent should follow
- Includes a "when to apply" trigger condition
- Is short (under 100 lines) and self-contained
Example — if the agent keeps failing at API pagination:
---
name: api-pagination-handler
description: >
Handle paginated API responses correctly. Use when making API calls
that may return partial results, or when results seem truncated.
---
When calling any API that supports pagination:
1. Check response for pagination indicators: `next_page`, `offset`,
`has_more`, `cursor`, or truncated result counts.
2. If paginated, loop until all pages are collected.
3. Concatenate results before processing.
4. Set a max-page safety limit (default: 20) to prevent infinite loops.
5. Log total items collected vs expected count if available.
B. Patch the System Prompt (for prompt ambiguity or missing guidance)
Write a short addendum to the system prompt that addresses the gap. Keep patches minimal — one paragraph per issue. Format:
## Prompt Patch: [Issue]
Append to system prompt:
> When [specific situation], always [specific action] because [reason].
C. Add a Knowledge Entry (for factual gaps or learned heuristics)
Record a reusable insight as a knowledge entry:
{
"id": "know-001",
"category": "experiment_design",
"insight": "Synthetic benchmarks with <100 samples produce high-variance results. Always use ≥500 samples or report confidence intervals.",
"source": "observation OBS-3 from batch 2",
"confidence": 0.85
}
D. Do Nothing (if observation is a one-off, severity is cosmetic, or the fix would be too broad / risky)
4. Gate (Validate)
Before accepting any mutation, check:
- Specificity: Does it target the observed failure without being so broad it could cause regressions elsewhere?
- Testability: Could you verify this mutation helps by re-running the failed tasks?
- Blast radius: How much of the agent's behavior does this change? Prefer small, targeted mutations over large rewrites.
- Consistency: Does it contradict existing skills or prompt guidance?
If a mutation fails the gate, either refine it or discard it. Explain your reasoning to the user.
5. Reload (Apply and Record)
Present the accepted mutations to the user. For each:
- State what changed and why
- Show the artifact (skill file, prompt patch, knowledge entry)
- Suggest where to place it in the project
For AutoResearchClaw projects, recommended locations:
| Artifact | Location |
|---|---|
| Evolved skill | .claude/skills/evolved/<skill-name>/SKILL.md |
| Prompt patch | Append to prompts.default.yaml or custom prompts file |
| Knowledge entry | docs/kb/evolved_knowledge/<id>.json |
| Observation log | evolution/observations/<batch>.md |
Keep a running version log so the user can track what evolved and when:
## Evolution Log
- evo-1 (2026-03-30): Generated `api-pagination-handler` skill from OBS-1
- evo-2 (2026-03-30): Prompt patch for citation format from OBS-4
Usage with AutoResearchClaw
This skill maps to ARC's pipeline stages:
| ARC Stage | Evolution Role |
|---|---|
| 12 EXPERIMENT_RUN | Source of Solve artifacts |
| 13 ITERATIVE_REFINE | Main Observe + Evolve trigger point |
| 15 RESEARCH_DECISION | Natural Gate — PROCEED = accept, REFINE = retry |
| 18 PEER_REVIEW | Additional Observe signal for writing quality |
When the user says "evolve my research pipeline" or similar:
- Ask which run to analyze (or find the latest
artifacts/rc-*/) - Run the Observe step on experiment outputs + review feedback
- Propose mutations targeting the weakest pipeline stages
- Generate skill files that ARC can load via
.claude/skills/
Anti-Patterns
Do NOT:
- Generate vague, generic skills ("always be careful", "check your work")
- Propose mutations for one-off errors that won't recur
- Rewrite the entire system prompt — patch it surgically
- Generate more than 3 skills per evolution cycle (quality over quantity)
- Mutate tool code unless the user explicitly asks for it
Relationship to MetaClaw
If AutoResearchClaw has MetaClaw enabled (metaclaw_bridge.enabled: true),
evolved skills from this process can be placed in ~/.metaclaw/skills/arc-*/
so MetaClaw injects them into future runs automatically. The two systems
are complementary:
- A-Evolve skill: Deep, targeted mutation from structured observation
- MetaClaw lesson: Broad pattern captured from pipeline warnings/errors
Both can coexist. Skills generated here are higher-precision; MetaClaw lessons are higher-recall.
More from aiming-lab/autoresearchclaw
scientific-writing
Academic manuscript writing with IMRAD structure, citation formatting, and reporting guidelines. Use when drafting or revising research papers.
10hypothesis-formulation
Structured scientific hypothesis generation from observations. Use when formulating testable hypotheses, competing explanations, or experimental predictions.
9scientific-visualization
Publication-ready scientific figure design with matplotlib and seaborn. Use when creating journal submission figures with proper formatting, accessibility, and statistical annotations.
9literature-search
Systematic literature review methodology including search strategy, screening, and synthesis. Use when conducting literature reviews or writing background sections.
9statistical-reporting
Statistical test selection, assumption checking, and APA-formatted reporting. Use when analyzing experimental results or writing results sections.
9chemistry-rdkit
Computational chemistry with RDKit for molecular analysis, descriptors, fingerprints, and substructure search. Use when working with SMILES, drug discovery, or cheminformatics tasks.
8