os-improvement-report
Dependencies
This skill requires Python 3.8+, pandas, and matplotlib.
To install this skill's dependencies:
pip-compile ./requirements.in
pip install -r ./requirements.txt
See ./requirements.txt for the dependency lockfile.
Loop Progress Report
Visual and text reporting on the agentic loop improvement cycle — across any plugin that
maintains an improvement-ledger.md and results.tsv per skill.
The reference output is the autoresearch progress chart: green KEEP dots on a timeline, gray DISCARD dots, running-best step line, annotations showing what each improvement was. This skill produces the same chart for agentic-os and exploration-cycle-plugin improvement cycles.
What It Reads
| Source | Content |
|---|---|
context/memory/improvement-ledger.md |
Eval score progression (Section 1), survey-to-action trace (Section 2), north star metric (Section 3) |
.agents/skills/*/evals/results.tsv |
Per-skill detailed eval score history (supplement to ledger) |
The improvement ledger is the primary source. It is written at every loop close (Stage 4.7
of os-improvement-loop). See references/improvement-ledger-spec.md for the format.
What It Produces
| Output | Description |
|---|---|
context/memory/reports/progress_YYYYMMDD_HHMM.png |
Progress chart: KEEP/DISCARD timeline, running-best step line, change annotations |
context/memory/reports/summary_YYYYMMDD_HHMM.md |
Text summary: baseline vs best, top hits by delta, survey effectiveness, north star trend |
Execution Flow
Phase 1: Check data availability
LEDGER="${CLAUDE_PROJECT_DIR}/context/memory/improvement-ledger.md"
if [ ! -f "$LEDGER" ]; then
echo "No improvement ledger found. Run at least one full loop cycle first."
echo "The ledger is created at Stage 4.7 of os-improvement-loop."
exit 0
fi
wc -l "$LEDGER"
If the ledger exists but Section 1 table is empty (no rows beyond the header), inform the user that no cycles have been completed yet and the first loop run will establish the baseline. Do not run the report script on an empty ledger — it will produce an empty chart.
Phase 2: Run the report
PLUGIN_DIR="${CLAUDE_PLUGIN_ROOT:-$(pwd)/.agents/skills/agent-agentic-os}"
PROJECT_DIR="${CLAUDE_PROJECT_DIR:-$(pwd)}"
python3 "${PLUGIN_DIR}/skills/os-improvement-report/scripts/generate_report.py" \
--project-dir "$PROJECT_DIR" \
--plugin-dir "$PLUGIN_DIR" \
[--skill SESSION-MEMORY-MANAGER] # optional: filter to one skill
The script exits 0 on success and prints the chart path and text summary to stdout.
Phase 3: Surface the output
After the script completes:
- Report the chart path to the user:
context/memory/reports/progress_[TIMESTAMP].png - Print the text summary inline (it is concise — top hits table, north star trend).
- Ask: "Would you like me to open the chart image or show the per-skill detail?"
Phase 4: Cross-plugin reporting (optional)
If the user wants improvement tracking across both agent-agentic-os AND exploration-cycle-plugin,
run the report twice — once per plugin — passing each plugin's project dir:
# agentic-os cycles
python3 "$SCRIPT" --project-dir "$AGENTIC_OS_PROJECT" --plugin-dir "$AGENTIC_OS_PLUGIN"
# exploration-cycle cycles
python3 "$SCRIPT" --project-dir "$EXPLORATION_PROJECT" --plugin-dir "$EXPLORATION_PLUGIN"
Both plugins write to context/memory/improvement-ledger.md in their respective project dirs.
Each produces its own chart. The text summaries can be concatenated for a combined view.
Reading the Chart
The chart mirrors the autoresearch progress.png:
- X-axis: Cycle number (chronological order)
- Y-axis: Eval score for the target skill (higher = better)
- Gray dots: DISCARD cycles — attempts that did not improve the skill
- Green dots: KEEP cycles — improvements that stuck
- Green step line: Running best — the frontier of improvement over time
- Annotations: What change was made on each KEEP cycle
A flat or declining step line = the loop is not improving the skill. Frequent DISCARD clusters = hypothesis quality needs work (check test scenarios seed). Steep step-line rises = the survey-to-action trace is working.
Adding a New Plugin
Any plugin that runs eval cycles can plug into this report by:
- Initializing
context/memory/improvement-ledger.mdwith the three-section format (seereferences/improvement-ledger-spec.md— includes a bash init snippet). - Writing to Section 1 after every KEEP or DISCARD cycle.
- Writing to Section 2 when a survey friction item results in a change attempt.
- Writing to Section 3 once per session with the completion rate.
The generate_report.py script works on any ledger with this format — it is not
tied to agent-agentic-os specifically.
References
- improvement-ledger-spec.md — ledger format, writing protocol, initialization
- chart-reading-guide.md — how to interpret KEEP/DISCARD dots, step line, and text summary fields
- os-improvement-loop SKILL — Stage 4.7 writes to the ledger
- test-scenarios-seed.md — 50 pre-designed test hypotheses
- post_run_survey.md — survey template (Section 2 trace sources)
More from richfrem/agent-plugins-skills
markdown-to-msword-converter
Converts Markdown files to one MS Word document per file using plugin-local scripts. V2 includes L5 Delegated Constraint Verification for strict binary artifact linting.
52excel-to-csv
>
32zip-bundling
Create technical ZIP bundles of code, design, and documentation for external review or context sharing. Use when you need to package multiple project files into a portable `.zip` archive instead of a single Markdown file.
29learning-loop
(Industry standard: Loop Agent / Single Agent) Primary Use Case: Self-contained research, content generation, and exploration where no inner delegation is required. Self-directed research and knowledge capture loop. Use when: starting a session (Orientation), performing research (Synthesis), or closing a session (Seal, Persist, Retrospective). Ensures knowledge survives across isolated agent sessions.
26ollama-launch
Start and verify the local Ollama LLM server. Use when Ollama is needed for RLM distillation, seal snapshots, embeddings, or any local LLM inference — and it's not already running. Checks if Ollama is running, starts it if not, and verifies the health endpoint.
26spec-kitty-checklist
A standard Spec-Kitty workflow routine.
26