sg-improve
/sg-improve — Self-Improving Feedback Loop
After an audit or visual-test session, this skill extracts what went well and what didn't, then feeds those insights back into ShipGuard so the next run is better. Think of it as a retrospective that writes its own action items.
Recommended model: Sonnet 4.6. Log analysis + config updates + GitHub issue writing — mechanical retrospective work where Opus 4.7 provides no measurable quality gain. Use
/model sonnetbefore invoking to save Opus weekly quota.
Two outputs:
- Local learnings (
.shipguard/learnings.yaml) — project-specific knowledge that the next/sg-code-auditreads automatically (zone size limits, codebase-specific patterns, infra timing) - GitHub issue (
bacoco/ShipGuard) — generic improvements that benefit all ShipGuard users (better retry logic, missing validation steps, UX friction)
Session (audit, visual-run, debug)
│
▼
/sg-improve
│
├─► Phase 1: Collect structured data (audit-results.json, git log, regressions)
├─► Phase 2: Extract friction signals from data + conversation
├─► Phase 3: Classify each signal (project-specific vs generic)
├─► Phase 4: Write .shipguard/learnings.yaml (project memory)
├─► Phase 5: File GitHub issue (generic improvements)
└─► Phase 6: Summary — what the next run will do differently
Invocations
| Command | Behavior |
|---|---|
/sg-improve |
Full loop — local learnings + GitHub issue |
/sg-improve --local-only |
Save learnings locally, skip GitHub |
/sg-improve --github-only |
File GitHub issue only, skip local |
/sg-improve --dry-run |
Show what would be saved/filed, write nothing |
/sg-improve --rollback |
Revert to the previous snapshot (undo last sg-improve run) |
/sg-improve --history |
List all snapshots with dates and summary |
Phase 0 — Snapshot (Safety Net)
Before touching ANY file, take a snapshot. This is the rollback point.
What gets snapshotted
Every file that sg-improve might modify:
.shipguard/
learnings.yaml ← zone hints, audit hints, noise filters, session history
mistakes.md ← coding error journal
How snapshots work
# Create snapshot directory
mkdir -p .shipguard/history/{timestamp}
# Copy current state
cp .shipguard/learnings.yaml .shipguard/history/{timestamp}/learnings.yaml 2>/dev/null
cp .shipguard/mistakes.md .shipguard/history/{timestamp}/mistakes.md 2>/dev/null
# Write metadata
cat > .shipguard/history/{timestamp}/meta.yaml << EOF
timestamp: "{ISO 8601}"
trigger: "sg-improve"
mode: "{flags used}"
audit_bugs: {count from audit-results.json or "unknown"}
files_snapshotted:
- learnings.yaml
- mistakes.md
EOF
Use {timestamp} = YYYYMMDD-HHMMSS (e.g., 20260414-073000).
Retention
Keep the last 5 snapshots. When creating the 6th, delete the oldest. The user can override with --keep-all to never prune.
--rollback
When the user runs /sg-improve --rollback:
- List snapshots:
ls -1t .shipguard/history/(newest first) - Show the user the most recent snapshot with its metadata:
Last sg-improve run: 2026-04-14 07:30:00 Audit bugs at time: 79 Files modified: learnings.yaml, mistakes.md Rollback? (oui/non) - If confirmed:
cp .shipguard/history/{latest}/learnings.yaml .shipguard/learnings.yaml cp .shipguard/history/{latest}/mistakes.md .shipguard/mistakes.md rm -rf .shipguard/history/{latest} - Print:
Rolled back to state before {timestamp}. The changes from that sg-improve run are undone.
--history
Show all snapshots:
ShipGuard improve history:
#1 2026-04-14 07:30 79 bugs learnings.yaml + mistakes.md (latest)
#2 2026-04-13 22:30 47 bugs learnings.yaml
#3 2026-04-12 15:00 23 bugs learnings.yaml
Rollback: /sg-improve --rollback
Rollback to specific: /sg-improve --rollback=#2
--rollback=#N
Rollback to a specific snapshot (not just the latest):
- Find snapshot
#Nin the history - Restore its files
- Delete all snapshots newer than
#N(they're now invalid) - Print:
Rolled back to state #N ({date}). {M} newer snapshots removed.
Phase 1 — Collect Structured Data
Before scanning the conversation, gather the hard data. These files contain objective metrics that don't depend on parsing chat messages.
Step 1: Find audit-results.json
Check these paths in order (first found wins):
visual-tests/_results/audit-results.json.code-audit-results/audit-results.jsonaudit-results.json(repo root)
If found, read it and extract:
summary.total_bugs,summary.by_severity,summary.by_categorysummary.duration_ms,agentscountscope_info.mode,scope_info.total_in_scope(if diff mode)- Count of
bugswherefix_applied: false(deferred/unfixable) - Count of
bugswhereconfidence: "low"(uncertain findings) verification.checked,verification.confirmed,verification.rejected(if present — Phase 5.7 stats)- Count of
unverified_bugs(findings rejected by confidence verification)
Prefer TOON: If audit-results.toon exists alongside the JSON, use it for LLM analysis (Phase 2+) — it's ~40% fewer tokens. Use the JSON only for structured field extraction in this step.
If not found, log: "No audit-results.json — extracting from conversation only."
Step 2: Read zone JSON files
Glob visual-tests/_results/zone-*-r*.json or .code-audit-results/zone-*-r*.json.
For each zone file, extract:
zone,files_audited,bugscount- Whether the zone ID contains
a/bsuffix (indicates a re-split happened)
Step 3: Read regressions
If visual-tests/_regressions.yaml exists, count entries and note any with consecutive_passes >= 2 (about to be auto-removed — a success signal).
Step 4: Read git log
git log --oneline --since="12 hours ago" | grep -c "audit-r[0-9]"
Count audit-related commits. Check for any revert commits (signal of a bad fix).
Step 5: Read existing learnings
If .shipguard/learnings.yaml exists, read it. The update in Phase 4 must merge with existing data, not overwrite.
Phase 2 — Extract Friction Signals
Now combine the structured data with conversation context. For each signal type below, check both the data (Phase 1) and the conversation history.
Error Signals
| Signal | How to detect | From |
|---|---|---|
| Context overflow | Zone IDs with a/b suffix in zone JSONs, or "Prompt is too long" in conversation |
Data + conversation |
| API overload | "529", "overloaded_error" in conversation | Conversation |
| Post-merge syntax error | "IndentationError", "SyntaxError", "NameError" after a worktree merge | Conversation |
| Browser collision | agent-browser returning / when a different URL was requested |
Conversation |
| Session expiry | Re-login needed mid-session | Conversation |
| Docker failure | "unhealthy", "dependency failed" after rebuild | Conversation |
| Merge conflict | "git merge --abort" in conversation | Conversation |
Performance Signals
| Signal | How to compute |
|---|---|
| Overflow rate | (zones with a/b suffix) / (total original zones) |
| Retry count | Count "retry" or "retrying" or "attempt" mentions in conversation |
| Wall clock | summary.duration_ms from audit-results.json, or estimate from first/last audit commit timestamps |
| Agent waste | Count of zone JSONs with duplicate zone paths (two agents did the same work) |
Quality Signals
| Signal | How to compute |
|---|---|
| Noise ratio | summary.by_severity.low / summary.total_bugs |
| Top noise pattern | Most frequent category among low severity bugs |
| Deferred count | Count of fix_applied: false in bugs array |
| Post-audit regression | Any Docker/build failure AFTER audit commits (check conversation) |
User Friction Signals
Scan the user's messages (not assistant messages) in the current conversation for correction and frustration patterns. These indicate that ShipGuard or Claude's behavior was wrong — even if no error was thrown.
| Signal | Regex patterns (case-insensitive) | Priority |
|---|---|---|
| COMMAND_FAILURE | Tool use exit code != 0, stderr contains "error", "failed", "not found" | 100 |
| USER_CORRECTION | I said, you didn't, that's wrong, no not, pas ça, non c'est, j'ai dit |
80 |
| REPETITION | Same instruction given 2+ times (Jaccard similarity > 0.5 across last 10 user messages) | 60 |
| TONE_ESCALATION | 3+ uppercase words in a row, 2+ exclamation marks, for the last time, encore une fois, STOP |
40 |
| SKILL_OVERRIDE | User explicitly overrides a skill decision: skip that, don't do, ignore, laisse tomber |
75 |
| REDO_REQUEST | refais, recommence, redo, try again, re-run, relance |
70 |
Detection rules:
- Scan only user messages, never assistant messages
- A message can trigger multiple signals (e.g., correction + escalation)
- For REPETITION: compare each user message against the previous 10 using Jaccard word similarity. If 3+ pairs score > 0.5, emit one REPETITION signal with count
- For COMMAND_FAILURE: check tool results in the conversation, not user messages
Output per signal:
- signal: "user_correction"
count: 2
details: "User said 'pas ça' after audit hint was applied incorrectly"
priority: 80
type: "friction"
quote: "non c'est pas le bon fichier" # exact user quote (truncated to 100 chars)
These signals feed into Phase 3 classification. A friction signal with priority >= 70 should generate either a local learning (if project-specific) or a GitHub issue (if generic to ShipGuard behavior).
Success Signals
These matter just as much — they tell us what NOT to change.
| Signal | How to detect |
|---|---|
| Clean merge rate | (total zones - merge conflicts) / total zones |
| Critical bug value | Count of critical + high severity bugs with fix_applied: true |
| Visual verification | Count of PASS results in visual-run report |
| Zero-retry zones | Zones that completed on first attempt |
For each signal, record:
- signal: "context_overflow"
count: 3
details: "z01 (172 files), z02 (178 files), z04 (214 files)"
impact: "Added ~10 min latency from re-splits and retries"
type: "error" # error | performance | quality | success
Phase 3 — Classify
For each signal, decide where it belongs:
| Classification | Rule of thumb | Destination |
|---|---|---|
| project-specific | Mentions a file path, service name, port, or timing specific to this repo | .shipguard/learnings.yaml only |
| generic | Would help ANY repo using ShipGuard — a missing step, bad default, or design flaw in the skill | GitHub issue |
| both | A generic pattern that was observed through a project-specific symptom | Both |
When in doubt, classify as both. It's better to have a slightly noisy GitHub issue than to lose a generic insight.
Phase 4 — Local Learnings
Write to {repo_root}/.shipguard/learnings.yaml. Create the .shipguard/ directory if it doesn't exist.
Schema (v2)
# .shipguard/learnings.yaml
# Auto-maintained by /sg-improve. Read by /sg-code-audit at startup.
# Manual edits are preserved — the skill only appends/updates, never deletes.
schema_version: 2
last_updated: "2026-04-14T07:00:00Z"
zone_hints:
# Directories where the default zone sizing caused overflow.
# sg-code-audit reads these to cap files-per-zone during zone discovery.
- path: "apps/uranus/src/hooks/"
max_files: 80
reason: "172 files overflowed Sonnet context (2026-04-13)"
last_seen: "2026-04-14"
occurrences: 1
infra_hints:
# Service-specific knowledge that helps with rebuild timing,
# post-audit verification, and Docker dependency ordering.
- service: "api-synthesia"
startup_time_seconds: 240
note: "Needs (healthy) before uranus can start"
last_seen: "2026-04-14"
audit_hints:
# Codebase-specific bug patterns to prioritize.
# Injected into agent prompts as additional checklist items.
- pattern: ".first() without None guard"
severity: critical
note: "SQLAlchemy returns None silently. 5 crash sites in rag_tasks.py."
first_seen: "2026-04-14"
occurrences: 5
noise_filters:
# Patterns that generate high volume, low value findings.
# sg-code-audit batches these into a single summary entry.
- pattern: "f-string in logger"
action: "batch"
reason: "13% of findings, all low severity"
success_patterns:
# Things that worked well — do NOT change these in the skill.
- pattern: "worktree isolation for agents"
note: "Clean merges on 10/13 zones. Isolation prevents cross-agent conflicts."
- pattern: "severity calibration examples in prompt"
note: "Agents consistently rated severity correctly. Keep the examples table."
session_history:
# Last 10 sessions. Older entries auto-pruned on update.
- date: "2026-04-14"
mode: "standard"
files: 2574
bugs_found: 79
bugs_fixed: 77
critical: 9
overflow_rate: 0.23
wall_clock_minutes: 90
Update Rules
- Read first. If the file exists, load it entirely before making changes.
- Merge, don't overwrite. Match entries by
path(zone_hints),service(infra_hints),pattern(audit/noise/success). If a match exists, updatelast_seen, incrementoccurrences, and updatenoteif the new observation adds information. - Append new entries for signals not already present.
- Prune session_history to the last 10 entries.
- Never delete zone_hints, audit_hints, or success_patterns — the user prunes manually. If an old hint seems stale (last_seen > 90 days), add a
possibly_stale: trueflag instead of removing it. - Preserve comments and manual edits. Read the file as text, parse YAML, modify in memory, write back. If the file has comments that don't parse, preserve them as-is at the top.
Phase 5 — GitHub Issue
Pre-flight
- Check
gh auth status— if not authenticated, skip this phase with a message: "GitHub CLI not authenticated. Rungh auth loginto enable issue filing. Local learnings saved." - Detect the repo: read
originremote URL from the ShipGuard plugin installation directory. Fallback:bacoco/ShipGuard.
Deduplication
Before creating a new issue, check for existing ones:
gh issue list --repo bacoco/ShipGuard --state open --label improvement --limit 30 --json number,title,body
For each insight you want to file:
- Extract keywords from the insight title (e.g., "context overflow", "retry backoff", "syntax validation")
- Search existing issue titles and bodies for those keywords
- If a matching open issue exists: add a comment with the new data point instead of creating a duplicate. Format:
### New data point ({repo_name}, {date})\n{details} - If no match: create a new issue
This is important because multiple users running /sg-improve on different projects will generate similar insights. Commenting on existing issues builds evidence ("3 users hit this") rather than fragmenting it across duplicates.
Issue Format
## Session Insights — {repo_name} ({date})
**Audit:** {mode} mode | {files} files | {zones} zones | {bugs} bugs ({critical} critical)
**Timing:** {minutes} min wall clock | {overflow_count} overflows | {retry_count} retries
### Improvements
#### 1. {Title}
**What happened:** {concrete description of the friction}
**Impact:** {time lost, bugs missed, or user confusion caused}
**Proposed fix:** {specific change to make in the skill prompt or code}
**Skill:** `sg-code-audit` | `sg-visual-run` | `sg-visual-review`
### What Worked Well
{Bullet list — these are signals to KEEP, not change}
### Summary
| # | Issue | Impact | Effort | Skill |
|---|-------|--------|--------|-------|
---
*Filed by `/sg-improve` from {repo_name}*
Labels
Always add improvement. Then add skill-specific labels based on content:
sg-code-audit— zone sizing, agent prompts, merge logicsg-visual-run— browser execution, auth, screenshotssg-visual-review— dashboard, report generationdx— developer experience, UX friction, confusing outputbug— a skill instruction that produced incorrect behavior (not just suboptimal)
Phase 6 — Summary
Display a concise report:
/sg-improve complete
Local (.shipguard/learnings.yaml):
+ {N} zone hints (max_files constraints for next audit)
+ {M} audit hints (codebase-specific patterns to check)
+ {K} noise filters (low-value patterns to batch)
+ {S} success patterns (what to keep)
Session #{H} recorded
GitHub:
{IF new issue} Created bacoco/ShipGuard#{number} — "{title}"
{IF commented} Updated bacoco/ShipGuard#{number} with new data point
{IF skipped} Skipped (--local-only or gh not authenticated)
Next /sg-code-audit on this repo will:
- Cap {path} zones at {max_files} files
- Add {N} project-specific patterns to agent checklists
- Batch {K} noise patterns into summary entries
Edge Cases
No audit-results.json found
The skill can still extract signals from conversation history (retries, errors, timing). Local learnings will be thinner but still useful. Log: "No audit artifacts found — using conversation context only."
First run (no existing learnings.yaml)
Create the file from scratch. All signals become new entries. session_history starts with one entry.
User ran /sg-improve twice in the same session
The second run should detect that session_history already has an entry for today. Update it rather than appending a duplicate.
gh CLI not installed or not authenticated
Skip Phase 5 entirely. Print the issue body to the terminal so the user can file it manually if they want.
The conversation is very long (>100K tokens)
Don't try to re-read the entire conversation. Focus on:
- The structured data from Phase 1 (audit-results.json, zone JSONs)
- The last 20 messages in the conversation (most recent friction)
- Any messages containing error keywords (grep-style scan)
No ShipGuard skills were used this session
If the conversation doesn't contain evidence of /sg-code-audit, /sg-visual-run, or /sg-visual-review usage, ask the user what they want to capture: "I don't see a ShipGuard session in this conversation. What would you like me to analyze?"
Phase 4b — Mistakes File (Coding Memory)
In addition to learnings.yaml (machine-readable, audit-focused), maintain a human-readable mistakes file that Claude reads at every coding session — not just during audits.
Write to {repo_root}/.shipguard/mistakes.md. This file is referenced from the project's CLAUDE.md so that Claude Code loads it at session start.
Format
# Erreurs à ne pas répéter
## {Language}
### {Rule title}
\```{language}
# ❌ Bad pattern
bad_code_here()
# ✅ Good pattern
good_code_here()
\```
*{Where found, when, how many instances}*
What goes in mistakes.md
Only real bugs found in THIS project — not generic best practices. Each entry must have:
- A bad pattern (code that was actually written and caused a bug)
- The fix (what it should have been)
- Context (which files, how many instances, what broke)
Update Rules
- Read the existing file first — don't overwrite
- For each
criticalorhighseverity bug from the audit:- Check if a similar pattern already exists in mistakes.md
- If yes: update the instance count and add the new file to the context
- If no: add a new entry under the appropriate language section
- Don't add
lowseverity bugs (too noisy for a coding reference) - Keep entries concise — one screen of code max per entry
- Add a "last updated" date at the bottom
How it's consumed
The project's CLAUDE.md should reference this file:
## Erreurs Recurrentes
**Cahier d'erreurs complet : `.shipguard/mistakes.md`** — LIRE ce fichier au debut de chaque session.
This means every Claude Code session — not just audits — benefits from accumulated knowledge. When a developer asks Claude to write a new hook, Claude already knows "don't use || [] in Zustand selectors" because it read mistakes.md at session start.
How sg-code-audit Consumes Learnings
For the feedback loop to work, /sg-code-audit must read .shipguard/learnings.yaml at startup. Here's the integration point (to be added to sg-code-audit Phase 3 — Discover Zones):
If {repo_root}/.shipguard/learnings.yaml exists:
1. Read the file
2. zone_hints: during zone splitting, if a directory matches a hint's path,
cap that zone at hint.max_files (override the default 30/80 thresholds)
3. audit_hints: append each pattern to the language-specific checklist
in the agent prompt, with its severity and note
4. noise_filters: for patterns with action "batch", add to the agent prompt:
"For {pattern}: report ONE summary entry with total count, not individual bugs"
5. success_patterns: no action needed — these are for /sg-improve's reference only
6. Print: "Loaded {N} learnings from .shipguard/learnings.yaml"
This creates the reinforcement loop:
Audit 1: hooks overflow at 172 files → /sg-improve saves max_files: 80
Audit 2: sg-code-audit reads hint → splits hooks into 2 zones → no overflow ✓
Audit 1: .first() crashes found 5 times → /sg-improve saves audit_hint
Audit 2: agents see the pattern in their checklist → catch it on first scan ✓
Audit 1: f-string loggers = 13% of findings → /sg-improve saves noise_filter
Audit 2: agents batch into "42 f-string calls in 12 files" → cleaner report ✓
Final Checklist
- Snapshot taken BEFORE any modification (Phase 0)
- audit-results.json read (if exists)
- Zone JSONs scanned for overflow indicators
- Git log checked for audit commits and reverts
- Conversation scanned for error/performance/quality/success signals
- Each signal classified (project-specific / generic / both)
-
.shipguard/learnings.yamlcreated or merged (unless--github-only) -
.shipguard/mistakes.mdupdated with critical/high bug patterns (unless--github-only) - Existing GitHub issues checked for duplicates (unless
--local-only) - GitHub issue created or existing issue commented (unless
--local-only) - Summary displayed with concrete "next run will..." predictions
More from bacoco/shipguard
sg-scout
GitHub intelligence for ShipGuard — scans repos for code audit, debugging, and self-improving agent techniques, then files actionable improvement proposals. Use when you want to discover new approaches, benchmark against similar tools, or find inspiration for ShipGuard improvements. Trigger on "sg-scout", "scout github", "find skills", "benchmark shipguard", "veille technique", "competitive analysis", "what are others doing", "find improvements".
1sg-visual-fix
Process human-annotated Visual screenshots — analyze marked problem areas, trace to source code, implement fixes, capture before/after screenshots, and regenerate the review page with a comparison tab. Trigger on "sg-visual-fix", "fix annotated tests", "process review annotations", "visual fix", "fix les annotations", "traite la review".
1sg-record
Record browser interactions as replayable ShipGuard test manifests. Opens a Playwright browser with a floating toolbar — user navigates, clicks Check to mark assertions, clicks Stop to generate YAML. Trigger on "sg-record", "record test", "record interactions", "macro recorder", "enregistrer test", "enregistre les interactions".
1sg-visual-review
Generate an interactive HTML screenshot review page from Visual test results. Browse all test screenshots in a grid, filter by status/category, annotate problems with a pen tool, multi-select failed tests, and export re-run manifests. Trigger on "sg-visual-review", "visual review", "review screenshots", "show test results", "review visual", "visual-review", "show results", "test review".
1sg-code-audit
Parallel AI codebase audit — dispatches agents to find and fix bugs across the entire repo. Produces structured JSON results viewable in /sg-visual-review. Trigger on "sg-code-audit", "code audit", "audit codebase", "find bugs", "code-audit", "audit code", "static audit", "security audit", "ship guard".
1sg-visual-run
Execute Visual test manifests using agent-browser with hybrid scripted+LLM assertions. Accepts natural language to describe what to test or what changed — the skill finds and runs the right tests, generating missing ones if needed. Trigger on "sg-visual-run", "visual run", "run visual tests", "test regressions", "run tests", "visual-run", "check if the app works", "I changed X check it still works".
1