learn-from-session
User request: $ARGUMENTS
Analyze a Claude Code session to identify what went well and what could be improved, then suggest high-confidence fixes to skills in this repository.
Input formats:
- Session ID (UUID):
184078b7-2609-46e0-a1f2-bb42367a8d34 - Session file path:
~/.claude/projects/.../session-id.jsonl - Inline commentary: Text description of what happened
Output: High-confidence issues only with evidence-based suggestions for skill improvements.
Signal quality bar: Only recommend changes that would have prevented specific rework in the session. A fix is high-signal when ALL of:
- You can point to exact message numbers where rework occurred
- The skill change would have triggered BEFORE that rework
- Following the change would have produced correct output initially
Definition - high-signal fix: A skill change that passes the 3/3 counterfactual test (see Phase 5.2).
Phase 1: Parse Input & Setup
1.1 Identify input type
| Input Pattern | Type | Action |
|---|---|---|
UUID format (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) |
Session ID | Find and read session file |
Path ending in .jsonl |
Session file | Read directly |
| Other text | Commentary | Analyze inline, may reference sessions |
1.2 Locate session file (if session ID)
Session files are stored at:
~/.claude/projects/{project-path-encoded}/{session-id}.jsonl
Note: {project-path-encoded} replaces path separators with URL-safe encoding (e.g., /home/user/myproject becomes -home-user-myproject). Don't rely on exact path structure—use find instead.
Use Bash to find:
find ~/.claude/projects -name "*{session-id}*" -type f 2>/dev/null
If file not found: Ask user to provide the session file path directly or check if session ID is correct.
1.3 Create analysis log
Path: /tmp/session-analysis-{session-id-short}-{timestamp}.md
Purpose: External memory that persists findings beyond LLM working memory. Write to this file IMMEDIATELY after each discovery—never batch multiple findings into one write.
# Session Analysis Log
Session: {id or "inline commentary"}
Started: {timestamp}
Status: IN_PROGRESS
---
## Session Overview
<!-- Write immediately after Phase 2 parsing -->
**Initial request**:
**Skills invoked**:
**Outcome**:
**Session length**:
---
## Pattern Detection
### Iterations Found
<!-- Write immediately after detecting iterations - before moving to corrections -->
### User Corrections Found
<!-- Write immediately after detecting corrections - before moving to deviations -->
### Workflow Deviations Found
<!-- Write immediately after detecting deviations - before moving to missing questions -->
### Missing Questions Found
<!-- Write immediately after detecting missing questions - before moving to post-impl -->
### Post-Implementation Fixes Found
<!-- Write immediately after detecting post-impl fixes - before skill comparison -->
---
## Skill Comparison
### Skills Discovered
<!-- Write immediately after discovering which skills were used -->
### Skill: {name}
<!-- Write immediately after analyzing EACH skill - don't batch -->
---
## Potential Issues
<!-- Write each issue as identified during comparison -->
---
## Counterfactual Analysis
<!-- Write results of 3/3 test for each issue -->
---
## Final Recommendations
<!-- Populated after refresh step -->
1.4 Create todo list
CRITICAL: Write to log IMMEDIATELY after each finding—never batch writes.
- [ ] Setup: Create log file, parse session, write overview
- [ ] Pattern detection: iterations, corrections, deviations, missing questions, post-impl fixes (write each to log)
- [ ] Skill discovery: extract skills, locate files, write to log
- [ ] (expand: "Analyze {skill}" for each skill found)
- [ ] Refresh context: read FULL analysis log
- [ ] Counterfactual analysis: test each issue, write recommendations
- [ ] Output final report
Expansion: When skills are discovered, add one todo per skill:
- [ ] Analyze {skill-name} skill + write findings to log
Why write-after-each-step matters: By synthesis, early findings suffer context rot. Writing externalizes findings to a file that persists. The refresh step moves ALL findings to context end (highest attention zone).
Phase 2: Parse Session
2.1 Session file structure
Claude Code sessions are JSONL files with these record types:
| Type | Contains |
|---|---|
user |
User messages, message.content field |
assistant |
Claude responses, tool calls, thinking |
system |
System events, commands, hooks |
file-history-snapshot |
File state tracking |
2.2 Extract key events
Use jq to parse:
# User messages
cat {session-file} | jq -r 'select(.type == "user") | .message.content' 2>/dev/null
# Tool calls
cat {session-file} | jq -r 'select(.type == "assistant") | .message.content | if type == "array" then .[] | select(.type == "tool_use") | .name else empty end' 2>/dev/null
# Skill invocations
grep -o '"skill":"[^"]*"' {session-file} | sort | uniq -c
2.3 Build session overview
Extract and log:
- Initial request: First user message (the goal)
- Workflow used: Which skills invoked (
/spec,/plan,/implement, etc.) - Workflow skipped: Skills that would typically apply but weren't invoked (see table below)
- Outcome: Success, partial, or required rework
- Session length: Message count, duration if available
Expected skills by task type (use to detect skipped workflows):
| Task Indicators in Request | Expected Skills |
|---|---|
| "build", "implement", "create feature", "add" | spec → plan → implement |
| "fix bug", "debug", "not working" | bugfix |
| "review", "check", "audit" | review (or specific review-*) |
| "refactor", "improve", "optimize" | plan → implement |
| Multi-file changes (3+ files likely) | plan before implement |
Only flag as "skipped" if evidence suggests the skill would have prevented issues that occurred.
Phase 3: Pattern Detection
Analyze the session for these patterns. Each pattern has evidence requirements.
3.1 Iteration patterns (things that didn't work first time)
Evidence required: Same file edited multiple times, OR error → fix → retry sequence
Look for:
- TypeScript errors followed by fixes
- Test failures followed by code changes
- Lint errors followed by formatting changes
- Same function/file edited more than once with different intent (not additive changes, but corrections)
Log format:
### Iteration: {description}
- Files affected: {list}
- Attempts: {count}
- Root cause: {why it didn't work first time}
- Potential skill gap: {what could have prevented this}
Write ALL iteration findings to log file NOW, before proceeding to 3.2.
3.2 User corrections ("no, I meant...")
Evidence required: User message containing correction language
Correction indicators:
- "no", "not what I meant", "actually", "instead", "I meant"
- "let's go back", "undo", "revert"
- "that's wrong", "incorrect"
Log format:
### User Correction: {what was corrected}
- Original action: {what Claude did}
- User feedback: {correction text}
- Missing context: {what Claude should have asked/known}
Write ALL correction findings to log file NOW, before proceeding to 3.3.
3.3 Workflow deviations
Evidence required: Expected workflow step skipped or out-of-order
Check for:
- Multi-phase skill invoked but phases skipped (read skill to know expected phases)
- Skill with prerequisites invoked without those prerequisites (e.g., implementation without planning)
- Ordered steps executed out of order
- Verification/validation steps skipped before proceeding
How to detect: Compare skill's documented phases against actual session sequence.
Log format:
### Workflow Deviation: {what was skipped/reordered}
- Skill: {which skill's workflow}
- Expected flow: {phases from skill definition}
- Actual flow: {what happened in session}
- Impact: {did this cause issues later?}
Write ALL deviation findings to log file NOW, before proceeding to 3.4.
3.4 Missing questions
Evidence required: Information discovered during implementation that should have been asked upfront
Look for:
- Design decisions made mid-implementation
- Assumptions that were later corrected
- "The user confirmed..." appearing late in session
- Post-implementation "actually, let's change..." patterns
Log format:
### Missing Question: {what should have been asked}
- Discovered at: {phase where it came up}
- Impact: {rework required}
- Skill gap: {which skill should have asked this}
Write ALL missing question findings to log file NOW, before proceeding to 3.5.
3.5 Post-implementation fixes
Evidence required: Changes made AFTER "implementation complete" or PR creation
Look for:
- Commits/changes after PR URL appears
- Refactoring after "done" or "complete" messages
- Review findings that required code changes
- User requesting changes after seeing "finished"
Log format:
### Post-Implementation Fix: {what was fixed}
- Original implementation: {what was done}
- Fix required: {what changed}
- Should have been caught by: {which phase/skill}
Write ALL post-implementation findings to log file NOW, before proceeding to Phase 4.
Phase 4: Skill Comparison
4.1 Discover skills used in session
Step 1: Extract skill invocations from session:
# Find all Skill tool invocations (handles both "skill" and skill names in tool calls)
grep -oE '"skill"\s*:\s*"[^"]*"' {session-file} | sort | uniq -c
# Find slash command patterns in user messages (alphanumeric with hyphens)
grep -oE '/[a-zA-Z][a-zA-Z0-9-]*' {session-file} | sort | uniq -c
# Find Skill tool calls with plugin:skill format
grep -oE 'Skill\s*\(\s*"[^"]+:[^"]+"' {session-file} | sort | uniq -c
Step 2: Find ALL relevant files for each skill.
Option A - If codebase-explorer skill is available:
Invoke the vibe-workflow:explore-codebase skill with: "medium - find all files related to the {skill-name} skill: SKILL.md definition, related agents it spawns, hooks that affect it, and any shared utilities"
Option B - Manual search fallback:
# Find skill definition
find . -path "*/skills/{skill-name}/SKILL.md" -type f
# Find related agents (look in same plugin's agents/ folder)
PLUGIN_DIR=$(dirname $(dirname {skill-path}))
ls "$PLUGIN_DIR/agents/" 2>/dev/null
# Find hooks that might affect this skill
grep -r "{skill-name}" --include="*.py" */hooks/ 2>/dev/null
This discovers:
- The SKILL.md definition itself
- Agents the skill spawns (e.g., plan-verifier for /plan)
- Hooks that intercept skill behavior (e.g., Stop hook)
- Shared utilities the skill depends on
- Test files that document expected behavior
Step 3: Log discovered skills with full context:
## Skills Used in Session
### {skill-name}
- **SKILL.md**: {path}
- **Related agents**: {list from explorer}
- **Related hooks**: {list from explorer}
- **Invoked**: {count} times
Write discovered skills to log file NOW, before proceeding to 4.2.
4.2 Extract actionable rules from each skill
For each skill file, extract:
Rule indicators (look for these patterns):
must,should,never,always→ mandatory behaviors## Phase N:or### Step N:→ workflow phasesquestions:orAskUserQuestion→ required user prompts| Condition | Action |tables → decision rules**CRITICAL**,**IMPORTANT**→ high-priority rules- [ ]todo templates → expected workflow stepsAcceptance:orValidation:→ verification requirements
Extract and log:
### Skill: {name}
**File**: {path}
**Mandatory behaviors**:
- {rule with line number}
**Workflow phases**:
1. {phase name} - expected outputs: {list}
**Required questions** (when applicable):
- {question topic}
**Verification steps**:
- {what should be checked}
4.3 Compare documented vs actual
For each skill used in the session:
| Aspect | Documented | Actual | Gap? | Impact |
|---|---|---|---|---|
| Questions asked | {from skill} | {from session} | {Y/N} | {would have prevented X} |
| Phases followed | {from skill} | {from session} | {Y/N} | {would have caught Y} |
| Validations run | {from skill} | {from session} | {Y/N} | {would have avoided Z} |
| Outputs produced | {from skill} | {from session} | {Y/N} | {required later but missing} |
Key comparison questions:
- Did the skill ask all documented questions? If not, did missing answers cause issues?
- Were all phases executed in order? If skipped, did it matter?
- Were validations run? If skipped, did bugs slip through?
- Did outputs match documented format? If not, did downstream steps suffer?
4.4 Log skill gaps with impact
### Skill Gap: {skill name}
- **Rule violated**: {what skill says to do, with line reference}
- **What happened**: {actual behavior in session}
- **Evidence**: {specific session content showing gap}
- **Impact**: {did this cause iteration/correction/rework?}
- **Counterfactual**: {if rule was followed, would outcome differ?}
- **Confidence**: HIGH | MEDIUM | LOW
Only log gaps where:
- Evidence is clear (specific session content)
- Impact is documented (caused measurable problem)
- Counterfactual is plausible (fix would have helped)
Write skill gap findings to log file after analyzing EACH skill—never batch multiple skills into one write.
Phase 5: Synthesize Recommendations
5.1 Refresh context (non-negotiable)
CRITICAL: Use the Read tool to read the FULL analysis log file NOW before any synthesis.
Why this step exists: By this point, findings from Phase 3 (pattern detection) have suffered context rot—they're in the "lost middle" of conversation. The log file contains ALL findings written throughout this workflow. Reading it:
- Moves ALL findings to context END (highest attention zone)
- Converts holistic synthesis (LLMs are bad at this) into dense recent context (LLMs are good at this)
- Restores details that would otherwise be missed
Action: Use Read("/tmp/session-analysis-{id}-{timestamp}.md") to read the entire log file. Do NOT proceed to 5.2 until this is complete.
5.2 Counterfactual analysis (the high-signal filter)
For each potential issue, answer these three questions with specific evidence:
| Test | Question | Evidence Required |
|---|---|---|
| T1: Rework identified | Can you cite specific message numbers where rework occurred? | Message #X shows error, #Y shows fix |
| T2: Intervention point | At which earlier message would the skill change have triggered? | Message #Z is where skill phase X runs |
| T3: Trigger conditions met | Did the session content at that point contain info that would activate the change? | Quote the text that would trigger it |
Scoring:
| Score | Criteria | Action |
|---|---|---|
| 3/3 | All three answered with specific evidence | HIGH confidence → Include |
| 2/3 | One test lacks specific evidence | MEDIUM confidence → Deferred section |
| 1/3 or 0/3 | Multiple tests lack evidence | LOW confidence → Discard |
Example counterfactual:
Issue: Plan skill should ask about time filtering
T1 (Rework): Message #47 adds 90-day filter, #48 user says "yes that's what I meant"
T2 (Intervention): Message #12 is plan phase, "Files to modify" section
T3 (Trigger): Message #5 user wrote "recent refunds" - this would trigger temporal scope question
Score: 3/3 → HIGH confidence
5.3 Additional disqualifiers
Even with 3/3 counterfactual score, discard if:
- Compliance failure: Skill already documents this; it just wasn't followed
- One-off context: Unusual situation unlikely to recur (e.g., user typo)
- Scope creep: Fix would make skill too complex for marginal benefit
- Side effects: Fix would break other documented behavior
5.4 Format recommendations
For each HIGH confidence issue:
## Issue: {short title}
**Confidence**: HIGH
**Counterfactual score**: 3/3
### Evidence
{Quote or describe specific session content}
### What Happened
{Brief description of the problem}
### Root Cause
{Why the current skill didn't prevent this - be specific about what's missing}
### Counterfactual
- **Iteration avoided**: {which rework/correction would NOT have happened}
- **Intervention point**: {exact moment the fix would have triggered}
- **Trigger condition**: {why the fix would have applied to this session}
### Suggested Fix
**File**: {skill file path}
**Section**: {phase/section name}
**Line**: ~{approximate line number}
**Current behavior**:
{What the skill does now}
**Proposed behavior**:
{What the skill should do instead}
```{code block showing the diff if applicable}```
### Risk Assessment
- **Side effects**: {could this break other flows? NO/LOW/MEDIUM/HIGH}
- **Complexity added**: {minimal/moderate/significant}
- **Test approach**: {how to verify the fix works}
Phase 6: Output
6.1 Final report structure
# Session Analysis: {session id or description}
## Summary
- **Session outcome**: {success/partial/rework needed}
- **Workflow used**: {skills invoked}
- **Iterations observed**: {count of retry/fix cycles}
- **High-signal fixes**: {count} (3/3 counterfactual score)
## What Went Well
{List of things that worked correctly - be specific about which skill behaviors succeeded}
## Iteration Timeline
{Chronological list of corrections/retries observed, with timestamps if available}
1. {time}: {what was attempted}
2. {time}: {what went wrong}
3. {time}: {how it was fixed}
## High-Confidence Improvements
### Fix 1: {title}
{Full format from 5.4}
### Fix 2: {title}
...
## Deferred (Partial Counterfactual Match)
{Issues that scored 2/3 - might help but not definitively causal}
| Issue | Missing Criterion | Why Deferred |
|-------|-------------------|--------------|
| {title} | {which of the 3 failed} | {brief explanation} |
## Not Actionable
{Issues that scored 1/3 or 0/3, or hit disqualifiers - briefly note for transparency}
---
Analysis log: {log file path}
6.2 Return report
Output the final report. User can then decide which fixes to implement.
Key Principles
| Principle | Rule |
|---|---|
| Counterfactual-driven | Every fix must pass the 3/3 counterfactual test |
| Evidence-based | Quote or cite specific session content |
| Iteration-focused | Primary signal = things that required retry/correction |
| Skill-focused | Goal is improving skills, not critiquing user or session |
| Log-driven | Write findings to log as you go, refresh before synthesis |
Confidence Criteria (Counterfactual-Based)
| Score | Test Results | Action |
|---|---|---|
| 3/3 | Would avoid iteration + Name exact moment + Conditions would trigger | HIGH → Include |
| 2/3 | Two of three | MEDIUM → Deferred section |
| 1/3 | One of three | LOW → Not Actionable section |
| 0/3 | None | Discard entirely |
Disqualifiers (Even at 3/3)
| Disqualifier | Why It Filters Out |
|---|---|
| Compliance failure | Skill already covers this - LLM didn't follow |
| One-off context | Unusual situation unlikely to recur |
| Scope creep | Fix adds complexity disproportionate to benefit |
| Side effects | Fix would break other documented behaviors |
Error Handling
| Error | Response |
|---|---|
| Session file not found | Ask user to provide path directly |
| Malformed JSONL (parse error) | Skip malformed lines, note in report |
| Skill file not found | Note as "skill definition unavailable" in comparison |
| jq not available | Use grep/sed fallback patterns to extract JSON fields |
| Empty session | Report "Session contains no analyzable content" |
Never Do
- Suggest changes without specific session evidence
- Include fixes that score <3/3 in main recommendations
- Skip the counterfactual test ("it seems like it would help")
- Skip reading the full analysis log before synthesis
- Critique user behavior (focus on skill gaps, not user mistakes)
- Suggest fixes that would break other documented behavior
- Recommend changes to skills that weren't used in the session