loop-diagnosis
Loop Diagnosis
Diagnose and fix stalled agent coding loops. This skill covers the diagnostic CLI, common failure modes, and the observability patterns that prevent silent stalls.
Quick Commands
# Diagnose all active loops at once
joelclaw loop diagnose all -c
# Diagnose a specific loop
joelclaw loop diagnose <loop-id> -c
# Diagnose AND auto-fix
joelclaw loop diagnose all -c --fix
# Full JSON output (for detailed inspection)
joelclaw loop diagnose <loop-id>
What Diagnosis Checks
The diagnose command runs 6 checks in order:
- Redis state — PRD stories (pass/skip/pending), progress entries, active claims
- Worktree — exists? commits? uncommitted changes? .out files?
- Inngest runs — running/failed agent-loop-* functions, recent plan runs
- Agent processes — any claude/codex processes still alive?
- Worker health — function_count from localhost:3111/api/inngest
- Diagnosis — pattern-matches the above into a root cause
Failure Modes & Fixes
| Diagnosis | Root Cause | Auto-Fix |
|---|---|---|
CHAIN_BROKEN |
Judge sent story.passed but plan never received it. Event lost in transit. |
Re-fires agent/loop.story.passed → plan picks next story |
ORPHANED_CLAIM |
Story claimed by an event, but agent died and no Inngest run is active. | Clears claim + re-fires plan event |
STUCK_RUN |
Inngest run marked RUNNING but agent process is dead. Run won't complete. | Clears claims + re-fires (manual run cancellation may be needed in Inngest dashboard) |
WORKER_UNHEALTHY |
Worker registering fewer functions than expected. Missing imports or crash loop. | Restarts system-bus-worker deployment in k8s |
NO_PRD |
Loop has no PRD in Redis — was nuked or never created. | None — start a new loop |
COMPLETE |
All stories passed or skipped. Nothing to do. | None — run joelclaw loop nuke dead to clean up |
When to Use (vs Other Skills)
- loop-diagnosis → Loop is stuck/stalled, need to figure out why and fix it
- loop-nanny → Loop is running, need to monitor progress and clean up after
- agent-loop → Need to START a new loop
The Event Chain
Understanding the chain helps diagnose WHERE it broke:
agent/loop.started
→ plan (picks story, dispatches test-writer)
→ agent/loop.story.dispatched
→ test-writer (writes acceptance tests)
→ agent/loop.tests.written
→ implement (codex/claude writes code)
→ agent/loop.story.implemented
→ review (runs tests, typecheck, claude review)
→ agent/loop.story.reviewed
→ judge (pass/fail/retry decision)
→ agent/loop.story.passed ←── feeds back to plan
→ agent/loop.story.failed ←── feeds back to plan
→ agent/loop.story.retry ←── feeds back to implement
Most common break point: judge → plan. The agent/loop.story.passed event fires but plan never picks it up. This happens when:
- Inngest is restarting during the event
- Worker was restarted between judge and plan
- k8s pod restart dropped the event
Observability Patterns
Passive: Failure Events
Every loop function should emit failure events via onFailure handlers (being added by harden loop). These fire agent/loop.function.failed which gets logged to slog.
Active: Watchdog (Future)
A periodic Inngest function (system/loop-watchdog) that:
- Scans all loops in Redis with pending stories
- Checks if any events were emitted in the last 10 minutes
- If not → auto-runs diagnose + fix
- Logs to slog + daily log
Manual: The Diagnostic Session
When an agent needs to debug loops manually, follow this sequence:
# 1. Quick overview
joelclaw loop diagnose all -c
# 2. If fix needed
joelclaw loop diagnose all -c --fix
# 3. Verify fix worked (wait ~30s for plan to fire)
joelclaw loop status <loop-id> -c
# 4. If still stuck, check worker
curl -s localhost:3111/api/inngest | python3 -c "import json,sys; print(json.load(sys.stdin)['function_count'])"
# 5. Nuclear option: full restart
joelclaw loop restart <loop-id>
Making Loops More Resilient
The root cause of most stalls is lost events in the judge→plan chain. Solutions being implemented:
- onFailure handlers — every function gets one, logs failure + emits diagnostic event
- Loop watchdog — periodic check for silent stalls
- Debounce on content-sync — prevents event storms that can crowd out loop events
- Singleton on backfill — prevents resource contention during loops
Cross-References
- agent-loop skill — starting loops
- loop-nanny skill — monitoring + cleanup
- joelclaw skill — full CLI reference
- ADR-0028 — reliability patterns
More from joelhooks/joelclaw
cli-design
Design and build agent-first CLIs with HATEOAS JSON responses, context-protecting output, and self-documenting command trees. Use when creating new CLI tools, adding commands to existing CLIs (joelclaw, slog), or reviewing CLI design for agent-friendliness. Triggers on 'build a CLI', 'add a command', 'CLI design', 'agent-friendly output', or any task involving command-line tool creation.
129k8s
>-
88docker-sandbox
Create, manage, and execute agent tools (claude, codex) inside Docker sandboxes for isolated code execution. Use when running agent loops, spawning tool subprocesses, or any task requiring process isolation. Triggers on "sandbox", "isolated execution", "docker sandbox", "safe agent execution", or when working on agent loop infrastructure.
86joel-writing-style
Joel's writing voice and style guide for joelclaw.com content. Use when writing, editing, or reviewing any blog post, essay, book chapter, or prose content for joelclaw.com. Also use when asked to 'write like Joel,' 'match Joel's voice,' 'draft a post,' 'write content for the blog,' or 'review this for voice.' This skill captures Joel's specific writing patterns derived from ~90,000 words of published content spanning 2012–2026. Cross-reference with copy-editing and copywriting skills for marketing-specific copy.
81task-management
Manage Joel's task system in Todoist. Triggers on: 'add a task', 'create a todo', 'what's on my list', 'today's tasks', 'what do I need to do', 'remind me to', 'inbox', 'complete', 'mark done', 'weekly review', 'groom tasks', 'what's next', or when actionable items emerge from other work. Also triggers when Joel mentions something he needs to do in passing — capture it.
54skill-review
Audit and maintain the joelclaw skill inventory. Use when checking skill health, fixing broken symlinks, finding stale skills, or running the skill garden. Triggers: 'skill audit', 'check skills', 'stale skills', 'skill health', 'skill garden', 'broken skill', 'skill review', 'fix skills', 'garden skills', or any task involving skill inventory maintenance.
49