research-debug
Research-Driven Debugging (research-debug)
복잡한 ML 학습 문제나 architecture-level 버그를 해결하는 통합 워크플로우입니다. 웹 연구, 코드 분석, 반복적 수정을 병렬로 진행하여 근본 원인을 찾고 해결합니다.
🎯 사용 시점
✅ Ideal Use Cases
- Training collapse/divergence: 학습이 초기에는 잘 되다가 갑자기 붕괴
- Gibberish generation: 모델이 무의미한 출력 생성
- Known research problem: 문제가 논문/블로그에서 논의된 적이 있을 것 같은 경우
- Architecture-level bugs: 단순 구현 버그가 아닌 설계 결함이 의심될 때
- Performance anomalies: 예상과 다른 성능 패턴 (loss spike, reward collapse 등)
❌ Not Suitable For
- Simple bugs: Syntax error, typo 등은 직접 수정
- Well-defined requirements: "Add feature X" 같은 명확한 요구사항은 직접 구현
- No literature: 완전히 새로운 문제는 research 단계가 비효율적
📋 Workflow Phases
Phase 1: Evidence Gathering (병렬 실행)
# 동시에 3가지 작업 시작:
# 1. Web research for similar cases
# 2. Deep code analysis with task-planner-analyzer
# 3. Start safe fixes (config changes, monitoring)
Step 1.1: Web Research
WebSearch(
query="[problem description] [domain] [year]"
)
# Example: "GRPO policy gradient collapse vocabulary 2024 2025"
찾을 것:
- Documented failure modes (문서화된 실패 패턴)
- Fundamental design flaws (근본적 설계 결함)
- Known workarounds (알려진 해결 방법)
- Recent papers addressing the issue (최신 연구)
Step 1.2: Code Analysis (Task-Planner-Analyzer)
Task(
subagent_type="task-planner-analyzer",
prompt=f"""
Analyze {problem_description} in this codebase.
**Context from web research:**
{web_search_findings}
**Files to examine:**
{list_of_relevant_files}
**Your tasks:**
1. Examine codebase structure
2. Identify design flaws matching literature
3. Check for known anti-patterns
4. Create prioritized TODO list with:
- File paths and line numbers
- Root causes vs symptoms
- Risk assessment
- Dependencies and constraints
"""
)
Step 1.3: Start Safe Fixes (Optional)
# If you already know some safe fixes (e.g., config changes), start them
# while analysis is running
python scripts/train.py --config fixed_config.yaml > logs/new_run.log 2>&1 &
Phase 2: Root Cause Triangulation
Cross-Reference Literature ↔ Code
Create a mapping table:
| Literature Finding | Code Location | Match? | Impact | Priority |
|---|---|---|---|---|
| Token-level issue | line 316 | ✅ YES | HIGH | 1 |
| Entropy collapse | line 888 | ✅ YES | CRITICAL | 1 |
| Conflicting grads | multiple | ✅ YES | MEDIUM | 2 |
Prioritize by Impact
- CRITICAL: Collapse trigger (direct cause of observed failure)
- HIGH: Fundamental flaw (will cause problems at scale)
- MEDIUM: Optimization (improves stability but not essential)
- LOW: Cosmetic (code quality, not behavior)
Phase 3: Iterative Fix-and-Verify
Fix Loop
FOR each priority (CRITICAL → HIGH → MEDIUM):
1. Modular-Code-Architect: Apply fix
2. Code-Reviewer: Verify no side effects
3. Run tests (if applicable)
4. Monitor metrics for early warnings
5. If problem recurs → back to Task-Planner
Apply Fixes with Modular-Code-Architect
Task(
subagent_type="modular-code-architect",
prompt=f"""
Implement fix for {root_cause} based on this analysis:
{analysis_from_phase2}
**Constraints:**
{list_of_constraints}
**Verification criteria:**
{how_to_verify_fix_worked}
Follow modular design: minimal changes, plug-and-play.
"""
)
Verify with Code-Reviewer
Task(
subagent_type="code-reviewer",
prompt="""
Review recent changes for:
1. Critical issues (logic errors, side effects)
2. Consistency with architecture constraints
3. Whether the fix actually addresses the root cause
Use ultrathink level.
"""
)
Verification Criteria
- No collapse for 100+ steps (or 10x previous collapse point)
- Metrics stay within healthy ranges
- No new issues introduced (regression tests pass)
- Edge cases handled
Phase 4: Documentation
Create Analysis Document
# File: logs/{problem}_root_cause_analysis.md
## Problem Summary
[What happened]
## Root Causes (Ranked)
1. **RC1**: [Description]
- Evidence: [Literature + Code]
- Fix: [What was applied]
- Verification: [How we know it worked]
## Fixes Applied
[Detailed changelog]
## Verification Results
[Metrics before/after]
## Lessons Learned
[What to watch for next time]
Update Project Memory
Add to MEMORY.md or similar:
- New failure modes discovered
- Effective fixes
- Ineffective approaches (to avoid repeating)
- Monitoring metrics to add
🔍 Monitoring & Debugging
Key Metrics to Watch
# For ML training issues:
metrics_to_monitor = [
"loss", # Should decrease steadily
"reward", # Should be stable or improve
"entropy", # High = gibberish, Low = mode collapse
"gradient_norm", # Should be bounded
"v_norm", # For LoRA: should not hit clamp boundary
"kl_loss", # For RL: should be non-zero when active
]
Early Warning Signs
# Add to training code:
if entropy > 0.9 * log(vocab_size):
logger.warning("Entropy collapse imminent - gibberish likely")
if all_advantages_zero and mean_reward > 0.5:
logger.warning("Perfect accuracy but zero variance - KL-only training")
if v_norm >= max_norm * 0.95:
logger.warning("V-vector at clamp boundary - may be fighting constraint")
📚 Example: GRPO Collapse
Phase 1: Evidence Gathering
WebSearch: Found 5 papers documenting GRPO instability
- Token-level importance weight fails (DAPO)
- Catastrophic model collapse (GSPO)
- Entropy collapse & gibberish (OpenReview)
Task-Planner: Identified 5 root causes in code
- KL-only collapse trigger (CRITICAL)
- Token-level mismatch (HIGH)
- Task overfitting (HIGH)
Phase 2: Triangulation
| Finding | Both Sources? | Priority |
|---|---|---|
| KL-only steps | ✅ Yes | CRITICAL |
| Token-level | ✅ Yes | HIGH |
| Task overfit | Code only | HIGH |
Phase 3: Fix & Verify
- Fix 1: Advantage-aware KL gating
- Applied, verified, no collapse at step 20
- Fix 2: Length-normalized log-probs
- Importance ratios stable [0.5, 2.0]
- Fix 3: Reduce steps_per_task 20→3
- New tasks every 3 steps, stable 200+ steps
Phase 4: Documentation
- ✅ Root cause analysis written
- ✅ MEMORY.md updated
- ✅ Workflow recipe created
⚠️ Common Pitfalls
1. Fixing Symptoms Instead of Root Causes
❌ Bad: "Gibberish appeared, let's increase temperature" ✅ Good: "Gibberish = entropy collapse. What causes that? KL-only signal."
2. Serial Execution (Wasting Time)
❌ Bad: WebSearch → wait → Analyze → wait → Fix → wait ✅ Good: WebSearch || Analyze || Start Safe Fixes → Integrate
3. Ignoring Literature
❌ Bad: "This is unique, no point searching" ✅ Good: "Let me check if anyone has seen this before"
4. Not Documenting Failures
❌ Bad: "That didn't work, let's try something else" ✅ Good: "Failed BECAUSE X, documented for future reference"
🚀 Quick Start Template
# 1. Start parallel evidence gathering
WebSearch("problem_description 2024 2025")
Task(subagent_type="task-planner-analyzer", prompt="Analyze {problem}...")
# 2. Start monitoring while waiting
tail -f logs/training.log &
# 3. Apply fixes iteratively
Task(subagent_type="modular-code-architect", prompt="Fix {root_cause}...")
Task(subagent_type="code-reviewer", prompt="Review changes...")
# 4. Document everything
Write("logs/root_cause_analysis.md", content="...")
Edit("MEMORY.md", add="New learnings...")
🎓 Success Criteria
Process Metrics
- Time to root cause: < 2 hours (with parallel execution)
- Fix iterations: < 3 (if root cause correct)
- Regression rate: < 10% (good code review)
Outcome Metrics
- Problem resolved: Yes/No
- Stability duration: Steps until next issue
- Knowledge captured: Documentation complete
📖 Related Skills
- iterative-code-review: Use after fixes applied for quality verification
- code-reviewer: Standalone code quality checks
- debugger: When tests fail during verification phase
- task-planner-analyzer: Can be used standalone for planning
Remember: This is a flexible workflow, not a rigid process. Adapt to your specific problem while maintaining the core principles: evidence-based, parallel execution, root-cause focused, and well-documented.
More from iamseungpil/claude-for-dslab
update-study
This skill should be used when the user asks to "update study", "analyze new experiments", "update experiment document", or "refresh study notes". Produces academic-paper-quality experiment reports with matplotlib plots, executive summary with comparison tables, implementation structure, experimental results with figure interpretation, proposed improvements with code examples, hypotheses, limitations, and LaTeX PDF export with figures. Features incremental detection (only analyze NEW experiments), data extraction to DataFrame, automated plot generation, iterative writing improvement loop with quality criteria, zero-hallucination verification, and LaTeX PDF export. Usage - `/update-study logs/experiment.log study.md` or `/update-study "logs/exp1.log logs/exp2.log" results/ablation_study.md`
55paper-digest
Generate shareable paper summaries for Discord/Slack/Twitter. Use when user provides arxiv paper(s) and wants a digestible summary to share. Triggers on phrases like "논문 요약", "paper summary", "share this paper", "디스코드에 공유", "summarize for sharing". Produces insight-centered single-paragraph summaries that explain WHY research matters, not just WHAT it does.
29hwpx
Comprehensive HWPX (Korean Hancom Office) document creation, editing, and analysis. When Claude needs to work with Korean word processor documents (.hwpx files) for: (1) Reading and extracting content, (2) Creating new documents, (3) Modifying or editing content, (4) Extracting tables to CSV, (5) Modifying tables or table cells, or any other HWPX document tasks. MANDATORY TRIGGERS: hwpx, hwp, 한글, 한컴, Hancom, Korean document
18survey-paper
|
10iterative-code-review
Iteratively improve code quality by using task-planner-analyzer for planning, modular-code-architect agent to fix issues, code-reviewer agent to validate quality, and running tests to verify correctness. Use when implementing new features, after bug fixes, during refactoring, or when preparing code for production deployment. Loops until code-reviewer reports no critical issues AND tests pass.
10codex-iterative-solver
Collaborate with Codex CLI to iteratively analyze, plan, and solve complex coding problems through multiple rounds of expert feedback. Use when analyzing complex codebases with multiple approaches, validating implementation plans, or solving problems that benefit from expert validation.
8