research-debug
Research-Driven Debugging (research-debug)
복잡한 ML 학습 문제나 architecture-level 버그를 해결하는 통합 워크플로우입니다. 웹 연구, 코드 분석, 반복적 수정을 병렬로 진행하여 근본 원인을 찾고 해결합니다.
🎯 사용 시점
✅ Ideal Use Cases
- Training collapse/divergence: 학습이 초기에는 잘 되다가 갑자기 붕괴
- Gibberish generation: 모델이 무의미한 출력 생성
- Known research problem: 문제가 논문/블로그에서 논의된 적이 있을 것 같은 경우
- Architecture-level bugs: 단순 구현 버그가 아닌 설계 결함이 의심될 때
- Performance anomalies: 예상과 다른 성능 패턴 (loss spike, reward collapse 등)
❌ Not Suitable For
- Simple bugs: Syntax error, typo 등은 직접 수정
- Well-defined requirements: "Add feature X" 같은 명확한 요구사항은 직접 구현
- No literature: 완전히 새로운 문제는 research 단계가 비효율적
📋 Workflow Phases
Phase 1: Evidence Gathering (병렬 실행)
# 동시에 3가지 작업 시작:
# 1. Web research for similar cases
# 2. Deep code analysis with task-planner-analyzer
# 3. Start safe fixes (config changes, monitoring)
Step 1.1: Web Research
WebSearch(
query="[problem description] [domain] [year]"
)
# Example: "GRPO policy gradient collapse vocabulary 2024 2025"
찾을 것:
- Documented failure modes (문서화된 실패 패턴)
- Fundamental design flaws (근본적 설계 결함)
- Known workarounds (알려진 해결 방법)
- Recent papers addressing the issue (최신 연구)
Step 1.2: Code Analysis (Task-Planner-Analyzer)
Task(
subagent_type="task-planner-analyzer",
prompt=f"""
Analyze {problem_description} in this codebase.
**Context from web research:**
{web_search_findings}
**Files to examine:**
{list_of_relevant_files}
**Your tasks:**
1. Examine codebase structure
2. Identify design flaws matching literature
3. Check for known anti-patterns
4. Create prioritized TODO list with:
- File paths and line numbers
- Root causes vs symptoms
- Risk assessment
- Dependencies and constraints
"""
)
Step 1.3: Start Safe Fixes (Optional)
# If you already know some safe fixes (e.g., config changes), start them
# while analysis is running
python scripts/train.py --config fixed_config.yaml > logs/new_run.log 2>&1 &
Phase 2: Root Cause Triangulation
Cross-Reference Literature ↔ Code
Create a mapping table:
| Literature Finding | Code Location | Match? | Impact | Priority |
|---|---|---|---|---|
| Token-level issue | line 316 | ✅ YES | HIGH | 1 |
| Entropy collapse | line 888 | ✅ YES | CRITICAL | 1 |
| Conflicting grads | multiple | ✅ YES | MEDIUM | 2 |
Prioritize by Impact
- CRITICAL: Collapse trigger (direct cause of observed failure)
- HIGH: Fundamental flaw (will cause problems at scale)
- MEDIUM: Optimization (improves stability but not essential)
- LOW: Cosmetic (code quality, not behavior)
Phase 3: Iterative Fix-and-Verify
Fix Loop
FOR each priority (CRITICAL → HIGH → MEDIUM):
1. Modular-Code-Architect: Apply fix
2. Code-Reviewer: Verify no side effects
3. Run tests (if applicable)
4. Monitor metrics for early warnings
5. If problem recurs → back to Task-Planner
Apply Fixes with Modular-Code-Architect
Task(
subagent_type="modular-code-architect",
prompt=f"""
Implement fix for {root_cause} based on this analysis:
{analysis_from_phase2}
**Constraints:**
{list_of_constraints}
**Verification criteria:**
{how_to_verify_fix_worked}
Follow modular design: minimal changes, plug-and-play.
"""
)
Verify with Code-Reviewer
Task(
subagent_type="code-reviewer",
prompt="""
Review recent changes for:
1. Critical issues (logic errors, side effects)
2. Consistency with architecture constraints
3. Whether the fix actually addresses the root cause
Use ultrathink level.
"""
)
Verification Criteria
- No collapse for 100+ steps (or 10x previous collapse point)
- Metrics stay within healthy ranges
- No new issues introduced (regression tests pass)
- Edge cases handled
Phase 4: Documentation
Create Analysis Document
# File: logs/{problem}_root_cause_analysis.md
## Problem Summary
[What happened]
## Root Causes (Ranked)
1. **RC1**: [Description]
- Evidence: [Literature + Code]
- Fix: [What was applied]
- Verification: [How we know it worked]
## Fixes Applied
[Detailed changelog]
## Verification Results
[Metrics before/after]
## Lessons Learned
[What to watch for next time]
Update Project Memory
Add to MEMORY.md or similar:
- New failure modes discovered
- Effective fixes
- Ineffective approaches (to avoid repeating)
- Monitoring metrics to add
🔍 Monitoring & Debugging
Key Metrics to Watch
# For ML training issues:
metrics_to_monitor = [
"loss", # Should decrease steadily
"reward", # Should be stable or improve
"entropy", # High = gibberish, Low = mode collapse
"gradient_norm", # Should be bounded
"v_norm", # For LoRA: should not hit clamp boundary
"kl_loss", # For RL: should be non-zero when active
]
Early Warning Signs
# Add to training code:
if entropy > 0.9 * log(vocab_size):
logger.warning("Entropy collapse imminent - gibberish likely")
if all_advantages_zero and mean_reward > 0.5:
logger.warning("Perfect accuracy but zero variance - KL-only training")
if v_norm >= max_norm * 0.95:
logger.warning("V-vector at clamp boundary - may be fighting constraint")
📚 Example: GRPO Collapse
Phase 1: Evidence Gathering
WebSearch: Found 5 papers documenting GRPO instability
- Token-level importance weight fails (DAPO)
- Catastrophic model collapse (GSPO)
- Entropy collapse & gibberish (OpenReview)
Task-Planner: Identified 5 root causes in code
- KL-only collapse trigger (CRITICAL)
- Token-level mismatch (HIGH)
- Task overfitting (HIGH)
Phase 2: Triangulation
| Finding | Both Sources? | Priority |
|---|---|---|
| KL-only steps | ✅ Yes | CRITICAL |
| Token-level | ✅ Yes | HIGH |
| Task overfit | Code only | HIGH |
Phase 3: Fix & Verify
- Fix 1: Advantage-aware KL gating
- Applied, verified, no collapse at step 20
- Fix 2: Length-normalized log-probs
- Importance ratios stable [0.5, 2.0]
- Fix 3: Reduce steps_per_task 20→3
- New tasks every 3 steps, stable 200+ steps
Phase 4: Documentation
- ✅ Root cause analysis written
- ✅ MEMORY.md updated
- ✅ Workflow recipe created
⚠️ Common Pitfalls
1. Fixing Symptoms Instead of Root Causes
❌ Bad: "Gibberish appeared, let's increase temperature" ✅ Good: "Gibberish = entropy collapse. What causes that? KL-only signal."
2. Serial Execution (Wasting Time)
❌ Bad: WebSearch → wait → Analyze → wait → Fix → wait ✅ Good: WebSearch || Analyze || Start Safe Fixes → Integrate
3. Ignoring Literature
❌ Bad: "This is unique, no point searching" ✅ Good: "Let me check if anyone has seen this before"
4. Not Documenting Failures
❌ Bad: "That didn't work, let's try something else" ✅ Good: "Failed BECAUSE X, documented for future reference"
🚀 Quick Start Template
# 1. Start parallel evidence gathering
WebSearch("problem_description 2024 2025")
Task(subagent_type="task-planner-analyzer", prompt="Analyze {problem}...")
# 2. Start monitoring while waiting
tail -f logs/training.log &
# 3. Apply fixes iteratively
Task(subagent_type="modular-code-architect", prompt="Fix {root_cause}...")
Task(subagent_type="code-reviewer", prompt="Review changes...")
# 4. Document everything
Write("logs/root_cause_analysis.md", content="...")
Edit("MEMORY.md", add="New learnings...")
🎓 Success Criteria
Process Metrics
- Time to root cause: < 2 hours (with parallel execution)
- Fix iterations: < 3 (if root cause correct)
- Regression rate: < 10% (good code review)
Outcome Metrics
- Problem resolved: Yes/No
- Stability duration: Steps until next issue
- Knowledge captured: Documentation complete
📖 Related Skills
- iterative-code-review: Use after fixes applied for quality verification
- code-reviewer: Standalone code quality checks
- debugger: When tests fail during verification phase
- task-planner-analyzer: Can be used standalone for planning
Remember: This is a flexible workflow, not a rigid process. Adapt to your specific problem while maintaining the core principles: evidence-based, parallel execution, root-cause focused, and well-documented.