experiment-bridge
Workflow 1.5: Experiment Bridge
Implement and deploy experiments from plan: $ARGUMENTS
Overview
This skill bridges Workflow 1 (idea discovery + method refinement) and Workflow 2 (auto review loop). It takes the experiment plan and turns it into running experiments with initial results.
Workflow 1 output: This skill: Workflow 2 input:
refine-logs/EXPERIMENT_PLAN.md → implement → GPT-5.4 review → deploy → collect → initial results ready
refine-logs/EXPERIMENT_TRACKER.md code (cross-model) /run-experiment for /auto-review-loop
refine-logs/FINAL_PROPOSAL.md
Constants
- CODE_REVIEW = true — GPT-5.4 xhigh reviews experiment code before deployment. Catches logic bugs before wasting GPU hours. Set
falseto skip. - AUTO_DEPLOY = true — Automatically deploy experiments after implementation + review. Set
falseto manually inspect code before deploying. - SANITY_FIRST = true — Run the sanity-stage experiment first (smallest, fastest) before launching the rest. Catches setup bugs early.
- MAX_PARALLEL_RUNS = 4 — Maximum number of experiments to deploy in parallel (limited by available GPUs).
- BASE_REPO = false — GitHub repo URL to use as base codebase. When set, clone the repo first and implement experiments on top of it. When
false(default), write code from scratch or reuse existing project files. - COMPACT = false — When
true, (1) readidea-stage/IDEA_CANDIDATES.mdinstead of fullidea-stage/IDEA_REPORT.mdif available, (2) append experiment results toEXPERIMENT_LOG.mdafter collection.
Override:
/experiment-bridge "EXPERIMENT_PLAN.md" — compact: true, base repo: https://github.com/org/project
Inputs
This skill expects one or more of:
refine-logs/EXPERIMENT_PLAN.md(best) — claim-driven experiment roadmap from/experiment-planrefine-logs/EXPERIMENT_TRACKER.md— run-by-run execution tablerefine-logs/FINAL_PROPOSAL.md— method description for implementation contextidea-stage/IDEA_CANDIDATES.md— compact idea summary (preferred whenCOMPACT: true) (fall back to./IDEA_CANDIDATES.mdif not found)idea-stage/IDEA_REPORT.md— full brainstorm output (fall back to./IDEA_REPORT.mdif not found)
If none exist, ask the user what experiments to implement.
Workflow
Phase 1: Parse the Experiment Plan
Read EXPERIMENT_PLAN.md and extract:
- Run order and milestones — which experiments run first (sanity → baseline → main → ablation → polish)
- For each experiment block:
- Dataset / split / task
- Compared systems and variants
- Metrics to compute
- Setup details (backbone, hyperparameters, seeds)
- Success criterion
- Priority (MUST-RUN vs NICE-TO-HAVE)
- Compute budget — total estimated GPU-hours
- Method details from
FINAL_PROPOSAL.md— what exactly to implement
Present a brief summary:
📋 Experiment plan loaded:
- Milestones: [N] (sanity → baseline → main → ablation)
- Must-run experiments: [N]
- Nice-to-have: [N]
- Estimated GPU-hours: [X]
Proceeding to implementation.
Phase 2: Implement Experiment Code
If BASE_REPO is set — clone the repo first:
git clone <BASE_REPO> base_repo/
# Read the repo's README, understand its structure, find entry points
# Implement experiments by modifying/extending this codebase
For each milestone (in order), write the experiment scripts:
-
Check existing code — scan the project (or cloned
base_repo/) for existing experiment scripts, model code, data loaders. Reuse as much as possible. -
Implement missing pieces:
- Training scripts with proper argparse (all hyperparameters configurable)
- Evaluation scripts computing the specified metrics
- Data loading / preprocessing if needed
- Baseline implementations if not already present
- Fixed random seeds for reproducibility
- Results saved to JSON/CSV for later analysis
- Proper logging (wandb if configured in CLAUDE.md)
-
Follow the plan's run order — implement sanity-stage experiments first, then baselines, then main method, then ablations.
-
Self-review before deploying:
- Are all hyperparameters from EXPERIMENT_PLAN.md reflected in argparse?
- Is the random seed fixed and controllable?
- Are results saved in a parseable format (JSON/CSV)?
- Does the code match FINAL_PROPOSAL.md's method description?
Phase 2.5: Cross-Model Code Review (when CODE_REVIEW = true)
Skip this step if CODE_REVIEW is false.
Before deploying, send the experiment code to GPT-5.4 xhigh for review:
mcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
Review the following experiment implementation for correctness.
## Experiment Plan:
[paste key sections from EXPERIMENT_PLAN.md]
## Method Description:
[paste from FINAL_PROPOSAL.md]
## Implementation:
[paste the experiment scripts]
Check for:
1. Does the code correctly implement the method described in the proposal?
2. Are all hyperparameters from the plan reflected in the code?
3. Are there any logic bugs (wrong loss function, incorrect data split, missing eval)?
4. Is the evaluation metric computed correctly?
5. **CRITICAL: Does evaluation use the dataset's actual ground truth labels — NOT another model's output as ground truth?** This is a common and severe bug.
6. Any potential issues (OOM risk, numerical instability, missing seeds)?
For each issue found, specify: CRITICAL / MAJOR / MINOR and the exact fix.
On review results:
- No CRITICAL issues → proceed to Phase 3
- CRITICAL issues found → fix them, then re-submit for review (max 2 rounds)
- Codex MCP unavailable → skip silently, proceed to Phase 3 (graceful degradation)
Phase 3: Sanity Check (if SANITY_FIRST = true)
Before deploying the full experiment suite, run the sanity-stage experiment:
/run-experiment [sanity experiment command]
Wait for completion. Verify:
- Training loop runs without errors
- Metrics are computed and saved correctly
- GPU memory usage is within bounds
- Output format matches expectations
If sanity fails → auto-debug before giving up (max 3 attempts):
- Read the error — parse traceback, stderr, and log files
- Diagnose — classify the failure:
- OOM → reduce batch size or enable gradient checkpointing
- ImportError → install missing package
- FileNotFoundError → fix path or download data
- CUDA error → check GPU availability, reduce model size
- NaN/divergence → reduce learning rate, check data preprocessing
- Fix and re-run — apply the fix, re-run sanity
- Attempt 2+ still failing? → Call in Codex rescue (if Codex plugin installed):
Before the next retry, invoke
/codex:rescueto get a second opinion on the root cause. Codex independently reads the code and error logs — it may spot issues Claude missed (wrong tensor shapes, subtle import shadowing, config mismatches, etc.). Apply its suggested fix, then re-run.- If
/codex:rescueis not available (plugin not installed), continue with Claude's own diagnosis
- If
- Still failing after 3 attempts? → stop, report the failure with all attempted fixes and error logs. Do not proceed with broken code.
Never give up on the first failure. Most experiment crashes are fixable without human intervention.
Phase 4: Deploy Full Experiments
Deploy experiments following the plan's milestone order. Route by job count:
Small batch (≤5 jobs per milestone) → use /run-experiment directly:
/run-experiment [experiment commands]
Large batch (≥10 jobs, multi-seed sweeps, or phase dependencies) → use /experiment-queue for proper orchestration:
/experiment-queue [grid spec or manifest]
Auto-routing rule: if any milestone in EXPERIMENT_PLAN.md declares ≥10 jobs (e.g., seeds: [42, 200, 201, ...] × N: [64, 128, 256] × n: [50K, 150K, 500K, 652K] = 36 jobs) or declares teacher→student phase dependencies, route that milestone to /experiment-queue. Otherwise use /run-experiment.
/experiment-queue adds: OOM-aware retry with backoff, stale-screen cleanup, wave-transition race prevention, phase dependency enforcement, crash-safe state persistence in queue_state.json. See skills/experiment-queue/SKILL.md for the manifest YAML format.
For each milestone:
- Deploy experiments in parallel (up to MAX_PARALLEL_RUNS for
/run-experiment, ormax_parallelfrom manifest for/experiment-queue) - Use
/monitor-experimentto track progress (reads from queue_state.json if/experiment-queueis active) - Collect results as experiments complete
🚦 Checkpoint (if AUTO_DEPLOY = false):
🔧 Code implementation complete. Ready to deploy:
Milestone 0 (sanity): [status — passed/pending]
Milestone 1 (baseline): [N experiments, ~X GPU-hours]
Milestone 2 (main method): [N experiments, ~X GPU-hours]
Milestone 3 (ablations): [N experiments, ~X GPU-hours]
Total estimated: ~X GPU-hours on [N] GPUs
Deploy now? Or review the code first?
Phase 5: Collect Initial Results
As experiments complete:
- Parse output files (JSON/CSV/logs) for key metrics
- Training quality check — if W&B data is available (CLAUDE.md has
wandb: trueandwandb_project), invoke/training-checkto detect NaN, loss divergence, plateaus, or overfitting. If W&B is not configured, skip silently. - Update
refine-logs/EXPERIMENT_TRACKER.md— fill in Status and Notes columns - Check success criteria from EXPERIMENT_PLAN.md — did each experiment meet its bar?
- Write initial results summary:
# Initial Experiment Results
**Date**: [today]
**Plan**: refine-logs/EXPERIMENT_PLAN.md
## Results by Milestone
### M0: Sanity — PASSED
- [result]
### M1: Baselines
| Run | System | Key Metric | Status |
|-----|--------|-----------|--------|
| R001 | baseline_1 | X.XX | DONE |
### M2: Main Method
| Run | System | Key Metric | Status |
|-----|--------|-----------|--------|
| R003 | our_method | X.XX | DONE |
### M3: Ablations
...
## Summary
- [X/Y] must-run experiments completed
- Main result: [positive/negative/inconclusive]
- Ready for /auto-review-loop: [YES/NO]
## Next Step
→ /auto-review-loop "[topic]"
Phase 5.5: Write Compact Log (when COMPACT = true)
Skip entirely if COMPACT is false.
Append each completed experiment to EXPERIMENT_LOG.md:
## [Run ID] — [timestamp]
- **System**: [method name]
- **Config**: [key hyperparameters]
- **Result**: [primary metric = X.XX]
- **Verdict**: [positive / negative / inconclusive]
- **Reproduce**: `python train.py --config configs/run_id.yaml --seed 42`
This structured log survives session recovery — downstream skills read it instead of parsing screen output.
Phase 5.6: Auto Ablation Planning
After main experiments (M2) complete with positive results, invoke /ablation-planner to design ablation studies:
- Read the main results and method description
- Generate a claim-driven ablation plan: which components to remove, what to compare, expected outcomes
- Append ablation blocks to
refine-logs/EXPERIMENT_PLAN.mdandrefine-logs/EXPERIMENT_TRACKER.md - If main results are negative or inconclusive, skip ablation planning and note in the summary
If /ablation-planner is not available, skip silently — the existing EXPERIMENT_PLAN.md ablation blocks (if any) remain unchanged.
Phase 6: Handoff
Present final status:
🔬 Experiment bridge complete:
- Implemented: [N] experiment scripts
- Deployed: [N] experiments on [M] GPUs
- Completed: [X/Y] must-run, [A/B] nice-to-have
- Main result: [one sentence]
Results: refine-logs/EXPERIMENT_RESULTS.md
Tracker: refine-logs/EXPERIMENT_TRACKER.md
Ready for Workflow 2:
→ /auto-review-loop "[topic]"
Output Protocols
Follow these shared protocols for all output files:
- Output Versioning Protocol — write timestamped file first, then copy to fixed name
- Output Manifest Protocol — log every output to MANIFEST.md
- Output Language Protocol — respect the project's language setting
Key Rules
- CRITICAL — Evaluation must use dataset ground truth. When writing evaluation scripts, ALWAYS compare model predictions against the dataset's actual ground truth labels/targets — NEVER use another model's output as ground truth. Double-check: (1) ground truth comes from the dataset split, not from a baseline/backbone model, (2) evaluation metrics are computed against the same ground truth for all methods, (3) if the task has official eval scripts, use those.
- Follow the plan. Do not invent experiments not in EXPERIMENT_PLAN.md. If you think something is missing, note it but don't add it.
- Sanity first. Never deploy a full suite without verifying the sanity stage passes.
- Reuse existing code. Scan the project before writing new scripts. Extend, don't duplicate.
- Save everything as JSON/CSV. The auto-review-loop needs parseable results, not just terminal output.
- Update the tracker.
EXPERIMENT_TRACKER.mdshould reflect real status after each run completes. - Don't wait forever. If an experiment exceeds 2x its estimated time, flag it and move on to the next milestone.
- Budget awareness. Track GPU-hours against the plan's budget. Warn if approaching the limit.
- Vast.ai lifecycle. If using vast.ai instances, destroy them after all experiments complete and results are downloaded. Running instances cost money every second — don't leave them idle. Use
/vast-gpu destroyor/vast-gpu destroy-allwhen done. - Modal lifecycle. If using
gpu: modal, no cleanup is needed — Modal auto-scales to zero after each run. But always show cost estimates before running and verify the spending limit is set at https://modal.com/settings (NEVER through CLI).
Composing with Other Skills
/idea-discovery "direction" ← Workflow 1: find + refine + plan
/experiment-bridge ← you are here (Workflow 1.5: implement + deploy)
/auto-review-loop "topic" ← Workflow 2: review + iterate
/paper-writing "NARRATIVE_REPORT.md" ← Workflow 3: write the paper
Or use /research-pipeline for the full end-to-end flow (includes this bridge).
More from wanshuiyin/auto-claude-code-research-in-sleep
idea-discovery
Workflow 1: Full idea discovery pipeline. Orchestrates research-lit → idea-creator → novelty-check → research-review to go from a broad research direction to validated, pilot-tested ideas. Use when user says \"找idea全流程\", \"idea discovery pipeline\", \"从零开始找方向\", or wants the complete idea exploration workflow.
125auto-review-loop
Autonomous multi-round research review loop. Repeatedly reviews via Codex MCP, implements fixes, and re-reviews until positive assessment or max rounds reached. Use when user says "auto review loop", "review until it passes", or wants autonomous iterative improvement.
118research-lit
Search and analyze research papers, find related work, summarize key ideas. Use when user says "find papers", "related work", "literature review", "what does this paper say", or needs to understand academic papers.
117research-pipeline
Full research pipeline: Workflow 1 (idea discovery) → implementation → Workflow 2 (auto review loop) → Workflow 3 (paper writing, optional). Goes from a broad research direction all the way to a polished PDF. Use when user says \"全流程\", \"full pipeline\", \"从找idea到投稿\", \"end-to-end research\", or wants the complete autonomous research lifecycle.
116pixel-art
Generate pixel art SVG illustrations for READMEs, docs, or slides. Use when user says "画像素图", "pixel art", "make an SVG illustration", "README hero image", or wants a cute visual.
116research-review
Get a deep critical review of research from GPT via Codex MCP. Use when user says "review my research", "help me review", "get external review", or wants critical feedback on research ideas, papers, or experimental results.
115