result-diagnosis
Result Diagnosis
Diagnose what an experiment result means for the project. This skill is for decision-making after results exist, especially when they are negative, surprising, unstable, or hard to interpret.
Use this skill when:
- a method does not improve over baseline
- results vary strongly across seeds
- a metric improves but another metric worsens
- a baseline unexpectedly wins
- a plot or table looks suspicious
- a result may be caused by an implementation bug, metric bug, data issue, or unfair comparison
- early experiments suggest revising the algorithm or paper claim
- the user asks "what does this result mean?" or "what should we do next?"
Do not use this skill to write a polished report. Pair it with experiment-report-writer after the diagnosis is clear.
Pair this skill with:
research-project-memorywhen the diagnosis should update claims, evidence, risks, actions, or worktree statusexperiment-report-writerwhen results need a shareable reportalgorithm-design-plannerwhen the diagnosis points to method revisionexperiment-design-plannerwhen the diagnosis requires a new controlled experimentrun-experimentwhen the next step is a rerun, sanity check, or ablationconference-writing-adapterwhen the right action is to narrow or reframe paper claims
Skill Directory Layout
<installed-skill-dir>/
├── SKILL.md
└── references/
├── diagnosis-taxonomy.md
├── evidence-audit.md
├── next-decision-rules.md
├── report-template.md
└── triage-protocol.md
Progressive Loading
- Always read
references/diagnosis-taxonomy.md,references/triage-protocol.md, andreferences/next-decision-rules.md. - Read
references/evidence-audit.mdwhen inspecting logs, configs, metrics, plots, runs, or code state. - Use
references/report-template.mdfor full diagnosis reports. - If a result depends on current SOTA, benchmark conventions, or recent baseline performance, verify current sources with web search or user-provided papers.
Core Principles
- Diagnose before optimizing.
- Separate observed result from interpretation.
- Prefer simple sanity checks before expensive reruns.
- Treat negative results as information: they may kill a claim, not the whole project.
- Do not blame the algorithm before checking implementation, data, metric, baseline, and selection rules.
- Do not blame implementation forever when repeated controlled evidence falsifies the claim.
- Every diagnosis should end with a decision: debug, rerun, ablate, revise method, narrow claim, write, park, or kill.
- Record uncertainty explicitly.
Step 1 - Define the Result and Expected Behavior
Extract:
- experiment question and linked claim
- method and baseline
- dataset/split
- metrics and expected direction
- observed result
- number of seeds/repeats
- configs, commit, logs, tables, and figures
- what result was expected and why
- whether this result affects paper claims or only internal debugging
Rewrite vague input into:
Expected [method] to improve [metric/diagnostic] over [baseline] on [setting], but observed [result] under [controls].
If expected behavior was never defined, route back to experiment-design-planner.
Step 2 - Classify the Symptom
Read references/diagnosis-taxonomy.md.
Classify the primary symptom:
- no improvement
- regression
- instability or high variance
- metric conflict
- suspiciously large gain
- baseline unexpectedly strong
- diagnostic/performance mismatch
- training failure or divergence
- reproducibility failure
- plot/table inconsistency
- result contradicts paper story
Then classify likely diagnosis categories:
- implementation bug
- metric/evaluation bug
- data/split/preprocessing issue
- unfair baseline or tuning issue
- seed variance or insufficient repeats
- optimization/hyperparameter issue
- method mechanism failure
- scale/regime mismatch
- claim/evidence mismatch
- expected negative result
Step 3 - Gather Evidence
Read references/evidence-audit.md.
Prefer primary artifacts:
- config diffs
- run commands
- git commit
- logs and stderr
- metric files
- checkpoints
- seeds
- dataset versions and split hashes
- plots and tables
- previous baseline runs
- implementation changes
Mark missing evidence rather than guessing.
Step 4 - Run Triage
Read references/triage-protocol.md.
Use this order:
- Reproducibility and provenance: correct commit, config, data, seed, output path.
- Metric and evaluation: metric direction, aggregation, split, leakage, postprocessing.
- Baseline fairness: same budget, tuning, checkpoint rule, data, sampler, and code path.
- Implementation sanity: feature flag, tensor shapes, gradient flow, loss scale, train/eval mode.
- Statistical stability: seeds, variance, confidence intervals, outliers.
- Mechanism diagnostic: whether the intended mechanism changed.
- Claim alignment: whether the result supports, weakens, or falsifies the paper claim.
Stop early only when a blocking bug or invalid comparison is found.
Step 5 - Build Competing Explanations
For each plausible explanation, state:
- evidence for it
- evidence against it
- cheapest test that would distinguish it
- decision if true
At minimum consider:
- bug
- bad metric
- weak experiment design
- baseline too strong or under-tuned
- hyperparameter issue
- mechanism false
- claim too broad
Step 6 - Choose Next Decision
Read references/next-decision-rules.md.
Choose one primary decision:
debug: result is not trustworthy until a bug or provenance issue is resolvedrerun: result is plausible but underpowered or missing controlsablate: result needs mechanism isolationrevise-method: mechanism likely needs design changenarrow-claim: evidence supports a smaller or different claimwrite: evidence is trustworthy enough to reportpark: result is inconclusive and not worth immediate computekill: claim or direction is falsified under fair controls
Do not pick write if basic provenance or fairness is unresolved.
Step 7 - Write the Diagnosis
Use references/report-template.md for full reports.
If saving to a project and no path is given, use:
docs/diagnosis/result_diagnosis_YYYY-MM-DD_<short-name>.md
Required output:
# Result Diagnosis: [Short Name]
## Result Snapshot
## Expected vs Observed
## Symptom Classification
## Evidence Checked
## Competing Explanations
## Most Likely Diagnosis
## Decision
## Next Checks or Actions
## Claim Impact
## Project Memory Writeback
Step 8 - Write Back to Project Memory
If the project uses research-project-memory, update:
memory/evidence-board.md: observed result, limitations, and source pathsmemory/provenance-board.md: mark result provenance verified, stale, contradictory, or missing when diagnosis depends on source validitymemory/claim-board.md: claims supported, weakened, revised, evidence-needed, provisional, parked, or cutmemory/risk-board.md: bugs, metric risks, baseline risks, mechanism risks, or claim risksmemory/action-board.md: debug, rerun, ablation, method revision, writing, park, or kill actionsmemory/handoff-board.md: create handoffs to method design, experiment design, paper evidence, or writing when diagnosis changes downstream workmemory/phase-dashboard.md: update the active gate when diagnosis advances evidence production or regresses the project to debugging, method revision, or claim narrowingmemory/decision-log.md: durable decisions such as killing a claim, changing method, or narrowing scope- worktree
.agent/worktree-status.md: latest result and exit condition if a branch/worktree is involved
Use observed for verified results and inferred for explanations. Mark stale claims explicitly.
Final Sanity Check
Before finalizing:
- observed result and interpretation are separated
- provenance and config are checked or listed as missing
- metric direction and aggregation are clear
- baseline fairness is addressed
- implementation sanity checks are considered
- seed variance and repeats are considered
- mechanism diagnostic is checked when relevant
- result is mapped to a concrete decision
- paper claim impact is explicit
- project memory is updated when present
More from a-green-hand-jack/ml-research-skills
project-init
Initialize an ML research project control root. Use for paper/code/slides repos, shared memory, GitHub Project alignment, agent guidance, worktree policy, and lifecycle handoffs.
37project-sync
Sync verified code-side experiment results into paper memory. Use when logs, reports, run docs, or user-confirmed metrics should become paper-facing evidence.
36add-git-tag
Create annotated Git milestone tags. Use when completing a phase, releasing a version, or marking a research checkpoint.
36update-docs
Refresh project documentation after code changes. Use after implementing features, changing behavior, or preparing a milestone commit.
36new-workspace
Create Git branches or worktrees for research code and paper versions. Use for experiments, baselines, rebuttal fixes, arXiv/camera-ready branches, and worktree memory.
36init-latex-project
Initialize a LaTeX academic paper project. Use for new conference or journal papers needing templates, macros, venue preambles, and writing guidance.
36