results-analysis
Results Analysis
Run strict, evidence-first experimental analysis for ML/AI research.
Use this skill to produce a strict analysis bundle:
analysis-report.mdstats-appendix.mdfigure-catalog.mdfigures/
Do not use this skill to draft a paper Results section or a full experiment wrap-up report. Those belong to ml-paper-writing or results-report.
Core contract
This skill is responsible for
- validating experiment artifacts and comparison units,
- running rigorous descriptive and inferential statistics,
- generating real scientific figures when data/logs are available,
- writing figure purposes, caption requirements, and interpretation checklists,
- surfacing limits, blockers, and missing evidence explicitly.
This skill is not responsible for
- paper-ready
Resultsprose, - manuscript narrative polishing,
- paper-ready figure/table packaging with
pubfig/pubtab, - project-level experiment retrospectives.
If the user wants the complete post-experiment summary report, hand off to results-report after this bundle is ready. If the user wants publication-grade figures/tables, export parameters, publication QA, or figure/table redesign, hand off to publication-chart-skill.
Non-negotiable quality bar
- Prefer real figures over figure specs. If the data can be read, generate real figures. Do not stop at “recommended visualization”.
- Never fabricate statistics. If sample size, seeds, or raw metrics are missing, state the blocker clearly.
- Report complete statistics. Do not report only best scores or only p-values.
- Interpret every main figure. Every major figure must have purpose, caption requirements, and post-figure interpretation notes.
- Separate evidence from prose. This skill produces analysis artifacts; it does not write manuscript sections.
Standard workflow
1. Inventory and validate artifacts
Start by identifying:
- metric tables (
csv,json,tsv, logs), - training curves and checkpoints,
- seeds / repeated runs,
- baselines, ablations, and comparison families,
- evaluation protocol metadata.
Validate:
- metric direction (higher/lower is better),
- unit of analysis (run, subject, fold, dataset, seed),
- number of runs / seeds,
- missing values or silent failures,
- comparability across methods.
If the comparison is not statistically valid, say so before continuing.
2. Lock the comparison questions
Before running statistics, define the exact comparison questions:
- Which method is compared to which baseline?
- What is the primary metric?
- What is the repeated-measure unit?
- Which ablation or robustness questions matter?
- Which findings are decision-changing?
Do not mix unrelated comparisons into one undifferentiated table.
3. Run strict statistics
Always produce:
- descriptive statistics:
mean ± stdwhen appropriate, 95% CIor another clearly justified interval,- run/seed counts,
- significance tests with assumptions stated,
- effect sizes,
- multiple-comparison handling when several contrasts are reported.
Default expectation:
- check parametric assumptions first,
- use non-parametric fallback when assumptions fail,
- state exactly what was tested and on what samples.
See:
references/statistical-methods.mdreferences/statistical-reporting.md
4. Generate real scientific figures
Produce actual figures whenever artifacts are available.
Minimum expectation for a non-trivial analysis bundle:
- one main comparison figure,
- one supporting figure (training dynamics / ablation / breakdown / error analysis),
- one exact numeric summary table in markdown.
Every main figure must define:
- figure purpose,
- plotted variables,
- error bar meaning,
- caption requirements,
- interpretation checklist.
See:
references/visualization-best-practices.mdreferences/figure-interpretation.md
5. Write analysis artifacts
analysis-report.md
Summarize:
- the analysis question,
- key findings,
- strongest supported comparisons,
- main caveats,
- what changed in the experimental understanding.
stats-appendix.md
Record:
- descriptive statistics,
- test choices,
- assumptions checked,
- effect sizes,
- confidence intervals,
- multiple comparison corrections,
- explicit blockers and limitations.
figure-catalog.md
For each figure, record:
- filename,
- purpose,
- data source,
- caption draft requirements,
- key observation,
- interpretation checklist,
- known caveats.
6. Final QA gate
Do not finish until all are true:
- the primary comparison question is explicit,
- sample size / seed count is stated,
- inferential tests are justified,
- effect sizes are reported for major contrasts,
- real figures exist when data exists,
- each figure has an interpretation note,
- limitations and blockers are explicit,
- no manuscript-style
Resultsdraft is included.
Output structure
analysis-output/
├── analysis-report.md
├── stats-appendix.md
├── figure-catalog.md
└── figures/
├── figure-01-main-comparison.pdf
├── figure-02-ablation.pdf
└── ...
Figure interpretation rule
For every major figure, answer all three questions:
- Why does this figure exist?
- What exactly should the reader notice?
- What does that observation change in our belief or next decision?
If a figure cannot answer question 3, it is probably decorative rather than scientific.
Failure mode policy
When inputs are incomplete, say so explicitly.
Examples:
- no seed-level data -> descriptive summary only; inferential claims blocked,
- no comparable baseline outputs -> no significance claim,
- no readable logs -> cannot generate dynamics figure,
- too few runs -> effect size may be unstable; report this limitation.
Never replace missing evidence with confident prose.
Reference files
Load only what is needed:
references/statistical-methods.md- test selection and assumptionsreferences/statistical-reporting.md- minimum reporting standardreferences/visualization-best-practices.md- publication-quality figure rulesreferences/figure-interpretation.md- how to explain figures with evidencereferences/analysis-depth.md- move from observation to mechanism and decisionreferences/common-pitfalls.md- common analysis and reporting failures
Example files
examples/example-analysis-report.mdexamples/example-stats-appendix.mdexamples/example-figure-catalog.md
More from galaxy-dawn/claude-scholar
review-response
Systematic review response workflow from comment analysis to professional rebuttal writing. Use when the user asks to "write rebuttal", "respond to reviewers", "draft review response", or "analyze review comments". Improves paper acceptance rates.
127writing-anti-ai
This skill should be used when the user asks to "remove AI writing patterns", "humanize this text", "make this sound more natural", "remove AI-generated traces", "fix robotic writing", or needs to eliminate AI writing patterns from prose. Supports both English and Chinese text. Based on Wikipedia's "Signs of AI writing" guide, detects and fixes inflated symbolism, promotional language, superficial -ing analyses, vague attributions, AI vocabulary, negative parallelisms, and excessive conjunctive phrases.
111skill-quality-reviewer
This skill should be used when the user asks to "analyze skill quality", "evaluate this skill", "review skill quality", "check my skill", or "generate quality report". Evaluates local skills across description quality, content organization, writing style, and structural integrity.
111daily-paper-generator
Use when the user asks to generate daily paper digests on a general topic. This skill supports both arXiv and bioRxiv (or either one), then produces structured Chinese/English summaries for selected papers.
102skill-improver
This skill should be used when the user asks to "apply skill improvements", "update skill from plan", "execute improvement plan", "fix skill issues", "implement skill recommendations", or mentions applying improvements from quality review reports. Reads improvement-plan-{name}.md files generated by skill-quality-reviewer and intelligently merges and executes the suggested changes to improve Claude Skills quality.
98uv-package-manager
Master the uv package manager for fast Python dependency management, virtual environments, and modern Python project workflows. Use when setting up Python projects, managing dependencies, or optimizing Python development workflows with uv.
95