auto-paper-improvement-loop
Auto Paper Improvement Loop: Review → Fix → Recompile
Autonomously improve the paper at: $ARGUMENTS
Context
This skill is designed to run after Workflow 3 (/paper-plan → /paper-figure → /paper-write → /paper-compile). It takes a compiled paper and iteratively improves it through external LLM review.
Unlike /auto-review-loop (which iterates on research — running experiments, collecting data, rewriting narrative), this skill iterates on paper writing quality — fixing theoretical inconsistencies, softening overclaims, adding missing content, and improving presentation.
Constants
- MAX_ROUNDS = 2 — Two rounds of review→fix→recompile. Empirically, Round 1 catches structural issues (4→6/10), Round 2 catches remaining presentation issues (6→7/10). Diminishing returns beyond 2 rounds for writing-only improvements.
- REVIEWER_MODEL =
gpt-5.4— Model used via Codex MCP for paper review. - REVIEWER_BIAS_GUARD = true — When
true, every review round uses a freshmcp__codex__codexthread with no prior review context. Never usemcp__codex__codex-replyfor review rounds. Set tofalseonly for deliberate debugging of the legacy behavior. Empirical evidence: running the same paper withcodex-reply+ "since last round we did X" prompts inflated scores from real 3/10 → fake 8/10 across multiple rounds; switching to fresh threads recovered the true 3/10 assessment. - REVIEW_LOG =
PAPER_IMPROVEMENT_LOG.md— Cumulative log of all rounds, stored in paper directory. - HUMAN_CHECKPOINT = false — When
true, pause after each round's review and present score + weaknesses to the user. The user can approve fixes, provide custom modification instructions, skip specific fixes, or stop early. Whenfalse(default), runs fully autonomously. - EDIT_WHITELIST =
null— Optional path to a YAML/JSON whitelist file constraining which paths and operations the fix-implementation step may touch. Whennull(default), all edits proceed unconstrained. When set via— edit-whitelist <path>(also accepts— edit_whitelist <path>), the loop loads the file at startup and consults it before each edit; rejected edits are logged toPAPER_IMPROVEMENT_LOG.mdrather than silently dropped. See "Optional: Edit Whitelist" below.
💡 Override:
/auto-paper-improvement-loop "paper/" — human checkpoint: true
Optional: Style reference (— style-ref: <source>, opt-in)
Lets the user steer structural fixes only during improvement (section reordering hints, paragraph length nudges, figure density adjustments) toward a reference paper. Default OFF — when the user does not pass — style-ref, do nothing differently from before.
Only when — style-ref: <source> appears in $ARGUMENTS, run the helper FIRST, before the loop starts:
if [ ! -f tools/extract_paper_style.py ]; then
echo "error: tools/extract_paper_style.py not found — re-run 'bash tools/install_aris.sh' to refresh the '.aris/tools' symlink (added in #174), or copy the helper manually from the ARIS repo" >&2
exit 1
fi
CACHE=$(python3 tools/extract_paper_style.py --source "<source>")
case $? in
0) ;; # use $CACHE/style_profile.md as structural guidance for the FIX phase only
2) echo "warning: style-ref skipped (missing optional dep)" >&2 ;;
3) echo "error: --style-ref source failed; aborting loop" >&2 ; exit 1 ;;
*) echo "error: helper failed unexpectedly; aborting loop" >&2 ; exit 1 ;;
esac
Sources accepted: local TeX dir / file, local PDF, arXiv id, http(s) URL. Overleaf URLs/IDs are rejected — clone via /overleaf-sync setup <id> first and pass the local clone path.
Strict rules (full contract in tools/extract_paper_style.py docstring):
- Use
style_profile.mdonly during the fix-implementation phase, to nudge structural choices when applying reviewer feedback. Reviewer feedback always takes precedence; style ref is tie-breaker for how to apply a fix, not whether to apply it. - Never copy prose, claims, examples, or terminology from anything reachable through the cache when implementing fixes.
- Never pass
— style-ref(or the cache contents) to the GPT-5.4 reviewer sub-agent. The Reviewer Independence Protocol below requires reviewers see only the artifact and the user's prompt — leaking the style ref would contaminate the review with author-side context. This is the most critical invariant in this skill.
Optional: Edit Whitelist (— edit-whitelist <path>, opt-in)
Lets the caller hard-constrain which files and operations the fix-implementation step (Step 3 / Step 6) is allowed to touch. Default OFF — when the user does not pass — edit-whitelist (or the alias — edit_whitelist), the loop applies all reviewer-driven edits without restriction, exactly as before.
This is the parameter that upstream pipelines (e.g. /resubmit-pipeline Phase 2) use to enforce text-only resubmit microedits: no .bib mutations, no .sty / .bst mutations, no edits to prior-submission directories, no new \cite{...}, no new theorem environments, no new numerical claims.
Schema
The whitelist file is YAML or JSON. All four sections are optional:
allowed_paths:
- sec/*.tex
- main.tex
- figures/*.tex
forbidden_paths:
- "**/*.bib"
- "**/*.sty"
- "**/*.bst"
- "../OldSubmission/**"
forbidden_operations:
- new_cite # blocks \cite{...}, \citep{...}, \citet{...}, \citeauthor{...} additions
- new_bibitem # blocks \bibitem{...} additions
- new_theorem_env # blocks \begin{theorem|lemma|proposition|corollary} additions
- numerical_claim # blocks adding new numbers / percentages / metrics
forbidden_deletions: # operations that block REMOVALS, not additions
- delete_existing_cite # blocks removal of \cite{...} from the body (use citation-audit --soft-only instead)
- delete_theorem_env # blocks removal of an existing \begin{theorem|...} block
requires_user_approval_for: # operations that don't auto-reject but pause for explicit user OK
- rewrite_abstract # paraphrasing the entire abstract triggers a checkpoint
- rewrite_intro_first_para
- delete_section
max_edits_per_round: 30 # hard cap on number of accepted edits per round (rejections are not counted; if cap is hit, remaining proposed edits are deferred to the next round with a warning)
rationale: "Resubmit mode: text-only microedits, paper structure frozen by user constraint."
Resolution rules
allowed_pathsempty ANDforbidden_pathsempty → whitelist is a no-op (advisory: the file is loaded andrationaleechoed to the log, but no path filtering is applied).allowed_pathsempty,forbidden_pathsnon-empty → all paths NOT matched byforbidden_pathsare mutable.allowed_pathsnon-empty,forbidden_pathsempty → only paths matchingallowed_pathsare mutable.- Both non-empty → an edit is allowed iff the target matches
allowed_pathsAND does NOT matchforbidden_paths.forbidden_pathsalways wins on overlap. forbidden_operationsmissing or empty → no operation-level guard; only path-level filtering applies.
Glob semantics
Use bash extglob / Python fnmatch.fnmatch semantics. ** matches any depth (zero or more directory segments). Patterns are matched against the path relative to the paper directory (e.g. paper/sec/intro.tex matches sec/*.tex when paper-directory is paper/).
Forbidden-operation detectors
For each candidate edit's diff (the new lines being added — deletions are exempt), the loop runs these regex checks and rejects if any forbidden operation matches:
| Operation | Detector (added lines only) |
|---|---|
new_cite |
\\cite[a-zA-Z]*\{[^}]+\} (catches \cite, \citep, \citet, \citeauthor, \citeyear, \citealp, etc.) |
new_bibitem |
\\bibitem\{[^}]+\} |
new_theorem_env |
`\begin{(theorem |
numerical_claim |
New token matching \b\d+(\.\d+)?%?\b that did NOT appear in the deleted/replaced lines (i.e. genuinely new numbers, not edits to existing ones) |
Behavior at loop start (before Round 1 fix-implementation)
- If
— edit-whitelist <path>is present in$ARGUMENTS, setEDIT_WHITELIST = <path>. - Load the file (
yaml.safe_load; if it fails, fall back tojson.loads). On load failure, abort the loop with a clear error — do NOT silently proceed unconstrained. - Echo
rationale(if present) intoPAPER_IMPROVEMENT_LOG.mdunder a new "Edit Whitelist" preamble section so the audit trail records why edits were constrained.
Behavior during fix-implementation (Steps 3 and 6)
Before applying each proposed edit:
- Resolve target file path relative to the paper directory.
- Path check: if
allowed_pathsis non-empty, target must match at least one pattern. Then ifforbidden_pathsis non-empty, target must NOT match any pattern. If either fails → reject aspathviolation. - Operation check: build the unified diff (or just the set of newly-added lines) for the proposed edit. For each entry in
forbidden_operations, run its detector on the added lines. If any detector matches → reject asoperationviolation. - If all checks pass, apply the edit normally.
- If rejected, append an entry to
PAPER_IMPROVEMENT_LOG.mdunder a## Rejected by edit_whitelist (Round N)heading with this schema:- file: <relative path> reason: path | operation pattern: <the offending forbidden_path glob, OR the offending forbidden_operation name + the matched substring> reviewer_concern: <the original Round-N weakness that motivated this edit> - Continue with the remaining edits in the round. Do NOT abort the whole round on a single rejection.
End-of-round surfacing
At the end of each round (after the recompile, before moving to the next round), if any edits were rejected during that round's fix step:
- Print a one-line summary to the round's checkpoint output:
Edit whitelist rejected N edits this round (M path, K operation). See PAPER_IMPROVEMENT_LOG.md "Rejected by edit_whitelist (Round N)". - If
HUMAN_CHECKPOINT = true, include the rejection list in the checkpoint shown to the user before they approve next-round fixes.
Example invocations
# Resubmit-pipeline Phase 2 caller (text-only mode):
/auto-paper-improvement-loop "paper/" — edit-whitelist .resubmit/edit_whitelist.yaml
# Aliased form is accepted:
/auto-paper-improvement-loop "paper/" — edit_whitelist .resubmit/edit_whitelist.yaml
# Combined with other flags:
/auto-paper-improvement-loop "paper/" — human checkpoint: true — edit-whitelist constraints.yaml
Rationale
Without a whitelist, the loop's reviewer-driven fix step is free to add citations, introduce new theorem environments, or tweak numerical claims — all of which are reasonable for first-submission polish but forbidden in resubmit / camera-ready / rebuttal-only modes where the paper structure is frozen by external constraint. Routing those constraints through a first-class parameter (rather than relying on the LLM to "remember" not to do them) makes the constraint enforceable, auditable via PAPER_IMPROVEMENT_LOG.md, and visible to the user at each round's checkpoint.
Inputs
- Compiled paper —
paper/main.pdf+ LaTeX source files - All section
.texfiles — concatenated for review prompt
State Persistence (Compact Recovery)
If the context window fills up mid-loop, Claude Code auto-compacts. To recover, this skill writes PAPER_IMPROVEMENT_STATE.json after each round:
{
"current_round": 1,
"threadId": "019ce736-...",
"last_score": 6,
"status": "in_progress",
"timestamp": "2026-03-13T21:00:00"
}
On startup: if PAPER_IMPROVEMENT_STATE.json exists with "status": "in_progress" AND timestamp is within 24 hours, read it + PAPER_IMPROVEMENT_LOG.md to recover context, then resume from the next round. Otherwise (file absent, "status": "completed", or older than 24 hours), start fresh.
After each round: overwrite the state file. On completion: set "status": "completed".
Reviewer Independence Protocol
The reviewer must be context-naive on every round. Prior-round summaries, fix lists, and executor explanations are not evidence; they are a source of confirmation bias. If the reviewer is told what changed, scores tend to drift upward even when the manuscript itself has not materially improved.
Rules:
- Every round starts with
mcp__codex__codex, notmcp__codex__codex-reply. - Never pass a prior threadId into the next review prompt.
- Never include "since last round", "we fixed", "after applying", or any fix summary in the reviewer prompt.
- The only acceptable evidence of improvement is the current
.texsource and compiled PDF. - If a fix cannot be observed in the files, the reviewer should not be told it happened.
- If recovery metadata is needed, store the returned threadId for crash recovery only; do not use it to preserve review context.
Set REVIEWER_BIAS_GUARD = false only if you explicitly want the legacy, context-carrying behavior for debugging.
Workflow
Step 0: Preserve Original
cp paper/main.pdf paper/main_round0_original.pdf
Step 1: Collect Paper Text
Concatenate all section files into a single text block for the review prompt:
# Collect all sections in order
for f in paper/sections/*.tex; do
echo "% === $(basename $f) ==="
cat "$f"
done > /tmp/paper_full_text.txt
Step 2: Round 1 Review
Send the full paper text AND compiled PDF to GPT-5.4 xhigh:
mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are reviewing a [VENUE] paper. Please provide a detailed, structured review.
## Paper Files:
- LaTeX source: [list all section .tex files]
- Compiled PDF: paper/main.pdf
- Figures: [list figure files]
Read BOTH the LaTeX source (for content/logic) AND the compiled PDF (for visual presentation).
## Review Instructions
Please act as a senior ML reviewer ([VENUE] level). Provide:
1. **Overall Score** (1-10, where 6 = weak accept, 7 = accept)
2. **Summary** (2-3 sentences)
3. **Strengths** (bullet list, ranked)
4. **Weaknesses** (bullet list, ranked: CRITICAL > MAJOR > MINOR)
5. **For each CRITICAL/MAJOR weakness**: A specific, actionable fix
6. **Missing References** (if any)
7. **Visual Review** (from the PDF):
- Figure quality: readable? labels legible? colors distinguishable in grayscale?
- Figure-caption alignment: does each caption match its figure?
- Layout: orphaned headers, awkward page breaks, figures far from references?
- Table formatting: aligned columns, consistent decimals, bold for best results?
- Visual consistency: same color scheme across all figures?
8. **Verdict**: Ready for submission? Yes / Almost / No
Focus on: theoretical rigor, claims vs evidence alignment, writing clarity,
self-containedness, notation consistency, AND visual presentation quality.
Save the threadId for Round 2.
Step 2b: Human Checkpoint (if enabled)
Skip if HUMAN_CHECKPOINT = false.
Present the review results and wait for user input:
📋 Round 1 review complete.
Score: X/10 — [verdict]
Key weaknesses (by severity):
1. [CRITICAL] ...
2. [MAJOR] ...
3. [MINOR] ...
Reply "go" to implement all fixes, give custom instructions, "skip 2" to skip specific fixes, or "stop" to end.
Parse user response same as /auto-review-loop: approve / custom instructions / skip / stop.
Step 3: Implement Round 1 Fixes
Parse the review and implement fixes by severity:
Priority order:
- CRITICAL fixes (assumption mismatches, internal contradictions)
- MAJOR fixes (overclaims, missing content, notation issues)
- MINOR fixes (if time permits)
Edit-whitelist gate (if set): If EDIT_WHITELIST is set, before applying each proposed edit, check the target path against allowed_paths / forbidden_paths and the new-lines diff against forbidden_operations per the "Optional: Edit Whitelist" section. Rejections are logged to PAPER_IMPROVEMENT_LOG.md under ## Rejected by edit_whitelist (Round 1) with file, reason (path or operation), the offending pattern, and the original reviewer concern. The loop continues with remaining edits — a rejection never aborts the round. Surface a rejection summary at the end of the round.
Common fix patterns:
| Issue | Fix Pattern |
|---|---|
| Assumption-model mismatch | Rewrite assumption to match the model, add formal proposition bridging the gap |
| Overclaims | Soften language: "validate" → "demonstrate practical relevance", "comparable" → "qualitatively competitive" |
| Missing metrics | Add quantitative table with honest parameter counts and caveats |
| Theorem not self-contained | Add "Interpretation" paragraph listing all dependencies |
| Notation confusion | Rename conflicting symbols globally, add Notation paragraph |
| Missing references | Add to references.bib, cite in appropriate locations |
| Theory-practice gap | Explicitly frame theory as idealized; add synthetic validation subsection |
| Proof gap (theory papers) | Run /proof-checker if PROOF_AUDIT.md doesn't exist yet; fix FATAL/CRITICAL issues |
| Writing clutter / passive voice | Apply sciwrite 5-pass audit: clutter extraction → active voice → sentence architecture → keyword consistency → numerical integrity. See paper-write Step 5 |
| Number mismatch (paper vs results) | Run /paper-claim-audit if PAPER_CLAIM_AUDIT.md doesn't exist; fix any number_mismatch or aggregation_mismatch claims |
| Keyword inconsistency | The "Banana Rule": if Methods says "obese group", Results must not say "heavier group". Extract key terms, verify consistency across all sections |
Step 4: Recompile Round 1
cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round1.pdf
Verify: 0 undefined references, 0 undefined citations.
Step 4.5: Restatement Regression Test
After every recompilation, rerun a theorem-statement consistency check so fix rounds cannot reintroduce appendix drift. Run this after Step 4 and again after Step 7 before the final format check.
Scope
- Compare only theorem/lemma/proposition/corollary statements, not proof bodies.
- Classify files by
main.texinput order: files before\appendixare main body; files after\appendixare appendix.
Normalized comparison logic
- Strip comments,
\label{...},\ref{...},\eqref{...},\cite...{...}, and whitespace-only differences. - Collapse formatting-only macros such as
\emph{},\textbf{},\textit{},\mathrm{},\mathbf{},\mathcal{}, and\operatorname{}to their contents. - Preserve quantifiers, case splits, assumptions, and the literal names of defined objects.
- Compare by theorem label when available; otherwise compare by theorem type and order.
- Flag any change in hypotheses, case splits, quantifier order, or terminology (
stationaryvsterminal) as regression drift.
python3 - <<'PY'
import re
def normalize(s):
s = re.sub(r'%.*', '', s)
s = re.sub(r'\\label\{[^}]*\}', '', s)
s = re.sub(r'\\(?:ref|eqref|cref|Cref|cite[a-zA-Z]*)\{[^}]*\}', '', s)
s = re.sub(r'\\(?:emph|textbf|textit|mathrm|mathbf|mathsf|mathcal|operatorname)\{([^{}]*)\}', r'\1', s)
s = re.sub(r'\\begin\{[^}]+\}|\\end\{[^}]+\}', '', s)
s = re.sub(r'\s+', ' ', s)
return s.strip().lower()
# Compare normalized theorem blocks from the current main-body files
# against their appendix restatements. Any mismatch blocks completion.
PY
Empirical motivation: in a real submission run, a key theorem had a multi-case split in the main text but a single-case statement in the appendix; a key variable was named one way in main and another in appendix. These drifted multiple times across fix rounds because no automated check caught regression.
Optional deeper check: /proof-checker --restatement-check
The inline Python check above is the default and is sufficient for routine main-vs-appendix consistency. For broader coverage, you may additionally invoke /proof-checker --restatement-check (added in PR #189), which extends the comparison to:
- Restatements in summary tables, abstract, "Key Contributions" lists, and discussion sections (not just main vs appendix).
- Six named drift signatures classified by reviewer:
conditional_loss,scope_change,quantifier_loss,regime_envelope_change,constant_change,variable_rename. - Structured findings emitted as
details.restatement_drift[]inPROOF_AUDIT.json, suitable for downstream tooling or audit trails.
This is advisory only — the inline Step 4.5 check remains the default and continues to run on every loop round. Consider invoking /proof-checker --restatement-check when (a) you suspect cross-location drift outside the main↔appendix axis (e.g., abstract overclaim relative to theorem statement), or (b) you want reviewer-graded drift signatures rather than raw string mismatches. Running both is supported and they are independent: the inline check fails fast on string drift, the proof-checker pass surfaces semantic-class drift.
Step 5: Round 2 Review
If REVIEWER_BIAS_GUARD = true (default), use a fresh mcp__codex__codex thread for Round 2. Do not reuse the Round 1 threadId for prompting. Save the returned threadId only for recovery bookkeeping.
mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are reviewing a [VENUE] paper. This is a fresh, zero-context review.
Ignore any prior review rounds, prior fix lists, or executor explanations.
Judge the paper only from the current LaTeX source and compiled PDF.
## Paper Files:
- LaTeX source: [list all section .tex files]
- Compiled PDF: paper/main.pdf
- Figures: [list figure files]
Read BOTH the LaTeX source (for content/logic) AND the compiled PDF (for visual presentation).
## Review Instructions
Please act as a senior ML reviewer ([VENUE] level). Provide:
1. **Overall Score** (1-10, where 6 = weak accept, 7 = accept)
2. **Summary** (2-3 sentences)
3. **Strengths** (bullet list, ranked)
4. **Weaknesses** (bullet list, ranked: CRITICAL > MAJOR > MINOR)
5. **For each CRITICAL/MAJOR weakness**: A specific, actionable fix
6. **Missing References** (if any)
7. **Visual Review** (from the PDF):
- Figure quality: readable? labels legible? colors distinguishable in grayscale?
- Figure-caption alignment: does each caption match its figure?
- Layout: orphaned headers, awkward page breaks, figures far from references?
- Table formatting: aligned columns, consistent decimals, bold for best results?
- Visual consistency: same color scheme across all figures?
8. **Verdict**: Ready for submission? Yes / Almost / No
Focus on: theoretical rigor, claims vs evidence alignment, writing clarity,
self-containedness, notation consistency, and visual presentation quality.
If REVIEWER_BIAS_GUARD = false (legacy debugging only), use mcp__codex__codex-reply with the saved threadId; this is not the recommended path.
Step 5.5: Kill Argument Exercise (theory / scope-heavy papers only)
Run this only if the paper is theory-heavy (≥5 \begin{theorem}|\begin{lemma}|\begin{proposition}|\begin{corollary} environments in the source) or has explicit scope/generality claims in title/abstract, and only on the final scheduled round (current_round == MAX_ROUNDS).
Delegate to /kill-argument — the canonical implementation lives at skills/kill-argument/SKILL.md (extracted in May 2026). This step does NOT re-implement the Attack-and-Adjudication prompt template; instead, invoke the skill and read its output:
# Invoke the canonical adversarial-review primitive on the current paper.
# /kill-argument runs two fresh-thread codex 5.5 xhigh calls and writes
# KILL_ARGUMENT.{md,json} into the paper directory. It is detect-only —
# it never edits the paper itself.
/kill-argument "$PAPER_DIR"
# Read the structured verdict.
KILL_VERDICT=$(jq -r '.verdict' "$PAPER_DIR/KILL_ARGUMENT.json")
KILL_REASON=$(jq -r '.reason_code' "$PAPER_DIR/KILL_ARGUMENT.json")
Merge rule (auto-loop's responsibility — /kill-argument itself is detect-only):
- Read
details.decomposed_pointsfromKILL_ARGUMENT.json. - For each point with
verdict == "still_unresolved"orverdict == "partially_answered"atseverity_if_unresolved == "critical":- Dedupe against the Round 2 weakness list by semantic overlap (~85% similarity threshold).
- If novel, append it to the Step 6 fix list with the recommended_fix as the action description.
- If the adjudication shows the issue is
answered_by_current_text, only downgrade an existing weakness item after verifying the cited file:line evidence yourself. - Record both
KILL_ARGUMENT.mdand the merge decision inPAPER_IMPROVEMENT_LOG.md. - If
HUMAN_CHECKPOINT = true, include the merged findings in the checkpoint summary before asking the user to proceed.
This phase feeds directly into Step 6. The merged findings must land before the final recompile.
If /kill-argument returns verdict: NOT_APPLICABLE (paper isn't theory- or scope-heavy enough to need this check), skip Step 5.5 entirely and proceed to Step 6. If it returns BLOCKED or ERROR, log the reason in PAPER_IMPROVEMENT_LOG.md and proceed without merging — the loop should not stall on an adversarial check that cannot run.
Empirical motivation: in a real submission run, after several rounds of standard improvement (score 7-8/10), the kill-argument exercise surfaced framing weaknesses that no prior review caught (e.g., a setting being mostly conditional rather than truly general, or a baseline being irrelevant to real systems). Author rebuttal forced explicit scope qualifications in abstract and discussion.
Step 5b: Human Checkpoint (if enabled)
Skip if HUMAN_CHECKPOINT = false. Same as Step 2b — present Round 2 review, wait for user input.
Step 6: Implement Round 2 Fixes
Same process as Step 3. Typical Round 2 fixes:
- Add controlled synthetic experiments validating theory
- Further soften any remaining overclaims
- Formalize informal arguments (e.g., truncation → formal proposition)
- Strengthen limitations section
Edit-whitelist gate (if set): Same as Step 3 — if EDIT_WHITELIST is set, run the path + forbidden-operation checks before applying each proposed edit. Rejections are logged to PAPER_IMPROVEMENT_LOG.md under ## Rejected by edit_whitelist (Round 2) and the loop continues. Surface a rejection summary at the end of the round.
Step 7: Recompile Round 2
cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round2.pdf
Step 8: Format Check
After the final recompilation, run a location-aware format compliance check.
# If the log lacks file/line data, rerun the final compile once with -file-line-error.
cd paper && latexmk -pdf -file-line-error -interaction=nonstopmode -halt-on-error main.tex
# 1. Page count vs venue limit
PAGES=$(pdfinfo paper/main.pdf | grep Pages | awk '{print $2}')
echo "Pages: $PAGES (limit: 9 main body for ICLR/NeurIPS)"
# 2. Duplicate labels: HARD BLOCK
DUP_LABELS=$(grep -Rho "\\\\label{[^}]*}" paper/main.tex paper/sections 2>/dev/null | sort | uniq -d || true)
if [ -n "$DUP_LABELS" ]; then
echo "Duplicate labels found (BLOCKING):"
echo "$DUP_LABELS"
fi
# 3. Overfull warnings with location classification
OVERFULLS=$(grep -n "Overfull \\\\hbox" paper/main.log 2>/dev/null || true)
# Main body = source files before \appendix in main.tex.
# Appendix = source files after \appendix, or files whose path contains "appendix".
# Bibliography = paper.bbl, references.bib, or bibliography-generated output.
MAIN_BODY_OVERFULL=$(echo "$OVERFULLS" | grep -v -E 'appendix|paper\.bbl|references\.bib' || true)
APPENDIX_OVERFULL=$(echo "$OVERFULLS" | grep -E 'appendix' || true)
BIB_OVERFULL=$(echo "$OVERFULLS" | grep -E 'paper\.bbl|references\.bib' || true)
echo "Main-body overfulls (any size BLOCKS):"
echo "$MAIN_BODY_OVERFULL"
echo "Appendix overfulls (>10pt blocks):"
echo "$APPENDIX_OVERFULL"
echo "Bibliography overfulls (>20pt blocks):"
echo "$BIB_OVERFULL"
Stop criteria:
- Any duplicate label blocks completion.
- Any overfull in the main body blocks completion, regardless of size.
- Appendix overfulls block completion only if they exceed 10pt or are visibly clipping.
- Bibliography overfulls block completion only if they exceed 20pt or are visibly clipping.
- Underfull hboxes remain warnings unless they create obvious layout damage.
Auto-fix patterns (location-aware):
| Issue | Fix |
|---|---|
| Main-body overfull in equation | Split with aligned / split / multline, or shorten notation |
| Main-body overfull in table | Reduce font, resize table, or break table across rows |
| Main-body overfull in text | Rephrase; do not hide it with global \sloppy |
| Appendix overfull ≤ 10pt | Warn only unless visibly clipping |
| Appendix overfull > 10pt | Apply the same fix if the spill is visible |
| Bibliography overfull ≤ 20pt | Warn only unless caused by malformed entry or clipping |
| Bibliography overfull > 20pt | Fix malformed entry, URL, or DOI formatting |
| Over page limit | Move content to appendix, compress tables, reduce figure sizes |
Location-aware interpretation:
- Classify by the source file reported in the
-file-line-errorlog. - If a warning cannot be classified, treat it as main body and fix it.
Empirical motivation: in a real submission run, dozens of overfull hbox warnings (the largest well over 100pt in an appendix proof) survived multiple improvement rounds because the previous blanket "overfull > 10pt blocks" rule was too lax and treated all locations equally.
Step 9: Document Results
Create PAPER_IMPROVEMENT_LOG.md in the paper directory:
# Paper Improvement Log
## Score Progression
| Round | Score | Verdict | Key Changes |
|-------|-------|---------|-------------|
| Round 0 (original) | X/10 | No/Almost/Yes | Baseline |
| Round 1 | Y/10 | No/Almost/Yes | [summary of fixes] |
| Round 2 | Z/10 | No/Almost/Yes | [summary of fixes] |
## Round 1 Review & Fixes
<details>
<summary>GPT-5.4 xhigh Review (Round 1)</summary>
[Full raw review text, verbatim]
</details>
### Fixes Implemented
1. [Fix description]
2. [Fix description]
...
## Round 2 Review & Fixes
<details>
<summary>GPT-5.4 xhigh Review (Round 2)</summary>
[Full raw review text, verbatim]
</details>
### Fixes Implemented
1. [Fix description]
2. [Fix description]
...
## PDFs
- `main_round0_original.pdf` — Original generated paper
- `main_round1.pdf` — After Round 1 fixes
- `main_round2.pdf` — Final version after Round 2 fixes
Step 9: Summary
Report to user:
- Score progression table
- Number of CRITICAL/MAJOR/MINOR issues fixed per round
- Final page count
- Remaining issues (if any)
Feishu Notification (if configured)
After each round's review AND at final completion, check ~/.claude/feishu.json:
- After each round: Send
review_scored— "Round N: X/10 — [key changes]" - After final round: Send
pipeline_done— score progression table + final page count - If config absent or mode
"off": skip entirely (no-op)
Output
paper/
├── main_round0_original.pdf # Original
├── main_round1.pdf # After Round 1
├── main_round2.pdf # After Round 2 (final)
├── main.pdf # = main_round2.pdf
└── PAPER_IMPROVEMENT_LOG.md # Full review log with scores
Key Rules
-
Large file handling: If the Write tool fails due to file size, immediately retry using Bash (
cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently. -
Preserve all PDF versions — user needs to compare progression
-
Save FULL raw review text — do not summarize or truncate GPT-5.4 responses
-
Reviewer independence (Round 2+): when
REVIEWER_BIAS_GUARD = true(default), use a freshmcp__codex__codexthread for every review round; never usemcp__codex__codex-replyand never include "since last round" / fix summaries in the prompt. See the Reviewer Independence Protocol section above. -
Always recompile after fixes — verify 0 errors before proceeding
-
Do not fabricate experimental results — synthetic validation must describe methodology, not invent numbers
-
Respect the paper's claims — soften overclaims rather than adding unsupported new claims
-
Global consistency — when renaming notation or softening claims, check ALL files (abstract, intro, method, experiments, theory sections, conclusion, tables, figure captions)
-
Edit-whitelist rejections are LOGGED, not silently dropped — when
EDIT_WHITELISTis set and an edit is rejected for a path or forbidden-operation violation, the rejection MUST be appended toPAPER_IMPROVEMENT_LOG.mdwith file, reason, offending pattern, and the original reviewer concern. The loop reports a rejection summary at the end of every round (and in the checkpoint, ifHUMAN_CHECKPOINT = true). Never silently swallow a whitelist rejection — the audit trail is the whole point of the parameter.
Typical Score Progression
Based on end-to-end testing on a real theory-paper run:
| Round | Score | Key Improvements |
|---|---|---|
| Round 0 | 4/10 (content) | Baseline: assumption-model mismatch, overclaims, notation issues |
| Round 1 | 6/10 (content) | Fixed assumptions, softened claims, added interpretation, renamed notation |
| Round 2 | 7/10 (content) | Added synthetic validation, formal truncation proposition, stronger limitations |
| Round 3 | 5→8.5/10 (format) | Removed hero fig, appendix, compressed conclusion, fixed overfull hbox |
+4.5 points across 3 rounds (2 content + 1 format) is typical for a well-structured but rough first draft. Final state at submission: clean overfull-hbox count and venue-format-compliant length.
Review Tracing
After each mcp__codex__codex or mcp__codex__codex-reply reviewer call, save the trace following shared-references/review-tracing.md. Use tools/save_trace.sh or write files directly to .aris/traces/<skill>/<date>_run<NN>/. Respect the --- trace: parameter (default: full).
More from wanshuiyin/auto-claude-code-research-in-sleep
idea-creator
Generate and rank research ideas given a broad direction. Use when user says "找idea", "brainstorm ideas", "generate research ideas", "what can we work on", or wants to explore a research area for publishable directions.
126idea-discovery
Workflow 1: Full idea discovery pipeline. Orchestrates research-lit → idea-creator → novelty-check → research-review to go from a broad research direction to validated, pilot-tested ideas. Use when user says \"找idea全流程\", \"idea discovery pipeline\", \"从零开始找方向\", or wants the complete idea exploration workflow.
123auto-review-loop
Autonomous multi-round research review loop. Repeatedly reviews via Codex MCP, implements fixes, and re-reviews until positive assessment or max rounds reached. Use when user says "auto review loop", "review until it passes", or wants autonomous iterative improvement.
116research-lit
Search and analyze research papers, find related work, summarize key ideas. Use when user says "find papers", "related work", "literature review", "what does this paper say", or needs to understand academic papers.
115research-pipeline
Full research pipeline: Workflow 1 (idea discovery) → implementation → Workflow 2 (auto review loop) → Workflow 3 (paper writing, optional). Goes from a broad research direction all the way to a polished PDF. Use when user says \"全流程\", \"full pipeline\", \"从找idea到投稿\", \"end-to-end research\", or wants the complete autonomous research lifecycle.
114pixel-art
Generate pixel art SVG illustrations for READMEs, docs, or slides. Use when user says "画像素图", "pixel art", "make an SVG illustration", "README hero image", or wants a cute visual.
114