skills/terrylica/cc-skills/academic-pdf-to-gfm

academic-pdf-to-gfm

Installation
SKILL.md

Academic PDF → GitHub GFM Conversion

A battle-tested workflow for converting academic/research PDF papers into GitHub-renderable GFM markdown with inline figures, mathematically correct LaTeX, and validated output.

Battle-tested on: López de Prado (2026) "How to Use the Sharpe Ratio" — 51 pages, 82 equations, 8 figures.

Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

Quick Start (3 Steps)

# Step 1: Extract prose (best structure preservation)
uv run --python 3.13 --with pymupdf4llm python3 -c "
import pymupdf4llm
md = pymupdf4llm.to_markdown('paper.pdf')
open('paper-raw.md', 'w').write(md)
"

# Step 2: Extract images
uv run --python 3.13 --with pymupdf python3 references/extract-images.py paper.pdf

# Step 3: Validate math before pushing
node references/validate-math.mjs paper.md

CRITICAL: Detect PDF Type First

This determines the entire workflow. Getting it wrong wastes hours.

Type A — Word-Generated PDF (Most Modern Academic Papers)

Signs: Embedded fonts, copyable text, Unicode math chars when you copy-paste (∑, π, α, β, γ, →)

Math encoding: Math is Unicode text in PDF stream — NOT images, NOT glyph maps

Consequence: OCR tools like marker-pdf cannot extract LaTeX — they see text like "γ₄" not \gamma_4. They may return empty output or crash silently.

Required approach:

  1. Use pymupdf4llm for prose extraction
  2. Manually transcribe all equations from PDF screenshots — there is no shortcut
  3. Read each formula visually, write LaTeX by hand

How to confirm: Run marker-pdf — if output is empty or has zero math content, it's Type A.

Type B — LaTeX-Generated PDF

Signs: Computer Modern fonts, precise mathematical spacing, arxiv.org source available

Math encoding: Glyph-mapped — structure is partially extractable

Approach: pymupdf4llm or pdftotext for text. If arxiv source exists, extract directly from .tex (vastly preferred over PDF conversion).

Type C — Scanned/Image PDF

Signs: All pages are raster images, zero copyable text

Approach: OCR pipeline — marker-pdf is best option, or tesseract


Tool Comparison

Tool Best For Install Key Limitation
pymupdf4llm Type A/B prose (best structure) uv run --with pymupdf4llm Math as Unicode, not LaTeX
pdftotext Quick plain text brew install poppler Loses table structure
markitdown Alternative prose uv run --with 'markitdown[pdf]' Slight over-spacing; same math limit
marker-pdf Type C scanned only pip install marker-pdf Fails silently on Type A (Unicode text bug)

Never trust marker-pdf output on Type A/B PDFs — the apparent "success" with empty math sections is the failure mode.


Image Extraction

Save references/extract-images.py:

import fitz, os, sys

doc = fitz.open(sys.argv[1])
os.makedirs("references/media", exist_ok=True)
saved = []
for page_num in range(len(doc)):
    for img_idx, img in enumerate(doc[page_num].get_images(full=True)):
        xref = img[0]
        base_image = doc.extract_image(xref)
        img_bytes = base_image["image"]
        if len(img_bytes) < 2048:   # skip icons/logos/watermarks/rules
            continue
        ext = base_image["ext"]
        fname = f"fig-p{page_num+1:02d}-{img_idx+1:02d}.{ext}"
        with open(f"references/media/{fname}", "wb") as f:
            f.write(img_bytes)
        saved.append((page_num+1, fname, base_image.get("width"), base_image.get("height")))
        print(f"Saved: {fname} ({len(img_bytes)//1024}KB, {base_image.get('width')}×{base_image.get('height')})")
doc.close()
print(f"\n{len(saved)} images saved to references/media/")

Naming: fig-p{page:02d}-{idx:02d}.{ext} — page number in name for easy location matching.

Size filter: Skip < 2 KB (captures icons, watermarks, horizontal rules). Review everything ≥ 2 KB — some are decorative but most are figures.

Insert in markdown:

![Figure 1: Variance of Sharpe ratio estimates](./media/fig-p12-01.png)

Place immediately after the nearest section heading or the paragraph that references the figure.


GitHub GFM Math Rendering Rules

The $$ vs ```math ``` Decision — Root Cause

GitHub's Markdown pre-processor runs BEFORE the math renderer. It treats \\ as an escaped backslash and collapses it to \. This breaks LaTeX line breaks in display math.

The rule is simple:

Equation type Use Reason
Single-line display $$...$$ No \\ → pre-processor safe
Multi-line (contains \\, \begin{aligned}, matrices) ```math ``` Pre-processor does NOT process code fences
Inline $...$ Standard
# BROKEN on GitHub — \\ stripped by pre-processor:

$$
\begin{aligned}
a &= b + c \\
d &= e + f
\end{aligned}
$$

# CORRECT on GitHub:

```math
\begin{aligned}
a &= b + c \\
d &= e + f
\end{aligned}
```

### Display Block Formatting Rules

- `$$` must be on its **own line** — not `$$formula$$` on one line
- **Blank line required** before AND after every `$$` block
- **Blank line required between consecutive** `$$` blocks
- These rules do NOT apply to `` ```math ``` `` blocks

### Supported/Unsupported LaTeX

See [references/github-math-support-table.md](./references/github-math-support-table.md) for the full table.

**Key things to avoid**:

| Command | Problem | Fix |
|---------|---------|-----|
| `\begin{align}` | ❌ Not supported by GitHub | Use `\begin{aligned}` |
| `\boxed{}` | ⚠️ Can cause raw LaTeX passthrough | Remove or use bold text |
| `\operatorname{}` | ⚠️ Active GitHub bug, inconsistent | Use `\text{}` or `\mathrm{}` |
| `\newcommand` | ❌ Was briefly available, then pulled | Expand all macros inline |
| `x^_y` | Superscript immediately before subscript | Write `x^{*}_{i}` with braces |

### Common Gotchas

- `\\[8pt]` vertical spacing inside `$$` → eaten by pre-processor → move to `` ```math ``` ``
- `\frac{1}{T}:\left(` → spurious colon after fraction → remove colon
- Pearson vs excess kurtosis: most finance formulas need Pearson (γ₄ = 3 for Gaussian), not excess. **Always document the kurtosis convention in the formula comment.**
- `\begin{pmatrix}` with `\\` → must use `` ```math ``` ``
- `\begin{cases}` with multiple rows → must use `` ```math ``` ``

---

## GitLab: No Workarounds Needed

**Empirically verified 2026-03-15** on GitLab CE 18.9.2. Confirmed by Comrak source code analysis.

GitLab uses the **Comrak** Rust parser with `math_dollars: true`. When Comrak encounters `$$`, it calls `handle_dollars` which slices the raw input buffer directly and stores it as a `NodeMath` AST node — CommonMark's backslash handler is never invoked on math content. The raw LaTeX is passed to KaTeX via `<span data-math-style="display/inline">` unchanged.

**Every GitHub workaround is unnecessary on GitLab:**

| GitHub problem | GitHub fix required | GitLab |
|----------------|---------------------|--------|
| `\\` in `$$` stripped → broken multiline | Use ` ```math ``` ` | `$$` works with `\\` |
| `\left\{` → `\left{` (delimiter error) | Use `\left\lbrace` | `\left\{` works |
| `\{...\}` set notation → invisible braces | Use `\lbrace...\rbrace` | `\{...\}` works |
| `\,` in `$$` → literal comma | Remove `\,` | `\,` works |
| `\,` in inline `$` → literal comma | Remove `\,` | `\,` works |

On GitLab you can write standard LaTeX without any platform-specific workarounds. If you're targeting GitLab (or hosting your own GitLab CE), skip all the `\lbrace`/`\rbrace` substitutions and ` ```math ``` ` conversions — plain `$$` with standard LaTeX is correct.

### GitLab.com Has a Hard 50-Span Per-Page Limit

**GitLab.com (SaaS) enforces a limit of 50 total math spans per page** (display + inline combined). After the 50th span, all subsequent equations silently fall back to raw LaTeX text. This limit exists to prevent DoS attacks and cannot be overridden on GitLab.com.

| Document math density | gitlab.com | Self-hosted CE |
|---|---|---|
| ≤ 50 total spans | ✅ Renders fully | ✅ |
| 51–100 spans | ⚠️ Partial render | ✅ |
| 100+ spans (academic papers) | ❌ Most equations raw text | ✅ Disable with `math_rendering_limits_enabled: false` |

**Validated on**: Sharpe ratio paper (341 spans) — breaks at span 51 on gitlab.com, renders fully on local CE.

**The W6 check in `validate-math.mjs`** warns when a file exceeds the limit.

**Summary: which platform to use**:
- **GitHub.com**: No math span limit. Use `\lbrace`/`\rbrace` workarounds (handled by `--fix`).
- **Self-hosted GitLab CE**: No limit (disable math_rendering_limits_enabled). No workarounds needed.
- **GitLab.com**: Only suitable for documents with ≤ 50 math spans.

### Self-hosting GitLab CE for Math-Heavy Documents

GitLab CE is free and runs on a single machine. On a 61 GB workstation with slim config:
- Memory footprint: ~3 GB (`puma['worker_processes'] = 2`, `sidekiq['concurrency'] = 5`, monitoring disabled)
- Push mirroring to GitHub: free on CE (syncs within 5 min)
- `glab` CLI: first-party, comparable to `gh`

```yaml
# docker-compose.yml — slim GitLab CE
services:
  gitlab:
    image: gitlab/gitlab-ce:latest
    restart: unless-stopped
    environment:
      GITLAB_OMNIBUS_CONFIG: |
        external_url 'http://YOUR_IP:8929'
        puma['worker_processes'] = 2
        sidekiq['concurrency'] = 5
        prometheus_monitoring['enable'] = false
        alertmanager['enable'] = false
        node_exporter['enable'] = false
        redis_exporter['enable'] = false
        postgres_exporter['enable'] = false
        gitlab_exporter['enable'] = false
    ports: ["8929:8929", "8922:22"]
    volumes:
      - /srv/gitlab/config:/etc/gitlab
      - /srv/gitlab/logs:/var/log/gitlab
      - /srv/gitlab/data:/var/opt/gitlab
```

---

## Validation Pipeline

### Step 1: Install KaTeX Validator

```bash
bun add -g katex   # Bun-first per project policy
# or: npm install -g katex

Step 2: Run Before Every Push

# Validate only (exit 1 on errors)
node references/validate-math.mjs your-file.md

# Validate + auto-fix correctable issues
node references/validate-math.mjs your-file.md --fix

The script is at references/validate-math.mjs. It runs two layers:

Layer 1 — KaTeX syntax: parse errors in $, $$, ```math ``` blocks Layer 2 — GFM structural (issues KaTeX passes but GitHub breaks):

Code Severity Issue Auto-fix
E0 Error \! \, \; \{ \} in $$ block — pre-processor strips backslash → parse error cascade ✅ spacing removed; \{\lbrace
E0b Warning \{ \} \, in inline $...$ — invisible braces or literal commas in prose ✅ → \lbrace/\rbrace; \, removed
E1 Error $$ block with \\ — GitHub pre-processor strips backslashes ✅ → ```math ```
E2 Error Consecutive $$ blocks without blank line — orphaned delimiter cascade ✅ add blank line
W1 Warning Bare ^* in $$ or $ block — markdown italic pairing eats the * ✅ → ^{\ast}
W2 Warning \begin{align} — not supported on GitHub ✗ manual
W3 Warning \boxed{} — can cause raw LaTeX passthrough ✗ manual
W4 Warning \operatorname{} — inconsistent GitHub support ✗ manual

E0 is the most dangerous: a single failing $$ block exposes its $$ delimiters as literal text, creating an orphaned $ that shifts ALL subsequent inline $...$ pairings. One broken equation takes down the entire document.

\{/\} trap: In $$ blocks, \left\{ becomes \left{ (invalid KaTeX delimiter → "Missing or unrecognized delimiter") and \{...\} set notation becomes invisible grouping. Fix: use \lbrace/\rbrace (letter-based, CommonMark-immune). This affects every equation using set notation like \{\hat{SR}_k\} or \min_T\left\{...\right\}.

Exits code 1 on errors (CI-friendly). Warnings do not block CI but should be reviewed.

Local Preview Tools

# GitHub-accurate hot-reload preview
bun add -g @hyrious/gfm
gfm your-file.md --serve

# Offline binary (gh extension)
gh extension install thiagokokada/gh-gfm-preview
gh gfm-preview your-file.md

VS Code extensions:

  • shd101wyy.markdown-preview-enhanced — closest to GitHub rendering
  • bierner.markdown-preview-github-styles — GitHub CSS styling

Multi-Agent Adversarial Equation Validation

For papers with 10+ equations, use this multi-agent pattern:

Phase 1 — Parallel Extraction

  • Agent A: Extract prose with pymupdf4llm, transcribe math from PDF screenshots
  • Agent B: Extract and categorize all images

Phase 2 — Parallel Validation

  • Agent C: Validate equations against reference implementation (if code/repo exists)
  • Agent D: Numerical spot-checks — compute paper's exhibit values, compare

Phase 3 — Discrepancy Handling

  • For each discrepancy: write /tmp/paper-discrepancy/eq-{N}.md
  • Spawn resolver agents to search online for authoritative third-party sources
  • Authority rule: Paper is tentatively more authoritative than code implementation; a third independent source breaks ties

Phase 4 — Guarded Application

  • Apply only HIGH-confidence fixes to the markdown
  • For MEDIUM-confidence: spawn an independent audit agent before touching the file
  • Document all discrepancies even if not fixed — future readers need to know

Anti-Patterns

Anti-pattern Why it fails Fix
\!\left( or \, in $$ blocks GH pre-processor strips \!! before KaTeX — !\left( crashes KaTeX, cascades all Remove \! \, \; (spacing only) — or use ```math ```
\left\{ or \{...\} in $$/$ blocks \{{ (CommonMark escape), so \left\{\left{ = "Missing delimiter" error, and \{x\} renders without visible braces Replace with \left\lbrace, \right\rbrace, \lbrace, \rbrace
$$\begin{aligned}...\\...\end{aligned}$$ \\ stripped by GH pre-processor Use ```math ```
Trusting marker-pdf on Word PDFs Returns no output or zero math (Unicode bug) Read as screenshots, transcribe manually
\begin{align} in display math Not supported by GitHub Replace with \begin{aligned}
\operatorname{Cov} Active GH bug — sometimes renders raw Use \text{Cov} or \mathrm{Cov}
KaTeX validation only, no ```math ``` conversion KaTeX passes but GH pre-processor still breaks \\ Also convert ALL multi-line blocks
\boxed{} for highlighting Can cause raw LaTeX passthrough on GitHub Use bold text or a blockquote callout
Excess kurtosis in formulas expecting Pearson Silent ~50% underestimate in variance formulas Always document convention; use scipy.stats.kurtosis(fisher=False)
Consecutive $$ blocks without blank lines GitHub collapses them into one broken block Add blank line between each block
Running validation AFTER pushing Bugs visible in public repo Validate locally before every push (--fix auto-corrects E0/E1/E2)

References

File Purpose
validate-math.mjs KaTeX batch validator for GFM files
pdf-type-detection.md Detailed guide to detecting PDF type
github-math-support-table.md Full supported/unsupported LaTeX table

Related Skills

Skill Relationship
pandoc-pdf-generation Opposite direction: markdown → PDF
documentation-standards GFM formatting standards
quant-research:opendeviation-eval-metrics Worked example: references/how-to-use-the-sharpe-ratio-2026.md

Post-Execution Reflection

After this skill completes, reflect before closing the task:

  1. Locate yourself. — Find this SKILL.md's canonical path before editing.
  2. What failed? — Fix the instruction that caused it.
  3. What worked better than expected? — Promote to recommended practice.
  4. What drifted? — Fix any script, reference, or dependency that no longer matches reality.
  5. Log it. — Evolution-log entry with trigger, fix, and evidence.

Do NOT defer. The next invocation inherits whatever you leave behind.

Weekly Installs
29
GitHub Stars
38
First Seen
3 days ago