MinerU PDF Converter

Convert PDF and other documents to high-quality Markdown using MinerU cloud API. Handles large PDFs (>600 pages) automatically by splitting, converting, and merging.

Capabilities

Convert PDF, images, and other documents to Markdown/LaTeX
Preserve formulas, tables, and complex layouts
Auto-upload local files via MinerU batch API
Auto-split large PDFs (>600 pages) and merge results
Support additional output formats: LaTeX, DOCX, HTML

Requirements

Python packages:

pip install requests pymupdf

API Token: Token stored at: ~/.claude/skills/mineru-pdf-converter/references/mineru-token.md

Quick Start

Basic Conversion

python ~/.claude/skills/mineru-pdf-converter/scripts/mineru_convert.py \
  --input "/path/to/document.pdf" \
  --token-file "~/.claude/skills/mineru-pdf-converter/references/mineru-token.md"

Convert from URL

python ~/.claude/skills/mineru-pdf-converter/scripts/mineru_convert.py \
  --url "https://example.com/paper.pdf" \
  --token-file "~/.claude/skills/mineru-pdf-converter/references/mineru-token.md"

Verbose Mode with Progress

python ~/.claude/skills/mineru-pdf-converter/scripts/mineru_convert.py \
  --input "/path/to/document.pdf" \
  --token-file "~/.claude/skills/mineru-pdf-converter/references/mineru-token.md" \
  --verbose

Output shows progress percentage during conversion:

Uploading file: /path/to/document.pdf
File uploaded, batch_id: abc123
Waiting for conversion...
Status: running (25/100 pages, 25.0%)
Status: running (50/100 pages, 50.0%)
Status: done
Downloading result...

Additional Formats

# Include LaTeX output
python ~/.claude/skills/mineru-pdf-converter/scripts/mineru_convert.py \
  --input "/path/to/document.pdf" \
  --token-file "~/.claude/skills/mineru-pdf-converter/references/mineru-token.md" \
  --extra-formats "latex"

Workflow

When user requests PDF conversion:

Identify input type
- Local file path: Use batch upload API to get temporary URL
- URL: Submit directly to conversion API
Check file size (for PDFs)
- If >600 pages: Split into 500-page chunks using PyMuPDF
- Process each chunk separately
- Merge final Markdown output

Execute conversion

python ~/.claude/skills/mineru-pdf-converter/scripts/mineru_convert.py \
  --input "[path]" \
  --token-file "~/.claude/skills/mineru-pdf-converter/references/mineru-token.md"

Report result
- Confirm output path (subfolder named after input file by default)
- Note any warnings or partial failures
- Provide path to main .md file

Parameters Reference

Parameter	Default	Description
`--input`	-	Local file path (mutually exclusive with --url)
`--url`	-	Remote file URL (mutually exclusive with --input)
`--token-file`	-	Path to token file (required)
`--model`	vlm	Model: pipeline, vlm, MinerU-HTML
`--language`	ch	Document language
`--extra-formats`	[]	Additional formats: latex, docx, html
`--output-dir`	(source dir/filename)	Override output directory (skips subfolder creation)
`--enable-formula`	true	Enable formula recognition
`--enable-table`	true	Enable table recognition
`--page-ranges`	-	Page ranges to convert (e.g., "1-100,150-200") - see note below
`--timeout`	600	Max wait time in seconds

Page Ranges

The --page-ranges parameter allows you to convert only specific pages:

# Convert pages 1-50 and 100-150
python ~/.claude/skills/mineru-pdf-converter/scripts/mineru_convert.py \
  --input "/path/to/document.pdf" \
  --page-ranges "1-50,100-150" \
  --token-file "~/.claude/skills/mineru-pdf-converter/references/mineru-token.md"

How it works:

Local files (--input): Pages are extracted client-side using PyMuPDF before upload
URL input (--url): Page ranges sent to MinerU API server-side

This means page ranges now work for both local files and URLs.

Large PDF Handling

PDFs over 600 pages are automatically:

Split into chunks of max 500 pages using PyMuPDF
Each chunk converted separately via the API
Output Markdown files merged in order with page markers
Temporary chunk files cleaned up

To handle large PDFs, ensure PyMuPDF is installed:

pip install pymupdf

Model Selection

Model	Best For	Notes
vlm (default)	Complex layouts, formulas, tables	Higher accuracy, handles scanned documents
pipeline	Simple text documents	Faster processing, lower resource usage
MinerU-HTML	HTML output needed	Specialized for HTML output format

Error Handling

Error	Cause	Solution
Auth failed (401)	Invalid or expired token	Update token in mineru.md
Task timeout	Large file or slow server	Increase --timeout; retry later
Conversion failed	Unsupported format or corrupted file	Try pipeline model as fallback
Upload failed (413)	File >200MB	Split file manually first
Rate limit (429)	Exceeded 2000 pages/day quota	Wait until next day

Output Structure

The conversion produces a ZIP file that is extracted to a subfolder named after the input file. This prevents naming conflicts when converting multiple PDFs in the same directory.

Default behavior (no --output-dir specified):

For input /path/to/paper.pdf, output is saved to /path/to/paper/:

/path/to/paper/
├── full.md               # Main Markdown file
├── images/               # Extracted images
│   ├── image_1.png
│   └── image_2.png
└── paper.json            # Structured content (optional)

With --output-dir specified:

When --output-dir /custom/path is provided, files are extracted directly to that directory (no subfolder created):

/custom/path/
├── full.md
├── images/
│   └── ...
└── paper.json

API Quota

High priority: 2000 pages/day
Low priority: Additional capacity (slower processing)
Check remaining quota in the API response

Supporting Files

scripts/mineru_convert.py - Main conversion orchestrator
scripts/pdf_splitter.py - PDF splitting utility (PyMuPDF)
scripts/merge_markdown.py - Output merger for chunked conversions
references/api-reference.md - Full MinerU API documentation

Troubleshooting

Token Expired

The JWT token has an expiration date. If authentication fails:

Log in to mineru.net
Get a new API token
Update ~/.claude/skills/mineru-pdf-converter/references/mineru-token.md

Conversion Hangs

For very large or complex documents:

Increase timeout: --timeout 1200
Use page ranges to convert in sections: --page-ranges "1-100"
Try the pipeline model: --model pipeline

Missing Formulas or Tables

Ensure recognition is enabled (default):

--enable-formula true
--enable-table true

For detailed API reference, see references/api-reference.md.