mineru
SKILL.md
MinerU Document Parser
Convert PDF, Word, PPT, and images to clean Markdown using MinerU's VLM engine โ LaTeX formulas, tables, and images all preserved.
Setup
- Get free API token at https://mineru.net/user-center/api-token
export MINERU_TOKEN="your-token-here"
Limits: 2000 pages/day ยท 200 MB per file ยท 600 pages per file
Supported File Types
| Type | Formats |
|---|---|
| ๐ PDF | .pdf โ papers, textbooks, scanned docs |
| ๐ Word | .docx โ reports, manuscripts |
| ๐ PPT | .pptx โ slides, presentations |
| ๐ผ๏ธ Image | .jpg, .jpeg, .png โ OCR extraction |
Commands
Single File
python3 scripts/mineru_v2.py --file ./document.pdf --output ./output/
Batch Directory with Resume
python3 scripts/mineru_v2.py \
--dir ./docs/ \
--output ./output/ \
--workers 10 \
--resume
Direct to Obsidian
python3 scripts/mineru_v2.py \
--dir ./pdfs/ \
--output "~/Library/Mobile Documents/com~apple~CloudDocs/Obsidian/VaultName/" \
--resume
Chinese Documents
python3 scripts/mineru_v2.py --dir ./papers/ --output ./output/ --language ch
Complex Layouts (Slow but Most Accurate)
python3 scripts/mineru_v2.py --file ./paper.pdf --output ./output/ --model vlm
CLI Options
--dir PATH Input directory (PDF/Word/PPT/images)
--file PATH Single file
--output PATH Output directory (default: ./output/)
--workers N Concurrent workers (default: 5, max: 15)
--resume Skip already processed files
--model MODEL Model version: pipeline | vlm | MinerU-HTML (default: vlm)
--language LANG Document language: auto | en | ch (default: auto)
--no-formula Disable formula recognition
--no-table Disable table extraction
--token TOKEN API token (overrides MINERU_TOKEN env var)
Model Version Guide
| Model | Speed | Accuracy | Best For |
|---|---|---|---|
pipeline |
โก Fast | High | Standard docs, most use cases |
vlm |
๐ข Slow | Highest | Complex layouts, multi-column, mixed text+figures |
MinerU-HTML |
โก Fast | High | Web-style output, HTML-ready content |
Script Selection
| Script | Use When |
|---|---|
mineru_v2.py |
Default โ async parallel (up to 15 workers) |
mineru_async.py |
Fast network, need maximum throughput |
mineru_stable.py |
Unstable network โ sequential, max retry |
Output Structure
output/
โโโ document-name/
โ โโโ document-name.md # Main Markdown
โ โโโ images/ # Extracted images
โ โโโ content.json # Metadata
Performance
| Workers | Speed |
|---|---|
| 1 (sequential) | 1.2 files/min |
| 5 | 3.1 files/min |
| 15 | 5.6 files/min |
Error Handling
- 5x auto-retry with exponential backoff
- Use
--resumeto continue interrupted batches - Failed files listed at end of run
API Reference
For detailed API documentation, see references/api_reference.md.
Weekly Installs
13
Repository
nebutra/mineru-skillGitHub Stars
1
First Seen
Feb 13, 2026
Security Audits
Installed on
cursor13
gemini-cli13
codex13
opencode13
github-copilot12
amp12