mineru-parser
MinerU Parser
Overview
Parse PDF, DOC, DOCX, PPT, PPTX, and image files into markdown format using the MinerU API. Extract text, tables, formulas, and structured content with support for OCR and VLM models.
Token Setup
REQUIRED: MinerU API requires authentication via API token.
Getting Your Token
- Visit the MinerU website and register/login
- Navigate to your account settings or API section
- Generate or copy your API token
Recommended: Environment Variable
Set the token as an environment variable before starting work:
export MINERU_API_KEY="your_api_token_here"
In Python, retrieve it with:
import os
token = os.environ.get("MINERU_API_KEY")
if not token:
raise ValueError("MINERU_API_KEY environment variable not set")
Alternative: Direct Input
If the environment variable is not set, prompt the user for their token:
import os
token = os.environ.get("MINERU_API_KEY")
if not token:
# Ask user for token
token = input("Please enter your MinerU API token: ")
Security Note: Never hardcode tokens in scripts or commit them to version control.
Workflow
Parsing documents with MinerU involves three steps:
- Submit parsing task - POST request with file URL and parameters
- Check task status - Poll the status endpoint until complete
- Retrieve results - Download markdown/JSON and other formats
Step 1: Submit Parsing Task
Create a parsing task by sending a POST request to the MinerU API:
import requests
import os
# Get API token from environment variable
token = os.environ.get("MINERU_API_KEY")
if not token:
raise ValueError("MINERU_API_KEY environment variable not set. Please set it or provide your token.")
url = "https://mineru.net/api/v4/extract/task"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {token}"
}
data = {
"url": "FILE_URL", # URL to the file to parse
"model_version": "vlm" # Options: "pipeline" (default) or "vlm"
}
response = requests.post(url, headers=headers, json=data)
task_id = response.json()["data"]["task_id"]
Required Parameters
url: File URL (supports .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg)- Authorization header with Bearer token
Optional Parameters
Configure parsing behavior with these parameters:
model_version: "pipeline" (default) or "vlm"is_ocr: Enable OCR (default: false, pipeline only)enable_formula: Formula recognition (default: true, pipeline only)enable_table: Table recognition (default: true, pipeline only)language: Document language (default: "ch")page_ranges: Specify pages e.g., "2,4-6" or "2--2" (from page 2 to second-to-last)extra_formats: Additional output formats from ["docx", "html", "latex"]data_id: Custom identifier for the parsing jobcallback: URL for result notificationsseed: Random string for callback signature verification (required when using callback)
Response
{
"code": 0,
"data": {
"task_id": "a90e6ab6-44f3-4554-b459-b62fe4c6b436"
},
"msg": "ok",
"trace_id": "c876cd60b202f2396de1f9e39a1b0172"
}
Step 2: Check Task Status
Poll the status endpoint to check parsing progress:
import time
status_url = f"https://mineru.net/api/v4/extract/task/{task_id}"
while True:
response = requests.get(status_url, headers=headers)
data = response.json()["data"]
state = data["state"]
if state == "done":
result_url = data["full_zip_url"]
break
elif state == "failed":
error_msg = data["err_msg"]
raise Exception(f"Parsing failed: {error_msg}")
elif state == "running":
progress = data["extract_progress"]
print(f"Progress: {progress['extracted_pages']}/{progress['total_pages']}")
time.sleep(3) # Wait 3 seconds before next check
Task States
pending: Task is queuedrunning: Currently parsingconverting: Converting to output formatsdone: Parsing completefailed: Parsing failed (checkerr_msg)
Step 3: Retrieve Results
Download the results ZIP file:
import requests
import zipfile
from io import BytesIO
# Download ZIP file
zip_response = requests.get(result_url)
zip_file = zipfile.ZipFile(BytesIO(zip_response.content))
# Extract contents
zip_file.extractall("output_directory")
# The ZIP contains:
# - auto/ directory with markdown and JSON files
# - images/ directory with extracted images
# - Additional formats if requested (docx, html, latex)
Output Format
The results ZIP contains:
- auto/[filename].md: Extracted markdown content
- auto/[filename]_content_list.json: Structured content metadata
- images/: Extracted images referenced in markdown
- [filename].docx/html/latex: Additional formats if requested
Limitations
- Max file size: 200MB
- Max pages: 600 pages
- Daily quota: 2000 pages at high priority per account
- GitHub/AWS URLs may timeout due to network restrictions
- Direct file upload not supported (must use URL)
Best Practices
Token Management
- Recommended: Use
MINERU_API_KEYenvironment variable - Always check if token is set before making API calls
- Never hardcode tokens in scripts
- Prompt user for token if environment variable is not set
Polling Strategy
- Use 3-5 second intervals to check task status
- Implement timeout logic for long-running tasks
- Display progress updates to user when available
Error Handling
- Validate API responses before accessing data
- Check response status codes
- Handle failed parsing states gracefully
- Provide clear error messages to users
Example Usage
Complete end-to-end example:
import os
import requests
import time
import zipfile
from io import BytesIO
# Step 0: Get API token
token = os.environ.get("MINERU_API_KEY")
if not token:
print("MINERU_API_KEY not found in environment variables")
token = input("Please enter your MinerU API token: ")
# Step 1: Submit parsing task
url = "https://mineru.net/api/v4/extract/task"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {token}"
}
data = {
"url": "https://example.com/document.pdf",
"model_version": "vlm"
}
response = requests.post(url, headers=headers, json=data)
task_id = response.json()["data"]["task_id"]
print(f"Task submitted: {task_id}")
# Step 2: Poll for completion
status_url = f"https://mineru.net/api/v4/extract/task/{task_id}"
while True:
response = requests.get(status_url, headers=headers)
data = response.json()["data"]
state = data["state"]
if state == "done":
result_url = data["full_zip_url"]
print("Parsing complete!")
break
elif state == "failed":
print(f"Parsing failed: {data['err_msg']}")
exit(1)
elif state == "running":
progress = data.get("extract_progress", {})
if progress:
print(f"Progress: {progress['extracted_pages']}/{progress['total_pages']}")
time.sleep(3)
# Step 3: Download and extract results
zip_response = requests.get(result_url)
zip_file = zipfile.ZipFile(BytesIO(zip_response.content))
zip_file.extractall("output_directory")
print("Results saved to output_directory/")
Advanced configuration:
data = {
"url": "https://example.com/document.pdf",
"model_version": "pipeline",
"is_ocr": True,
"enable_formula": True,
"enable_table": True,
"language": "en",
"page_ranges": "1-10,15,20-25",
"extra_formats": ["docx", "html"]
}