skills/aidenwu0209/paddleocr-skills/paddleocr-doc-parsing

paddleocr-doc-parsing

SKILL.md

PaddleOCR Document Parsing Skill

When to Use This Skill

Use Document Parsing for:

  • Documents with tables (invoices, financial reports, spreadsheets)
  • Documents with mathematical formulas (academic papers, scientific documents)
  • Documents with charts and diagrams
  • Multi-column layouts (newspapers, magazines, brochures)
  • Complex document structures requiring layout analysis
  • Any document requiring structured understanding

Use Text Recognition instead for:

  • Simple text-only extraction
  • Quick OCR tasks where speed is critical
  • Screenshots or simple images with clear text

How to Use This Skill

⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔

  1. ONLY use PaddleOCR Document Parsing API - Execute the script python scripts/vl_caller.py
  2. NEVER use Claude's built-in vision - Do NOT parse documents yourself
  3. NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
  4. IF API fails - Display the error message and STOP immediately
  5. NO fallback methods - Do NOT attempt document parsing any other way

If the script execution fails (API not configured, network error, etc.):

  • Show the error message to the user
  • Do NOT offer to help using your vision capabilities
  • Do NOT ask "Would you like me to try parsing it?"
  • Simply stop and wait for user to fix the configuration

Basic Workflow

  1. Execute document parsing:

    python scripts/vl_caller.py --file-url "URL provided by user"
    

    Or for local files:

    python scripts/vl_caller.py --file-path "file path"
    

    Optional: explicitly set file type:

    python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0
    
    • --file-type 0: PDF
    • --file-type 1: image
    • If omitted, the service can infer file type from input.

    Save result to file (recommended):

    python scripts/vl_caller.py --file-url "URL" --output result.json --pretty
    
    • The script will display: Result saved to: /absolute/path/to/result.json
    • This message appears on stderr, the JSON is saved to the file
    • Tell the user the file path shown in the message
  2. The script returns COMPLETE JSON with all document content:

    • Headers, footers, page numbers
    • Main text content
    • Tables with structure
    • Formulas (with LaTeX)
    • Figures and charts
    • Footnotes and references
    • Seals and stamps
    • Layout and reading order

    Note: The actual content types that can be parsed depend on the model configured at your API endpoint (PADDLEOCR_DOC_PARSING_API_URL). The list above represents the maximum set of supported types.

  3. Extract what the user needs from the complete data based on their request.

IMPORTANT: Complete Content Display

CRITICAL: You must display the COMPLETE extracted content to the user based on their needs.

  • The script returns ALL document content in a structured format
  • Display the full content requested by the user, do NOT truncate or summarize
  • If user asks for "all text", show the entire text field
  • If user asks for "tables", show ALL tables in the document
  • If user asks for "main content", filter out headers/footers but show ALL body text

What this means:

  • DO: Display complete text, all tables, all formulas as requested
  • DO: Present content in the order provided by the API
  • DON'T: Truncate with "..." unless content is excessively long (>10,000 chars)
  • DON'T: Summarize or provide excerpts when user asks for full content
  • DON'T: Say "Here's a preview" when user expects complete output

Example - Correct:

User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:

[Display the entire text field]

Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)

Example - Incorrect ❌:

User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"

Understanding the JSON Response

The script returns a JSON envelope wrapping the raw API result:

{
  "ok": true,
  "text": "Full markdown/HTML text extracted from all pages",
  "result": [
    {
      "prunedResult": { ... },  // layout element positions, content, confidence
      "markdown": {
        "text": "Full page content in markdown/HTML format",
        "images": { ... }
      }
    }
  ],
  "error": null
}

Key fields:

  • text — extracted markdown text from all pages (use this for quick text display)
  • result — raw API result array (one object per page)
  • result[n].prunedResult — layout element positions, content, and confidence scores
  • result[n].markdown — full page content in markdown/HTML format

Content Extraction Guidelines

User Says What to Extract How
"Extract all text" Everything Use text field directly
"Get all tables" Tables only Look for <table> in the markdown text
"Show main content" Main body text Use text field, filter as needed
"Complete document" Everything Use text field

Usage Examples

Example 1: Extract Main Content (default behavior)

python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

Then use the text field for main content display.

Example 2: Extract Tables Only

python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

Then look for <table> content in the result to extract tables.

Example 3: Complete Document with Everything

python scripts/vl_caller.py \
  --file-url "URL" \
  --pretty

Then use the text field or iterate the full result.

First-Time Configuration

When API is not configured:

The error will show:

Configuration error: API not configured. Get your API at: https://paddleocr.com

Configuration workflow:

  1. Show the exact error message to user (including the URL)

  2. Tell user to provide credentials:

    Please visit the URL above to get your PADDLEOCR_DOC_PARSING_API_URL and PADDLEOCR_ACCESS_TOKEN.
    Once you have them, send them to me and I'll configure it automatically.
    
  3. When user provides credentials (accept any format):

    • PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...
    • Here's my API: https://xxx and token: abc123
    • Copy-pasted code format
    • Any other reasonable format
  4. Parse credentials from user's message:

    • Extract PADDLEOCR_DOC_PARSING_API_URL value (look for URLs with paddleocr.com or similar)
    • Extract PADDLEOCR_ACCESS_TOKEN value (long alphanumeric string, usually 40+ chars)
  5. Configure automatically:

    python scripts/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN"
    
  6. If configuration succeeds:

    • Inform user: "Configuration complete! Parsing document now..."
    • Retry the original parsing task
  7. If configuration fails:

    • Show the error
    • Ask user to verify the credentials

IMPORTANT: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it.

Handling Large Files

There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.

Tips for large files:

Use URL for Large Local Files (Recommended)

For very large local files, prefer --file-url over --file-path to avoid base64 encoding overhead:

python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"

Process Specific Pages (PDF Only)

If you only need certain pages from a large PDF, extract them first:

# Using pypdfium2 (requires: pip install pypdfium2)
python -c "
import pypdfium2 as pdfium
doc = pdfium.PdfDocument('large.pdf')
# Extract pages 0-4 (first 5 pages)
new_doc = pdfium.PdfDocument.new()
for i in range(min(5, len(doc))):
    new_doc.import_pages(doc, [i])
new_doc.save('pages_1_5.pdf')
"

# Then process the smaller file
python scripts/vl_caller.py --file-path "pages_1_5.pdf"

Error Handling

Authentication failed (403):

error: Authentication failed

→ Token is invalid, reconfigure with correct credentials

API quota exceeded (429):

error: API quota exceeded

→ Daily API quota exhausted, inform user to wait or upgrade

Unsupported format:

error: Unsupported file format

→ File format not supported, convert to PDF/PNG/JPG

Important Notes

  • The script NEVER filters content - It always returns complete data
  • Claude decides what to present - Based on user's specific request
  • All data is always available - Can be re-interpreted for different needs
  • No information is lost - Complete document structure preserved

Reference Documentation

For in-depth understanding of the PaddleOCR Document Parsing system, refer to:

  • references/output_schema.md - Output format specification
  • references/provider_api.md - Provider API contract

Note: Model version and capabilities are determined by your API endpoint (PADDLEOCR_DOC_PARSING_API_URL).

Load these reference documents into context when:

  • Debugging complex parsing issues
  • Need to understand output format
  • Working with provider API details

Testing the Skill

To verify the skill is working properly:

python scripts/smoke_test.py

This tests configuration and optionally API connectivity.

Weekly Installs
51
GitHub Stars
5
First Seen
Feb 9, 2026
Installed on
gemini-cli49
opencode49
codex48
github-copilot47
kimi-cli46
amp46