pdf-to-txt

SKILL.md

PDF to Text Skill

Convert PDF files to plain text format using PyMuPDF4LLM. This tool extracts text content from PDF documents while preserving the reading order and basic formatting.

Features

  • Extract text from any PDF file
  • Preserve reading order and structure
  • Support for multi-page documents
  • Optional Markdown output with formatting hints

Usage

Basic Conversion

Convert a PDF to text file:

python {baseDir}/scripts/convert.py "<pdf_path>"

Output will be saved as <pdf_filename>.txt in the same directory as the PDF.

Specify Output Path

python {baseDir}/scripts/convert.py "<pdf_path>" --output "~/Documents/output.txt"

Convert with Markdown Formatting

Use --markdown flag to get Markdown-formatted output with headers, lists, and other formatting hints:

python {baseDir}/scripts/convert.py "<pdf_path>" --markdown

Page Range Selection

Convert only specific pages:

# Convert pages 1-10 only
python {baseDir}/scripts/convert.py "<pdf_path>" --pages 1-10

# Convert single page
python {baseDir}/scripts/convert.py "<pdf_path>" --pages 5

Examples

# Basic conversion
python {baseDir}/scripts/convert.py "~/Documents/paper.pdf"
# Output: ~/Documents/paper.txt

# With custom output path
python {baseDir}/scripts/convert.py "~/Documents/paper.pdf" --output "~/Notes/paper_content.txt"

# Markdown output
python {baseDir}/scripts/convert.py "~/Documents/paper.pdf" --markdown --output "~/Notes/paper.md"

# Convert first 5 pages only
python {baseDir}/scripts/convert.py "~/Documents/book.pdf" --pages 1-5 --output "~/Notes/chapter1.txt"

Output Format

Plain Text (default)

  • Clean text extraction
  • Preserves paragraph breaks
  • Removes decorative formatting

Markdown (--markdown)

  • Headers marked with #
  • Lists preserved with - or *
  • Bold/italic formatting hints where detectable
  • Better for documents with complex structure

Troubleshooting

"No text found in PDF"

  • The PDF may be scanned images without OCR
  • Try using OCR tools first to add a text layer

Garbled text

  • The PDF may use custom fonts without proper encoding
  • Some PDFs have text stored in unexpected ways

Missing content

  • Complex layouts (multi-column, sidebars) may lose some positioning
  • Forms and interactive elements may not extract cleanly
Weekly Installs
2
GitHub Stars
13
First Seen
Feb 19, 2026
Installed on
replit2
openclaw2
mcpjam1
claude-code1
windsurf1
zencoder1