robust-pdf-read
Robust PDF Text Extraction
Problem
Standard file reading tools (e.g., read_file) often fail to extract text from PDF documents. Instead of returning parsed text, they may return:
- Raw binary data
- Base64 encoded images
- Garbled characters or null bytes
This occurs because PDFs are complex binary formats, not plain text files. Attempts to parse them using general-purpose Python libraries (like PyMuPDF) in sandboxed environments may also fail due to missing dependencies or environment restrictions.
Solution
Use the pdftotext command-line utility (part of poppler-utils) via run_shell. This tool is commonly pre-installed in Linux environments and reliably extracts text content from PDFs.
Procedure
1. Detect Extraction Failure
When attempting to read a PDF:
- Check the content returned by
read_file. - If the content contains null bytes (
\x00), appears as base64, or is clearly binary/garbled, assume standard reading has failed.
More from hkuds/openspace
skill-discovery
Search for reusable skills across OpenSpace's local registry and cloud community. Reusing proven skills saves tokens, improves reliability, and extends your capabilities beyond built-in tools.
5delegate-task
Delegate tasks to OpenSpace — a full-stack autonomous worker for coding, DevOps, web research, and desktop automation, backed by an extensive MCP tool and skill library. Skills auto-improve through use, reducing token consumption over time. A cloud community lets agents share and collectively evolve reusable skills.
2skill-template-generator
Generate properly-formatted SKILL.md files from extracted architectural patterns. Turns raw pattern descriptions into reusable skills that OpenSpace can discover, select, and evolve.
1data-driven-panel
Create dashboard panel components with integrated resilient data services, combining UI construction and data fetching into a unified pattern.
1