rag-parse
rag-parse Skill
Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally. This skill uses LiteParse (lit CLI) under the hood — fast, lightweight, no cloud dependencies or LLM required.
Initial Setup
When this skill is invoked, respond with:
I'm ready to use LiteParse to parse files locally. Before we begin, please confirm that:
- `@llamaindex/liteparse` is installed globally (`npm i -g @llamaindex/liteparse`)
- The `lit` CLI command is available in your terminal
If both are set, please provide:
1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.)
2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc.
3. What you'd like to do with the parsed content.
I will produce the appropriate `lit` CLI command or TypeScript script, and once approved, report the results.
Then wait for the user's input.
Step 0 — Install LiteParse (if needed)
If liteparse is not yet installed, install it globally:
npm i -g @llamaindex/liteparse
Verify installation:
lit --version
For Office document support (DOCX, PPTX, XLSX), LibreOffice is required:
# macOS
brew install --cask libreoffice
# Ubuntu/Debian
apt-get install libreoffice
For image parsing, ImageMagick is required:
# macOS
brew install imagemagick
# Ubuntu/Debian
apt-get install imagemagick
Step 1 — Produce the CLI Command or Script
Parse a Single File
# Basic text extraction
lit parse document.pdf
# JSON output saved to a file
lit parse document.pdf --format json -o output.json
# Specific page range
lit parse document.pdf --target-pages "1-5,10,15-20"
# Disable OCR (faster, text-only PDFs)
lit parse document.pdf --no-ocr
# Use an external HTTP OCR server for higher accuracy
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr
# Higher DPI for better quality
lit parse document.pdf --dpi 300
Batch Parse a Directory
lit batch-parse ./input-directory ./output-directory
# Only process PDFs, recursively
lit batch-parse ./input ./output --extension .pdf --recursive
Generate Page Screenshots
Screenshots are useful for LLM agents that need to see visual layout.
# All pages
lit screenshot document.pdf -o ./screenshots
# Specific pages
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots
# High-DPI PNG
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots
# Page range
lit screenshot document.pdf --pages "1-10" -o ./screenshots
Step 2 — Key Options Reference
OCR Options
| Option | Description |
|---|---|
| (default) | Tesseract.js — zero setup, built-in |
--ocr-language fra |
Set OCR language (ISO code) |
--ocr-server-url <url> |
Use external HTTP OCR server (EasyOCR, PaddleOCR, custom) |
--no-ocr |
Disable OCR entirely |
Output Options
| Option | Description |
|---|---|
--format json |
Structured JSON with bounding boxes |
--format text |
Plain text (default) |
-o <file> |
Save output to file |
Performance / Quality Options
| Option | Description |
|---|---|
--dpi <n> |
Rendering DPI (default: 150; use 300 for high quality) |
--max-pages <n> |
Limit pages parsed |
--target-pages <pages> |
Parse specific pages (e.g. "1-5,10") |
--no-precise-bbox |
Disable precise bounding boxes (faster) |
--skip-diagonal-text |
Ignore rotated/diagonal text |
--preserve-small-text |
Keep very small text that would otherwise be dropped |
Step 4 — Using a Config File
For repeated use with consistent options, generate a liteparse.config.json:
{
"ocrLanguage": "en",
"ocrEnabled": true,
"maxPages": 1000,
"dpi": 150,
"outputFormat": "json",
"preciseBoundingBox": true,
"skipDiagonalText": false,
"preserveVerySmallText": false
}
For an HTTP OCR server:
{
"ocrServerUrl": "http://localhost:8828/ocr",
"ocrLanguage": "en",
"outputFormat": "json"
}
Use with:
lit parse document.pdf --config liteparse.config.json
Step 5 — HTTP OCR Server API (Advanced)
If the user wants to plug in a custom OCR backend, the server must implement:
- Endpoint:
POST /ocr - Accepts:
file(multipart) andlanguage(string) parameters - Returns:
{
"results": [
{ "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 }
]
}
Ready-to-use wrappers exist for EasyOCR and PaddleOCR in the LiteParse repo.
Supported Input Formats
| Category | Formats |
|---|---|
.pdf |
|
| Word | .doc, .docx, .docm, .odt, .rtf |
| PowerPoint | .ppt, .pptx, .pptm, .odp |
| Spreadsheets | .xls, .xlsx, .xlsm, .ods, .csv, .tsv |
| Images | .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg |
Office documents require LibreOffice; images require ImageMagick. LiteParse auto-converts these formats to PDF before parsing.
More from etalab-ia/skills
rgaa
>-
14securite-anssi
Règles essentielles de sécurité ANSSI pour le développement d'applications de l'État. 12 règles couvrant TLS, secrets, authentification, headers, dépendances, entrées, logs et durcissement. Utiliser cette skill quand l'utilisateur développe une application web, une API, ou tout service exposé, quand il mentionne la sécurité, l'ANSSI, le durcissement, ou quand on configure un serveur, un reverse proxy ou un pipeline CI/CD.
10datagouv-apis
>-
10react-dsfr
Créer des interfaces React conformes au Design System de l'État français (DSFR) avec @codegouvfr/react-dsfr. Utiliser cette skill quand l'utilisateur demande de créer des pages, composants ou interfaces en React utilisant le DSFR, quand il mentionne react-dsfr, le design system de l'État, ou quand le projet utilise @codegouvfr/react-dsfr. Couvre les composants natifs react-dsfr (pas MUI), le routing, les icônes, les couleurs et les patterns de mise en page.
9lasuite-ui-kit
Créer des interfaces React pour les applications LaSuite (Docs, Drive, People, Webinaire, Messagerie, etc.) avec @gouvfr-lasuite/ui-kit et @gouvfr-lasuite/cunningham-react. Utiliser cette skill quand l'utilisateur travaille sur une application LaSuite, mentionne @gouvfr-lasuite/ui-kit, @gouvfr-lasuite/cunningham-react, @openfun/cunningham-react, ou développe des composants pour la suite numérique de l'État. Couvre le layout applicatif, la navigation, la recherche rapide, les utilisateurs, les badges, les icônes Material, les menus contextuels et les patterns de partage.
9rag-index
Indexer un corpus de documents markdown pour la recherche sémantique. Utiliser quand l'utilisateur veut créer une base de connaissances, indexer des documents, ou configurer une recherche sémantique.
6