docling-convert

Installation
SKILL.md

Docling Convert

Use this skill to run document conversion through a local Docling service instead of ad-hoc parsing.

Quick Start

  • Assume the Docling service is already deployed locally and reachable at http://localhost:5001.
  • Prefer scripts/docling_gradio_convert.py for repeatable work. It wraps the documented Gradio API and handles submission, waiting, and archive extraction.
  • Install the required client before running the script:
pip install gradio_client
  • If URL jobs need placeholder image repair and beautifulsoup4 is missing, install it:
pip install beautifulsoup4 lxml
  • Read references/gradio-api-workflow.md only when changing endpoints, tuning advanced options, or debugging output layouts.

Workflow

  1. Classify the inputs. Use the file flow for local paths and the URL flow for web pages. Do not mix files and URLs in one API request; if the user gives both, run two jobs.

  2. Choose the outputs. Default to md. Add json when the user also needs structured output. Add html, text, or doctags only when the task explicitly needs them.

  3. Choose the processing options. Keep pipeline=standard, ocr=true, force_ocr=false, pdf_backend=dlparse_v4, and table_mode=accurate unless the task calls for a change. Keep image_export_mode=embedded when the goal is to preserve extracted images. The wrapper post-processes embedded Markdown images into real files under images/. Turn on enrichment flags only when the user explicitly wants code, formulas, picture classification, or picture descriptions. For URL jobs, the wrapper also normalizes Markdown output by injecting stable front matter, preserving unknown existing front matter keys, and prepending # title only when the document body does not already start with that title.

  4. Run the wrapper script.

# Single file
python scripts/docling_gradio_convert.py report.pdf

# Batch files with Markdown + JSON
python scripts/docling_gradio_convert.py "*.pdf" --to-format md --to-format json

# Single URL
python scripts/docling_gradio_convert.py https://example.com/article --output-dir ./article

# Single URL with optional sidecar files
python scripts/docling_gradio_convert.py https://example.com/article --save-source-html --save-manifest

# Alternate service URL
python scripts/docling_gradio_convert.py slides.pptx --service-url http://localhost:5001
  1. Verify the extracted results. The script always requests return_as_file=true, downloads the returned artifact, extracts it into the chosen output directory, rewrites embedded Markdown images into local files when needed, and for URL conversions can backfill Docling image placeholders from the source page. URL Markdown outputs are post-processed after extraction so the final .md contains normalized front matter plus a title heading when needed. Inspect the produced Markdown plus any extracted image assets before presenting the result to the user.

Output Conventions

  • Prefer the script defaults unless the user asks for a different layout.
  • For a single local file, extract into a sibling directory named after the input stem.
  • For a single URL, extract into docling-<slug> under the current working directory.
  • For multiple inputs, extract into docling-files-batch or docling-urls-batch under the current working directory, unless --output-dir is supplied.
  • If the user supplies --output-dir and both file and URL jobs are needed, the script creates files/ and urls/ subdirectories to keep the results separate.

Script Notes

  • Use scripts/docling_gradio_convert.py --dry-run ... to verify grouping, endpoint selection, and destination paths without contacting the service.
  • Let the script infer the Gradio UI URL from the service root. http://localhost:5001 becomes http://localhost:5001/ui/.
  • Let the script ask /change_ocr_lang for the default OCR language set when --ocr-lang is not provided. Fall back to en,fr,de,es if the endpoint is unavailable.
  • Treat a missing gradio_client installation as an environment issue and fix it with pip install gradio_client instead of rewriting the workflow.
  • If a URL conversion returns <!-- 🖼️❌ Image not available ... -->, let the wrapper fetch the source page, collect article images, download them into images/, and replace placeholders in order.
  • URL post-processing fetches the source page once and reuses that HTML for metadata extraction, title normalization, and optional sidecar output instead of maintaining a separate capture flow.
  • Existing front matter keys outside the managed set are preserved; managed keys are url, title, description, author, published, cover_image, language, captured_at, converter, pipeline, ocr, and ocr_lang.
  • Use --save-source-html to write source.html for single URL jobs, and --save-manifest to write manifest.json with the conversion settings and output summary.
  • Sidecar files are skipped for multi-URL batch jobs even if the flags are set.

Resources

scripts/docling_gradio_convert.py

Use this wrapper for deterministic Docling conversions. It supports:

  • local files, URLs, and wildcard expansion
  • batch conversion
  • OCR and enrichment flags
  • archive download and extraction
  • output directory planning
  • dry-run validation

references/gradio-api-workflow.md

Read this reference when you need:

  • the endpoint mapping for file versus URL jobs
  • the argument names expected by the Gradio client
  • the wait_task_finish tuple layout
  • the defaults adopted by this skill
Related skills
Installs
11
GitHub Stars
1
First Seen
Mar 14, 2026