docling-convert
Docling Convert
Use this skill to run document conversion through a local Docling service instead of ad-hoc parsing.
Quick Start
- Assume the Docling service is already deployed locally and reachable at
http://localhost:5001. - Prefer
scripts/docling_gradio_convert.pyfor repeatable work. It wraps the documented Gradio API and handles submission, waiting, and archive extraction. - Install the required client before running the script:
pip install gradio_client
- If URL jobs need placeholder image repair and
beautifulsoup4is missing, install it:
pip install beautifulsoup4 lxml
- Read
references/gradio-api-workflow.mdonly when changing endpoints, tuning advanced options, or debugging output layouts.
Workflow
-
Classify the inputs. Use the file flow for local paths and the URL flow for web pages. Do not mix files and URLs in one API request; if the user gives both, run two jobs.
-
Choose the outputs. Default to
md. Addjsonwhen the user also needs structured output. Addhtml,text, ordoctagsonly when the task explicitly needs them. -
Choose the processing options. Keep
pipeline=standard,ocr=true,force_ocr=false,pdf_backend=dlparse_v4, andtable_mode=accurateunless the task calls for a change. Keepimage_export_mode=embeddedwhen the goal is to preserve extracted images. The wrapper post-processes embedded Markdown images into real files underimages/. Turn on enrichment flags only when the user explicitly wants code, formulas, picture classification, or picture descriptions. For URL jobs, the wrapper also normalizes Markdown output by injecting stable front matter, preserving unknown existing front matter keys, and prepending# titleonly when the document body does not already start with that title. -
Run the wrapper script.
# Single file
python scripts/docling_gradio_convert.py report.pdf
# Batch files with Markdown + JSON
python scripts/docling_gradio_convert.py "*.pdf" --to-format md --to-format json
# Single URL
python scripts/docling_gradio_convert.py https://example.com/article --output-dir ./article
# Single URL with optional sidecar files
python scripts/docling_gradio_convert.py https://example.com/article --save-source-html --save-manifest
# Alternate service URL
python scripts/docling_gradio_convert.py slides.pptx --service-url http://localhost:5001
- Verify the extracted results.
The script always requests
return_as_file=true, downloads the returned artifact, extracts it into the chosen output directory, rewrites embedded Markdown images into local files when needed, and for URL conversions can backfill Docling image placeholders from the source page. URL Markdown outputs are post-processed after extraction so the final.mdcontains normalized front matter plus a title heading when needed. Inspect the produced Markdown plus any extracted image assets before presenting the result to the user.
Output Conventions
- Prefer the script defaults unless the user asks for a different layout.
- For a single local file, extract into a sibling directory named after the input stem.
- For a single URL, extract into
docling-<slug>under the current working directory. - For multiple inputs, extract into
docling-files-batchordocling-urls-batchunder the current working directory, unless--output-diris supplied. - If the user supplies
--output-dirand both file and URL jobs are needed, the script createsfiles/andurls/subdirectories to keep the results separate.
Script Notes
- Use
scripts/docling_gradio_convert.py --dry-run ...to verify grouping, endpoint selection, and destination paths without contacting the service. - Let the script infer the Gradio UI URL from the service root.
http://localhost:5001becomeshttp://localhost:5001/ui/. - Let the script ask
/change_ocr_langfor the default OCR language set when--ocr-langis not provided. Fall back toen,fr,de,esif the endpoint is unavailable. - Treat a missing
gradio_clientinstallation as an environment issue and fix it withpip install gradio_clientinstead of rewriting the workflow. - If a URL conversion returns
<!-- 🖼️❌ Image not available ... -->, let the wrapper fetch the source page, collect article images, download them intoimages/, and replace placeholders in order. - URL post-processing fetches the source page once and reuses that HTML for metadata extraction, title normalization, and optional sidecar output instead of maintaining a separate capture flow.
- Existing front matter keys outside the managed set are preserved; managed keys are
url,title,description,author,published,cover_image,language,captured_at,converter,pipeline,ocr, andocr_lang. - Use
--save-source-htmlto writesource.htmlfor single URL jobs, and--save-manifestto writemanifest.jsonwith the conversion settings and output summary. - Sidecar files are skipped for multi-URL batch jobs even if the flags are set.
Resources
scripts/docling_gradio_convert.py
Use this wrapper for deterministic Docling conversions. It supports:
- local files, URLs, and wildcard expansion
- batch conversion
- OCR and enrichment flags
- archive download and extraction
- output directory planning
- dry-run validation
references/gradio-api-workflow.md
Read this reference when you need:
- the endpoint mapping for file versus URL jobs
- the argument names expected by the Gradio client
- the
wait_task_finishtuple layout - the defaults adopted by this skill
More from mhliulgy/my-skills
alphaxiv-paper-lookup
Look up any arxiv paper on alphaxiv.org to get a structured AI-generated overview. This is faster and more reliable than trying to read a raw PDF.
36document-image-extractor
从 Word (.docx) 和 PDF (.pdf) 文档中提取图片并保存到指定文件夹。使用场景包括:(1) 从 Word 文档提取图片,(2) 从 PDF 文档提取图片,(3) 批量提取多个文档的图片,(4) 提取文档中的所有图片素材
33code-ocr
使用百度 OCR 高精度含位置版将代码截图转换为文本文件,保持原始缩进。使用场景包括:(1) 将代码截图识别为可编辑文本,(2) 从截图提取代码,(3) 图片转代码文字。触发条件:用户提到"截图转文字"、"OCR"、"识别代码"、"图片转文本"、"code screenshot to text
19remotion-best-practices
Best practices for Remotion - Video creation in React
2