qianfanocr-document-intelligence
Qianfan OCR Document Intelligence
This skill orchestrates visual understanding for images and PDFs. It does not implement a vision model itself. It selects the right analysis mode, prepares inputs, invokes the bundled CLI, and returns a structured result for the upstream agent.
Required Execution Order
Always follow this order:
- Check whether
QIANFAN_TOKENis already available. - If the token is missing, stop immediately and ask the user for the API Key.
- If the user provides the API Key, write it to
<skill-root>/.envasQIANFAN_TOKEN=.... - Only after the token is available, continue to mode selection, reference loading, and CLI calls.
This token preflight takes precedence over all later rules in this skill. Do not read
references/*.md, do not select a mode, and do not call any bundled script until the token check
has passed.
API Key Setup
Before first use, make sure QIANFAN_TOKEN is available either in the process environment or in
<skill-root>/.env.
If the token is missing, ask the user in Chinese:
QIANFAN_TOKEN 环境变量未设置。请提供百度千帆 API Key。
如果您暂时没有 API Key,请到 https://cloud.baidu.com/product-s/qianfan_home 注册获取。
If the user provides the key, persist it to <skill-root>/.env before continuing. Do not rely on
a temporary export QIANFAN_TOKEN=... as the only storage mechanism.
Do not assume a bundled default token exists.
Bundled Tools
scripts/qianfan_ocr_cli.py: send one or more images to the backend VLM.scripts/pdf_to_images.py: convert one or more PDFs into per-page images before calling the VLM.scripts/render_doc_markdown.py: replace document-parsing image placeholders with cropped image files.scripts/run_document_parsing.py: run document parsing end-to-end and always render image placeholders.scripts/run_pdf_document_parsing.py: run PDF document parsing end-to-end and export combined markdown, shared assets, and per-page markdown files.scripts/run_document_parsing_with_layout.py: run document parsing with layout and export markdown, layout JSON, and a layout overlay image.scripts/run_layout_analysis.py: run layout analysis and export_layout.jsonplus a layout overlay image.scripts/run_element_recognition.py: run element recognition and save the result as a sibling markdown file.
Always call scripts by absolute path. In Codex, use the installed absolute skill path instead of a bare relative path.
Examples:
python3 "<skill-root>/scripts/qianfan_ocr_cli.py" "<prompt>" --image <path_or_url>
python3 "<skill-root>/scripts/pdf_to_images.py" <pdf_or_url> --output-dir <dir>
python3 "<skill-root>/scripts/render_doc_markdown.py" parsed.md --image <page_image> --output-dir <assets_dir> --output-markdown <rendered_md>
python3 "<skill-root>/scripts/run_document_parsing.py" <image_or_pdf>
python3 "<skill-root>/scripts/run_pdf_document_parsing.py" <pdf> --pages all
python3 "<skill-root>/scripts/run_document_parsing_with_layout.py" <image_or_pdf>
python3 "<skill-root>/scripts/run_layout_analysis.py" <image_or_pdf>
python3 "<skill-root>/scripts/run_element_recognition.py" <cropped_image_or_pdf> --element-type <text|formula|table>
Trigger Rules
Trigger this skill only when all of the following are true:
- The task involves one or more image files or URLs, or one or more PDF files or URLs.
- The agent must recognize, understand, extract, answer questions about, or locate content inside those images or PDFs.
- The relevant information cannot be obtained from existing plain-text sources already available to the agent.
Do not trigger when:
- The file is plain text, structured data, or source code that can be read directly.
- The user is asking for image-processing or PDF-processing code rather than visual understanding.
- The image/PDF path or URL is mentioned incidentally and no visual understanding is requested.
- A previous invocation already answered the same question and repeating the call would be redundant.
Input Preparation
Image inputs
- Pass local image paths or image URLs directly to
qianfan_ocr_cli.py. - Use one call per unrelated image.
- Use repeated
--imageflags only when cross-image reasoning is required.
PDF inputs
- If the input is a PDF, convert it to page images first with
scripts/pdf_to_images.py. - For single-page questions, analyze only the relevant page when known.
- For multi-page PDFs, keep page order and label outputs with page numbers.
- If the PDF already has reliable selectable text and the task is pure text retrieval, do not use this skill; read the text directly.
Recommended PDF flow:
python3 "<skill-root>/scripts/pdf_to_images.py" report.pdf --output-dir /tmp/report-pages
python3 "<skill-root>/scripts/qianfan_ocr_cli.py" "<prompt>" --image /tmp/report-pages/report-p001.png
Analysis Modes
Select exactly one primary mode per call. If needed, make a second, more specific call after an initial pass.
| Mode | Use When | Goal |
|---|---|---|
document parsing |
Need document structure, text, formulas, tables, and image placeholders from an image/PDF | Output Markdown parsing result |
layout analysis |
Need all layout elements with positions and categories | Output layout elements with bbox and category |
element recognition |
Need precise recognition on cropped elements such as text blocks, formulas, or tables | Output exact recognition for the cropped element |
document parsing with layout |
Need both structural parsing and layout detection in one workflow | Output Markdown parsing plus layout analysis |
general ocr |
Need all visible text without document structure | Extract all visible text lines |
key information extraction |
Need key fields from cards, forms, receipts, invoices, contracts, or similar documents | Extract key information in structured form |
chart understanding |
Need chart captions, structured chart content, or chart QA | Understand and structure chart content |
doc vqa |
Need answers to specific questions about a document image/PDF | Answer questions grounded in the document |
Mode Selection Heuristics
- Use
document parsingfor full-page document understanding where output should be Markdown and preserve hierarchy. - Use
layout analysiswhen bounding boxes and categories are the main output. - Use
element recognitiononly after cropping the target region or when the user provides a single focused element image. - Prefer
scripts/run_element_recognition.pyforelement recognitionso the result is written next to the source file as a single markdown file without any assets directory. - Use
document parsing with layoutwhen both Markdown reconstruction and layout boxes are needed. - Use
general ocrfor screenshots, signs, posters, and simple document text extraction where layout is not important. - Use
key information extractionfor forms, certificates, IDs, invoices, receipts, contracts, and other field-centric documents. - For
key information extraction, if the user asks for all key-value information or all fields without naming a concrete field list, use the schema-free prompt path instead of inventing an explicit schema. - Use
chart understandingfor plots, dashboards, and chart-heavy report pages. - Use
doc vqafor targeted questions such as totals, dates, clauses, page content, or whether a document contains a specific item.
Reference Loading Rule
Only after token preflight has passed and after selecting the mode, always read the corresponding
file in references/ before composing the prompt or calling any script.
document parsing->references/document-parsing.mdlayout analysis->references/layout-analysis.mdelement recognition->references/element-recognition.mddocument parsing with layout->references/document-parsing-with-layout.mdgeneral ocr->references/general-ocr.mdkey information extraction->references/key-information-extraction.mdchart understanding->references/chart-understanding.mddoc vqa->references/doc-vqa.md
Do not skip this step when a matching reference exists.
Prompt Sourcing Rule
When the selected reference contains a prompt template, prompt rule, fixed prompt, output format, or parameter recommendation, use that reference as the primary source of truth.
- Prefer the prompt in the corresponding
references/*.mdfile over ad-hoc prompt writing. - Reuse the reference prompt verbatim when it is marked as a fixed prompt or standard prompt.
- Reuse the reference output format requirements instead of inventing a new format.
- Only add task-specific details, such as the user question, selected keys, page number, or input scope, on top of the reference prompt.
- If you intentionally deviate from the reference prompt, state why in the intermediate reasoning and keep the deviation minimal.
Parameter Mapping Rule
If the selected reference defines execution parameters, convert them into actual CLI flags or request fields. Do not leave them as documentation-only notes.
Examples:
min_dynamic_patch = 8-> pass--min-dynamic-patch 8max_dynamic_patch = 24-> pass--max-dynamic-patch 24- thinking mode -> pass
--thinking
Before running qianfan_ocr_cli.py, verify that the final command includes the parameter settings
required by the selected mode.
Prompt Rules
- Write the VLM prompt in Chinese when the user is communicating in Chinese; otherwise use English.
- State the mode and output format explicitly in the prompt.
- Tell the model to mark anything uncertain as unclear / unreadable instead of guessing.
- For PDFs, mention page numbers in the prompt whenever multiple pages are analyzed.
- For cropped inputs used in
element recognition, specify the element type: text, formula, table, figure caption, seal, signature block, and so on. - For
document parsingoutputs that containplaceholders, runscripts/render_doc_markdown.pybefore presenting the Markdown to users who need renderable local images. - Prefer
scripts/run_document_parsing.pyover manually chainingqianfan_ocr_cli.pyandrender_doc_markdown.pywhen the task is standard document parsing. - Prefer
scripts/run_pdf_document_parsing.pyfor PDF document parsing when the user wants one markdown for the whole PDF plus per-page markdown files and a shared assets directory. Use--request-mode jointwhen selected PDF pages are semantically related and should be sent as one multi-image request. Use--request-mode batch --concurrency <N>when pages can be parsed independently and should run concurrently. - Prefer
scripts/run_document_parsing_with_layout.pyfordocument parsing with layoutso the final output includes markdown,_layout.json, and a rendered layout overlay image.
CLI Strategy
- Default to one call per image or page.
- Use repeated
--imageflags only for cross-page or cross-image reasoning that truly depends on joint context. - If multiple images should be processed independently rather than jointly, use
scripts/qianfan_ocr_cli.py --batch --concurrency <N>or a dedicated runner instead of sending all images as one joint request. - Use
--thinkingonly for difficult document understanding tasks with ambiguous reading order or dense field relationships. - Retry at most once per image/page, and only with a more specific prompt.
Read only the relevant reference file when needed, but do read that file before prompt construction:
references/document-parsing.mdreferences/layout-analysis.mdreferences/element-recognition.mdreferences/document-parsing-with-layout.mdreferences/general-ocr.mdreferences/key-information-extraction.mdreferences/chart-understanding.mdreferences/doc-vqa.md
Output Contract
Return a structured result instead of raw model prose:
=== VISUAL ANALYSIS RESULT ===
mode: <mode>
confidence: <high|medium|low>
input_type: <image|pdf>
image_count: <N>
page_count: <N or n/a>
answer:
<direct answer or summary>
evidence:
- <directly observed fact>
warnings:
- <uncertainty or limitation>
markdown:
<for document parsing modes>
layout:
- page: <n>
category: <label>
bbox: [x1, y1, x2, y2]
structured_data:
<for key information extraction / chart understanding>
recognized_elements:
<for element recognition>
=== END VISUAL ANALYSIS ===
Conservatism Rules
- Separate observation from inference.
- Never invent text, values, fields, or boxes.
- Mark unreadable regions explicitly.
- For charts, distinguish exact values from estimated values.
- For PDFs, keep page attribution explicit:
page_1,page_2, and so on. - If page conversion or image quality is poor, mention that in
warnings.
More from baidubce/skills
baidu-search
Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.
1template-skill
Replace with description of the skill and when Claude should use it.
1famou-experiment-manager
管理 famou 进化实验任务的工作流技能。当用户提到"提交实验"、"查看实验状态"、"删除实验"、"获取实验结果"、"famou 实验"、"上传实验"、"config.yaml 实验"或需要使用 famou-ctl 管理实验任务时,必须使用此技能。即使用户只说"提交"或"跑实验",只要上下文涉及 famou 平台,也应触发此技能。
1famou-data-analysis
数据分析技能,用于理解数据、分析数据、制作数据处理流程、汇总数据分析结果。当用户提到"分析数据"、"数据处理"、"数据探索"、"统计分析"、"数据清洗"、"数据汇总"、"制作数据报告"、"理解这份数据"、"看一下这个CSV/Excel/数据集"时,必须使用此技能。即使用户只说"帮我看看这个数据"、"分析一下",只要上下文涉及数据文件或数据集,也应立即触发此技能。如果在FaMou问题定义过程中涉及到数据分析,也需要调用此技能。
1medical-bill-organizer
医疗票据整理助手,用于自动分类、OCR识别和信息抽取。当用户需要处理医疗票据时使用,支持文件夹或压缩包输入,自动分类存放到不同文件夹,抽取病历信息(住院日期、医院名称),生成发票汇总CSV表格。
1famou-artifact-generator
交互式引导用户完成 FaMou 进化任务的完整流程:先通过结构化澄清循环产出 `problem.md`,再实现并验证 FaMou 实验的三个输入物料(`init.py`、`evaluator.py`、`prompt.md`)。当用户提到以下任意情形时触发:定义/澄清/创建 FaMou 任务、帮我写 problem.md、我想建一个进化任务、帮我准备 FaMou 实验物料、生成 init.py 或 evaluator.py、优化/ML/搜索问题需要进化求解。即使用户只说"帮我做个 FaMou 任务"或提供粗略想法,也应触发此技能并从澄清阶段开始。
1