addon-pdf-preprocess-page-artifacts
SKILL.md
Add-on: PDF Preprocessing (Page Artifacts)
Use this skill to implement the preprocessing stage that turns an uploaded PDF into:
- page-level text/markdown artifacts (raw)
- page-level cleaned artifacts (after header/footer cleanup; separate stage)
- stable provenance metadata for audit + reprocessing
Inputs
Collect:
PDF_PARSER:docling(if available) orpypdf/pdfplumberfallback.PAGE_MARKER_STYLE:structured-pages(preferred) ormarkdown-markers.
Output Contracts
For each page (1-based), persist:
raw_page_markdown(or raw extracted text)metadata_jsonbincluding parser name/version and extraction params
Also persist canonical whole-document artifacts to object storage:
- raw PDF (already stored on upload)
- optional markdown export (include page markers if used)
Implementation Notes
Preferred: Structured Pages
If the parser yields pages directly:
- iterate page objects
- insert one
document_pagesrow per page - avoid regex page splitting
Fallback: Markdown Markers
If the parser only yields markdown:
- emit deterministic markers:
<!-- PAGE:1 --><!-- PAGE:2 -->
- split on markers; never infer pages by heuristics alone
Worker Responsibilities
The preprocess worker should:
- Download raw PDF from object storage
- Extract per-page raw content deterministically
- Insert/update
document_pages.raw_markdown - Update
documents.page_countanddocuments.status - Enqueue the next stage (cleanup / chunking / extraction) or mark stage completion
Guardrails
- Do not run cleanup in the raw extraction step; raw must remain raw.
- Ensure preprocessing is idempotent (safe retries) for the same
document_id. - Store extraction parameters and parser version so results are reproducible.
Decision Justification Rule
- Every non-trivial decision must include a concrete justification.
- Capture the alternatives considered and why they were rejected.
- State tradeoffs and residual risks for the chosen option.
- If justification is missing, treat the task as incomplete and surface it as a blocker.
Weekly Installs
1
Repository
ajrlewis/ai-skillsFirst Seen
4 days ago
Security Audits
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1