to-markdown
To Markdown
Convert any file or URL to clean Markdown using MarkItDown as the conversion engine, with a lightweight fetch layer for URLs.
Reference Files
| File | Purpose |
|---|---|
references/formats.md |
Per-format handling notes, internal engines, known gaps |
references/fetch.md |
URL fetch layer: trafilatura + Playwright strategies |
references/install.md |
Dependency install guide for all variants |
Decision Tree
Determine the input type before touching any tool:
Input type?
Local file path -> markitdown directly
URL
YouTube URL -> markitdown directly (transcript extraction built-in)
Static page -> trafilatura fetch -> markitdown on HTML result
JS-rendered / auth -> Playwright fetch -> markitdown on result
Pasted HTML string -> markitdown directly on string
Do not use web_fetch or WebFetch for URLs — route through the fetch layer described in references/fetch.md to preserve the conversion pipeline.
Core Conversion Workflow
Step 1: Ensure dependencies
uv pip show markitdown || uv pip install 'markitdown[all]' trafilatura
See references/install.md for selective installs and full dependency table.
Step 2: Convert
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False)
result = md.convert("path/to/file.pdf")
print(result.text_content)
Step 3: Workflow
- Detect input type (file path, URL, raw HTML).
- If URL, run fetch layer first (see
references/fetch.md). - Run markitdown conversion on the local file or fetched content.
- Post-process if needed (strip boilerplate, trim to main content).
- Write output or return inline per output conventions below.
Output Conventions
| Context | Output behaviour |
|---|---|
| Single file, user wants file | Write <input_stem>.md to same directory |
| Single file, inline request | Return Markdown in conversation |
| Batch (multiple files) | Write each to <stem>.md, summarise what was produced |
| URL | Write <slug>.md to current directory or return inline |
| Piped into another workflow | Return result.text_content string only |
Default: "convert this file" -> write a file. "Read this" or "what does this say" -> return inline.
Output Example
Source (two-column PDF with a table):
Annual Report 2024 Financial Highlights
Revenue grew 12% year-over-year... | Metric | 2023 | 2024 |
| Revenue | $4.2B | $4.7B |
| EBITDA | $1.1B | $1.3B |
Converted Markdown:
# Annual Report 2024
Revenue grew 12% year-over-year...
## Financial Highlights
| Metric | 2023 | 2024 |
| ------- | ----- | ----- |
| Revenue | $4.2B | $4.7B |
| EBITDA | $1.1B | $1.3B |
Multi-column layouts merge into linear flow. Tables are preserved as Markdown tables. Headings are inferred from font size/weight.
LLM Image Description (opt-in)
Markitdown supports an llm_client for image description in PPTX and image files. Never enable by default — it incurs cost, latency, and unexpected API calls. Prompt the user first: "This file contains images. Do you want me to use Claude to describe them? This will make additional API calls."
import anthropic
from markitdown import MarkItDown
client = anthropic.Anthropic()
md = MarkItDown(llm_client=client, llm_model="claude-sonnet-4-20250514")
result = md.convert("presentation.pptx")
Error Handling
| Severity | Condition | Action |
|---|---|---|
| Terminal | Unsupported format (no converter exists) | Report to user immediately; do not retry |
| Terminal | Password-protected Office file | Report to user; no programmatic workaround |
| Terminal | File not found / path invalid | Report exact path; ask user to verify |
| Recover | Empty output from PDF | Likely scanned — escalate to OCR path in references/formats.md |
| Recover | Missing optional dependency (e.g. playwright) | Install the dependency, then retry the conversion |
| Recover | URL fetch returns paywall page | Report fetch limitation; do not retry or attempt bypass |
| Recover | trafilatura returns empty | Escalate to Playwright fetch strategy per references/fetch.md |
result = md.convert(path)
if not result.text_content.strip():
raise ValueError(f"No text extracted from {path}. See references/formats.md for OCR options.")
Never silently return empty Markdown. Surface the failure with the severity and a pointer to the relevant reference file.
Known Gaps and Escalation
- HTML fidelity: markitdown uses
html2textinternally — complex layouts lose structure. For high-fidelity HTML conversion where DOM structure matters, suggest Turndown via Node subprocess. - Hard paywalls: The fetch layer returns the regwall page, not the content. This is a fetch limitation, not a conversion problem.
- Scanned PDFs (image-only, no text layer): markitdown returns near-empty output. Escalate to OCR workflow (Azure Document Intelligence or Tesseract). See
references/formats.md. - Protected Office files: Password-protected DOCX/XLSX will fail. Inform the user.
Calibration Rules
- Converted output must contain at least 10 words per page of source document. Below this threshold, treat as empty extraction and escalate per the error handling table.
- Tables in the source must appear as Markdown tables in the output — if a table is present in the original but missing in the conversion, flag it to the user.
- Heading hierarchy from the source document must be preserved (H1 > H2 > H3). Flat output with no headings from a structured document indicates a conversion quality issue.
- For URL conversions, output must not contain navigation elements, cookie banners, or footer boilerplate. If present, re-run through trafilatura with
include_tables=Trueto strip boilerplate. - Multi-sheet XLSX must produce one clearly labeled section per sheet. Missing sheets indicate a partial conversion — report which sheets were extracted.
Limitations
- No paywall bypass. Document it, don't attempt it.
- No Turndown integration built-in. Different runtime (Node.js).
- No scheduled/batch crawling. One conversion per invocation.
- No output format other than Markdown.
- Auto-generated YouTube captions may contain errors for technical terms.
- Scanned PDFs require external OCR — markitdown alone returns empty output.
More from mathews-tom/praxis-skills
manuscript-review
Pre-publication manuscript audit producing a section-level refactoring report with citation hygiene and submission-readiness checks. Triggers on: "review my paper", "check before submission", "is this ready to submit", "pre-pub checklist", "refactor my paper", "check my references", "does the abstract work".
64html-presentation
Converts documents, outlines, or notes into self-contained HTML slide decks with horizontal (Reveal.js) or vertical scroll navigation and multiple themes. Triggers on: "create a presentation", "slide deck", "pitch deck", "HTML presentation", "web-based slides", "reveal.js deck", "convert document into slides".
61filesystem
File and directory operations via Claude Code built-in tools, replacing the Filesystem MCP server. Triggers on: "read this file", "write to file", "edit file", "find files matching", "search for text in files", "list directory", "show directory tree", "rename file".
39md-to-pdf
Convert Markdown to styled PDFs with Mermaid diagrams, LaTeX/KaTeX math, tables, and code highlighting. Triggers on: "convert markdown to pdf", "make a pdf from this md", "export markdown as pdf", "pdf from markdown with equations".
36concept-to-video
Turn concepts into animated explainer videos using Manim (Python) with MP4/GIF output, audio overlay, multi-scene composition. Triggers on: "create a video", "animate this", "make an explainer", "manim animation", "motion graphic". NOT for React video, use remotion-video.
32concept-to-image
Turn concepts into static HTML visuals exported as PNG or SVG files via HTML/CSS/SVG. Triggers on: "create an image of", "export as PNG", "save as SVG", "concept to image", "screenshot this HTML". NOT for interactive HTML, use static-web-artifacts-builder.
31