markitdown
markitdown
Convert any file or document to markdown using Microsoft's markitdown CLI. It produces LLM-optimized markdown output — clean, structured, and ready to reason over.
Supported Formats
| Format | Extensions | Notes |
|---|---|---|
| Text extraction; scanned PDFs may be limited | ||
| Word | .docx, .doc | Full text + headings |
| Excel | .xlsx, .xls | Tables per sheet |
| PowerPoint | .pptx, .ppt | Slide text + notes |
| HTML | .html, .htm | Rendered text |
| CSV | .csv | Table format |
| JSON / XML | .json, .xml | Structured data |
| Images | .jpg, .png, .gif, .webp | EXIF metadata + OCR description |
| Audio | .wav, .mp3, .m4a | Transcription |
| EPub | .epub | E-book content |
| Outlook MSG | .msg | Email content |
| ZIP | .zip | Extracts and converts contents |
| YouTube URL | URL | Transcript extraction |
Step 0: Ensure dependencies are installed
markitdown (Python package):
which markitdown || pip install 'markitdown[all]'
The [all] extras cover PDF, DOCX, XLSX, PPTX, audio, and image Python dependencies in one
shot — worth the one-time cost to avoid "unsupported format" errors.
System binaries — pip cannot install these; they must come from the OS package manager.
Only install what the file type requires:
| File type | Needs | Check | Install (Debian/Ubuntu) |
|---|---|---|---|
| Audio (.wav, .mp3, .m4a) | ffmpeg |
which ffmpeg |
apt-get install -y ffmpeg |
| Images (OCR text) | tesseract |
which tesseract |
apt-get install -y tesseract-ocr |
Check before attempting audio or image conversion — missing system binaries produce empty
output with no error message, which is confusing. If on macOS, use brew install ffmpeg /
brew install tesseract instead.
Step 1: Identify the file
Confirm the file path from the user's message. If the path is ambiguous or relative, resolve
it (e.g., ~/Downloads/report.pdf expands to the full path). Verify it exists before running.
Step 2: Convert
Read inline — output markdown to stdout, use directly in the conversation:
markitdown "path/to/file.pdf"
Save to file — useful for large documents or when the user wants a persistent .md file:
markitdown "path/to/file.pdf" -o "path/to/output.md"
Batch convert — convert all files of a type in a directory:
for f in /path/to/dir/*.docx; do
markitdown "$f" -o "${f%.docx}.md"
done
Always quote file paths to handle spaces and special characters correctly.
Step 3: Present results
After conversion:
- Answer the user's question from the content — don't dump raw markdown unless they explicitly ask for it. If they asked "what's in this spreadsheet?", synthesize the answer.
- For saved files, confirm the output path and approximate size.
- For large output (200+ pages, very large spreadsheets): summarize the document structure first ("This PDF has 5 sections: Executive Summary, Financials, Appendices..."), then ask which sections to focus on. Dumping a 200-page PDF into context degrades response quality.
Error Handling
| Error | Fix |
|---|---|
command not found: markitdown |
Run Step 0 to install |
File not found |
Check path; ask user to confirm location |
Unsupported file format |
Check extension; try markitdown[all] if not installed with extras |
| Password-protected file | markitdown cannot decrypt; ask user to provide an unlocked copy |
| Empty output from audio file | ffmpeg not installed — run apt-get install -y ffmpeg |
| Empty output from image file | tesseract not installed — run apt-get install -y tesseract-ocr |
| Empty output from PDF | File is likely a scanned image PDF with no text layer; inform user |
| YouTube URL fails | Network issue or transcript disabled; try downloading audio first |
Important Notes
- markitdown is designed for LLM consumption — the output prioritizes structure and readability over pixel-perfect fidelity.
- For local files only: URLs to remote documents must be downloaded first
(e.g.,
curl -o /tmp/file.pdf "URL"), then converted. - The
-oflag writes directly to disk without printing to stdout — useful when you don't want to flood the context window.