markitdown
MarkItDown Document Conversion
Convert files to Markdown using Microsoft's MarkItDown utility.
Installation
Full Installation
pip install 'markitdown[all]'
Selective Installation
pip install 'markitdown[pdf]' # PDF only
pip install 'markitdown[docx]' # Word documents
pip install 'markitdown[pptx]' # PowerPoint
pip install 'markitdown[xlsx]' # Excel
pip install 'markitdown[audio]' # Audio transcription
pip install 'markitdown[image]' # Image OCR
pip install 'markitdown[azure-doc-intelligence]' # Azure AI PDF
pip install 'markitdown[llm]' # LLM image descriptions
Command-Line Usage
# Basic conversion
markitdown file.pdf
# Save to file
markitdown file.pdf > output.md
markitdown file.pdf -o output.md
# Batch conversion
for file in *.pdf; do markitdown "$file" > "${file%.pdf}.md"; done
Python API
Basic Usage
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
Stream Processing
with open("file.pdf", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")
With Azure Document Intelligence
md = MarkItDown(
azure_doc_intelligence_endpoint="https://your-resource.cognitiveservices.azure.com",
azure_doc_intelligence_key="your-key"
)
With LLM Image Descriptions
md = MarkItDown(
llm_model="gpt-4o",
llm_client=None # Uses default client
)
Supported Formats
| Format | Extensions | Features |
|---|---|---|
| Text, tables, links, structure | ||
| Word | .docx | Headings, lists, tables, images, links |
| PowerPoint | .pptx | Slides, titles, content, images |
| Excel | .xlsx, .xls | Sheets, tables, headers |
| Images | .png, .jpg, .gif | EXIF, OCR, LLM descriptions |
| Audio | .wav, .mp3 | Transcription, timestamps |
| HTML | .html | Content, links, tables |
| CSV | .csv | Data tables |
| JSON | .json | Structure preservation |
| XML | .xml | Data extraction |
| ZIP | .zip | Archive processing |
| EPub | .epub | E-book content |
| YouTube | URLs | Metadata, transcripts |
Common Patterns
Batch Processing
import os
from markitdown import MarkItDown
md = MarkItDown()
for filename in os.listdir("input/"):
if filename.endswith(('.pdf', '.docx', '.pptx')):
result = md.convert(f"input/{filename}")
base = os.path.splitext(filename)[0]
with open(f"output/{base}.md", "w") as f:
f.write(result.text_content)
Error Handling
try:
result = md.convert("file.pdf")
markdown = result.text_content
except Exception as e:
print(f"Conversion failed: {e}")
Memory-Efficient Processing
with open("large_file.pdf", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")
Docker Usage
# Build
docker build -t markitdown:latest .
# Run
docker run --rm -i markitdown:latest < input.pdf > output.md
# With volume
docker run --rm -v $(pwd):/data markitdown:latest /data/file.pdf
Output Format
MarkItDown produces clean, structured Markdown:
# Document Title
## Section Heading
Content with **bold** and *italic* formatting.
- Bullet lists
- Preserved from source
| Table | Headers |
|-------|---------|
| Data | Values |
[Links](https://example.com) maintained.
Best Practices
Performance
- Use streams for files >10MB
- Batch process multiple files
- Cache converted results
- Use selective dependencies
Quality
- High-resolution images for OCR
- Well-formatted source documents
- Azure Document Intelligence for complex PDFs
- LLM descriptions for important images
Integration
- Check token counts for LLM limits
- Chunk long documents
- Preserve metadata in context
- Validate output structure
Troubleshooting
| Issue | Solution |
|---|---|
| Import errors | pip install --upgrade 'markitdown[all]' |
| Memory errors | Use convert_stream() instead of convert() |
| Poor OCR | Increase image resolution, use Azure |
| Missing content | Check source file quality |
Requirements
- Python 3.10+
- Virtual environment recommended
- Optional: Azure subscription for enhanced features
- Optional: OpenAI API for image descriptions
When to Use This Skill
- Converting documents for AI analysis
- Extracting content from PDFs
- Processing Word/PowerPoint files
- Preparing data for language models
- Batch document conversion
- Building document pipelines
More from housegarofalo/claude-code-base
mqtt-iot
Configure MQTT brokers (Mosquitto, EMQX) for IoT messaging, device communication, and smart home integration. Manage topics, QoS levels, authentication, and bridging. Use when setting up IoT messaging, smart home communication, or device-to-cloud connectivity. (project)
22devops-engineer-agent
Infrastructure and DevOps specialist. Manages Docker, Kubernetes, CI/CD pipelines, and cloud deployments. Expert in GitHub Actions, Azure DevOps, Terraform, and container orchestration. Use for deployment automation, infrastructure setup, or CI/CD optimization.
6home-assistant
Ultimate Home Assistant skill - complete administration, wireless protocols (Zigbee/ZHA/Z2M, Z-Wave JS, Thread, Matter), ESPHome device building, advanced troubleshooting, performance optimization, security hardening, custom integration development, and professional dashboard design. Covers configuration, REST API, automation debugging, database optimization, SSL/TLS, Jinja2 templating, and HACS custom cards. Use for any HA task.
6testing
Comprehensive testing skill covering unit, integration, and E2E testing with pytest, Jest, Cypress, and Playwright. Use for writing tests, improving coverage, debugging test failures, and setting up testing infrastructure.
5react-typescript
Build modern React applications with TypeScript. Covers React 18+ patterns, hooks, component architecture, state management (Zustand, Redux Toolkit), server components, and best practices. Use for React development, TypeScript integration, component design, and frontend architecture.
5power-automate
Expert guidance for Power Automate development including cloud flows, desktop flows, Dataverse connector, expression functions, custom connectors, error handling, and child flow patterns. Use when building automated workflows, writing flow expressions, creating custom connectors from OpenAPI, or implementing error handling patterns.
5