markitdown

SKILL.md

Document to Markdown Conversion

Overview

Convert various document formats to clean Markdown using Microsoft's MarkItDown tool. Optimized for LLM processing, content extraction, and document analysis workflows.

Supported Formats: PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx/.xls), Images (with OCR/LLM), HTML, Audio (with transcription), CSV, JSON, XML, ZIP archives, EPubs

Quick Start

Basic Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

Command Line

# Convert single file
markitdown document.pdf > output.md
markitdown document.pdf -o output.md

# Pipe input
cat document.pdf | markitdown

🔒 Security Considerations

Before using in production:

  • ✅ Validate file types (MIME, not extension)
  • ✅ Limit file sizes (prevent DoS)
  • ✅ Sanitize file paths (prevent traversal)
  • ✅ Protect API keys (never hardcode)
  • ✅ Consider data privacy (external services)

See patterns.md for implementation details.

API Key Security

❌ NEVER:

  • Hardcode keys in code
  • Commit .env files to git
  • Log environment variables

✅ ALWAYS:

  • Use environment variables: export OPENAI_API_KEY="sk-..." # pragma: allowlist secret
  • Use secret management (AWS Secrets Manager, Azure Key Vault)
  • Rotate keys regularly

Common Patterns

PDF Documents

# Basic PDF conversion
md = MarkItDown()
result = md.convert("report.pdf")

# With Azure Document Intelligence (better quality)
md = MarkItDown(docintel_endpoint="<your-endpoint>")
result = md.convert("report.pdf")

Office Documents

# Word documents - preserves structure
result = md.convert("document.docx")

# Excel - converts tables to markdown tables
result = md.convert("spreadsheet.xlsx")

# PowerPoint - extracts slide content
result = md.convert("presentation.pptx")

Images with Descriptions

# ✅ SECURE: Using environment variables for API keys
import os
from openai import OpenAI

api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise RuntimeError("OPENAI_API_KEY not set")

client = OpenAI(api_key=api_key)
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.jpg")  # Gets AI-generated description

Batch Processing

from pathlib import Path

md = MarkItDown()
documents = Path(".").glob("*.pdf")

for doc in documents:
    result = md.convert(str(doc))
    output_path = doc.with_suffix(".md")
    output_path.write_text(result.text_content)

Installation

# Full installation (all features)
pip install 'markitdown[all]'

# Selective features
pip install 'markitdown[pdf, docx, pptx]'

Requirements: Python 3.10 or higher

Key Features

  • Structure Preservation: Maintains headings, lists, tables, links
  • Plugin System: Extend with custom converters
  • Docker Support: Containerized deployments
  • MCP Integration: Model Context Protocol server for LLM apps

When to Read Supporting Files

  • reference.md - Read when you need:

    • Complete API reference and all configuration options
    • Azure Document Intelligence integration details
    • Plugin development guide
    • Docker and MCP server setup
    • Troubleshooting and error handling
  • examples.md - Read when you need:

    • Working examples for specific file types
    • Batch processing workflows
    • Error handling patterns
    • Integration with existing pipelines
  • patterns.md - Read when you need:

    • Production deployment patterns
    • Performance optimization strategies
    • Security considerations
    • Anti-patterns to avoid

Quick Reference

File Type Use Case Command
PDF Reports, papers md.convert("file.pdf")
Word Documents md.convert("file.docx")
Excel Data tables md.convert("file.xlsx")
PowerPoint Presentations md.convert("file.pptx")
Images Diagrams with OCR md = MarkItDown(llm_client=client); md.convert("img.jpg")
HTML Web pages md.convert("page.html")
ZIP Archives md.convert("archive.zip") - processes contents

⚠️ Common Mistakes to Avoid

Anti-Pattern 1: Hardcoded API Keys

# ❌ NEVER DO THIS
md = MarkItDown(llm_client=OpenAI(api_key="sk-hardcoded-key"))

# ✅ ALWAYS DO THIS
api_key = os.getenv("OPENAI_API_KEY")
md = MarkItDown(llm_client=OpenAI(api_key=api_key))

Anti-Pattern 2: Unvalidated File Paths

# ❌ Vulnerable to path traversal
user_input = "../../../etc/passwd"
md.convert(user_input)

# ✅ Validate and sanitize
from pathlib import Path
safe_path = Path(user_input).resolve()
if not safe_path.is_relative_to(allowed_dir):
    raise ValueError("Invalid path")
md.convert(str(safe_path))

Anti-Pattern 3: Ignoring File Size Limits

# ❌ Can cause DoS
md.convert("huge_file.pdf")  # No size check

# ✅ Check size first
max_size = 50 * 1024 * 1024  # 50MB
if Path("file.pdf").stat().st_size > max_size:
    raise ValueError("File too large")

Common Issues

Import Error: Ensure Python >= 3.10 and markitdown installed Missing Dependencies: Install with pip install 'markitdown[all]' Image Descriptions Not Working: Requires LLM client (OpenAI or compatible)

For detailed troubleshooting, see reference.md.

Weekly Installs
48
GitHub Stars
32
First Seen
Feb 18, 2026
Installed on
opencode48
codex46
cursor46
github-copilot45
kimi-cli45
gemini-cli45