office-to-md
SKILL.md
Office to Markdown Skill
Overview
This skill enables conversion from various Office formats to Markdown using markitdown - Microsoft's open-source tool for converting documents to Markdown. Perfect for making Office content searchable, version-controllable, and AI-friendly.
How to Use
- Provide the Office file (Word, Excel, PowerPoint, PDF, etc.)
- Optionally specify conversion options
- I'll convert it to clean Markdown
Example prompts:
- "Convert this Word document to Markdown"
- "Turn this PowerPoint into Markdown notes"
- "Extract content from this PDF as Markdown"
- "Convert this Excel file to Markdown tables"
Domain Knowledge
markitdown Fundamentals
from markitdown import MarkItDown
# Initialize converter
md = MarkItDown()
# Convert file
result = md.convert("document.docx")
print(result.text_content)
# Save to file
with open("output.md", "w") as f:
f.write(result.text_content)
Supported Formats
| Format | Extension | Notes |
|---|---|---|
| Word | .docx | Full text, tables, basic formatting |
| Excel | .xlsx | Converts to Markdown tables |
| PowerPoint | .pptx | Slides as sections |
| Text extraction | ||
| HTML | .html | Clean markdown |
| Images | .jpg, .png | OCR with vision model |
| Audio | .mp3, .wav | Transcription |
| ZIP | .zip | Processes contained files |
Basic Usage
Python API
from markitdown import MarkItDown
# Simple conversion
md = MarkItDown()
result = md.convert("document.docx")
# Access content
markdown_text = result.text_content
# With options
md = MarkItDown(
llm_client=None, # Optional LLM for enhanced processing
llm_model=None # Model name if using LLM
)
Command Line
# Install
pip install markitdown
# Convert file
markitdown document.docx > output.md
# Or with output file
markitdown document.docx -o output.md
Word Document Conversion
from markitdown import MarkItDown
md = MarkItDown()
# Convert Word document
result = md.convert("report.docx")
# Output preserves:
# - Headings (as # headers)
# - Bold/italic formatting
# - Lists (bulleted and numbered)
# - Tables (as markdown tables)
# - Hyperlinks
print(result.text_content)
Example Output:
# Annual Report 2024
## Executive Summary
This report summarizes the key achievements and challenges...
### Key Metrics
| Metric | 2023 | 2024 | Change |
|--------|------|------|--------|
| Revenue | $10M | $12M | +20% |
| Users | 50K | 75K | +50% |
## Detailed Analysis
The following sections provide...
Excel Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xlsx")
# Each sheet becomes a section
# Data becomes markdown tables
print(result.text_content)
Example Output:
## Sheet1
| Name | Department | Salary |
|------|------------|--------|
| John | Engineering | $80,000 |
| Jane | Marketing | $75,000 |
## Sheet2
| Product | Q1 | Q2 | Q3 | Q4 |
|---------|----|----|----|----|
| Widget A | 100 | 120 | 150 | 180 |
PowerPoint Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("presentation.pptx")
# Each slide becomes a section
# Speaker notes included if present
print(result.text_content)
Example Output:
# Slide 1: Company Overview
Our mission is to...
## Key Points
- Innovation first
- Customer focused
- Global reach
---
# Slide 2: Market Analysis
The market opportunity is significant...
**Notes:** Mention the competitor analysis here
PDF Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
# Extracts text content
# Tables converted where detected
print(result.text_content)
Image Conversion (with Vision Model)
from markitdown import MarkItDown
import anthropic
# Use Claude for image description
client = anthropic.Anthropic()
md = MarkItDown(
llm_client=client,
llm_model="claude-sonnet-4-20250514"
)
result = md.convert("diagram.png")
print(result.text_content)
# Output: Description of the image content
Batch Conversion
from markitdown import MarkItDown
from pathlib import Path
def batch_convert(input_dir, output_dir):
"""Convert all Office files to Markdown."""
md = MarkItDown()
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
extensions = ['.docx', '.xlsx', '.pptx', '.pdf']
for ext in extensions:
for file in input_path.glob(f'*{ext}'):
try:
result = md.convert(str(file))
output_file = output_path / f"{file.stem}.md"
with open(output_file, 'w') as f:
f.write(result.text_content)
print(f"Converted: {file.name}")
except Exception as e:
print(f"Error converting {file.name}: {e}")
batch_convert('./documents', './markdown')
Best Practices
- Check Output Quality: Review converted Markdown for accuracy
- Handle Tables: Complex tables may need manual adjustment
- Preserve Structure: Use consistent heading levels in source docs
- Image Handling: Consider using vision models for important images
- Version Control: Store converted Markdown in Git for tracking
Common Patterns
Document Archive
import os
from datetime import datetime
from markitdown import MarkItDown
def archive_document(doc_path, archive_dir):
"""Convert and archive Office document to Markdown."""
md = MarkItDown()
result = md.convert(doc_path)
# Create archive structure
date_str = datetime.now().strftime('%Y-%m-%d')
filename = os.path.basename(doc_path)
base_name = os.path.splitext(filename)[0]
# Save with metadata
output_content = f"""---
source: {filename}
converted: {date_str}
---
{result.text_content}
"""
output_path = os.path.join(archive_dir, f"{base_name}.md")
with open(output_path, 'w') as f:
f.write(output_content)
return output_path
AI-Ready Corpus
from markitdown import MarkItDown
from pathlib import Path
import json
def create_ai_corpus(doc_folder, output_file):
"""Convert documents to JSON corpus for AI training/RAG."""
md = MarkItDown()
corpus = []
for doc in Path(doc_folder).glob('**/*'):
if doc.suffix in ['.docx', '.pdf', '.pptx', '.xlsx']:
try:
result = md.convert(str(doc))
corpus.append({
'source': str(doc),
'filename': doc.name,
'content': result.text_content,
'type': doc.suffix[1:]
})
except Exception as e:
print(f"Skipped {doc.name}: {e}")
with open(output_file, 'w') as f:
json.dump(corpus, f, indent=2)
print(f"Created corpus with {len(corpus)} documents")
return corpus
Examples
Example 1: Convert Documentation Suite
from markitdown import MarkItDown
from pathlib import Path
def convert_docs_to_wiki(docs_folder, wiki_folder):
"""Convert all Office docs to markdown wiki structure."""
md = MarkItDown()
docs_path = Path(docs_folder)
wiki_path = Path(wiki_folder)
# Create wiki structure
wiki_path.mkdir(exist_ok=True)
# Create index
index_content = "# Documentation Index\n\n"
for doc in sorted(docs_path.glob('**/*.docx')):
try:
result = md.convert(str(doc))
# Create relative path in wiki
rel_path = doc.relative_to(docs_path)
output_file = wiki_path / rel_path.with_suffix('.md')
output_file.parent.mkdir(parents=True, exist_ok=True)
# Write markdown
with open(output_file, 'w') as f:
f.write(result.text_content)
# Add to index
link = str(rel_path.with_suffix('.md')).replace('\\', '/')
index_content += f"- [{doc.stem}]({link})\n"
print(f"Converted: {doc.name}")
except Exception as e:
print(f"Error: {doc.name} - {e}")
# Write index
with open(wiki_path / 'index.md', 'w') as f:
f.write(index_content)
convert_docs_to_wiki('./company_docs', './wiki')
Example 2: Meeting Notes Processor
from markitdown import MarkItDown
import re
from datetime import datetime
def process_meeting_notes(pptx_path):
"""Extract and structure meeting notes from PowerPoint."""
md = MarkItDown()
result = md.convert(pptx_path)
# Parse the markdown
content = result.text_content
# Extract sections
sections = {
'attendees': [],
'agenda': [],
'decisions': [],
'action_items': []
}
current_section = None
for line in content.split('\n'):
line_lower = line.lower()
if 'attendee' in line_lower or 'participant' in line_lower:
current_section = 'attendees'
elif 'agenda' in line_lower:
current_section = 'agenda'
elif 'decision' in line_lower:
current_section = 'decisions'
elif 'action' in line_lower:
current_section = 'action_items'
elif line.strip().startswith(('-', '*', '•')) and current_section:
sections[current_section].append(line.strip()[1:].strip())
# Generate structured output
output = f"""# Meeting Notes
**Date:** {datetime.now().strftime('%Y-%m-%d')}
**Source:** {pptx_path}
## Attendees
{chr(10).join('- ' + a for a in sections['attendees'])}
## Agenda
{chr(10).join('- ' + a for a in sections['agenda'])}
## Decisions Made
{chr(10).join('- ' + d for d in sections['decisions'])}
## Action Items
{chr(10).join('- [ ] ' + a for a in sections['action_items'])}
"""
return output
notes = process_meeting_notes('team_meeting.pptx')
print(notes)
Example 3: Excel to Documentation
from markitdown import MarkItDown
def excel_to_data_dictionary(xlsx_path):
"""Convert Excel data model to data dictionary documentation."""
md = MarkItDown()
result = md.convert(xlsx_path)
# Add documentation structure
doc = f"""# Data Dictionary
Generated from: `{xlsx_path}`
{result.text_content}
## Usage Notes
- All tables are derived from the source Excel file
- Review data types and constraints before use
- Contact data team for clarifications
## Change Log
| Date | Change | Author |
|------|--------|--------|
| {datetime.now().strftime('%Y-%m-%d')} | Initial generation | Auto |
"""
return doc
documentation = excel_to_data_dictionary('data_model.xlsx')
with open('data_dictionary.md', 'w') as f:
f.write(documentation)
Limitations
- Complex formatting may be simplified
- Images are not embedded (use vision model for descriptions)
- Some table structures may not convert perfectly
- Track changes in Word are not preserved
- Comments may not be extracted
Installation
pip install markitdown
# For image/audio processing
pip install markitdown[all]
# For specific features
pip install markitdown[images] # Image OCR
pip install markitdown[audio] # Audio transcription