web-to-markdown
SKILL.md
Web to Markdown
Overview
Captures web pages using headless Playwright browser automation (handles JavaScript-rendered content), converts HTML to clean markdown via Turndown library, and saves all URLs from a single request into one timestamped file.
Key features:
- Headless browser (no visible window)
- Handles JavaScript-rendered content (SPAs, React, Vue, etc.)
- Batch processing multiple URLs ā single file
- Self-contained (no MCP dependency)
Prerequisites
- Node.js/pnpm for running TypeScript scripts
- Dependencies installed:
cd skills/web-to-markdown && pnpm install - Playwright browsers will auto-install on first run
Usage
When user provides URLs and asks to:
- "capture these pages as markdown"
- "save web content for documentation"
- "fetch and convert these webpages"
- "get markdown from these sites"
- "download and convert to markdown"
Output: docs/web-captures/YYYYMMDD_HHMMSS.md containing all pages.
Workflow
Single Command
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts <url1> [url2] [url3] ...
That's it! Script handles:
- Creates timestamped output file with header
- Launches headless Playwright browser
- Scrapes each URL sequentially
- Converts HTML ā markdown (Turndown)
- Appends to output file with formatted headers
- Closes browser and reports summary
Examples
Single URL:
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts https://example.com/docs
# Output: docs/web-captures/20251103_143052.md
Multiple URLs (Batch):
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts \
https://example.com/guide \
https://example.com/api \
https://example.com/faq
# Output: docs/web-captures/20251103_143052.md (all 3 pages)
From project root:
pnpm --filter @skills/web-to-markdown tsx scripts/scrape-and-convert.ts <urls...>
Output Format
# Web Captures - YYYY-MM-DD HH:MM:SS
Generated: YYYYMMDD_HHMMSS
URLs: N
---
## š https://example.com/page1
[Converted markdown content...]
---
## š https://example.com/page2
[Converted markdown content...]
---
Implementation Notes
TypeScript + FP Patterns:
- Pure functions (no classes except custom errors)
- Explicit error handling with typed errors:
BrowserError,FileError,HtmlConversionError - Small, focused functions
- Side effects isolated at edges
- CLI-style logging with chalk/ora
File Structure:
skills/web-to-markdown/
āāā SKILL.md # This file (workflow instructions)
āāā package.json # pnpm workspace config
āāā tsconfig.json # TypeScript config
āāā scripts/
ā āāā scrape-and-convert.ts # Main CLI (Playwright + Turndown)
ā āāā html-to-markdown.ts # Pure conversion function (Turndown wrapper)
ā āāā convert-and-append.ts # Legacy CLI (deprecated, kept for reference)
āāā tests/
āāā html-to-markdown.test.ts # Unit tests
Error Handling
- Browser launch failure: Check Playwright installation, run
pnpm exec playwright install chromium - Page not found (404): Logs error, continues with other URLs
- Timeout (>30s): Reports slow page, continues to next URL
- Navigation error: Logs error, continues to next URL
- Conversion failure: Reports malformed HTML, skips page
Configuration
Default timeout: 30 seconds per page
To customize, edit DEFAULT_CONFIG in scripts/scrape-and-convert.ts:
const DEFAULT_CONFIG = {
outputDir: 'docs/web-captures',
timeout: 30000, // milliseconds
};
Performance
- Headless mode (no GUI overhead)
- Sequential processing (one URL at a time for stability)
- Browser reuse across URLs (faster than launching per-page)
- ~2-5 seconds per page (depends on site complexity)
Comparison with Alternatives
| Feature | web-to-markdown | scratchpad-fetch | Jina AI Reader |
|---|---|---|---|
| Transport | Playwright (headless) | curl (HTTP) | Cloud API |
| JavaScript | ā Full rendering | ā No | ā Server-side |
| Conversion | ā Turndown | ā Raw HTML | ā LLM-powered |
| Self-hosted | ā Yes | ā Yes | ā Cloud only |
| Setup | pnpm install | None | API key |
| Speed | Medium (2-5s/page) | Fast (<1s) | Fast (~2s) |
| Visible browser | ā No (headless) | N/A | N/A |
Troubleshooting
"Executable doesn't exist" error:
cd skills/web-to-markdown
pnpm exec playwright install chromium
Pages timing out:
- Increase timeout in
DEFAULT_CONFIG - Check network connectivity
- Some sites may block automated browsers (use Jina AI Reader alternative)
Empty markdown output:
- Site may use heavy client-side rendering
- Try waiting longer (increase timeout)
- Check if site blocks headless browsers (User-Agent detection)
Notes
- One request = one file (all URLs aggregated)
- Handles JavaScript-rendered content (React, Vue, Angular, etc.)
- Headless by default (no visible browser window)
- Browser auto-installs on first run
- Ideal for documentation scraping and archival
- No external API dependencies
Weekly Installs
1
Repository
smithery/aiFirst Seen
10 days ago
Installed on
amp1
opencode1
cursor1
kimi-cli1
droid1
codex1