ā–„NYC
skills/smithery/ai/web-to-markdown

web-to-markdown

SKILL.md

Web to Markdown

Overview

Captures web pages using headless Playwright browser automation (handles JavaScript-rendered content), converts HTML to clean markdown via Turndown library, and saves all URLs from a single request into one timestamped file.

Key features:

  • Headless browser (no visible window)
  • Handles JavaScript-rendered content (SPAs, React, Vue, etc.)
  • Batch processing multiple URLs → single file
  • Self-contained (no MCP dependency)

Prerequisites

  • Node.js/pnpm for running TypeScript scripts
  • Dependencies installed: cd skills/web-to-markdown && pnpm install
  • Playwright browsers will auto-install on first run

Usage

When user provides URLs and asks to:

  • "capture these pages as markdown"
  • "save web content for documentation"
  • "fetch and convert these webpages"
  • "get markdown from these sites"
  • "download and convert to markdown"

Output: docs/web-captures/YYYYMMDD_HHMMSS.md containing all pages.

Workflow

Single Command

cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts <url1> [url2] [url3] ...

That's it! Script handles:

  1. Creates timestamped output file with header
  2. Launches headless Playwright browser
  3. Scrapes each URL sequentially
  4. Converts HTML → markdown (Turndown)
  5. Appends to output file with formatted headers
  6. Closes browser and reports summary

Examples

Single URL:

cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts https://example.com/docs

# Output: docs/web-captures/20251103_143052.md

Multiple URLs (Batch):

cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts \
  https://example.com/guide \
  https://example.com/api \
  https://example.com/faq

# Output: docs/web-captures/20251103_143052.md (all 3 pages)

From project root:

pnpm --filter @skills/web-to-markdown tsx scripts/scrape-and-convert.ts <urls...>

Output Format

# Web Captures - YYYY-MM-DD HH:MM:SS

Generated: YYYYMMDD_HHMMSS
URLs: N

---

## šŸ“„ https://example.com/page1

[Converted markdown content...]

---

## šŸ“„ https://example.com/page2

[Converted markdown content...]

---

Implementation Notes

TypeScript + FP Patterns:

  • Pure functions (no classes except custom errors)
  • Explicit error handling with typed errors: BrowserError, FileError, HtmlConversionError
  • Small, focused functions
  • Side effects isolated at edges
  • CLI-style logging with chalk/ora

File Structure:

skills/web-to-markdown/
ā”œā”€ā”€ SKILL.md                    # This file (workflow instructions)
ā”œā”€ā”€ package.json                # pnpm workspace config
ā”œā”€ā”€ tsconfig.json               # TypeScript config
ā”œā”€ā”€ scripts/
│   ā”œā”€ā”€ scrape-and-convert.ts   # Main CLI (Playwright + Turndown)
│   ā”œā”€ā”€ html-to-markdown.ts     # Pure conversion function (Turndown wrapper)
│   └── convert-and-append.ts   # Legacy CLI (deprecated, kept for reference)
└── tests/
    └── html-to-markdown.test.ts # Unit tests

Error Handling

  • Browser launch failure: Check Playwright installation, run pnpm exec playwright install chromium
  • Page not found (404): Logs error, continues with other URLs
  • Timeout (>30s): Reports slow page, continues to next URL
  • Navigation error: Logs error, continues to next URL
  • Conversion failure: Reports malformed HTML, skips page

Configuration

Default timeout: 30 seconds per page

To customize, edit DEFAULT_CONFIG in scripts/scrape-and-convert.ts:

const DEFAULT_CONFIG = {
  outputDir: 'docs/web-captures',
  timeout: 30000, // milliseconds
};

Performance

  • Headless mode (no GUI overhead)
  • Sequential processing (one URL at a time for stability)
  • Browser reuse across URLs (faster than launching per-page)
  • ~2-5 seconds per page (depends on site complexity)

Comparison with Alternatives

Feature web-to-markdown scratchpad-fetch Jina AI Reader
Transport Playwright (headless) curl (HTTP) Cloud API
JavaScript āœ… Full rendering āŒ No āœ… Server-side
Conversion āœ… Turndown āŒ Raw HTML āœ… LLM-powered
Self-hosted āœ… Yes āœ… Yes āŒ Cloud only
Setup pnpm install None API key
Speed Medium (2-5s/page) Fast (<1s) Fast (~2s)
Visible browser āŒ No (headless) N/A N/A

Troubleshooting

"Executable doesn't exist" error:

cd skills/web-to-markdown
pnpm exec playwright install chromium

Pages timing out:

  • Increase timeout in DEFAULT_CONFIG
  • Check network connectivity
  • Some sites may block automated browsers (use Jina AI Reader alternative)

Empty markdown output:

  • Site may use heavy client-side rendering
  • Try waiting longer (increase timeout)
  • Check if site blocks headless browsers (User-Agent detection)

Notes

  • One request = one file (all URLs aggregated)
  • Handles JavaScript-rendered content (React, Vue, Angular, etc.)
  • Headless by default (no visible browser window)
  • Browser auto-installs on first run
  • Ideal for documentation scraping and archival
  • No external API dependencies
Weekly Installs
1
Repository
smithery/ai
First Seen
10 days ago
Installed on
amp1
opencode1
cursor1
kimi-cli1
droid1
codex1