Documentation Scraper

Scrape any documentation website into local markdown files. Uses crawl4ai for async web crawling.

Quick Start

# Scrape any documentation URL
uv run --with crawl4ai python ./references/scrape_docs.py <URL>

# Examples
uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/
uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwind

Output goes to ./docs/<auto-detected-name>/ by default.

Prerequisites (First Time Only)

uv run --with crawl4ai playwright install

Usage

uv run --with crawl4ai python ./references/scrape_docs.py <URL> [OPTIONS]

Options

Option	Description	Default
`-o, --output PATH`	Output directory	`./docs/<auto-detected-name>`
`--max-depth N`	Maximum link depth	`6`
`--max-pages N`	Maximum pages to scrape	`500`
`--url-pattern PATTERN`	URL filter (glob)	Auto-detected
`-q, --quiet`	Suppress verbose output	`False`

Examples

# Basic - scrape to ./docs/documentation_v3/
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://mediasoup.org/documentation/v3/

# Custom output directory
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://docs.rombo.co/tailwind \
  --output ./my-tailwind-docs

# Limit crawl scope
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://tanstack.com/start/latest/docs/framework/react/overview \
  --max-pages 50 \
  --max-depth 3

# Custom URL pattern filter
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://example.com/docs/api/v2/ \
  --url-pattern "*api/v2/*"

How It Works

Auto-detects domain and URL pattern from the input URL
Crawls using BFS (breadth-first search) strategy
Filters to stay within the documentation section
Converts pages to clean markdown
Saves with directory structure mirroring the URL paths

Output Structure

docs/<name>/
  index.md           # Root page
  getting-started.md
  api/
    overview.md
    client.md
  guides/
    installation.md

Troubleshooting

Issue	Solution
`Playwright browser binaries are missing`	Run `uv run --with crawl4ai playwright install`
Empty output	Check if URL pattern matches actual doc URLs. Try `--url-pattern`
Missing pages	Increase `--max-depth` or `--max-pages`
Wrong pages scraped	Use stricter `--url-pattern`

Tips

Test first - Use --max-pages 10 to verify config before full crawl
Check output name - Script auto-detects from URL path segments
Rerun safe - Files are overwritten, duplicates skipped

jb-docs-scraper

Documentation Scraper

Quick Start

Prerequisites (First Time Only)

Usage

Options

Examples

How It Works

Output Structure

Troubleshooting

Tips

More from bjesuiter/skills

mole-mac-cleanup

xcode

summarize

jb-browser-testing

mcporter

security-check