jb-docs-scraper
Documentation Scraper
Scrape any documentation website into local markdown files. Uses crawl4ai for async web crawling.
Quick Start
# Scrape any documentation URL
uv run --with crawl4ai python ./references/scrape_docs.py <URL>
# Examples
uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/
uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwind
Output goes to ./docs/<auto-detected-name>/ by default.
Prerequisites (First Time Only)
uv run --with crawl4ai playwright install
Usage
uv run --with crawl4ai python ./references/scrape_docs.py <URL> [OPTIONS]
Options
| Option | Description | Default |
|---|---|---|
-o, --output PATH |
Output directory | ./docs/<auto-detected-name> |
--max-depth N |
Maximum link depth | 6 |
--max-pages N |
Maximum pages to scrape | 500 |
--url-pattern PATTERN |
URL filter (glob) | Auto-detected |
-q, --quiet |
Suppress verbose output | False |
Examples
# Basic - scrape to ./docs/documentation_v3/
uv run --with crawl4ai python ./references/scrape_docs.py \
https://mediasoup.org/documentation/v3/
# Custom output directory
uv run --with crawl4ai python ./references/scrape_docs.py \
https://docs.rombo.co/tailwind \
--output ./my-tailwind-docs
# Limit crawl scope
uv run --with crawl4ai python ./references/scrape_docs.py \
https://tanstack.com/start/latest/docs/framework/react/overview \
--max-pages 50 \
--max-depth 3
# Custom URL pattern filter
uv run --with crawl4ai python ./references/scrape_docs.py \
https://example.com/docs/api/v2/ \
--url-pattern "*api/v2/*"
How It Works
- Auto-detects domain and URL pattern from the input URL
- Crawls using BFS (breadth-first search) strategy
- Filters to stay within the documentation section
- Converts pages to clean markdown
- Saves with directory structure mirroring the URL paths
Output Structure
docs/<name>/
index.md # Root page
getting-started.md
api/
overview.md
client.md
guides/
installation.md
Troubleshooting
| Issue | Solution |
|---|---|
Playwright browser binaries are missing |
Run uv run --with crawl4ai playwright install |
| Empty output | Check if URL pattern matches actual doc URLs. Try --url-pattern |
| Missing pages | Increase --max-depth or --max-pages |
| Wrong pages scraped | Use stricter --url-pattern |
Tips
- Test first - Use
--max-pages 10to verify config before full crawl - Check output name - Script auto-detects from URL path segments
- Rerun safe - Files are overwritten, duplicates skipped
More from bjesuiter/skills
mole-mac-cleanup
Mac cleanup & optimization tool combining CleanMyMac, AppCleaner, DaisyDisk features. Deep cleaning, smart uninstaller, disk insights, and project artifact purge.
32xcode
Build, test, and manage Xcode projects and Swift packages. Use when the user mentions Xcode, iOS/macOS app development, simulators, Swift packages, or needs to build/test Apple platform apps. Triggers on "build", "run", "test", "simulator", "xcodebuild", "swift package", "iOS app", "macOS app".
30summarize
Summarize URLs or files with the summarize CLI (web, PDFs, images, audio, YouTube).
27jb-browser-testing
Private browser testing rules for jb workflows. Prefer playwriter_exec with careful tab selection, fall back to agent-browser, and avoid playwright-mcp.
16mcporter
Use the mcporter CLI to list, configure, auth, and call MCP servers/tools directly (HTTP or stdio), including ad-hoc servers, config edits, and CLI/type generation.
16security-check
Red-team security review for code changes. Use when reviewing pending git changes, branch diffs, or new features for security vulnerabilities, permission gaps, injection risks, and attack vectors. Acts as a pen-tester analyzing code.
16