crawler
Crawler Skill
Converts any URL into clean markdown using a robust 3-tier fallback chain.
Quick start
uv run scripts/crawl.py --url https://example.com --output reports/example.md
Markdown is saved to the file specified by --output. Progress/errors go to stderr. Exit code 0 on
success, 1 if all scrapers fail.
How it works
The script tries each tier in order and returns the first success:
| Tier | Module | Requires |
|---|---|---|
| 1 | Firecrawl (firecrawl_scraper.py) |
FIRECRAWL_API_KEY env var (optional; falls back if missing) |
| 2 | Jina Reader (jina_reader.py) |
Nothing — free, no key needed |
| 3 | Scrapling (scrapling_scraper.py) |
Local headless browser (auto-installs via pip) |
File layout
crawler-skill/
├── SKILL.md ← this file
├── scripts/
│ ├── crawl.py ← main CLI entry point (PEP 723 inline deps)
│ └── src/
│ ├── domain_router.py ← URL-to-tier routing rules
│ ├── firecrawl_scraper.py ← Tier 1: Firecrawl API
│ ├── jina_reader.py ← Tier 2: Jina r.jina.ai proxy
│ └── scrapling_scraper.py ← Tier 3: local headless scraper
└── tests/
└── test_crawl.py ← 70 pytest tests (all passing)
Usage examples
# Basic fetch — tries Firecrawl, falls back to Jina, then Scrapling
# Always prefer using --output to avoid terminal encoding issues
uv run scripts/crawl.py --url https://docs.python.org/3/ --output reports/python_docs.md
# If no --output is provided, markdown goes to stdout (not recommended on Windows)
uv run scripts/crawl.py --url https://example.com
# With a Firecrawl API key for best results
FIRECRAWL_API_KEY=fc-... uv run scripts/crawl.py --url https://example.com --output reports/example.md
URL requirements
Only http:// and https:// URLs are accepted. Passing any other scheme
(ftp://, file://, javascript:, a bare path, etc.) exits with code 1
and prints a clear error — no scraping is attempted.
Saving Reports
When the user asks to save the crawled content or a summary to a file, ALWAYS use the --output argument and save the file into the reports/ directory at the project root (for example, {project_root}/reports). If the directory does not exist, the script will create it.
Example:
If asked to "save to result.md", you should run:
uv run scripts/crawl.py --url <URL> --output reports/result.md
Point at a self-hosted Firecrawl instance
FIRECRAWL_API_URL=http://localhost:3002 uv run scripts/crawl.py --url https://example.com
Content validation
Each scraper validates its output before returning success:
- Minimum 100 characters of content (rejects empty/error pages)
- Detection of CAPTCHA / bot-verification pages (Firecrawl)
- Detection of Cloudflare interstitial pages (Scrapling — escalates to StealthyFetcher)
- Detection of Jina error page indicators (
Error:,Access Denied, etc.)
Domain routing
Certain hostnames bypass one or more scraper tiers to avoid known compatibility
issues. The logic lives in scripts/src/domain_router.py.
| Domain | Skipped tiers | Active chain |
|---|---|---|
medium.com (and subdomains) |
firecrawl | jina → scrapling |
mp.weixin.qq.com |
firecrawl + jina | scrapling only |
| everything else | — | firecrawl → jina → scrapling |
Sub-domain matching follows a suffix rule: blog.medium.com matches the
medium.com rule because its hostname ends with .medium.com. An exact
sub-domain like other.weixin.qq.com does not match mp.weixin.qq.com.
Running tests
uv run pytest tests/ -v
All 70 tests use mocking — no network calls, no API keys required.
Dependencies (auto-installed by uv run)
firecrawl-py>=2.0— Firecrawl Python SDKhttpx>=0.27— HTTP client for Jina Readerscrapling>=0.2— Headless scraping with stealth supporthtml2text>=2024.2.26— HTML-to-markdown conversion
When to invoke this skill
Invoke crawl.py whenever you need the text content of a web page:
result = subprocess.run(
["uv", "run", "scripts/crawl.py", "--url", url],
capture_output=True, text=True
)
if result.returncode == 0:
markdown = result.stdout
Or simply run it directly from the terminal as shown in Quick start above.