cli-web-scrape
Scrapling CLI
Web scraping CLI with browser impersonation, anti-bot bypass, and CSS extraction.
Prerequisites
# Install with all extras (CLI needs click, fetchers need playwright/camoufox)
uv tool install 'scrapling[all]'
# Install fetcher browser engines (one-time)
scrapling install
Verify: scrapling --help
Fetcher Selection
| Tier | Command | Engine | Speed | Stealth | JS | Use When |
|---|---|---|---|---|---|---|
| HTTP | extract get/post/put/delete |
httpx + TLS impersonation | Fast | Medium | No | Static pages, APIs, most sites |
| Dynamic | extract fetch |
Playwright (headless browser) | Medium | Low | Yes | JS-rendered SPAs, wait-for-element |
| Stealthy | extract stealthy-fetch |
Camoufox (patched Firefox) | Slow | High | Yes | Cloudflare, aggressive anti-bot |
Default to HTTP tier — only escalate when the page requires JS rendering or blocks HTTP requests.
Output Format
Determined by output file extension:
| Extension | Output | Best For |
|---|---|---|
.html |
Raw HTML | Parsing, further processing |
.md |
HTML converted to Markdown | Reading, LLM context |
.txt |
Text content only | Clean text extraction |
Always use /tmp/scrapling-*.{md,txt,html} for output files. Read the file after extraction.
Core Commands
HTTP Tier: GET
scrapling extract get URL OUTPUT_FILE [OPTIONS]
| Flag | Purpose | Example |
|---|---|---|
-s, --css-selector |
Extract matching elements only | -s ".article-body" |
--impersonate |
Force specific browser | --impersonate firefox |
-H, --headers |
Custom headers (repeatable) | -H "Authorization: Bearer tok" |
--cookies |
Cookie string | --cookies "session=abc123" |
--proxy |
Proxy URL | --proxy "http://user:pass@host:port" |
-p, --params |
Query params (repeatable) | -p "page=2" -p "limit=50" |
--timeout |
Seconds (default: 30) | --timeout 60 |
--no-verify |
Skip SSL verification | For self-signed certs |
--no-follow-redirects |
Don't follow redirects | For redirect inspection |
--no-stealthy-headers |
Disable stealth headers | For debugging |
Examples:
# Basic page fetch as markdown
scrapling extract get "https://example.com" /tmp/scrapling-out.md
# Extract only article content
scrapling extract get "https://news.site.com/article" /tmp/scrapling-out.txt -s "article"
# Multiple CSS selectors
scrapling extract get "https://hn.com" /tmp/scrapling-out.txt -s ".titleline > a"
# With auth header
scrapling extract get "https://api.example.com/data" /tmp/scrapling-out.txt -H "Authorization: Bearer TOKEN"
# Impersonate Firefox
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate firefox
# Random browser impersonation from list
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate "chrome,firefox,safari"
# With proxy
scrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080"
HTTP Tier: POST
scrapling extract post URL OUTPUT_FILE [OPTIONS]
Additional options over GET:
| Flag | Purpose | Example |
|---|---|---|
-d, --data |
Form data | -d "param1=value1¶m2=value2" |
-j, --json |
JSON body | -j '{"key": "value"}' |
# POST with form data
scrapling extract post "https://api.example.com/search" /tmp/scrapling-out.txt -d "q=test&page=1"
# POST with JSON
scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}'
PUT and DELETE share the same interface as POST and GET respectively.
Dynamic Tier: fetch
For JS-rendered pages. Launches headless Playwright browser.
scrapling extract fetch URL OUTPUT_FILE [OPTIONS]
| Flag | Purpose | Default |
|---|---|---|
--headless/--no-headless |
Headless mode | True |
--disable-resources |
Drop images/CSS/fonts for speed | False |
--network-idle |
Wait for network idle | False |
--timeout |
Milliseconds | 30000 |
--wait |
Extra wait after load (ms) | 0 |
-s, --css-selector |
CSS selector extraction | — |
--wait-selector |
Wait for element before proceeding | — |
--real-chrome |
Use installed Chrome instead of bundled | False |
--proxy |
Proxy URL | — |
-H, --extra-headers |
Extra headers (repeatable) | — |
# Fetch JS-rendered SPA
scrapling extract fetch "https://spa-app.com" /tmp/scrapling-out.md
# Wait for specific element to load
scrapling extract fetch "https://dashboard.com" /tmp/scrapling-out.md --wait-selector ".data-table"
# Fast mode: skip images/CSS, wait for network idle
scrapling extract fetch "https://app.com" /tmp/scrapling-out.md --disable-resources --network-idle
# Extra wait for slow-loading content
scrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000
Stealthy Tier: stealthy-fetch
Maximum anti-detection. Uses Camoufox (patched Firefox).
scrapling extract stealthy-fetch URL OUTPUT_FILE [OPTIONS]
Additional options over fetch:
| Flag | Purpose | Default |
|---|---|---|
--solve-cloudflare |
Solve Cloudflare challenges | False |
--block-webrtc |
Block WebRTC (prevents IP leak) | False |
--hide-canvas |
Add noise to canvas fingerprinting | False |
--block-webgl |
Block WebGL fingerprinting | False (allowed) |
# Bypass Cloudflare
scrapling extract stealthy-fetch "https://cf-protected.com" /tmp/scrapling-out.md --solve-cloudflare
# Maximum stealth
scrapling extract stealthy-fetch "https://aggressive-antibot.com" /tmp/scrapling-out.md \
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl
# Stealthy with CSS selector
scrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt \
--solve-cloudflare -s ".content"
Auto-Escalation Protocol
ALL scrapling usage must follow this protocol. Never use extract get alone — always validate content and escalate if needed. Consumer skills (res-deep, res-price-compare, doc-daily-digest) MUST use this pattern, not a bare extract get.
Step 1: HTTP Tier
scrapling extract get "URL" /tmp/scrapling-out.md
Read /tmp/scrapling-out.md and validate content before proceeding.
Step 2: Validate Content
Check the scraped output for thin content indicators — signs that the site requires JS rendering:
| Indicator | Pattern | Example |
|---|---|---|
| JS disabled warning | "JavaScript", "enable JavaScript", "JS wyłączony" | iSpot.pl, many SPAs |
| No product/price data | Output has navigation and footer but no prices, specs, or product names | E-commerce SPAs |
| Mostly nav links | 80%+ of content is menu items, category links, cookie banners | React/Angular/Vue apps |
| Very short content | Less than ~20 meaningful lines after stripping nav/footer | Hydration-dependent pages |
| Login/loading wall | "Loading...", "Please wait", skeleton UI text | Dashboard apps |
If ANY indicator is present → escalate to Dynamic tier. Do NOT treat HTTP 200 with thin content as success.
Step 3: Dynamic Tier (if content validation fails)
scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resources
Read and validate again. If content is now rich → done. If still blocked (403, Cloudflare challenge, empty) → escalate.
Step 4: Stealthy Tier (if Dynamic tier fails)
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflare
If still blocked, add maximum stealth flags:
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md \
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl
Consumer Skill Integration
When a consumer skill says "retry with scrapling" or "scrapling fallback", it means: follow the full auto-escalation protocol above, not just the HTTP tier. The pattern:
extract get→ Read → Validate content- Content thin? →
extract fetch --network-idle --disable-resources→ Read → Validate - Still blocked? →
extract stealthy-fetch --solve-cloudflare→ Read - All tiers fail? → Skip and label "scrapling blocked"
Known JS-rendered sites (always start at Dynamic tier):
- iSpot.pl — React SPA, HTTP tier returns only nav shell
- Single-page apps with client-side routing (hash or history API URLs)
Interactive Shell
# Launch REPL
scrapling shell
# One-liner evaluation
scrapling shell -c 'Fetcher().get("https://example.com").css("title::text")'
Troubleshooting
| Issue | Fix |
|---|---|
ModuleNotFoundError: click |
Reinstall: uv tool install --force 'scrapling[all]' |
| fetch/stealthy-fetch fails | Run scrapling install to install browser engines |
| Cloudflare still blocks | Add --block-webrtc --hide-canvas to stealthy-fetch |
| Timeout | Increase --timeout (seconds for HTTP, milliseconds for fetch/stealthy) |
| SSL error | Add --no-verify (HTTP tier only) |
| Empty output with selector | Try without -s first to verify page loads, then refine selector |
Constraints
- Output file path is required — scrapling writes to file, not stdout
- CSS selectors return ALL matches concatenated
- HTTP tier timeout is in seconds, fetch/stealthy-fetch timeout is in milliseconds
--impersonateonly available on HTTP tier (fetch/stealthy handle it internally)--solve-cloudflareonly on stealthy-fetch tier- Stealth headers enabled by default on HTTP tier — disable with
--no-stealthy-headersfor debugging