cli-web-scrape
Scrapling CLI
Web scraping CLI with browser impersonation, anti-bot bypass, and CSS extraction.
Prerequisites
# Install with all extras (CLI needs click, fetchers need playwright/camoufox)
uv tool install 'scrapling[all]'
# Install fetcher browser engines (one-time)
scrapling install
Verify: scrapling --help
Fetcher Selection
| Tier | Command | Engine | Speed | Stealth | JS | Use When |
|---|---|---|---|---|---|---|
| HTTP | extract get/post/put/delete |
httpx + TLS impersonation | Fast | Medium | No | Static pages, APIs, most sites |
| Dynamic | extract fetch |
Playwright (headless browser) | Medium | Low | Yes | JS-rendered SPAs, wait-for-element |
| Stealthy | extract stealthy-fetch |
Camoufox (patched Firefox) | Slow | High | Yes | Cloudflare, aggressive anti-bot |
Default to HTTP tier — only escalate when the page requires JS rendering or blocks HTTP requests.
Output Format
Determined by output file extension:
| Extension | Output | Best For |
|---|---|---|
.html |
Raw HTML | Parsing, further processing |
.md |
HTML converted to Markdown | Reading, LLM context |
.txt |
Text content only | Clean text extraction |
Always use /tmp/scrapling-*.{md,txt,html} for output files. Read the file after extraction.
Core Commands
HTTP Tier: GET
scrapling extract get URL OUTPUT_FILE [OPTIONS]
| Flag | Purpose | Example |
|---|---|---|
-s, --css-selector |
Extract matching elements only | -s ".article-body" |
--impersonate |
Force specific browser | --impersonate firefox |
-H, --headers |
Custom headers (repeatable) | -H "Authorization: Bearer tok" |
--cookies |
Cookie string | --cookies "session=abc123" |
--proxy |
Proxy URL | --proxy "http://user:pass@host:port" |
-p, --params |
Query params (repeatable) | -p "page=2" -p "limit=50" |
--timeout |
Seconds (default: 30) | --timeout 60 |
--no-verify |
Skip SSL verification | For self-signed certs |
--no-follow-redirects |
Don't follow redirects | For redirect inspection |
--no-stealthy-headers |
Disable stealth headers | For debugging |
Examples:
# Basic page fetch as markdown
scrapling extract get "https://example.com" /tmp/scrapling-out.md
# Extract only article content
scrapling extract get "https://news.site.com/article" /tmp/scrapling-out.txt -s "article"
# Multiple CSS selectors
scrapling extract get "https://hn.com" /tmp/scrapling-out.txt -s ".titleline > a"
# With auth header
scrapling extract get "https://api.example.com/data" /tmp/scrapling-out.txt -H "Authorization: Bearer TOKEN"
# Impersonate Firefox
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate firefox
# Random browser impersonation from list
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate "chrome,firefox,safari"
# With proxy
scrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080"
HTTP Tier: POST
scrapling extract post URL OUTPUT_FILE [OPTIONS]
Additional options over GET:
| Flag | Purpose | Example |
|---|---|---|
-d, --data |
Form data | -d "param1=value1¶m2=value2" |
-j, --json |
JSON body | -j '{"key": "value"}' |
# POST with form data
scrapling extract post "https://api.example.com/search" /tmp/scrapling-out.txt -d "q=test&page=1"
# POST with JSON
scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}'
PUT and DELETE share the same interface as POST and GET respectively.
Dynamic Tier: fetch
For JS-rendered pages. Launches headless Playwright browser.
scrapling extract fetch URL OUTPUT_FILE [OPTIONS]
| Flag | Purpose | Default |
|---|---|---|
--headless/--no-headless |
Headless mode | True |
--disable-resources |
Drop images/CSS/fonts for speed | False |
--network-idle |
Wait for network idle | False |
--timeout |
Milliseconds | 30000 |
--wait |
Extra wait after load (ms) | 0 |
-s, --css-selector |
CSS selector extraction | — |
--wait-selector |
Wait for element before proceeding | — |
--real-chrome |
Use installed Chrome instead of bundled | False |
--proxy |
Proxy URL | — |
-H, --extra-headers |
Extra headers (repeatable) | — |
# Fetch JS-rendered SPA
scrapling extract fetch "https://spa-app.com" /tmp/scrapling-out.md
# Wait for specific element to load
scrapling extract fetch "https://dashboard.com" /tmp/scrapling-out.md --wait-selector ".data-table"
# Fast mode: skip images/CSS, wait for network idle
scrapling extract fetch "https://app.com" /tmp/scrapling-out.md --disable-resources --network-idle
# Extra wait for slow-loading content
scrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000
Stealthy Tier: stealthy-fetch
Maximum anti-detection. Uses Camoufox (patched Firefox).
scrapling extract stealthy-fetch URL OUTPUT_FILE [OPTIONS]
Additional options over fetch:
| Flag | Purpose | Default |
|---|---|---|
--solve-cloudflare |
Solve Cloudflare challenges | False |
--block-webrtc |
Block WebRTC (prevents IP leak) | False |
--hide-canvas |
Add noise to canvas fingerprinting | False |
--block-webgl |
Block WebGL fingerprinting | False (allowed) |
# Bypass Cloudflare
scrapling extract stealthy-fetch "https://cf-protected.com" /tmp/scrapling-out.md --solve-cloudflare
# Maximum stealth
scrapling extract stealthy-fetch "https://aggressive-antibot.com" /tmp/scrapling-out.md \
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl
# Stealthy with CSS selector
scrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt \
--solve-cloudflare -s ".content"
Auto-Escalation Protocol
ALL scrapling usage must follow this protocol. Never use extract get alone — always validate content and escalate if needed. Consumer skills (res-deep, res-price-compare, doc-daily-digest) MUST use this pattern, not a bare extract get.
Step 1: HTTP Tier
scrapling extract get "URL" /tmp/scrapling-out.md
Read /tmp/scrapling-out.md and validate content before proceeding.
Step 2: Validate Content
Check the scraped output for thin content indicators — signs that the site requires JS rendering:
| Indicator | Pattern | Example |
|---|---|---|
| JS disabled warning | "JavaScript", "enable JavaScript", "JS wyłączony" | iSpot.pl, many SPAs |
| No product/price data | Output has navigation and footer but no prices, specs, or product names | E-commerce SPAs |
| Mostly nav links | 80%+ of content is menu items, category links, cookie banners | React/Angular/Vue apps |
| Very short content | Less than ~20 meaningful lines after stripping nav/footer | Hydration-dependent pages |
| Login/loading wall | "Loading...", "Please wait", skeleton UI text | Dashboard apps |
If ANY indicator is present → escalate to Dynamic tier. Do NOT treat HTTP 200 with thin content as success.
Step 3: Dynamic Tier (if content validation fails)
scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resources
Read and validate again. If content is now rich → done. If still blocked (403, Cloudflare challenge, empty) → escalate.
Step 4: Stealthy Tier (if Dynamic tier fails)
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflare
If still blocked, add maximum stealth flags:
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md \
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl
Consumer Skill Integration
When a consumer skill says "retry with scrapling" or "scrapling fallback", it means: follow the full auto-escalation protocol above, not just the HTTP tier. The pattern:
extract get→ Read → Validate content- Content thin? →
extract fetch --network-idle --disable-resources→ Read → Validate - Still blocked? →
extract stealthy-fetch --solve-cloudflare→ Read - All tiers fail? → Skip and label "scrapling blocked"
Known JS-rendered sites (always start at Dynamic tier):
- iSpot.pl — React SPA, HTTP tier returns only nav shell
- Single-page apps with client-side routing (hash or history API URLs)
Interactive Shell
# Launch REPL
scrapling shell
# One-liner evaluation
scrapling shell -c 'Fetcher().get("https://example.com").css("title::text")'
Troubleshooting
| Issue | Fix |
|---|---|
ModuleNotFoundError: click |
Reinstall: uv tool install --force 'scrapling[all]' |
| fetch/stealthy-fetch fails | Run scrapling install to install browser engines |
| Cloudflare still blocks | Add --block-webrtc --hide-canvas to stealthy-fetch |
| Timeout | Increase --timeout (seconds for HTTP, milliseconds for fetch/stealthy) |
| SSL error | Add --no-verify (HTTP tier only) |
| Empty output with selector | Try without -s first to verify page loads, then refine selector |
Constraints
- Output file path is required — scrapling writes to file, not stdout
- CSS selectors return ALL matches concatenated
- HTTP tier timeout is in seconds, fetch/stealthy-fetch timeout is in milliseconds
--impersonateonly available on HTTP tier (fetch/stealthy handle it internally)--solve-cloudflareonly on stealthy-fetch tier- Stealth headers enabled by default on HTTP tier — disable with
--no-stealthy-headersfor debugging
More from molechowski/claude-skills
res-price-compare
Polish market product price comparison: 20+ shops, shipping costs, manufacturer vs seller warranty, B2B/statutory warranty analysis, stock status, distribution chain. Export TXT/XLSX/HTML. Use when: looking for a product to buy, price comparison, where to buy cheapest. Triggers: cena, porównaj, gdzie kupić, najtaniej, sklep, price compare, best price, kup, ile kosztuje.
36doc-vault-project
Manage multi-note research projects in Obsidian vault with phased subdirectory structure (concept, research, design, implementation). Scaffold new projects, add component notes, track status, link existing research, promote topics to projects. Use when: creating a project, adding to a project, checking project status, linking research to a project, promoting a research topic to a full project. Triggers: project init, project add, project status, project link, project promote, create project, new project.
35res-deep
Iterative multi-round deep research with structured analysis frameworks. Use for: deep research on a topic, compare X vs Y, landscape analysis, evaluate options for a decision, deep dive into a technology, comprehensive research with cross-referencing. Triggers: deep research, compare, landscape, evaluate, deep dive, comprehensive research, which is better, should we use.
35doc-daily-digest
Process Obsidian daily notes: classify raw URLs and loose ideas, fetch content (X tweets, GitHub repos, web pages), run deep research on ideas, create structured vault notes, replace raw items with wikilinks. Orchestrates doc-obsidian, res-x, and res-deep skills. Use when: processing daily note links, digesting saved URLs into notes, turning ideas into research, daily note cleanup. Triggers: daily digest, process daily, daily links, triage daily, digest daily note.
35res-x
Fetch X/Twitter tweet content by URL and search X posts. Resolves tweet links that WebFetch cannot scrape. Use for: reading saved X/Twitter links, fetching tweet content from URLs, searching X for posts on a topic, batch-processing X links from notes. Triggers: x.com link, twitter.com link, fetch tweet, read tweet, what does this tweet say, X search, twitter search.
34doc-project
Update all project documentation in one pass: CLAUDE.md, AGENTS.md, README.md, SKILLS.md, CHANGELOG.md. Orchestrates doc-claude-md, doc-readme, doc-skills-md, and doc-changelog skills sequentially. Use when: project docs are stale, after major changes, initial project setup, sync all docs. Triggers: update all docs, update project docs, sync documentation, refresh docs, doc-project.
34