cli-web-scrape

Installation

SKILL.md

Scrapling CLI

Web scraping CLI with browser impersonation, anti-bot bypass, and CSS extraction.

Prerequisites

# Install with all extras (CLI needs click, fetchers need playwright/camoufox)
uv tool install 'scrapling[all]'

# Install fetcher browser engines (one-time)
scrapling install

Verify: scrapling --help

Fetcher Selection

Tier	Command	Engine	Speed	Stealth	JS	Use When
HTTP	`extract get/post/put/delete`	httpx + TLS impersonation	Fast	Medium	No	Static pages, APIs, most sites
Dynamic	`extract fetch`	Playwright (headless browser)	Medium	Low	Yes	JS-rendered SPAs, wait-for-element
Stealthy	`extract stealthy-fetch`	Camoufox (patched Firefox)	Slow	High	Yes	Cloudflare, aggressive anti-bot

Default to HTTP tier — only escalate when the page requires JS rendering or blocks HTTP requests.

Output Format

Determined by output file extension:

Extension	Output	Best For
`.html`	Raw HTML	Parsing, further processing
`.md`	HTML converted to Markdown	Reading, LLM context
`.txt`	Text content only	Clean text extraction

Always use /tmp/scrapling-*.{md,txt,html} for output files. Read the file after extraction.

Core Commands

HTTP Tier: GET

scrapling extract get URL OUTPUT_FILE [OPTIONS]

Flag	Purpose	Example
`-s, --css-selector`	Extract matching elements only	`-s ".article-body"`
`--impersonate`	Force specific browser	`--impersonate firefox`
`-H, --headers`	Custom headers (repeatable)	`-H "Authorization: Bearer tok"`
`--cookies`	Cookie string	`--cookies "session=abc123"`
`--proxy`	Proxy URL	`--proxy "http://user:pass@host:port"`
`-p, --params`	Query params (repeatable)	`-p "page=2" -p "limit=50"`
`--timeout`	Seconds (default: 30)	`--timeout 60`
`--no-verify`	Skip SSL verification	For self-signed certs
`--no-follow-redirects`	Don't follow redirects	For redirect inspection
`--no-stealthy-headers`	Disable stealth headers	For debugging

Examples:

# Basic page fetch as markdown
scrapling extract get "https://example.com" /tmp/scrapling-out.md

# Extract only article content
scrapling extract get "https://news.site.com/article" /tmp/scrapling-out.txt -s "article"

# Multiple CSS selectors
scrapling extract get "https://hn.com" /tmp/scrapling-out.txt -s ".titleline > a"

# With auth header
scrapling extract get "https://api.example.com/data" /tmp/scrapling-out.txt -H "Authorization: Bearer TOKEN"

# Impersonate Firefox
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate firefox

# Random browser impersonation from list
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate "chrome,firefox,safari"

# With proxy
scrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080"

HTTP Tier: POST

scrapling extract post URL OUTPUT_FILE [OPTIONS]

Additional options over GET:

Flag	Purpose	Example
`-d, --data`	Form data	`-d "param1=value1&param2=value2"`
`-j, --json`	JSON body	`-j '{"key": "value"}'`

# POST with form data
scrapling extract post "https://api.example.com/search" /tmp/scrapling-out.txt -d "q=test&page=1"

# POST with JSON
scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}'

PUT and DELETE share the same interface as POST and GET respectively.

Dynamic Tier: fetch

For JS-rendered pages. Launches headless Playwright browser.

scrapling extract fetch URL OUTPUT_FILE [OPTIONS]

Flag	Purpose	Default
`--headless/--no-headless`	Headless mode	True
`--disable-resources`	Drop images/CSS/fonts for speed	False
`--network-idle`	Wait for network idle	False
`--timeout`	Milliseconds	30000
`--wait`	Extra wait after load (ms)	0
`-s, --css-selector`	CSS selector extraction	—
`--wait-selector`	Wait for element before proceeding	—
`--real-chrome`	Use installed Chrome instead of bundled	False
`--proxy`	Proxy URL	—
`-H, --extra-headers`	Extra headers (repeatable)	—

# Fetch JS-rendered SPA
scrapling extract fetch "https://spa-app.com" /tmp/scrapling-out.md

# Wait for specific element to load
scrapling extract fetch "https://dashboard.com" /tmp/scrapling-out.md --wait-selector ".data-table"

# Fast mode: skip images/CSS, wait for network idle
scrapling extract fetch "https://app.com" /tmp/scrapling-out.md --disable-resources --network-idle

# Extra wait for slow-loading content
scrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000

Stealthy Tier: stealthy-fetch

Maximum anti-detection. Uses Camoufox (patched Firefox).

scrapling extract stealthy-fetch URL OUTPUT_FILE [OPTIONS]

Additional options over fetch:

Flag	Purpose	Default
`--solve-cloudflare`	Solve Cloudflare challenges	False
`--block-webrtc`	Block WebRTC (prevents IP leak)	False
`--hide-canvas`	Add noise to canvas fingerprinting	False
`--block-webgl`	Block WebGL fingerprinting	False (allowed)

# Bypass Cloudflare
scrapling extract stealthy-fetch "https://cf-protected.com" /tmp/scrapling-out.md --solve-cloudflare

# Maximum stealth
scrapling extract stealthy-fetch "https://aggressive-antibot.com" /tmp/scrapling-out.md \
  --solve-cloudflare --block-webrtc --hide-canvas --block-webgl

# Stealthy with CSS selector
scrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt \
  --solve-cloudflare -s ".content"

Auto-Escalation Protocol

ALL scrapling usage must follow this protocol. Never use extract get alone — always validate content and escalate if needed. Consumer skills (res-deep, res-price-compare, doc-daily-digest) MUST use this pattern, not a bare extract get.

Step 1: HTTP Tier

scrapling extract get "URL" /tmp/scrapling-out.md

Read /tmp/scrapling-out.md and validate content before proceeding.

Step 2: Validate Content

Check the scraped output for thin content indicators — signs that the site requires JS rendering:

Indicator	Pattern	Example
JS disabled warning	"JavaScript", "enable JavaScript", "JS wyłączony"	iSpot.pl, many SPAs
No product/price data	Output has navigation and footer but no prices, specs, or product names	E-commerce SPAs
Mostly nav links	80%+ of content is menu items, category links, cookie banners	React/Angular/Vue apps
Very short content	Less than ~20 meaningful lines after stripping nav/footer	Hydration-dependent pages
Login/loading wall	"Loading...", "Please wait", skeleton UI text	Dashboard apps

If ANY indicator is present → escalate to Dynamic tier. Do NOT treat HTTP 200 with thin content as success.

Step 3: Dynamic Tier (if content validation fails)

scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resources

Read and validate again. If content is now rich → done. If still blocked (403, Cloudflare challenge, empty) → escalate.

Step 4: Stealthy Tier (if Dynamic tier fails)

scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflare

If still blocked, add maximum stealth flags:

scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md \
  --solve-cloudflare --block-webrtc --hide-canvas --block-webgl

Consumer Skill Integration

When a consumer skill says "retry with scrapling" or "scrapling fallback", it means: follow the full auto-escalation protocol above, not just the HTTP tier. The pattern:

extract get → Read → Validate content
Content thin? → extract fetch --network-idle --disable-resources → Read → Validate
Still blocked? → extract stealthy-fetch --solve-cloudflare → Read
All tiers fail? → Skip and label "scrapling blocked"

Known JS-rendered sites (always start at Dynamic tier):

iSpot.pl — React SPA, HTTP tier returns only nav shell
Single-page apps with client-side routing (hash or history API URLs)

Interactive Shell

# Launch REPL
scrapling shell

# One-liner evaluation
scrapling shell -c 'Fetcher().get("https://example.com").css("title::text")'

Troubleshooting

Issue	Fix
`ModuleNotFoundError: click`	Reinstall: `uv tool install --force 'scrapling[all]'`
fetch/stealthy-fetch fails	Run `scrapling install` to install browser engines
Cloudflare still blocks	Add `--block-webrtc --hide-canvas` to stealthy-fetch
Timeout	Increase `--timeout` (seconds for HTTP, milliseconds for fetch/stealthy)
SSL error	Add `--no-verify` (HTTP tier only)
Empty output with selector	Try without `-s` first to verify page loads, then refine selector

Constraints

Output file path is required — scrapling writes to file, not stdout
CSS selectors return ALL matches concatenated
HTTP tier timeout is in seconds, fetch/stealthy-fetch timeout is in milliseconds
--impersonate only available on HTTP tier (fetch/stealthy handle it internally)
--solve-cloudflare only on stealthy-fetch tier
Stealth headers enabled by default on HTTP tier — disable with --no-stealthy-headers for debugging

Related skills

More from molechowski/claude-skills

Installs

Repository

molechowski/cla…e-skills

First Seen

Mar 7, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykFail