scrapling
scrapling - Adaptive Web Scraping Framework
Keyword:
scrapling·adaptive scraping·stealthy fetch·scrapling spiderRespect each target site's terms, robots, rate limits, and authorization boundaries.
Scrapling is a Python scraping framework for parser-first HTML extraction, browser-backed fetching, stealth anti-bot handling, CLI prototyping, and optional larger crawl workflows. Its distinctive feature is adaptive scraping: you can save element fingerprints and later relocate equivalent elements after a site redesign.
When to use this skill
- Install Scrapling with the right extras for parser-only, fetchers, shell, AI, or full usage
- Parse known HTML with
Selectorbefore escalating to browser-backed fetchers - Choose between
Fetcher,DynamicFetcher, andStealthyFetcher - Reuse
FetcherSession,DynamicSession, orStealthySessionfor multiple requests - Parse HTML with CSS, XPath,
::text,::attr(...), text matching, regex, and similar-element lookup - Enable adaptive scraping with
adaptive=True,auto_save=True,retrieve(), andrelocate() - Use the
scraplingCLI for terminal-first extraction or shell work - Understand MCP and spiders as second-tier workflows once core scraping is working
- Decide when Docker-only CLI usage is enough versus when Python code is required
Instructions
Step 1: Install and verify the environment
Use a virtual environment unless the user explicitly wants a system install.
bash scripts/install.sh
Supported install profiles:
parser:pip install scraplingfetchers:pip install "scrapling[fetchers]"shell:pip install "scrapling[shell]"ai:pip install "scrapling[ai]"all:pip install "scrapling[all]"
Examples:
bash scripts/install.sh --profile parser
bash scripts/install.sh --profile fetchers
bash scripts/install.sh --profile all --force
Browser-backed flows require scrapling install. Parser-only workflows do not.
If the user only wants terminal extraction and prefers containers, Docker images are available. That path is CLI-oriented and does not replace Python coding workflows.
Step 2: Start parser-first, then choose the right fetcher
If the user already has HTML or only needs DOM parsing, start with Selector:
from scrapling import Selector
page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()
Important parser notes:
- Scrapling currently targets HTML, not XML feeds
Selectoris the current user-facing parser API- Legacy
Adaptorcompatibility exists in code, but teachSelector
If the user needs live fetching, pick the narrowest fetcher that solves the job:
Fetcher: static HTML or plain HTTP targetsDynamicFetcher: JavaScript-rendered pagesStealthyFetcher: harder protected targets, including documented Cloudflare handling
Escalation rule:
- Start with
Selectorif you already have HTML - Otherwise start with
Fetcher - If content is rendered client-side, switch to
DynamicFetcher - If protection blocks or empties the result, switch to
StealthyFetcher
Do not present anti-bot bypass as guaranteed. Phrase it as a documented capability whose success depends on the target and environment.
Step 3: Parse the response and reuse sessions
All fetchers return a Response object that extends Scrapling's Selector engine.
Core parsing options:
- CSS selectors:
page.css(".product") - XPath selectors:
page.xpath("//article") - Text and attributes:
::text,::attr(href) - Text search:
find_by_text(...) - Regex search:
find_by_regex(...) - Similar elements:
element.find_similar(...)
Use session classes for repeated requests, cookies, or state reuse:
from scrapling.fetchers import FetcherSession
with FetcherSession(impersonate="chrome") as session:
page1 = session.get("https://example.com")
page2 = session.get("https://example.com/account")
Use adaptive scraping when the target DOM is brittle:
from scrapling.fetchers import Fetcher
Fetcher.configure(adaptive=True)
page = Fetcher.get("https://example.com")
saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)
Important adaptive notes:
adaptive=Trueis off by defaultauto_save=Truestores element fingerprints keyed by selector or identifieradaptive_domainhelps when the same site moved domains or archived copies are involved- Manual flows are available with
save(),retrieve(), andrelocate()
Move the deeper parser details into references/parser-and-adaptive.md.
Step 4: Use the CLI for quick extraction or shell work
CLI overview:
scrapling installscrapling shellscrapling extract get|post|put|delete|fetch|stealthy-fetch
Wrapper scripts in this skill:
bash scripts/run-extract.sh get "https://example.com" article.mdbash scripts/run-extract.sh fetch "https://app.example.com" content.md --network-idlebash scripts/run-extract.sh stealth "https://protected.example.com" content.md --solve-cloudflare
Use the CLI when:
- The user needs quick output files in
.md,.html, or.txt - CSS selectors are enough to trim output
- A shell should be started without writing Python first
CLI and optional MCP details live in references/cli-and-mcp.md.
Step 5: Treat MCP and spiders as second-tier workflows
Use MCP when the user explicitly wants Scrapling exposed to an agent client:
bash scripts/run-mcp.sh
bash scripts/run-mcp.sh --http --host 127.0.0.1 --port 8000
Use spiders when the task is no longer a few page fetches and becomes a crawl with link following, concurrency, or checkpoint resume.
These are important capabilities, but they should not replace the core parser-plus-fetcher workflow in normal end-user guidance.
Examples
Example 1: Install Scrapling with all extras
bash scripts/install.sh
Example 2: Parser-only install
bash scripts/install.sh --profile parser
Example 3: Parse local HTML with Selector
from scrapling import Selector
page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()
Example 4: Fast static scrape from the terminal
bash scripts/run-extract.sh get "https://example.com" content.md --css-selector "article"
Example 5: Python Fetcher
from scrapling.fetchers import Fetcher
page = Fetcher.get("https://example.com", impersonate="chrome")
title = page.css("title::text").get()
Example 6: Python DynamicFetcher
from scrapling.fetchers import DynamicFetcher
page = DynamicFetcher.fetch(
"https://example.com",
network_idle=True,
wait_selector=".content"
)
Example 7: Python StealthyFetcher
from scrapling.fetchers import StealthyFetcher
page = StealthyFetcher.fetch(
"https://example.com",
headless=True,
solve_cloudflare=True
)
Example 8: Async HTTP
from scrapling.fetchers import AsyncFetcher
page = await AsyncFetcher.get("https://example.com")
Example 9: Session reuse
from scrapling.fetchers import FetcherSession
with FetcherSession(impersonate="chrome") as session:
page1 = session.get("https://example.com")
page2 = session.get("https://example.com/account")
Example 10: Adaptive selector recovery
from scrapling.fetchers import Fetcher
Fetcher.configure(adaptive=True, adaptive_domain="example.com")
page = Fetcher.get("https://example.com")
saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)
Example 11: Minimal spider reference
from scrapling.spiders import Spider, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
async def parse(self, response: Response):
for quote in response.css(".quote"):
yield {"text": quote.css(".text::text").get()}
result = QuotesSpider().start()
print(result.items.to_json())
Example 12: Start the MCP server over stdio
bash scripts/run-mcp.sh
Best practices
- Start with
Selectoror the lightest fetcher that works and escalate only when the site actually needs rendering or stealth. - Reuse session classes for repeated requests so browser startup and connection overhead stay low.
- Prefer
.mdor.txtCLI output and CSS selectors over dumping full HTML into the model context. - Enable adaptive scraping only where selector brittleness is a real maintenance problem.
- Use
page_action,wait_selector, andnetwork_idledeliberately instead of adding blind sleeps. - Treat Cloudflare solving, proxies, and browser impersonation as opt-in tools for authorized, policy-compliant work, not guaranteed bypasses.
- Remember that XML feeds are not the current target surface; Scrapling is documented around HTML parsing.
- Move from CLI to Python or spiders when retry logic, structured output, or crawl control becomes important.
- For MCP usage, make the client/server transport explicit: stdio for local agent integration,
--httpfor streamable HTTP deployments.