scrapfly-scraper
Scrapfly Scraper
Use the Scrapfly Scraper API to collect web page data with proxy rotation, anti-bot bypass, JavaScript rendering, and JavaScript scenarios for browser control.
When to use
- Scraping web pages (HTML, JSON, text, markdown)
- Bypassing anti-bot protections (Cloudflare, DataDome, PerimeterX, Kasada, and more)
- Collecting data through rotating proxies with geo-targeting
- Rendering JavaScript-heavy pages with headless browsers
- Control the browser using JavaScript scenario for common actions (waiting for selectors, clicking, filling elements, etc.)
- Session-reuse for session-presited scraping
- Capturing browser XHR call data
Setup
pip install scrapfly-sdk
The API key must be provided via environment variable SCRAPFLY_API_KEY or passed directly to the client.
API Reference
Endpoint: https://api.scrapfly.io/scrape
The HTTP method is forwarded to the upstream URL.
To retrieve data from a web page or an API, use the GET method, which is the default method.
If the API resource like an API requries other methods, like POST. You can use it via the method parameter of the ScrapeConfig
ScrapflyClient
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse
import os
client = ScrapflyClient(key=os.environ["SCRAPFLY_API_KEY"])
ScrapeConfig Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str | required | Target URL to scrape |
method |
str | "GET" |
HTTP method: GET, POST, PUT, PATCH, HEAD, OPTIONS |
headers |
dict | None | Custom HTTP headers |
cookies |
dict | None | Custom cookies (merged into headers["cookie"]) |
body |
str | None | Raw request body for POST/PUT/PATCH; used when you pre-encode payload |
data |
dict | None | Form/JSON data for POST/PUT/PATCH; encoded according to Content-Type when body is not provided |
timeout |
int | None | Request timeout in ms (default ~150,000ms) |
retry |
bool | True | Auto-retry on network failures |
country |
str | None | Proxy country (ISO 3166-1 alpha-2, e.g. "us", "de") |
proxy_pool |
str | "public_datacenter_pool" |
Proxy pool: "public_datacenter_pool" or "public_residential_pool" |
session |
str | None | Session ID to persist cookies/fingerprint across requests |
session_sticky_proxy |
bool | False | Keep the same proxy IP for a given session |
asp |
bool | False | Enable Anti Scraping Protection bypass |
render_js |
bool | False | Enable headless browser JavaScript rendering (+5 credits) |
rendering_wait |
int | None | Wait time in ms after page load (requires render_js=True) |
rendering_stage |
str | "complete" |
Browser readiness stage: "complete" or "domcontentloaded" (requires render_js=True) |
wait_for_selector |
str | None | CSS/XPath selector to wait for (requires render_js=True) |
js |
str | None | JavaScript code to execute in the browser (auto-encoded) |
js_scenario |
list | None | List of browser actions (click, fill, scroll, wait, etc.) |
auto_scroll |
bool | None | Automatically scroll page during rendering for lazy-loaded content (requires render_js=True) |
screenshots |
dict | None | Capture screenshots: {"fullpage": "png"} or {"selector": ".element"} (requires render_js=True) |
screenshot_flags |
list[str] | None | Screenshot options: "load_images", "dark_mode", "block_banners", "high_quality", "print_media_format" |
format |
str | "raw" |
Output format: "raw", "clean_html", "json", "markdown", "text" |
format_options |
list[str] | None | Format modifiers (markdown only): "no_images", "no_links", "only_content" |
extract |
dict | None | Raw extraction spec to apply on the response (encoded and sent as extract) |
extraction_template |
str | None | Name of a saved server-side extraction template |
extraction_ephemeral_template |
dict | None | Inline JSON extraction template used once (ephemeral:) |
extraction_prompt |
str | None | Natural language instructions for Extraction API |
extraction_model |
str | None | LLM model name to use for Extraction API |
cache |
bool | False | Enable response caching |
cache_ttl |
int | None | Cache time-to-live in seconds |
cache_clear |
bool | False | Clear any existing cached response when cache=True |
dns |
bool | False | Collect DNS records and timings in result.dns (slower) |
ssl |
bool | False | Collect SSL certificate details in result.ssl (slower) |
debug |
bool | False | Enable debug recording and extra metadata |
raise_on_upstream_error |
bool | True | Raise exceptions for upstream 4xx/5xx HTTP responses |
correlation_id |
str | None | Custom ID for request tracking across systems |
tags |
list[str] | None | Custom tags for request organization and analytics |
lang |
list[str] | None | Accept-Language values, e.g. ["en-US", "en"] |
os |
str | None | Override browser OS fingerprint, e.g. "windows", "macos" |
webhook |
str | None | Named webhook for async scrape completion callbacks |
cost_budget |
int | None | Max credits to spend on ASP retries and extra features |
ScrapeApiResponse
response = client.scrape(ScrapeConfig(url="https://httpbin.dev"))
response.content # Page content (HTML/JSON/text)
response.scrape_result # Full result dict
response.status_code # Scrapfly API HTTP status code
response.scrape_result["status_code"] # Upsteam HTTP status code
response.headers # Response headers
response.context # Metadata about features used
Examples
Basic scrape
from scrapfly import ScrapflyClient, ScrapeConfig
import os
client = ScrapflyClient(key=os.environ["SCRAPFLY_API_KEY"])
result = client.scrape(ScrapeConfig(
url="https://httpbin.dev",
))
print(result.content)
Scrape with anti-bot bypass and geo-targeting
result = client.scrape(ScrapeConfig(
url="https://httpbin.dev",
asp=True, # enable the asp to bypass antibots
country="us", # match the proxy with the domain country
proxy_pool="public_residential_pool", # use the residential proxy pool to match real ISP IPs
))
Scrape with JavaScript rendering
result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/products",
render_js=True, # enable JavaScript rendering using a cloud browser
rendering_wait=5000, # rendering wait for wait for
wait_for_selector="div.product", # wait a for a selector to be present on the page
))
Scrape as markdown
result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/products",
format="markdown", # other supported formats are: json, text, clean_html
))
print(result.content)
POST request with body
result = client.scrape(ScrapeConfig(
url="https://httpbin.dev/anything",
method="POST",
headers={"Content-Type": "application/json"},
body='{"query": "search term"}',
))
print(result.content)
JavaScript scenario (browser actions)
result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/login",
render_js=True,
# browser control
js_scenario=[
{
"fill":{
"selector":"input[name='username']","clear":True,"value":"user123"
}
},{
"fill":{
"selector":"input[name='password']","clear":True,"value":"password"
}
},{
"click":{
"selector":"form > button[type='submit']"
}
},{
"wait_for_navigation":{
"timeout":5000
}
}
],
# request headers
headers={
"cookie":"cookiesAccepted=true"
}
))
# access the element under login
print(result.selector.css("div#secret-message::text").get())
Built-in Parsel selector
result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/products",
render_js=True
))
# access the built-in Parsel selector
selector = result.selector
for product in selector.css("div.product"):
print("product name:", product.xpath(".//a/text()").get()) # using xpath
print("product price:", product.css("div.price::text").get()) # using css
Concurrent scraping
import asyncio
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key=os.environ["SCRAPFLY_API_KEY"])
urls = [
"https://web-scraping.dev/products",
"https://web-scraping.dev/products?page=2",
"https://web-scraping.dev/products?page=3",
]
configs = [ScrapeConfig(url=url) for url in urls]
async def concurrent_scraping():
results = []
async for result in client.concurrent_scrape(configs):
results.append(result)
for result in results:
print(result.content[:100])
print("===========")
asyncio.run(concurrent_scraping())
Using sessions for stateful scraping
# First request: login
result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/login",
render_js=True,
# browser control
js_scenario=[
{
"fill":{
"selector":"input[name='username']","clear":True,"value":"user123"
}
},{
"fill":{
"selector":"input[name='password']","clear":True,"value":"password"
}
},{
"click":{
"selector":"form > button[type='submit']"
}
},{
"wait_for_navigation":{
"timeout":5000
}
}
],
# request headers
headers={
"cookie":"cookiesAccepted=true"
},
session="logged-in-session"
))
# Second request: access protected page with same session
result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/login",
session="logged-in-session"
))
# access the built-in selector
print(result.selector.css("div#secret-message::text").get())
Caching responses
result = client.scrape(ScrapeConfig(
url="https://example.com",
cache=True,
cache_ttl=3600, # Cache for 1 hour
))
Error Handling
from scrapfly import ScrapflyClient, ScrapeConfig
from scrapfly.errors import (
ScrapflyError,
UpstreamHttpClientError,
UpstreamHttpServerError,
ScrapflyProxyError,
ScrapflyThrottleError,
)
try:
result = client.scrape(ScrapeConfig(url="https://httpbin.dev", asp=True))
except ScrapflyThrottleError as e:
print(f"Rate limited, retry after {e.retry_delay}s")
except UpstreamHttpClientError as e:
print(f"Target returned 4xx: {e.message}")
except UpstreamHttpServerError as e:
print(f"Target returned 5xx: {e.message}")
except ScrapflyProxyError as e:
print(f"Proxy error: {e.message}")
except ScrapflyError as e:
print(f"Scrapfly error: {e.message}")
Important Notes
render_js=Trueis required forrendering_wait,wait_for_selector,js,js_scenario, andscreenshotsproxy_pool="public_residential_pool"is recommended when usingasp=True- Use
format="markdown"for clean content accessible for LLMs - Use
sessionparameter to maintain state across multiple requests - The
concurrent_scrapemethod handles rate limiting automatically
More from scrapfly/skills
scrapfly-browser
Automate cloud browsers using the Scrapfly Cloud Browser API with Python Playwright
13scrapfly-extraction
Extract structured data from web content using the Scrapfly Extraction API with the Python SDK
12scrapfly-crawler
Crawl entire websites using the Scrapfly Crawler API with the Python SDK
10scrapfly-screenshot
Capture web page screenshots using the Scrapfly Screenshot API with the Python SDK
9