scrapfly-extraction
Scrapfly Extraction
Use the Scrapfly Extraction API to extract structured data from HTML, markdown, or text using LLM prompts, pre-trained AI models, or custom extraction templates.
When to use
- Extracting structured data from web page content
- Using natural language prompts to pull specific information from documents
- Extracting product, article, review, or real estate data with auto AI models
- Parsing HTML/markdown into structured formats with custom templates
- Asking questions about document content and getting AI-powered answers
Setup
pip install scrapfly-sdk
The API key must be provided via environment variable SCRAPFLY_API_KEY or passed directly to the client.
API Reference
Endpoint: POST https://api.scrapfly.io/extraction
ScrapflyClient
from scrapfly import ScrapflyClient, ExtractionConfig
import os
client = ScrapflyClient(key=os.environ["SCRAPFLY_API_KEY"])
ExtractionConfig Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
body |
str | required | Document content to extract from |
content_type |
str | required | Input format: "text/html", "text/markdown", "text/plain", "text/xml" |
url |
str | None | Base URL for resolving relative links in HTML |
charset |
str | None | Document encoding (auto-detected if omitted) |
extraction_prompt |
str | None | Natural language instruction for LLM extraction |
extraction_model |
str | None | Pre-trained model: "product", "article", "review_list", "real_estate_listing" |
extraction_template |
str | None | Custom template name or inline template definition |
timeout |
int | None | Processing timeout in seconds (60-155) |
webhook_name |
str | None | Webhook name for async processing |
You must provide exactly one of: extraction_prompt, extraction_model, or extraction_template.
Three Extraction Methods
1. LLM Prompt Extraction
Use natural language to describe what data to extract. The AI interprets the content and returns structured results.
2. Auto AI Models
Pre-trained models for common data types. Returns standardized schemas with quality scores.
"article"- News/blog articles (title, author, date, content, etc.)"event"- Events (name, date, location, description, etc.)"food_recipe"- Recipes (ingredients, steps, servings, etc.)"hotel"- Single hotel/property (name, amenities, rating, etc.)"hotel_listing"- Hotel search/list results"job_listing"- Job search/list results"job_posting"- Single job (title, company, salary, description, etc.)"organization"- Company/organization (name, contact, description, etc.)"product"- E-commerce product (name, price, description, images, etc.)"product_listing"- Product search/category listing"real_estate_property"- Single property (price, address, features, etc.)"real_estate_property_listing"- Property search/list results"review_list"- Lists of reviews (reviewer, rating, text, date, etc.)"search_engine_results"- SERP data (results, snippets, etc.)"social_media_post"- Social post (author, content, engagement, etc.)"software"- Software/app (name, description, pricing, etc.)"stock"- Stock/market data"vehicle_ad"- Single vehicle listing"vehicle_ad_listing"- Vehicle search/list results
3. Custom Templates
Structured extraction rules for consistent parsing across similar pages. Can be defined inline or stored on Scrapfly for reuse.
extraction_template = {
"source": "html",
"selectors": [
{
"name": "title",
"query": "h3.product-title::text",
"type": "css",
"formatters": [
{
"name": "uppercase"
}
],
},
{
"name": "description",
"query": "p.product-description::text",
"type": "css"
},
{
"extractor": {
"name": "price"
},
"name": "price",
"query": ".product-price::text",
"type": "css"
},
{
"name": "variants",
"query": "div.variants",
"type": "css",
"nested": [
{
"name": "name",
"query": "//a[@data-variant-id]/@data-variant-id",
"type": "xpath",
"multiple": True,
},
{
"name": "link",
"query": "//a[@data-variant-id]/@href",
"type": "xpath",
"multiple": True,
},
]
},
{
"name": "reviews",
"query": "div.review>p::text",
"type": "css",
"multiple": True,
}
]
}
Examples
LLM prompt extraction from HTML
from scrapfly import ScrapflyClient, ExtractionConfig
import os
client = ScrapflyClient(key=os.environ["SCRAPFLY_API_KEY"])
html_content = "<html><body><h1>iPhone 15</h1><span class='price'>$999</span><p>Latest Apple smartphone</p></body></html>"
result = client.extract(ExtractionConfig(
body=html_content,
content_type="text/html",
extraction_prompt="Extract the product name, price, and description as JSON",
))
# result
result.extraction_result['data']
# or
print(result.data)
# result content_type
result.extraction_result['content_type']
# or
print(result.content_type)
LLM prompt extraction from markdown
markdown_content = """
# Best Restaurants in NYC
1. **Le Bernardin** - French, $$$, 4.8 stars
2. **Peter Luger** - Steakhouse, $$$, 4.5 stars
3. **Di Fara Pizza** - Italian, $, 4.7 stars
"""
result = client.extract(ExtractionConfig(
body=markdown_content,
content_type="text/markdown",
extraction_prompt="Extract each restaurant as a JSON array with name, cuisine, price_range, and rating fields",
))
print(result.data)
Auto AI model: product extraction
from scrapfly import ScrapflyClient, ExtractionConfig, ScrapeConfig
# First scrape the page, then extract
scrape_result = client.scrape(ScrapeConfig(url="https://web-scraping.dev/product/1"))
result = client.extract(ExtractionConfig(
body=scrape_result.content,
content_type="text/html",
url="https://web-scraping.dev/product/1",
extraction_model="product",
))
print(result.data)
# Returns: {"name": "...", "price": "...", "currency": "...", "description": "...", ...}
Ask a question about content
result = client.extract(ExtractionConfig(
body=html_content,
content_type="text/html",
extraction_prompt="What is the most expensive item on this page and how much does it cost?",
))
print(result.data)
Scrape + Extract in one flow
from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig
import os
client = ScrapflyClient(key=os.environ["SCRAPFLY_API_KEY"])
# Step 1: Scrape the page
scrape_result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/product/1",
format="markdown",
))
# Step 2: Extract structured data from the content
extraction_result = client.extract(ExtractionConfig(
body=scrape_result.content,
content_type="text/markdown",
extraction_prompt="Extract all products as a JSON array with fields: name, price, availability",
))
print(extraction_result.data)
Inline extraction with the Scrape API
You can also extract data directly within a scrape request:
result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/product/1",
extraction_prompt="Extract the product name, price, and description as JSON",
))
# Extraction result is included in the scrape response
print(result.scrape_result["extracted_data"])
Custom extraction template (inline)
# First, scrape the web page to retrieve its HTML
api_response = client.scrape(scrape_config=ScrapeConfig(
url='https://web-scraping.dev/product/1',
render_js=True
))
html = api_response.content
# extraction template for HTML parsing instructions. It accepts the following:
# selectors: CSS, XPath, JMESPath, Regex, Nested (nesting multiple selector types)
# extractors: extracts commonly accessed data types: price, image, links, emails
# formatters: transforms the extracted data for common methods: lowercase, uppercase, datatime, etc.
# refer to the docs for more details: https://scrapfly.io/docs/extraction-api/rules-and-template#rules
extraction_template = {
"source": "html",
"selectors": [
{
"name": "title",
"query": "h3.product-title::text",
"type": "css",
"formatters": [
{
"name": "uppercase"
}
],
},
{
"name": "description",
"query": "p.product-description::text",
"type": "css"
},
{
"extractor": {
"name": "price"
},
"name": "price",
"query": ".product-price::text",
"type": "css"
},
{
"name": "variants",
"query": "div.variants",
"type": "css",
"nested": [
{
"name": "name",
"query": "//a[@data-variant-id]/@data-variant-id",
"type": "xpath",
"multiple": True,
},
{
"name": "link",
"query": "//a[@data-variant-id]/@href",
"type": "xpath",
"multiple": True,
},
]
},
{
"name": "reviews",
"query": "div.review>p::text",
"type": "css",
"multiple": True,
}
]
}
extraction_api_response = client.extract(
extraction_config=ExtractionConfig(
body=html, # pass the HTML content
content_type='text/html', # content data type
charset='utf-8', # passed content charset, use `auto` if you aren't sure
extraction_ephemeral_template=extraction_template # declared template defintion or template name saved on the dashboard
)
)
# extracted data
print(extraction_api_response.data)
# extracted data content_type
print(extraction_api_response.content_type)
Error Handling
from scrapfly.errors import ScrapflyError
try:
result = client.extract(ExtractionConfig(
body="<html><body><h1>iPhone 15</h1><span class='price'>$999</span><p>Latest Apple smartphone</p></body></html",
content_type="text/html",
extraction_prompt="Extract the product price",
))
print(result.data)
except ScrapflyError as e:
print(f"Extraction failed: {e.message}")
Important Notes
- Provide exactly one extraction method per request:
extraction_prompt,extraction_model, orextraction_template - For LLM prompts, be specific about the desired output format (e.g., "as JSON")
- The
urlparameter helps resolve relative links in HTML but is not required - For large documents, consider using
format="markdown"orformat="text"in the scrape step first to reduce token usage - Auto AI models return quality metrics alongside extracted data
- Extraction can also be done inline during a scrape by adding
extraction_promptorextraction_modeltoScrapeConfig
More from scrapfly/skills
scrapfly-scraper
Web scraping using the Scrapfly Scraper API with the Python SDK
13scrapfly-browser
Automate cloud browsers using the Scrapfly Cloud Browser API with Python Playwright
13scrapfly-crawler
Crawl entire websites using the Scrapfly Crawler API with the Python SDK
10scrapfly-screenshot
Capture web page screenshots using the Scrapfly Screenshot API with the Python SDK
9