scrapfly-extraction

Installation
SKILL.md

Scrapfly Extraction

Use the Scrapfly Extraction API to extract structured data from HTML, markdown, or text using LLM prompts, pre-trained AI models, or custom extraction templates.

When to use

  • Extracting structured data from web page content
  • Using natural language prompts to pull specific information from documents
  • Extracting product, article, review, or real estate data with auto AI models
  • Parsing HTML/markdown into structured formats with custom templates
  • Asking questions about document content and getting AI-powered answers

Setup

pip install scrapfly-sdk

The API key must be provided via environment variable SCRAPFLY_API_KEY or passed directly to the client.

API Reference

Endpoint: POST https://api.scrapfly.io/extraction

ScrapflyClient

from scrapfly import ScrapflyClient, ExtractionConfig
import os

client = ScrapflyClient(key=os.environ["SCRAPFLY_API_KEY"])

ExtractionConfig Parameters

Parameter Type Default Description
body str required Document content to extract from
content_type str required Input format: "text/html", "text/markdown", "text/plain", "text/xml"
url str None Base URL for resolving relative links in HTML
charset str None Document encoding (auto-detected if omitted)
extraction_prompt str None Natural language instruction for LLM extraction
extraction_model str None Pre-trained model: "product", "article", "review_list", "real_estate_listing"
extraction_template str None Custom template name or inline template definition
timeout int None Processing timeout in seconds (60-155)
webhook_name str None Webhook name for async processing

You must provide exactly one of: extraction_prompt, extraction_model, or extraction_template.

Three Extraction Methods

1. LLM Prompt Extraction

Use natural language to describe what data to extract. The AI interprets the content and returns structured results.

2. Auto AI Models

Pre-trained models for common data types. Returns standardized schemas with quality scores.

  • "article" - News/blog articles (title, author, date, content, etc.)
  • "event" - Events (name, date, location, description, etc.)
  • "food_recipe" - Recipes (ingredients, steps, servings, etc.)
  • "hotel" - Single hotel/property (name, amenities, rating, etc.)
  • "hotel_listing" - Hotel search/list results
  • "job_listing" - Job search/list results
  • "job_posting" - Single job (title, company, salary, description, etc.)
  • "organization" - Company/organization (name, contact, description, etc.)
  • "product" - E-commerce product (name, price, description, images, etc.)
  • "product_listing" - Product search/category listing
  • "real_estate_property" - Single property (price, address, features, etc.)
  • "real_estate_property_listing" - Property search/list results
  • "review_list" - Lists of reviews (reviewer, rating, text, date, etc.)
  • "search_engine_results" - SERP data (results, snippets, etc.)
  • "social_media_post" - Social post (author, content, engagement, etc.)
  • "software" - Software/app (name, description, pricing, etc.)
  • "stock" - Stock/market data
  • "vehicle_ad" - Single vehicle listing
  • "vehicle_ad_listing" - Vehicle search/list results

3. Custom Templates

Structured extraction rules for consistent parsing across similar pages. Can be defined inline or stored on Scrapfly for reuse.

extraction_template = {
    "source": "html",
    "selectors": [    
        {
            "name": "title",
            "query": "h3.product-title::text",
            "type": "css",
            "formatters": [
                {
                    "name": "uppercase"
                }
            ],
        },
        {
            "name": "description",
            "query": "p.product-description::text",
            "type": "css"
        },
        {
            "extractor": {
                "name": "price"
            },
            "name": "price",
            "query": ".product-price::text",
            "type": "css"
        },
        {
            "name": "variants",
          	"query": "div.variants",
            "type": "css",
            "nested": [
                {
                    "name": "name",
                    "query": "//a[@data-variant-id]/@data-variant-id",
                    "type": "xpath",
                    "multiple": True,
                },
                {
                    "name": "link",
                    "query": "//a[@data-variant-id]/@href",
                    "type": "xpath",
                    "multiple": True,
                },
            ]
        },
        {
            "name": "reviews",
            "query": "div.review>p::text",
            "type": "css",
            "multiple": True,            
        }
    ]
}

Examples

LLM prompt extraction from HTML

from scrapfly import ScrapflyClient, ExtractionConfig
import os

client = ScrapflyClient(key=os.environ["SCRAPFLY_API_KEY"])

html_content = "<html><body><h1>iPhone 15</h1><span class='price'>$999</span><p>Latest Apple smartphone</p></body></html>"

result = client.extract(ExtractionConfig(
    body=html_content,
    content_type="text/html",
    extraction_prompt="Extract the product name, price, and description as JSON",
))

# result
result.extraction_result['data']
# or
print(result.data)

# result content_type
result.extraction_result['content_type']
# or
print(result.content_type)

LLM prompt extraction from markdown

markdown_content = """
# Best Restaurants in NYC

1. **Le Bernardin** - French, $$$, 4.8 stars
2. **Peter Luger** - Steakhouse, $$$, 4.5 stars
3. **Di Fara Pizza** - Italian, $, 4.7 stars
"""

result = client.extract(ExtractionConfig(
    body=markdown_content,
    content_type="text/markdown",
    extraction_prompt="Extract each restaurant as a JSON array with name, cuisine, price_range, and rating fields",
))
print(result.data)

Auto AI model: product extraction

from scrapfly import ScrapflyClient, ExtractionConfig, ScrapeConfig
# First scrape the page, then extract
scrape_result = client.scrape(ScrapeConfig(url="https://web-scraping.dev/product/1"))

result = client.extract(ExtractionConfig(
    body=scrape_result.content,
    content_type="text/html",
    url="https://web-scraping.dev/product/1",
    extraction_model="product",
))

print(result.data)
# Returns: {"name": "...", "price": "...", "currency": "...", "description": "...", ...}

Ask a question about content

result = client.extract(ExtractionConfig(
    body=html_content,
    content_type="text/html",
    extraction_prompt="What is the most expensive item on this page and how much does it cost?",
))
print(result.data)

Scrape + Extract in one flow

from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig
import os

client = ScrapflyClient(key=os.environ["SCRAPFLY_API_KEY"])

# Step 1: Scrape the page
scrape_result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/product/1",
    format="markdown",
))

# Step 2: Extract structured data from the content
extraction_result = client.extract(ExtractionConfig(
    body=scrape_result.content,
    content_type="text/markdown",
    extraction_prompt="Extract all products as a JSON array with fields: name, price, availability",
))

print(extraction_result.data)

Inline extraction with the Scrape API

You can also extract data directly within a scrape request:

result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/product/1",
    extraction_prompt="Extract the product name, price, and description as JSON",
))
# Extraction result is included in the scrape response
print(result.scrape_result["extracted_data"])

Custom extraction template (inline)

# First, scrape the web page to retrieve its HTML
api_response = client.scrape(scrape_config=ScrapeConfig(
    url='https://web-scraping.dev/product/1',
    render_js=True
))

html = api_response.content

# extraction template for HTML parsing instructions. It accepts the following:
# selectors: CSS, XPath, JMESPath, Regex, Nested (nesting multiple selector types)
# extractors: extracts commonly accessed data types: price, image, links, emails
# formatters: transforms the extracted data for common methods: lowercase, uppercase, datatime, etc.
# refer to the docs for more details: https://scrapfly.io/docs/extraction-api/rules-and-template#rules
extraction_template = {
    "source": "html",
    "selectors": [    
        {
            "name": "title",
            "query": "h3.product-title::text",
            "type": "css",
            "formatters": [
                {
                    "name": "uppercase"
                }
            ],
        },
        {
            "name": "description",
            "query": "p.product-description::text",
            "type": "css"
        },
        {
            "extractor": {
                "name": "price"
            },
            "name": "price",
            "query": ".product-price::text",
            "type": "css"
        },
        {
            "name": "variants",
          	"query": "div.variants",
            "type": "css",
            "nested": [
                {
                    "name": "name",
                    "query": "//a[@data-variant-id]/@data-variant-id",
                    "type": "xpath",
                    "multiple": True,
                },
                {
                    "name": "link",
                    "query": "//a[@data-variant-id]/@href",
                    "type": "xpath",
                    "multiple": True,
                },
            ]
        },
        {
            "name": "reviews",
            "query": "div.review>p::text",
            "type": "css",
            "multiple": True,            
        }
    ]
}

extraction_api_response = client.extract(
    extraction_config=ExtractionConfig(
        body=html, # pass the HTML content
        content_type='text/html', # content data type
        charset='utf-8', # passed content charset, use `auto` if you aren't sure
        extraction_ephemeral_template=extraction_template # declared template defintion or template name saved on the dashboard
    )
)

# extracted data
print(extraction_api_response.data)

# extracted data content_type
print(extraction_api_response.content_type)

Error Handling

from scrapfly.errors import ScrapflyError

try:
    result = client.extract(ExtractionConfig(
        body="<html><body><h1>iPhone 15</h1><span class='price'>$999</span><p>Latest Apple smartphone</p></body></html",
        content_type="text/html",
        extraction_prompt="Extract the product price",
    ))
    print(result.data)
except ScrapflyError as e:
    print(f"Extraction failed: {e.message}")

Important Notes

  • Provide exactly one extraction method per request: extraction_prompt, extraction_model, or extraction_template
  • For LLM prompts, be specific about the desired output format (e.g., "as JSON")
  • The url parameter helps resolve relative links in HTML but is not required
  • For large documents, consider using format="markdown" or format="text" in the scrape step first to reduce token usage
  • Auto AI models return quality metrics alongside extracted data
  • Extraction can also be done inline during a scrape by adding extraction_prompt or extraction_model to ScrapeConfig
Related skills
Installs
12
Repository
scrapfly/skills
First Seen
Mar 15, 2026