crawl4ai

Installation

SKILL.md

Crawl4AI Skill

Complete guide for Crawl4AI - the #1 trending GitHub repository for AI-powered web crawling and scraping. Generate clean, AI-ready markdown from any website with intelligent extraction strategies.

Quick Reference

Core Capabilities

Feature	Description
Clean Markdown	AI-optimized markdown output from any webpage
LLM Extraction	Use GPT-4, Claude, or local models for structured extraction
CSS/XPath Extraction	Fast, token-efficient data extraction with selectors
Browser Automation	Full Chromium control with JavaScript execution
Adaptive Crawling	AI-guided crawling that follows relevant links
Session Management	Persistent browser state across requests
Parallel Crawling	Efficient multi-URL processing
RAG Integration	Built for retrieval-augmented generation pipelines

When to Use Crawl4AI vs Other Tools

Tool	Best For
Crawl4AI	AI/LLM workflows, RAG, clean markdown, structured extraction
Playwright	Complex browser automation, testing, visual verification
Puppeteer	Node.js environments, Chrome DevTools Protocol
Scrapy	Large-scale traditional scraping, spider frameworks
BeautifulSoup	Simple HTML parsing without browser rendering
Selenium	Legacy browser automation, cross-browser testing

1. Installation

Basic Installation

# Install Crawl4AI
pip install crawl4ai

# Install browser dependencies (Chromium)
crawl4ai-setup

# Verify installation
crawl4ai-doctor

Installation Options

# With PyTorch support (for local embeddings)
pip install "crawl4ai[torch]"

# With Transformers (for Hugging Face models)
pip install "crawl4ai[transformer]"

# With all optional features
pip install "crawl4ai[all]"

# For development
pip install "crawl4ai[dev]"

Docker Installation

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    wget gnupg ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Install Crawl4AI
RUN pip install crawl4ai
RUN crawl4ai-setup

WORKDIR /app
COPY . .

CMD ["python", "crawler.py"]

Troubleshooting Installation

# If browser not found
crawl4ai-setup --force

# Check browser path
python -c "from crawl4ai import AsyncWebCrawler; print('OK')"

# Verbose diagnostics
crawl4ai-doctor --verbose

2. Basic Usage

Minimal Example

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")

        # Access different output formats
        print(result.markdown)           # Clean markdown
        print(result.cleaned_html)       # Sanitized HTML
        print(result.extracted_content)  # Raw extracted content
        print(result.links)              # All links found

asyncio.run(main())

Synchronous Wrapper

from crawl4ai import WebCrawler

# For simpler use cases
with WebCrawler() as crawler:
    result = crawler.run("https://example.com")
    print(result.markdown)

Result Object Properties

result = await crawler.arun(url)

# Content Properties
result.markdown              # Clean markdown output
result.markdown_v2           # Enhanced markdown with metadata
result.cleaned_html          # Sanitized HTML
result.extracted_content     # Raw extracted content
result.fit_markdown          # Markdown optimized for token limits

# Metadata
result.url                   # Final URL (after redirects)
result.success               # Boolean success status
result.status_code           # HTTP status code
result.response_headers      # Response headers dict
result.error_message         # Error details if failed

# Links and Media
result.links                 # Dict with internal/external links
result.media                 # Dict with images, videos, audio
result.metadata              # Page metadata (title, description, etc.)

# Extraction Results
result.extracted_content     # LLM/CSS extraction results

3. Configuration Classes

BrowserConfig

Controls browser behavior and appearance.

from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_config = BrowserConfig(
    # Browser Selection
    browser_type="chromium",        # "chromium", "firefox", "webkit"
    headless=True,                  # Run without GUI

    # Viewport
    viewport_width=1920,
    viewport_height=1080,

    # Identity
    user_agent="Custom User Agent",
    user_agent_mode="random",       # "fixed", "random"

    # Network
    proxy_config={
        "server": "http://proxy:8080",
        "username": "user",
        "password": "pass"
    },

    # Performance
    use_cached_html=True,

    # Stealth Mode (avoid detection)
    use_managed_browser=True,

    # Browser Arguments
    extra_args=["--disable-gpu", "--no-sandbox"],

    # Cookies and Headers
    cookies=[
        {"name": "session", "value": "abc123", "domain": ".example.com"}
    ],
    headers={
        "Accept-Language": "en-US,en;q=0.9"
    },

    # JavaScript
    java_script_enabled=True,

    # Downloads
    accept_downloads=True,
    downloads_path="./downloads",

    # Debugging
    verbose=True
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun("https://example.com")

CrawlerRunConfig

Controls crawling behavior for each request.

from crawl4ai import CrawlerRunConfig, CacheMode

run_config = CrawlerRunConfig(
    # Caching
    cache_mode=CacheMode.ENABLED,   # ENABLED, DISABLED, READ_ONLY, WRITE_ONLY, BYPASS

    # Content Extraction
    word_count_threshold=10,        # Min words per block
    excluded_tags=["nav", "footer", "aside"],
    exclude_external_links=False,
    exclude_social_media_links=True,

    # CSS Selectors
    css_selector=".main-content",   # Only extract from this selector
    excluded_selector=".ads, .sidebar",

    # Page Interaction
    wait_for="css:.content-loaded", # Wait for element
    delay_before_return_html=2.0,   # Seconds to wait

    # JavaScript Execution
    js_code="""
        document.querySelector('.load-more').click();
        await new Promise(r => setTimeout(r, 2000));
    """,
    js_only=False,                  # Skip initial page load

    # Scrolling for Lazy Loading
    scan_full_page=True,            # Auto-scroll entire page
    scroll_delay=0.5,               # Delay between scrolls

    # Screenshots and PDF
    screenshot=True,
    screenshot_wait_for="css:.loaded",
    pdf=True,

    # Content Processing
    remove_overlay_elements=True,   # Remove popups/modals
    process_iframes=True,           # Include iframe content

    # Media Handling
    exclude_external_images=False,

    # Session
    session_id="my-session",        # Reuse browser state

    # Extraction Strategy
    extraction_strategy=None,       # Set extraction strategy

    # Magic Mode (AI-powered content detection)
    magic=False,

    # Verbose
    verbose=True
)

result = await crawler.arun("https://example.com", config=run_config)

Cache Modes

from crawl4ai import CacheMode

# Always use cache if available
CacheMode.ENABLED

# Never use cache
CacheMode.DISABLED

# Only read from cache, don't write
CacheMode.READ_ONLY

# Only write to cache, don't read
CacheMode.WRITE_ONLY

# Bypass cache completely
CacheMode.BYPASS

4. Extraction Strategies

CSS/XPath Extraction (Token-Efficient)

Best for structured data when you know the page layout.

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

# Define extraction schema
schema = {
    "name": "Products",
    "baseSelector": ".product-card",  # Repeating element
    "fields": [
        {
            "name": "title",
            "selector": "h2.product-title",
            "type": "text"
        },
        {
            "name": "price",
            "selector": ".price",
            "type": "text"
        },
        {
            "name": "image",
            "selector": "img",
            "type": "attribute",
            "attribute": "src"
        },
        {
            "name": "url",
            "selector": "a.product-link",
            "type": "attribute",
            "attribute": "href"
        },
        {
            "name": "rating",
            "selector": ".stars",
            "type": "attribute",
            "attribute": "data-rating"
        },
        {
            "name": "description",
            "selector": ".description",
            "type": "html"  # Get inner HTML
        }
    ]
}

strategy = JsonCssExtractionStrategy(schema, verbose=True)

config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun("https://shop.example.com/products", config=config)

# Parse extracted data
import json
products = json.loads(result.extracted_content)
for product in products:
    print(f"{product['title']}: {product['price']}")

Nested Extraction

schema = {
    "name": "Articles",
    "baseSelector": "article",
    "fields": [
        {"name": "title", "selector": "h1", "type": "text"},
        {"name": "author", "selector": ".author-name", "type": "text"},
        {
            "name": "comments",
            "selector": ".comment",
            "type": "nested",
            "fields": [
                {"name": "user", "selector": ".comment-user", "type": "text"},
                {"name": "text", "selector": ".comment-body", "type": "text"},
                {"name": "date", "selector": ".comment-date", "type": "text"}
            ]
        }
    ]
}

XPath Extraction

schema = {
    "name": "Data",
    "baseSelector": "//div[@class='item']",  # XPath for base
    "fields": [
        {
            "name": "title",
            "selector": ".//h2/text()",  # XPath selector
            "type": "xpath"
        }
    ]
}

LLM Extraction Strategy

Use AI for complex, unstructured content.

from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import LLMConfig
from pydantic import BaseModel, Field
from typing import List, Optional
import os

# Define output schema with Pydantic
class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Price in USD")
    currency: str = Field(default="USD")
    description: str = Field(description="Product description")
    features: List[str] = Field(default_factory=list, description="Key features")
    rating: Optional[float] = Field(default=None, description="Rating out of 5")
    in_stock: bool = Field(default=True)

class ProductList(BaseModel):
    products: List[Product]

# Configure LLM
llm_config = LLMConfig(
    provider="openai/gpt-4o-mini",  # or "anthropic/claude-3-haiku-20240307"
    api_token=os.getenv("OPENAI_API_KEY"),
    temperature=0.0,
    max_tokens=4000
)

# Create extraction strategy
strategy = LLMExtractionStrategy(
    llm_config=llm_config,
    schema=ProductList.model_json_schema(),
    extraction_type="schema",  # "schema" or "block"
    instruction="""
    Extract all products from this e-commerce page.
    Include the name, price, description, and any listed features.
    If rating is shown as stars, convert to a number out of 5.
    """,
    verbose=True
)

config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun("https://shop.example.com", config=config)

# Parse results
data = ProductList.model_validate_json(result.extracted_content)
for product in data.products:
    print(f"{product.name}: ${product.price}")

LLM Providers

# OpenAI
llm_config = LLMConfig(
    provider="openai/gpt-4o",
    api_token=os.getenv("OPENAI_API_KEY")
)

# Anthropic Claude
llm_config = LLMConfig(
    provider="anthropic/claude-3-5-sonnet-20241022",
    api_token=os.getenv("ANTHROPIC_API_KEY")
)

# Azure OpenAI
llm_config = LLMConfig(
    provider="azure/gpt-4",
    api_token=os.getenv("AZURE_API_KEY"),
    base_url="https://your-resource.openai.azure.com/"
)

# Local Ollama
llm_config = LLMConfig(
    provider="ollama/llama3.2",
    base_url="http://localhost:11434"
)

# OpenRouter (access many models)
llm_config = LLMConfig(
    provider="openrouter/anthropic/claude-3-opus",
    api_token=os.getenv("OPENROUTER_API_KEY")
)

Block Extraction (No Schema)

Extract content as semantic blocks without predefined structure.

strategy = LLMExtractionStrategy(
    llm_config=llm_config,
    extraction_type="block",
    instruction="Extract the main article content, author bio, and related articles."
)

Cosine Similarity Strategy

Extract content based on semantic similarity to a query.

from crawl4ai.extraction_strategy import CosineStrategy

strategy = CosineStrategy(
    semantic_filter="machine learning tutorials",
    word_count_threshold=50,
    sim_threshold=0.3,
    max_dist=0.2,
    top_k=5
)

config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun(url, config=config)

5. Adaptive Crawling

AI-guided crawling that intelligently follows relevant links.

from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, BrowserConfig, CrawlerRunConfig

async def adaptive_crawl_docs():
    browser_config = BrowserConfig(headless=True)

    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Create adaptive crawler
        adaptive = AdaptiveCrawler(
            crawler=crawler,
            llm_config=LLMConfig(
                provider="openai/gpt-4o-mini",
                api_token=os.getenv("OPENAI_API_KEY")
            )
        )

        # Digest a documentation site
        result = await adaptive.digest(
            start_url="https://docs.example.com/getting-started",
            query="How to authenticate with the API",

            # Crawl Control
            max_pages=20,              # Maximum pages to crawl
            max_depth=3,               # Maximum link depth

            # Relevance
            confidence_threshold=0.7,   # Minimum relevance score

            # Link Selection
            include_patterns=["docs.example.com/*"],
            exclude_patterns=["*/changelog/*", "*/releases/*"],

            # Output
            verbose=True
        )

        # Results
        print(f"Crawled {len(result.pages)} pages")
        print(f"Answer: {result.answer}")

        for page in result.pages:
            print(f"- {page.url} (relevance: {page.relevance_score:.2f})")

Adaptive Crawler for Knowledge Base

async def build_knowledge_base():
    async with AsyncWebCrawler() as crawler:
        adaptive = AdaptiveCrawler(crawler=crawler, llm_config=llm_config)

        # Crawl multiple documentation sections
        topics = [
            ("https://docs.example.com/auth", "authentication methods"),
            ("https://docs.example.com/api", "API endpoints"),
            ("https://docs.example.com/sdk", "SDK installation"),
        ]

        all_content = []

        for url, query in topics:
            result = await adaptive.digest(
                start_url=url,
                query=query,
                max_pages=10,
                confidence_threshold=0.6
            )

            for page in result.pages:
                all_content.append({
                    "url": page.url,
                    "topic": query,
                    "content": page.markdown,
                    "relevance": page.relevance_score
                })

        return all_content

6. Multi-URL Crawling

Efficiently crawl multiple URLs with parallel processing.

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def crawl_multiple():
    urls = [
        "https://site1.com/page1",
        "https://site1.com/page2",
        "https://site2.com/article",
        "https://site3.com/docs",
    ]

    config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        word_count_threshold=50
    )

    async with AsyncWebCrawler() as crawler:
        # Crawl all URLs in parallel
        results = await crawler.arun_many(
            urls,
            config=config,
            max_concurrent=5  # Limit concurrent requests
        )

        for result in results:
            if result.success:
                print(f"OK: {result.url}")
                print(f"   Words: {len(result.markdown.split())}")
            else:
                print(f"FAIL: {result.url} - {result.error_message}")

Crawling with Rate Limiting

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
import asyncio

async def rate_limited_crawl(urls: list, requests_per_second: float = 1.0):
    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    delay = 1.0 / requests_per_second

    results = []
    async with AsyncWebCrawler() as crawler:
        for url in urls:
            result = await crawler.arun(url, config=config)
            results.append(result)
            await asyncio.sleep(delay)

    return results

Sitemap Crawling

import xml.etree.ElementTree as ET
import httpx

async def crawl_sitemap(sitemap_url: str, max_urls: int = 100):
    # Fetch sitemap
    async with httpx.AsyncClient() as client:
        response = await client.get(sitemap_url)
        root = ET.fromstring(response.text)

    # Extract URLs
    namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
    urls = [
        loc.text for loc in root.findall('.//ns:loc', namespace)
    ][:max_urls]

    # Crawl all URLs
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls, max_concurrent=10)

    return results

7. Session Management

Maintain browser state across multiple requests.

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def session_example():
    async with AsyncWebCrawler() as crawler:
        # Login with session
        login_config = CrawlerRunConfig(
            session_id="user-session",
            js_code="""
                document.querySelector('#username').value = 'user@example.com';
                document.querySelector('#password').value = 'password123';
                document.querySelector('form').submit();
            """,
            wait_for="css:.dashboard"
        )

        await crawler.arun("https://example.com/login", config=login_config)

        # Subsequent requests use same session (cookies, localStorage)
        data_config = CrawlerRunConfig(
            session_id="user-session"  # Same session ID
        )

        # Access authenticated pages
        result1 = await crawler.arun("https://example.com/dashboard", config=data_config)
        result2 = await crawler.arun("https://example.com/profile", config=data_config)
        result3 = await crawler.arun("https://example.com/settings", config=data_config)

        return [result1, result2, result3]

Session with Cookies

from crawl4ai import BrowserConfig

browser_config = BrowserConfig(
    cookies=[
        {
            "name": "auth_token",
            "value": "abc123xyz",
            "domain": ".example.com",
            "path": "/",
            "httpOnly": True,
            "secure": True
        },
        {
            "name": "preferences",
            "value": "dark_mode=true",
            "domain": ".example.com"
        }
    ]
)

8. Page Interaction

JavaScript Execution

from crawl4ai import CrawlerRunConfig

# Click load more button
config = CrawlerRunConfig(
    js_code="""
        // Click all "Load More" buttons
        while (document.querySelector('.load-more:not(.disabled)')) {
            document.querySelector('.load-more').click();
            await new Promise(r => setTimeout(r, 1000));
        }
    """,
    wait_for="css:.all-loaded"
)

# Form interaction
config = CrawlerRunConfig(
    js_code="""
        // Fill search form
        document.querySelector('#search-input').value = 'crawl4ai';
        document.querySelector('#search-form').submit();
    """,
    wait_for="css:.search-results"
)

# Scroll and interact
config = CrawlerRunConfig(
    js_code="""
        // Scroll to load lazy content
        for (let i = 0; i < 5; i++) {
            window.scrollTo(0, document.body.scrollHeight);
            await new Promise(r => setTimeout(r, 500));
        }

        // Expand all accordions
        document.querySelectorAll('.accordion-toggle').forEach(el => el.click());
    """
)

Waiting Strategies

# Wait for CSS selector
config = CrawlerRunConfig(wait_for="css:.content-loaded")

# Wait for JavaScript condition
config = CrawlerRunConfig(wait_for="js:window.dataLoaded === true")

# Wait for XPath
config = CrawlerRunConfig(wait_for="xpath://div[@class='loaded']")

# Wait with timeout
config = CrawlerRunConfig(
    wait_for="css:.content",
    page_timeout=30000  # 30 seconds
)

# Fixed delay
config = CrawlerRunConfig(delay_before_return_html=3.0)  # 3 seconds

Handling Infinite Scroll

config = CrawlerRunConfig(
    scan_full_page=True,     # Auto-scroll entire page
    scroll_delay=0.3,        # Delay between scroll steps

    # Or custom scroll logic
    js_code="""
        const scrollToBottom = async () => {
            let lastHeight = 0;
            while (true) {
                window.scrollTo(0, document.body.scrollHeight);
                await new Promise(r => setTimeout(r, 1000));

                const newHeight = document.body.scrollHeight;
                if (newHeight === lastHeight) break;
                lastHeight = newHeight;
            }
        };
        await scrollToBottom();
    """
)

Removing Overlays and Popups

config = CrawlerRunConfig(
    remove_overlay_elements=True,  # Auto-remove common overlays

    # Or explicit removal
    js_code="""
        // Remove cookie banners
        document.querySelectorAll('[class*="cookie"], [class*="consent"]')
            .forEach(el => el.remove());

        // Remove modals
        document.querySelectorAll('.modal, .popup, .overlay')
            .forEach(el => el.remove());

        // Remove fixed elements blocking content
        document.querySelectorAll('[style*="position: fixed"]')
            .forEach(el => el.remove());
    """
)

9. Proxy and Stealth Mode

Proxy Configuration

from crawl4ai import BrowserConfig

# Simple proxy
browser_config = BrowserConfig(
    proxy_config={
        "server": "http://proxy.example.com:8080"
    }
)

# Authenticated proxy
browser_config = BrowserConfig(
    proxy_config={
        "server": "http://proxy.example.com:8080",
        "username": "proxyuser",
        "password": "proxypass"
    }
)

# SOCKS proxy
browser_config = BrowserConfig(
    proxy_config={
        "server": "socks5://proxy.example.com:1080"
    }
)

Rotating Proxies

import random

class ProxyRotator:
    def __init__(self, proxies: list):
        self.proxies = proxies
        self.index = 0

    def get_next(self) -> dict:
        proxy = self.proxies[self.index]
        self.index = (self.index + 1) % len(self.proxies)
        return {"server": proxy}

proxies = [
    "http://proxy1.example.com:8080",
    "http://proxy2.example.com:8080",
    "http://proxy3.example.com:8080",
]

rotator = ProxyRotator(proxies)

async def crawl_with_rotation(urls: list):
    results = []
    for url in urls:
        browser_config = BrowserConfig(
            proxy_config=rotator.get_next(),
            headless=True
        )
        async with AsyncWebCrawler(config=browser_config) as crawler:
            result = await crawler.arun(url)
            results.append(result)
    return results

Stealth Mode

browser_config = BrowserConfig(
    # Enable stealth mode
    use_managed_browser=True,

    # Random user agent
    user_agent_mode="random",

    # Or specific user agent
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",

    # Additional stealth options
    extra_args=[
        "--disable-blink-features=AutomationControlled",
        "--disable-infobars",
        "--window-size=1920,1080"
    ],

    # Realistic viewport
    viewport_width=1920,
    viewport_height=1080
)

# Additional anti-detection JavaScript
config = CrawlerRunConfig(
    js_code="""
        // Override navigator properties
        Object.defineProperty(navigator, 'webdriver', { get: () => undefined });

        // Add realistic mouse movements
        const event = new MouseEvent('mousemove', {
            clientX: Math.random() * window.innerWidth,
            clientY: Math.random() * window.innerHeight
        });
        document.dispatchEvent(event);
    """
)

10. Screenshots and PDF Generation

from crawl4ai import CrawlerRunConfig
import base64

# Screenshot
config = CrawlerRunConfig(
    screenshot=True,
    screenshot_wait_for="css:.content-loaded"
)

result = await crawler.arun(url, config=config)

if result.screenshot:
    # Save screenshot (base64 encoded)
    with open("screenshot.png", "wb") as f:
        f.write(base64.b64decode(result.screenshot))

# PDF Generation
config = CrawlerRunConfig(pdf=True)

result = await crawler.arun(url, config=config)

if result.pdf:
    with open("page.pdf", "wb") as f:
        f.write(base64.b64decode(result.pdf))

Full Page Screenshot

config = CrawlerRunConfig(
    screenshot=True,
    scan_full_page=True,  # Capture full page, not just viewport
    screenshot_wait_for="js:document.readyState === 'complete'"
)

11. MCP Server Integration

Crawl4AI provides an MCP (Model Context Protocol) server for integration with Claude Code and other AI assistants.

Installation

# Install with MCP server support
pip install "crawl4ai[mcp]"

# Or install standalone
pip install crawl4ai-mcp

Configuration in Claude Code

Add to your claude_desktop_config.json or MCP settings:

{
  "mcpServers": {
    "crawl4ai": {
      "command": "python",
      "args": ["-m", "crawl4ai.mcp_server"],
      "env": {
        "OPENAI_API_KEY": "your-key",
        "CRAWL4AI_LLM_PROVIDER": "openai/gpt-4o-mini"
      }
    }
  }
}

Alternative: Using npx

{
  "mcpServers": {
    "crawl4ai": {
      "command": "npx",
      "args": ["-y", "crawl4ai-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "your-key"
      }
    }
  }
}

Available MCP Tools

Tool	Description
`crawl_single_page`	Crawl a single URL and return markdown
`smart_crawl_url`	Adaptive crawling with AI-guided link following
`perform_rag_query`	Query crawled content with semantic search
`get_available_sources`	List all crawled sources
`extract_structured_data`	Extract data using CSS/LLM strategies

Example MCP Usage

Human: Crawl the Crawl4AI documentation and tell me about extraction strategies

Claude: I'll crawl the documentation using the Crawl4AI MCP server.

[Uses smart_crawl_url tool with url="https://docs.crawl4ai.com" and query="extraction strategies"]

Based on the crawled documentation, Crawl4AI supports several extraction strategies:
1. CSS/XPath extraction for structured data...
2. LLM extraction for complex content...

12. RAG Pipeline Integration

Build knowledge bases from web content for retrieval-augmented generation.

Complete RAG Pipeline

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Step 1: Crawl Documentation
async def crawl_documentation(base_url: str, pages: list) -> list:
    """Crawl documentation pages and return clean markdown."""
    config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        word_count_threshold=50,
        excluded_tags=["nav", "footer", "aside"],
        remove_overlay_elements=True
    )

    documents = []
    async with AsyncWebCrawler() as crawler:
        urls = [f"{base_url}/{page}" for page in pages]
        results = await crawler.arun_many(urls, config=config, max_concurrent=5)

        for result in results:
            if result.success and result.markdown:
                documents.append({
                    "url": result.url,
                    "content": result.markdown,
                    "title": result.metadata.get("title", "Untitled")
                })

    return documents

# Step 2: Chunk Content
def chunk_documents(documents: list, chunk_size: int = 1000, overlap: int = 200) -> list:
    """Split documents into smaller chunks for embedding."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n## ", "\n### ", "\n\n", "\n", " ", ""]
    )

    chunks = []
    for doc in documents:
        doc_chunks = splitter.split_text(doc["content"])
        for i, chunk in enumerate(doc_chunks):
            chunks.append({
                "content": chunk,
                "metadata": {
                    "source": doc["url"],
                    "title": doc["title"],
                    "chunk_index": i
                }
            })

    return chunks

# Step 3: Create Vector Store
def create_vector_store(chunks: list, persist_dir: str = "./chroma_db"):
    """Create embeddings and store in vector database."""
    embeddings = OpenAIEmbeddings()

    texts = [chunk["content"] for chunk in chunks]
    metadatas = [chunk["metadata"] for chunk in chunks]

    vectorstore = Chroma.from_texts(
        texts=texts,
        embedding=embeddings,
        metadatas=metadatas,
        persist_directory=persist_dir
    )

    return vectorstore

# Step 4: Build RAG Chain
def create_rag_chain(vectorstore):
    """Create RAG chain for question answering."""
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4}
    )

    llm = ChatOpenAI(model="gpt-4o", temperature=0)

    template = """Answer the question based on the following context from the documentation.
    If the answer is not in the context, say "I don't have information about that in the documentation."

    Context:
    {context}

    Question: {question}

    Answer:"""

    prompt = ChatPromptTemplate.from_template(template)

    def format_docs(docs):
        return "\n\n---\n\n".join([
            f"Source: {doc.metadata.get('source', 'Unknown')}\n{doc.page_content}"
            for doc in docs
        ])

    chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    return chain

# Full Pipeline
async def build_docs_rag(base_url: str, pages: list):
    """Complete pipeline to build RAG from documentation."""
    print("Crawling documentation...")
    documents = await crawl_documentation(base_url, pages)
    print(f"Crawled {len(documents)} pages")

    print("Chunking content...")
    chunks = chunk_documents(documents)
    print(f"Created {len(chunks)} chunks")

    print("Creating vector store...")
    vectorstore = create_vector_store(chunks)

    print("Building RAG chain...")
    chain = create_rag_chain(vectorstore)

    return chain

# Usage
async def main():
    chain = await build_docs_rag(
        base_url="https://docs.crawl4ai.com",
        pages=["getting-started", "extraction", "configuration", "advanced"]
    )

    # Query the documentation
    answer = chain.invoke("How do I extract structured data from a webpage?")
    print(answer)

asyncio.run(main())

Incremental Updates

async def update_knowledge_base(vectorstore, new_urls: list):
    """Add new pages to existing knowledge base."""
    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(new_urls, config=config)

    documents = [
        {"url": r.url, "content": r.markdown, "title": r.metadata.get("title", "")}
        for r in results if r.success
    ]

    chunks = chunk_documents(documents)

    # Add to existing vectorstore
    texts = [c["content"] for c in chunks]
    metadatas = [c["metadata"] for c in chunks]
    vectorstore.add_texts(texts=texts, metadatas=metadatas)

    return len(chunks)

13. Best Practices

Performance Optimization

# Use caching during development
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)

# Limit concurrent requests
results = await crawler.arun_many(urls, max_concurrent=5)

# Use CSS extraction when possible (faster, cheaper than LLM)
css_strategy = JsonCssExtractionStrategy(schema)

# Only extract needed content
config = CrawlerRunConfig(
    css_selector=".main-content",  # Target specific content
    excluded_tags=["nav", "footer", "script", "style"]
)

Respecting Websites

# Check robots.txt
import httpx
from urllib.parse import urljoin

async def check_robots(base_url: str, path: str) -> bool:
    robots_url = urljoin(base_url, "/robots.txt")
    async with httpx.AsyncClient() as client:
        response = await client.get(robots_url)
        # Parse robots.txt and check if path is allowed
        # ... implementation
        return True

# Rate limiting
async def respectful_crawl(urls: list, delay: float = 1.0):
    config = CrawlerRunConfig(mean_delay=delay)
    async with AsyncWebCrawler() as crawler:
        results = []
        for url in urls:
            result = await crawler.arun(url, config=config)
            results.append(result)
            await asyncio.sleep(delay)
        return results

Error Handling

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
import asyncio

async def robust_crawl(url: str, max_retries: int = 3):
    """Crawl with retry logic."""
    config = CrawlerRunConfig(
        page_timeout=30000,
        wait_for="css:body"
    )

    for attempt in range(max_retries):
        try:
            async with AsyncWebCrawler() as crawler:
                result = await crawler.arun(url, config=config)

                if result.success:
                    return result

                print(f"Attempt {attempt + 1} failed: {result.error_message}")

        except Exception as e:
            print(f"Attempt {attempt + 1} error: {e}")

        # Exponential backoff
        await asyncio.sleep(2 ** attempt)

    return None

Memory Management

# Process large URL lists in batches
async def batch_crawl(urls: list, batch_size: int = 50):
    all_results = []

    for i in range(0, len(urls), batch_size):
        batch = urls[i:i + batch_size]

        async with AsyncWebCrawler() as crawler:
            results = await crawler.arun_many(batch, max_concurrent=10)
            all_results.extend(results)

        # Let garbage collection run
        await asyncio.sleep(0.1)

    return all_results

Token Efficiency

# Use CSS extraction first, LLM only when needed
async def efficient_extraction(url: str, css_schema: dict, llm_instruction: str):
    """Try CSS extraction first, fall back to LLM."""

    # Try CSS extraction
    css_strategy = JsonCssExtractionStrategy(css_schema)
    config = CrawlerRunConfig(extraction_strategy=css_strategy)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url, config=config)

    if result.extracted_content and result.extracted_content != "[]":
        return {"method": "css", "data": result.extracted_content}

    # Fall back to LLM
    llm_strategy = LLMExtractionStrategy(
        llm_config=llm_config,
        instruction=llm_instruction
    )
    config = CrawlerRunConfig(extraction_strategy=llm_strategy)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url, config=config)

    return {"method": "llm", "data": result.extracted_content}

14. Common Patterns

Documentation Crawler

async def crawl_documentation_site(base_url: str, max_pages: int = 100):
    """Crawl a documentation site following internal links."""
    visited = set()
    to_visit = [base_url]
    documents = []

    config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        word_count_threshold=100,
        excluded_tags=["nav", "footer", "aside", "header"]
    )

    async with AsyncWebCrawler() as crawler:
        while to_visit and len(visited) < max_pages:
            url = to_visit.pop(0)

            if url in visited:
                continue

            visited.add(url)
            result = await crawler.arun(url, config=config)

            if result.success:
                documents.append({
                    "url": url,
                    "markdown": result.markdown,
                    "title": result.metadata.get("title", "")
                })

                # Add internal links to queue
                for link in result.links.get("internal", []):
                    if link["href"] not in visited:
                        to_visit.append(link["href"])

    return documents

E-Commerce Product Scraper

schema = {
    "name": "Products",
    "baseSelector": ".product-item",
    "fields": [
        {"name": "name", "selector": ".product-name", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "original_price", "selector": ".original-price", "type": "text"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
        {"name": "rating", "selector": ".rating", "type": "attribute", "attribute": "data-rating"},
        {"name": "reviews", "selector": ".review-count", "type": "text"},
        {"name": "url", "selector": "a.product-link", "type": "attribute", "attribute": "href"}
    ]
}

async def scrape_products(category_url: str):
    strategy = JsonCssExtractionStrategy(schema)
    config = CrawlerRunConfig(
        extraction_strategy=strategy,
        scan_full_page=True,  # Load all products
        js_code="""
            // Click "Load More" until all products visible
            while (document.querySelector('.load-more:not([disabled])')) {
                document.querySelector('.load-more').click();
                await new Promise(r => setTimeout(r, 1000));
            }
        """
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(category_url, config=config)

    import json
    return json.loads(result.extracted_content)

News Aggregator

async def aggregate_news(sources: list):
    """Aggregate news from multiple sources."""

    # Different extraction schemas per source type
    article_schema = {
        "name": "Articles",
        "baseSelector": "article, .article, .post",
        "fields": [
            {"name": "headline", "selector": "h1, h2, h3", "type": "text"},
            {"name": "summary", "selector": "p, .excerpt, .summary", "type": "text"},
            {"name": "author", "selector": ".author, .byline", "type": "text"},
            {"name": "date", "selector": "time, .date", "type": "text"},
            {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }

    strategy = JsonCssExtractionStrategy(article_schema)
    config = CrawlerRunConfig(extraction_strategy=strategy)

    all_articles = []
    async with AsyncWebCrawler() as crawler:
        for source in sources:
            result = await crawler.arun(source["url"], config=config)
            if result.success:
                import json
                articles = json.loads(result.extracted_content)
                for article in articles:
                    article["source"] = source["name"]
                all_articles.extend(articles)

    # Sort by date
    all_articles.sort(key=lambda x: x.get("date", ""), reverse=True)
    return all_articles

Research Assistant

async def research_topic(query: str, num_sources: int = 10):
    """Search and summarize information on a topic."""

    # This would integrate with a search API to get URLs
    # For demo, assume we have URLs
    search_results = await search_web(query, num_sources)
    urls = [r["url"] for r in search_results]

    config = CrawlerRunConfig(
        word_count_threshold=100,
        excluded_tags=["nav", "footer", "aside", "script"]
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls, config=config, max_concurrent=5)

    # Collect successful content
    sources = []
    for result in results:
        if result.success and len(result.markdown) > 200:
            sources.append({
                "url": result.url,
                "title": result.metadata.get("title", ""),
                "content": result.markdown[:5000]  # Limit content length
            })

    # Use LLM to synthesize
    from openai import OpenAI
    client = OpenAI()

    context = "\n\n---\n\n".join([
        f"Source: {s['title']}\nURL: {s['url']}\n\n{s['content']}"
        for s in sources
    ])

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Synthesize the following sources into a comprehensive summary. Cite sources."},
            {"role": "user", "content": f"Topic: {query}\n\nSources:\n{context}"}
        ]
    )

    return {
        "summary": response.choices[0].message.content,
        "sources": sources
    }

15. Troubleshooting

Browser Not Found

# Reinstall browser dependencies
crawl4ai-setup --force

# Check browser installation
crawl4ai-doctor --verbose

# Specify browser path manually
from crawl4ai import BrowserConfig
config = BrowserConfig(
    browser_type="chromium",
    extra_args=["--no-sandbox"]
)

JavaScript Not Executing

# Increase wait time
config = CrawlerRunConfig(
    delay_before_return_html=5.0,
    page_timeout=60000
)

# Wait for specific condition
config = CrawlerRunConfig(
    wait_for="js:window.dataLoaded === true"
)

# Check if JavaScript is enabled
browser_config = BrowserConfig(java_script_enabled=True)

Rate Limiting / 429 Errors

# Add delays between requests
config = CrawlerRunConfig(mean_delay=2.0)

# Use rotating proxies
proxies = ["proxy1:8080", "proxy2:8080"]
# Rotate through proxies for each request

# Add realistic headers
browser_config = BrowserConfig(
    user_agent_mode="random",
    headers={"Accept-Language": "en-US,en;q=0.9"}
)

Memory Issues with Large Crawls

# Process in smaller batches
for batch in chunked(urls, 50):
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(batch)
    process_results(results)

# Disable screenshots/PDFs
config = CrawlerRunConfig(
    screenshot=False,
    pdf=False
)

# Limit content size
config = CrawlerRunConfig(
    word_count_threshold=50  # Skip small content blocks
)

Content Not Loading (SPA/React Sites)

# Enable full page scan
config = CrawlerRunConfig(
    scan_full_page=True,
    scroll_delay=1.0
)

# Wait for React hydration
config = CrawlerRunConfig(
    wait_for="js:document.querySelector('[data-reactroot]')"
)

# Use Magic Mode
config = CrawlerRunConfig(magic=True)

SSL/Certificate Errors

browser_config = BrowserConfig(
    extra_args=[
        "--ignore-certificate-errors",
        "--ignore-ssl-errors"
    ]
)

Blocked by Cloudflare/Bot Protection

# Use managed browser with stealth
browser_config = BrowserConfig(
    use_managed_browser=True,
    user_agent_mode="random"
)

# Add human-like delays
config = CrawlerRunConfig(
    delay_before_return_html=3.0,
    js_code="""
        // Simulate mouse movement
        document.dispatchEvent(new MouseEvent('mousemove', {
            clientX: 100 + Math.random() * 500,
            clientY: 100 + Math.random() * 300
        }));
    """
)

16. API Reference

Core Classes

Class	Description
`AsyncWebCrawler`	Main async crawler class
`WebCrawler`	Synchronous wrapper
`BrowserConfig`	Browser configuration
`CrawlerRunConfig`	Request configuration
`AdaptiveCrawler`	AI-guided adaptive crawling
`LLMConfig`	LLM provider configuration

Extraction Strategies

Strategy	Description
`JsonCssExtractionStrategy`	CSS/XPath-based extraction
`LLMExtractionStrategy`	LLM-powered extraction
`CosineStrategy`	Semantic similarity extraction

Cache Modes

Mode	Description
`CacheMode.ENABLED`	Read and write cache
`CacheMode.DISABLED`	No caching
`CacheMode.READ_ONLY`	Only read from cache
`CacheMode.WRITE_ONLY`	Only write to cache
`CacheMode.BYPASS`	Skip cache completely

Resources

Related skills

More from housegarofalo/claude-code-base

Installs

Repository

housegarofalo/c…ode-base

GitHub Stars

First Seen

Mar 15, 2026

crawl4ai

Crawl4AI Skill

Quick Reference

Core Capabilities

When to Use Crawl4AI vs Other Tools

1. Installation

Basic Installation

Installation Options

Docker Installation

Troubleshooting Installation

2. Basic Usage

Minimal Example

Synchronous Wrapper

Result Object Properties

3. Configuration Classes

BrowserConfig

CrawlerRunConfig

Cache Modes

4. Extraction Strategies

CSS/XPath Extraction (Token-Efficient)

Nested Extraction

XPath Extraction

LLM Extraction Strategy

LLM Providers

Block Extraction (No Schema)

Cosine Similarity Strategy

5. Adaptive Crawling

Adaptive Crawler for Knowledge Base

6. Multi-URL Crawling

Crawling with Rate Limiting

Sitemap Crawling

7. Session Management

Session with Cookies

8. Page Interaction

JavaScript Execution

Waiting Strategies

Handling Infinite Scroll

Removing Overlays and Popups

9. Proxy and Stealth Mode

Proxy Configuration

Rotating Proxies

Stealth Mode

10. Screenshots and PDF Generation

Full Page Screenshot

11. MCP Server Integration

Installation

Configuration in Claude Code

Alternative: Using npx

Available MCP Tools

Example MCP Usage

12. RAG Pipeline Integration

Complete RAG Pipeline

Incremental Updates

13. Best Practices

Performance Optimization

Respecting Websites

Error Handling

Memory Management

Token Efficiency

14. Common Patterns

Documentation Crawler

E-Commerce Product Scraper

News Aggregator

Research Assistant

15. Troubleshooting

Browser Not Found

JavaScript Not Executing

Rate Limiting / 429 Errors

Memory Issues with Large Crawls

Content Not Loading (SPA/React Sites)

SSL/Certificate Errors

Blocked by Cloudflare/Bot Protection

16. API Reference

Core Classes

Extraction Strategies

Cache Modes

Resources

More from housegarofalo/claude-code-base

mqtt-iot

devops-engineer-agent