web-crawl4ai
Crawl4AI Toolkit
Overview
This skill provides comprehensive support for web crawling and data extraction using Crawl4AI library, including complete SDK reference, ready-to-use scripts, error handling patterns, and optimized workflows for efficient data extraction.
Context
User needs programmatic web crawling capabilities. This skill is appropriate when:
- Scraping websites with JavaScript rendering requirements
- Extracting structured data using CSS selectors or LLM extraction
- Building automated web data pipelines
- Handling login-protected or dynamic content
Process
- Verify crawl4ai installation with
crawl4ai-doctor - Configure browser and crawler settings based on target site
- Execute crawl using appropriate method (basic, batch, or advanced)
- Process results (markdown, extracted data, links)
- Handle errors with retry logic and exponential backoff
- Verification: Confirm extracted content meets quality requirements
Quick Start
Installation Check
# Verify installation
crawl4ai-doctor
# If issues, run setup
crawl4ai-setup
Basic First Crawl
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500]) # First 500 chars
asyncio.run(main())
Using Provided Scripts
# Simple markdown extraction
python scripts/basic_crawler.py https://example.com
# Batch processing
python scripts/batch_crawler.py urls.txt
# Data extraction
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
Core Crawling Fundamentals
1. Basic Crawling
Understanding of core components for any crawl:
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
# Browser configuration (controls browser behavior)
browser_config = BrowserConfig(
headless=True, # Run without GUI
viewport_width=1920,
viewport_height=1080,
user_agent="custom-agent" # Optional custom user agent
)
# Crawler configuration (controls crawl behavior)
crawler_config = CrawlerRunConfig(
page_timeout=30000, # 30 seconds timeout
screenshot=True, # Take screenshot
remove_overlay_elements=True # Remove popups/overlays
)
# Execute crawl with arun()
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=crawler_config
)
# CrawlResult contains everything
print(f"Success: {result.success}")
print(f"HTML length: {len(result.html)}")
print(f"Markdown length: {len(result.markdown)}")
print(f"Links found: {len(result.links)}")
2. Configuration Deep Dive
BrowserConfig - Controls browser instance:
headless: Run with/without GUIviewport_width/height: Browser dimensionsuser_agent: Custom user agent stringcookies: Pre-set cookiesheaders: Custom HTTP headers
CrawlerRunConfig - Controls each crawl:
page_timeout: Maximum page load/JS execution time (ms)wait_for: CSS selector or JS condition to wait for (optional)cache_mode: Control caching behaviorjs_code: Execute custom JavaScriptscreenshot: Capture page screenshotsession_id: Persist session across crawls
3. Content Processing
Basic content operations available in every crawl:
result = await crawler.arun(url)
# Access extracted content
markdown = result.markdown # Clean markdown
html = result.html # Raw HTML
text = result.cleaned_html # Cleaned HTML
# Media and links
images = result.media["images"]
videos = result.media["videos"]
internal_links = result.links["internal"]
external_links = result.links["external"]
# Metadata
title = result.metadata["title"]
description = result.metadata["description"]
JavaScript-Heavy Content Handling
For sites that rely on JavaScript for content rendering:
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
# Configure for JavaScript-heavy sites
browser_config = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
crawler_config = CrawlerRunConfig(
page_timeout=60000, # 60 seconds for JS loading
wait_for="css:.main-content", # Wait for specific element
js_code="""
// Scroll to trigger lazy loading
window.scrollTo(0, document.body.scrollHeight);
// Wait for any async content
await new Promise(resolve => setTimeout(resolve, 2000));
""",
remove_overlay_elements=True
)
async def crawl_dynamic_content(url):
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url, config=crawler_config)
return result.markdown
Content Filtering
Focus on relevant content while ignoring navigation and footers:
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Option 1: Relevance-based filtering
bm25_filter = BM25ContentFilter(
user_query="product features documentation",
bm25_threshold=1.0
)
# Option 2: Quality-based filtering
pruning_filter = PruningContentFilter(threshold=0.4, threshold_type="fixed")
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
async def extract_focused_content(url, query):
# Update filter with user query
bm25_filter.user_query = query
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url, config=config)
return {
"raw_content": str(result.markdown.raw_markdown),
"focused_content": str(result.markdown.fit_markdown),
"metadata": result.metadata
}
Markdown Generation
Basic Markdown Extraction
Crawl4AI excels at generating clean, well-formatted markdown:
# Simple markdown extraction
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.example.com")
# High-quality markdown ready for LLMs
with open("documentation.md", "w") as f:
f.write(result.markdown)
Fit Markdown (Content Filtering)
Use content filters to get only relevant content:
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Option 1: Pruning filter (removes low-quality content)
pruning_filter = PruningContentFilter(threshold=0.4, threshold_type="fixed")
# Option 2: BM25 filter (relevance-based filtering)
bm25_filter = BM25ContentFilter(user_query="machine learning tutorials", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
# Access filtered content
print(result.markdown.fit_markdown) # Filtered markdown
print(result.markdown.raw_markdown) # Original markdown
Markdown Customization
Control markdown generation with options:
config = CrawlerRunConfig(
# Exclude elements from markdown
excluded_tags=["nav", "footer", "aside"],
# Focus on specific CSS selector
css_selector=".main-content",
# Clean up formatting
remove_forms=True,
remove_overlay_elements=True,
# Control link handling
exclude_external_links=True,
exclude_internal_links=False
)
# Custom markdown generation
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
generator = DefaultMarkdownGenerator(
options={
"ignore_links": False,
"ignore_images": False,
"image_alt_text": True
}
)
Data Extraction
Schema-Based Extraction (Most Efficient)
For repetitive patterns, generate schema once and reuse:
# Step 1: Generate schema with LLM (one-time)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# Step 2: Use schema for fast extraction (no LLM)
python scripts/extraction_pipeline.py --use-schema https://shop.com generated_schema.json
Manual CSS/JSON Extraction
When you know the structure:
schema = {
"name": "articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"},
{"name": "content", "selector": ".content", "type": "text"}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema=schema)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
LLM-Based Extraction
For complex or irregular content:
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
instruction="Extract key financial metrics and quarterly trends"
)
Advanced Patterns
Deep Crawling
Discover and crawl links from a page:
# Basic link discovery
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url)
# Extract and process discovered links
internal_links = result.links.get("internal", [])
external_links = result.links.get("external", [])
# Crawl discovered internal links
for link in internal_links:
if "/blog/" in link and "/tag/" not in link: # Filter links
sub_result = await crawler.arun(link)
# Process sub-page
# For advanced deep crawling, consider using URL seeding patterns
# or custom crawl strategies (see complete-sdk-reference.md)
Batch & Multi-URL Processing
Efficiently crawl multiple URLs:
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
async with AsyncWebCrawler() as crawler:
# Concurrent crawling with arun_many()
results = await crawler.arun_many(
urls=urls,
config=crawler_config,
max_concurrent=5 # Control concurrency
)
for result in results:
if result.success:
print(f"✅ {result.url}: {len(result.markdown)} chars")
Session & Authentication
Handle login-required content:
# First crawl - establish session and login
login_config = CrawlerRunConfig(
session_id="user_session",
js_code="""
document.querySelector('#username').value = 'myuser';
document.querySelector('#password').value = 'mypass';
document.querySelector('#submit').click();
""",
wait_for="css:.dashboard" # Wait for post-login element
)
await crawler.arun("https://site.com/login", config=login_config)
# Subsequent crawls - reuse session
config = CrawlerRunConfig(session_id="user_session")
await crawler.arun("https://site.com/protected-content", config=config)
Dynamic Content Handling
For JavaScript-heavy sites:
config = CrawlerRunConfig(
# Wait for dynamic content
wait_for="css:.ajax-content",
# Execute JavaScript
js_code="""
// Scroll to load content
window.scrollTo(0, document.body.scrollHeight);
// Click load more button
document.querySelector('.load-more')?.click();
""",
# Note: For virtual scrolling (Twitter/Instagram-style),
# use virtual_scroll_config parameter (see docs)
# Extended timeout for slow loading
page_timeout=60000
)
Anti-Detection & Proxies
Avoid bot detection:
# Proxy configuration
browser_config = BrowserConfig(
headless=True,
proxy_config={
"server": "http://proxy.server:8080",
"username": "user",
"password": "pass"
}
)
# For stealth/undetected browsing, consider:
# - Rotating user agents via user_agent parameter
# - Using different viewport sizes
# - Adding delays between requests
# Rate limiting
import asyncio
for url in urls:
result = await crawler.arun(url)
await asyncio.sleep(2) # Delay between requests
Robust Crawling Template
import asyncio
import random
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def robust_crawl(url, query=None, max_retries=3):
"""Robust crawling with retries and error handling"""
for attempt in range(max_retries):
try:
# Randomize configuration for each attempt
browser_config = BrowserConfig(
headless=True,
viewport_width=random.choice([1920, 1366, 1440]),
viewport_height=random.choice([1080, 768, 900]),
user_agent=random.choice([
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
])
)
# Add content filtering if query provided
config = CrawlerRunConfig(
page_timeout=45000,
remove_overlay_elements=True
)
if query:
bm25_filter = BM25ContentFilter(user_query=query, bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config.markdown_generator = md_generator
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url, config=config)
if result.success and len(str(result.markdown)) > 100:
return {
"content": str(result.markdown),
"metadata": result.metadata,
"links": result.links,
"attempt": attempt + 1
}
else:
print(f"Attempt {attempt + 1}: Insufficient content extracted")
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
# Exponential backoff
if attempt < max_retries - 1:
await asyncio.sleep(2 ** attempt)
raise Exception(f"Failed to extract content after {max_retries} attempts")
Common Use Cases
Documentation to Markdown
# Convert entire documentation site to clean markdown
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.example.com")
# Save as markdown for LLM consumption
with open("docs.md", "w") as f:
f.write(result.markdown)
E-commerce Product Monitoring
# Generate schema once for product pages
# Then monitor prices/availability without LLM costs
schema = load_json("product_schema.json")
products = await crawler.arun_many(product_urls,
config=CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema)))
News Aggregation
# Crawl multiple news sources concurrently
news_urls = ["https://news1.com", "https://news2.com", "https://news3.com"]
results = await crawler.arun_many(news_urls, max_concurrent=5)
# Extract articles with Fit Markdown
for result in results:
if result.success:
# Get only relevant content
article = result.fit_markdown
Research & Data Collection
# Academic paper collection with focused extraction
config = CrawlerRunConfig(
fit_markdown=True,
fit_markdown_options={
"query": "machine learning transformers",
"max_tokens": 10000
}
)
Performance Optimization
Concurrent Crawling
For multiple URLs:
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
async def crawl_multiple(urls, max_concurrent=3):
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
config=CrawlerRunConfig(page_timeout=30000),
max_concurrent=max_concurrent
)
return [
{"url": r.url, "content": str(r.markdown), "success": r.success}
for r in results if r.success
]
results = asyncio.run(crawl_multiple(urls))
Caching Strategy
Enable caching during development:
from crawl4ai import CacheMode
config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED, # Cache successful requests
excluded_tags=["script", "style"] # Exclude unnecessary elements
)
Error Handling & Troubleshooting
Common Issues and Solutions
1. Timeout Errors
# Increase timeout for slow sites
config = CrawlerRunConfig(
page_timeout=90000, # 90 seconds
wait_for="js:document.readyState === 'complete'"
)
2. Bot Detection
# Rotate user agents and add delays
import random
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
]
browser_config = BrowserConfig(
headless=True,
user_agent=random.choice(user_agents),
viewport_width=f"{random.choice([1920, 1366, 1440])}",
viewport_height=f"{random.choice([1080, 768, 900])}"
)
# Add delay between requests
await asyncio.sleep(random.uniform(1, 3))
3. Content Not Loading
# Wait for specific content
config = CrawlerRunConfig(
wait_for=[
"css:.article-content", # Wait for main content
"js:window.mainContentLoaded" # Wait for JS flag
],
js_code="""
// Trigger any lazy loading
window.dispatchEvent(new Event('load'));
"""
)
JavaScript not loading:
config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for specific element
page_timeout=60000 # Increase timeout
)
Bot detection issues:
browser_config = BrowserConfig(
headless=False, # Sometimes visible browsing helps
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
# Add delays between requests
await asyncio.sleep(random.uniform(2, 5))
Content extraction problems:
# Debug what's being extracted
result = await crawler.arun(url)
print(f"HTML length: {len(result.html)}")
print(f"Markdown length: {len(result.markdown)}")
print(f"Links found: {len(result.links)}")
# Try different wait strategies
config = CrawlerRunConfig(
wait_for="js:document.querySelector('.content') !== null"
)
Session/auth issues:
# Verify session is maintained
config = CrawlerRunConfig(session_id="test_session")
result = await crawler.arun(url, config=config)
print(f"Session ID: {result.session_id}")
print(f"Cookies: {result.cookies}")
Guidelines
- Always check
result.successbefore processing content - Start with basic crawling - understand BrowserConfig, CrawlerRunConfig, and arun() before moving to advanced features
- Use markdown generation for documentation and content - Crawl4AI excels at clean markdown extraction
- Try schema generation first for structured data - 10-100x more efficient than LLM extraction
- Use appropriate timeouts - 30s for normal sites, 60s+ for JavaScript-heavy sites
- Enable caching during development with
cache_mode=CacheMode.ENABLEDto avoid repeated requests - Implement retry logic with exponential backoff
- Add delays between requests to respect rate limits
- Filter content aggressively to focus on relevant information
- Reuse sessions for authenticated content instead of re-logging
Resources
scripts/
- extraction_pipeline.py - Three extraction approaches with schema generation
- basic_crawler.py - Simple markdown extraction with screenshots
- batch_crawler.py - Multi-URL concurrent processing
examples/
- basic-scraping.py - Core scraping patterns
- structured-data-extraction.py - JSON extraction with CSS selectors
references/
- complete-sdk-reference.md - Complete SDK documentation (23K words) with all parameters, methods, and advanced features
tests/
- test_basic_crawling.py - Basic crawling patterns
- test_advanced_patterns.py - Advanced crawling scenarios
- test_data_extraction.py - Data extraction strategies
- test_markdown_generation.py - Markdown generation tests
For more details on any topic, refer to references/complete-sdk-reference.md which contains comprehensive documentation of all features, parameters, and advanced usage patterns.
More from nikhilmaddirala/gtd-cc
tools-catppuccin
Agent skill for creating and validating Catppuccin theme ports
18obsidian-gtd
Obsidian vault management and GTD workflows. Use when integrating with Obsidian vaults, managing notes, organizing knowledge, or supporting Getting Things Done methodology through note-based workflows.
13web-search
General web search patterns and techniques including Gemini CLI coordination. Use this skill when you need to perform web searches, find current information, or research topics online. Covers both Gemini CLI and built-in WebSearch tool usage with precise instruction crafting.
11tools-diagnostics
Interactive system resource analysis and troubleshooting for memory, disk, CPU, and performance issues
11web-content-extraction
Extract documentation and content from websites. Supports Mintlify, Starlight/Astro, Docusaurus, GitBook, ReadTheDocs, Sphinx, and generic sites. Uses a tiered approach - try the simplest method first (direct curl, Jina AI Reader) before falling back to Crawl4AI for JS-heavy sites.
10docs-pdf
Parse PDF documents into repository-friendly markdown and text artifacts. Use when users need to extract text, tables, or structure from PDF files.
10