fetch4ai Skill

Fetch web content using crawl4ai with customizable filtering strategies. Produces clean, LLM-ready markdown with noise removed.

Can be used as:

Standalone CLI tool - Simple command-line web fetching with clean output
web-research backend - Fetching layer for research workflows

Prerequisites

Ensure crawl4ai is installed:

pip install -U crawl4ai
crawl4ai-setup  # First-time setup for Playwright

Standalone Quick Use

For simple fetching when you just want clean markdown:

# Simplest: fetch URL, get markdown output
python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/article" \
  --format markdown

# With timeout control (default: 30s)
python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://slow-site.com/page" \
  --format md \
  --timeout 60

# Save directly to file
python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com" \
  --format markdown \
  -o content.md

Quiet Mode

Suppress crawl4ai status messages for clean piping:

# Clean output for piping to other tools
python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com" \
  --format md \
  --quiet

# Short form
python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com" -q --format md

Shell Alias (Optional)

Add to your ~/.zshrc or ~/.bashrc:

alias fetch4ai='python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py'

# Then use simply:
# fetch4ai --url "https://example.com" --format md -q

Quick Start

Basic Fetch (Pruning Filter - Default)

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/article" \
  --strategy pruning

Query-Focused Fetch (BM25)

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/article" \
  --strategy bm25 \
  --query "machine learning applications"

Clean Article Extraction (Tag Exclusion)

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/article" \
  --strategy tags \
  --excluded-tags "nav,footer,aside,header"

Filtering Strategies

Strategy 1: Pruning (Default)

Automatically removes low-quality content by scoring text density, link density, and tag importance.

When to use:

General content extraction from any webpage
Articles, blog posts, documentation
Cases without a specific search query

Parameters:

--threshold (0.0-1.0, default 0.48): Higher = stricter filtering
--min-words (default 5): Minimum words per content block

Example:

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://en.wikipedia.org/wiki/Artificial_intelligence" \
  --strategy pruning \
  --threshold 0.5

Strategy 2: BM25 (Query-Relevant)

Uses BM25 ranking algorithm to extract only content relevant to your search query.

When to use:

Focused research on specific topics
Extracting relevant sections from long pages
Targeted extraction with known search terms

Parameters:

--query (required): Search terms for relevance scoring
--bm25-threshold (default 1.2): Minimum relevance score

Example:

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://docs.python.org/3/tutorial/" \
  --strategy bm25 \
  --query "list comprehension syntax"

Strategy 3: Tag Exclusion

Removes specific HTML elements and filters by word count.

When to use:

Clean article extraction
Removing navigation, footers, sidebars
Pages with predictable noise elements

Parameters:

--excluded-tags (comma-separated): Tags to remove
--word-count-threshold (default 10): Minimum words per block

Common tag presets:

Article: nav,footer,header,aside
Minimal: nav,footer
Aggressive: nav,footer,header,aside,advertisement,script,style

Example:

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/blog/post" \
  --strategy tags \
  --excluded-tags "nav,footer,aside,header,advertisement" \
  --word-count-threshold 15

Strategy 4: Composite (Multi-Pass)

Combine strategies for high-precision extraction: Pruning first, then BM25.

When to use:

Research requiring both noise removal and relevance filtering
Long pages with scattered relevant content
Maximum precision extraction

Example:

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/research-paper" \
  --strategy composite \
  --threshold 0.4 \
  --query "experimental results methodology"

Output Format

The script returns JSON with:

{
  "success": true,
  "url": "https://example.com/article",
  "title": "Page Title",
  "content": "# Clean markdown content...",
  "links": [
    {"text": "Link Text", "href": "https://..."}
  ],
  "stats": {
    "raw_length": 45000,
    "fit_length": 12000,
    "reduction_percent": 73.3
  },
  "strategy": "pruning",
  "metadata": {
    "fetch_time": "2025-01-04T10:30:00",
    "word_count": 2500
  }
}

Advanced Options

Output Format

# JSON with full metadata (default)
--format json

# Plain markdown content only (great for piping)
--format markdown
--format md

Timeout Control

# Default is 30 seconds
--timeout 60  # 60 seconds for slow pages

Include/Exclude Links and Images

# Include links (default: true)
--include-links

# Include image references
--include-images

# Exclude external links (keep only same-domain)
--exclude-external-links

Session Management (Multi-Page)

For crawling multiple pages with shared browser state:

# First page
python fetch4ai.py --url "https://example.com/page1" --session-id "my_session"

# Subsequent pages (shares cookies, state)
python fetch4ai.py --url "https://example.com/page2" --session-id "my_session"

Output to File

python fetch4ai.py --url "https://example.com" --output result.json

Integration with web-research Skill

fetch4ai serves as the fetching layer for the web-research skill:

web-research spawns research subagents
Subagents use fetch4ai to get clean content
Content is saved to findings files
web-research synthesizes all findings

Usage in research workflow:

# In research subagent prompt:
Use fetch4ai to get content from [URL] with BM25 filtering for "[query]".
Save the fit_markdown to findings_[topic].md.

Error Handling

The script handles common errors:

Network timeouts (30s default)
Invalid URLs
JavaScript-heavy pages (Playwright handles JS)
Empty content after filtering

Errors return:

{
  "success": false,
  "url": "https://...",
  "error": "Error description",
  "error_type": "timeout|network|parsing|empty_content"
}

Strategy Selection Guide

Scenario	Strategy	Key Parameters
General article	`pruning`	`--threshold 0.48`
Specific topic search	`bm25`	`--query "your terms"`
Blog/news extraction	`tags`	`--excluded-tags "nav,footer,aside"`
Research paper sections	`composite`	`--threshold 0.4 --query "..."`
Documentation pages	`pruning`	`--threshold 0.3` (lower for docs)
Product listings	`tags`	`--word-count-threshold 20`

Reference Documentation

For detailed strategy comparisons and advanced patterns:

See references/filtering-strategies.md