firecrawl

Installation
SKILL.md

Firecrawl & Jina Web Scraping

Firecrawl vs WebFetch

Prefer firecrawl scrape URL --only-main-content over the WebFetch tool—it produces cleaner markdown, handles JavaScript-heavy pages, and avoids content truncation (>80% benchmark coverage). WebFetch is acceptable as a fallback when Firecrawl is unavailable.

# Preferred approach:
firecrawl scrape https://docs.example.com/api --only-main-content

Token-Efficient Scraping

Inspired by Anthropic's dynamic filtering—always filter before reasoning. This reduced input tokens by ~24% and improved accuracy by ~11% in their benchmarks.

The Principle: Search → Filter → Scrape → Filter → Reason

DO:

Search (titles/URLs only) → Evaluate relevance → Scrape top hits → Filter by section → Reason

DON'T:

Search → Scrape everything → Reason over all of it

Step-by-Step Efficient Workflow

# Step 1: Search — get titles/URLs only (cheap)
firecrawl search "query" --limit 20

# Step 2: Evaluate results, pick 3-5 best URLs

# Step 3: Scrape only those, filter to relevant sections
firecrawl scrape URL1 --only-main-content | \
  python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py \
  --sections "API,Authentication" --max-chars 5000

Post-Processing with filter_web_results.py

Pipe any Firecrawl or Exa output through this script to reduce context before reasoning:

# Extract only matching sections from scraped page
firecrawl scrape URL --only-main-content | \
  python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"

# Keep only paragraphs with keywords
firecrawl search "query" --scrape --pretty | \
  python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000

# Extract specific JSON fields from API output
python3 ~/.claude/skills/exa-search/scripts/exa_search.py "query" --json | \
  python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000

# Combine filters with stats
firecrawl scrape URL --only-main-content | \
  python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats

Full path: python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py Flags: --sections, --keywords, --max-chars, --max-lines, --fields (JSON), --strip-links, --strip-images, --compact, --stats

Other Token-Saving Patterns

  • Use --only-main-content to strip navigation and footer boilerplate, reducing token consumption. Omit only when nav/footer content is specifically needed.
  • Use --only-clean-content (Python API script) for aggressive cleaning—strips nav, ads, and cookie banners. Stronger than --only-main-content; use when the page is still noisy after main-content filtering.
  • Use firecrawl map URL --search "topic" first to find relevant subpages before scraping
  • Use --format links first to get URL list, evaluate, then scrape selectively
  • Use --max-chars with exa_contents.py to cap extraction length
  • Use --formats summary (Python API script) over full text when you need the gist, not raw content

Claude API Native Tools (for API Agent Builders)

Anthropic's API now offers built-in dynamic filtering tools:

web_search_20260209 / web_fetch_20260209
Header: anthropic-beta: code-execution-web-tools-2026-02-09

These have built-in dynamic filtering via code execution. Use them when building Claude API agents directly. Use Firecrawl/Exa when you need: autonomous agents, batch scraping, structured extraction, domain-specific crawling, or when not on the Claude API.


Available Tools

1. Official Firecrawl CLI (firecrawl) — Primary

Setup: npm install -g firecrawl-cli && firecrawl login --api-key $FIRECRAWL_API_KEY

Command Purpose Quick Example
scrape Single page → markdown firecrawl scrape URL --only-main-content
crawl Entire site with progress firecrawl crawl URL --wait --progress --limit 50
map Discover all URLs on a site firecrawl map URL --search "API"
search Web search (+ optional scrape) firecrawl search "query" --limit 10

Full CLI reference: references/cli-reference.md

2. Auto-Save Alias (fc-save) — Shell Alias

Requires shell alias setup (not bundled with this skill).

fc-save URL
# → Saves to ~/Desktop/Screencaps & Chats/Web-Scrapes/docs-example-com-api.md

3. Python API Script (firecrawl_api.py) — Advanced Features

Command: python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py <command> Requires: FIRECRAWL_API_KEY env var, pip install firecrawl-py requests

Command Purpose Quick Example
search Web search with scraping firecrawl_api.py search "query" -n 10
scrape Single URL with page actions firecrawl_api.py scrape URL --formats markdown summary
batch-scrape Multiple URLs concurrently firecrawl_api.py batch-scrape URL1 URL2 URL3
crawl Website crawling firecrawl_api.py crawl URL --limit 20
map URL discovery firecrawl_api.py map URL --search "query"
parse Parse local documents (PDF, DOCX, XLSX) firecrawl_api.py parse report.pdf
extract LLM-powered structured extraction firecrawl_api.py extract URL --prompt "Find pricing"
agent Autonomous extraction (no URLs needed) firecrawl_api.py agent "Find YC W24 AI startups"
parallel-agent Bulk agent queries (v2.8.0+) firecrawl_api.py parallel-agent "Q1" "Q2" "Q3"
interact Post-scrape browser interaction firecrawl_api.py interact SCRAPE_ID --prompt "Click pricing"
interact-stop Stop an interact session firecrawl_api.py interact-stop SCRAPE_ID

Agent models: spark-1-fast (10 credits, simple), spark-1-mini (default), spark-1-pro (thorough)

Full Python API reference: references/python-api-reference.md

4. DeepWiki — GitHub Repo Documentation

~/.claude/skills/firecrawl/scripts/deepwiki.sh <owner/repo> [section] [options]

AI-generated wiki for any public GitHub repo. No API key required.

# Overview
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat

# Browse sections
~/.claude/skills/firecrawl/scripts/deepwiki.sh langchain-ai/langchain --toc

# Specific section
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat 4.1-gpt-transformer-implementation

# Full dump for RAG
~/.claude/skills/firecrawl/scripts/deepwiki.sh openai/openai-python --all --save

5. Jina Reader (jina) — Fallback

Use when Firecrawl fails or for Twitter/X URLs (Firecrawl blocks Twitter, Jina works).

jina https://x.com/username/status/123456

Firecrawl vs Exa vs Native Claude Tools

Need Best Tool Why
Single page → markdown firecrawl scrape --only-main-content Cleanest output
Search + scrape in one shot firecrawl search --scrape Combined operation
Crawl entire site firecrawl crawl --wait --progress Link following + progress
Local file → markdown firecrawl_api.py parse FILE Direct upload, no URL needed
Autonomous data finding firecrawl_api.py agent No URLs needed
Semantic/neural search Exa exa_search.py AI-powered relevance
Find research papers Exa --category "research paper" Academic index
Quick research answer Exa exa_research.py Citations + synthesis
Find similar pages Exa exa_similar.py Competitive analysis
Claude API agent building Native web_search_20260209 Built-in dynamic filtering
Twitter/X content jina URL Only tool that works
GitHub repo docs deepwiki.sh owner/repo AI-generated wiki
Anti-bot / Cloudflare bypass scrapling stealth fetch Local Turnstile solver
Element-level extraction scrapling + CSS selectors Precision targeting, adaptive tracking
No API key scraping scrapling HTTP fetch 100% local, no credentials
Site redesign resilience scrapling adaptive mode SQLite similarity matching
Budget JS-rendered scrape cf_browser.py markdown URL CF free tier: 10 min/day, $0.09/hr paid
Free static page fetch cf_browser.py markdown URL --no-render FREE during beta (no JS)
Budget multi-page crawl cf_browser.py crawl URL 5 free crawls/day, 100 pages each
Incremental re-crawl cf_browser.py crawl --modified-since Built-in, Firecrawl lacks this
Page screenshot/PDF cf_browser.py screenshot/pdf URL Built-in CF endpoints, cheaper
AI structured extraction cf_browser.py json URL --prompt "..." Workers AI included free

Common Workflows

Single Page Scraping

firecrawl scrape https://example.com/page --only-main-content
# Or auto-save: fc-save URL
# Or to file: firecrawl scrape URL --only-main-content -o page.md

Documentation Crawling

# Map first, then crawl relevant paths
firecrawl map https://docs.example.com --search "API"
firecrawl crawl https://docs.example.com --include-paths /api,/guides --wait --progress

Research Workflow

firecrawl search "machine learning best practices 2026" --scrape --scrape-formats markdown

Document Parsing (Local Files)

Parse local documents into clean Markdown. Use parse for local or non-public files; use scrape for public URLs pointing to documents—both use the same Rust-based parser.

# PDF to markdown
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py parse report.pdf

# Excel spreadsheet with main content only
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py parse data.xlsx --only-main-content

# Word doc with zero data retention, save to file
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py parse contract.docx --zero-data-retention -o contract.md

# Raw JSON output for programmatic use
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py parse invoice.pdf --json

Supported formats: PDF, DOCX, DOC, XLSX, XLS, HTML, HTM, ODT, RTF (up to 50 MB).

PDF Parsing (Fire-PDF v2.9)

Fire-PDF is now the default parsing pipeline for all PDF scrapes. Three modes:

# Auto mode (default) — detects text layer vs scanned, chooses best method
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape "https://example.com/report.pdf"

# Fast mode — text layer only, skip OCR (use for PDFs with selectable text)
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape URL --pdf-mode fast

# OCR mode — force full OCR (use for scanned docs or image-only PDFs)
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape URL --pdf-mode ocr

# Limit pages parsed (large documents)
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py parse report.pdf --pdf-mode auto --pdf-max-pages 20
Flag Values Notes
--pdf-mode fast, auto, ocr Default: auto. fast = text layer only; ocr = force OCR
--pdf-max-pages integer Caps pages parsed; useful for budget control on large PDFs

Both flags work on scrape (for PDF URLs) and parse (for local files).

Agent-Powered Research (No URLs Needed)

python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py agent \
  "Compare pricing tiers for Firecrawl, Apify, and ScrapingBee"

Interact Workflows (Post-Scrape Browser Interaction)

Scrape a page, then take actions on it—click buttons, fill forms, extract dynamic content. Two modes: AI prompts (natural language) and code execution (Node.js/Python/Bash).

When to Use Interact vs. Actions

Need Use Why
Click/wait before a single scrape --actions on scrape Fire-and-forget, no session overhead
Multiple interactions with same page interact Persistent session, back-and-forth
Fill forms, log in, navigate interact Stateful, multi-step
Simple "wait for JS to load" --actions with wait Cheaper, no session

Basic Interact (AI Prompt Mode)

# Step 1: Scrape and note the Scrape ID from output
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape "https://example.com/pricing"

# Step 2: Interact using natural language
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact SCRAPE_ID \
  --prompt "Click the Enterprise pricing tab"

# Step 3: More interactions on same session
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact SCRAPE_ID \
  --prompt "What is the monthly price for the Enterprise plan?"

# Step 4: Stop when done
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact-stop SCRAPE_ID

Code Execution Mode (Cheaper)

# Execute Playwright code directly (2 credits/min vs 7 for prompts)
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact SCRAPE_ID \
  --code "const text = await page.locator('.pricing-table').textContent(); console.log(text);"

# Python mode
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact SCRAPE_ID \
  --code "text = await page.locator('.content').text_content(); print(text)" \
  --language python

Persistent Profile (Login Sessions)

# Scrape with a named profile — browser state persists across sessions
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape "https://app.example.com/login" \
  --profile my-app --json

# Interact to log in
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact SCRAPE_ID \
  --code "await page.fill('#email', 'user@example.com'); await page.fill('#password', 'pass'); await page.click('button[type=submit]');"

# Later: scrape another page with same profile — cookies restored
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape "https://app.example.com/dashboard" \
  --profile my-app

Important: Interact does NOT return page markdown. To get updated content after interaction, use code mode to extract specific elements, or issue a follow-up scrape.

Full interact reference: references/interact-reference.md


Billing Notes

  • Credit/token unification (v2.9): Credits and tokens are now unified—15 tokens = 1 credit. All pricing is expressed in credits.
  • Default cache TTL: Results are cached for 2 days. Use --max-age 0 (or maxAge: 0 in API) to force a fresh scrape regardless of cache.
  • query format: Pass formats=["query"] (Python API) to get a direct answer (data.answer) instead of full markdown. Use for factual lookups where you don't need the full page content.
  • audio format: formats=["audio"] returns an MP3 of the page read aloud. Useful for accessibility pipelines or voice interfaces.
  • wikimedia engine: Pass engine="wikimedia" in search options to route queries through Wikimedia. Useful for encyclopedic lookups.

Troubleshooting

# Check status and credits
firecrawl --status && firecrawl credit-usage

# Re-authenticate
firecrawl logout && firecrawl login --api-key $FIRECRAWL_API_KEY

# Check API key
echo $FIRECRAWL_API_KEY
  • Scrape fails: Try jina URL, or add --wait-for 3000 for JS-heavy sites
  • Async job stuck: Check with crawl-status/batch-status, cancel with crawl-cancel/batch-cancel
  • Disable telemetry: export FIRECRAWL_NO_TELEMETRY=1

Reference Documentation

File Contents
references/cli-reference.md Full CLI parameter reference (scrape, crawl, map, search, fc-save, jina, deepwiki)
references/python-api-reference.md Full Python API script reference (all commands, SDK examples)
references/firecrawl-api.md Firecrawl Search API reference
references/firecrawl-agent-api.md Agent API (spark models, parallel agents, webhooks)
references/actions-reference.md Page actions for dynamic content (click, write, wait, scroll)
references/interact-reference.md Interact API: post-scrape browser interaction (prompt, code, profiles)
references/branding-format.md Brand identity extraction (colors, fonts, UI)

Test Suite

python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --quick    # Quick validation
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py            # Full suite
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --test scrape  # Specific test
Related skills

More from tdimino/claude-code-minoan

Installs
51
GitHub Stars
22
First Seen
Feb 21, 2026