firecrawl
Firecrawl & Jina Web Scraping
Firecrawl vs WebFetch
Prefer firecrawl scrape URL --only-main-content over the WebFetch tool—it produces cleaner markdown, handles JavaScript-heavy pages, and avoids content truncation (>80% benchmark coverage). WebFetch is acceptable as a fallback when Firecrawl is unavailable.
# Preferred approach:
firecrawl scrape https://docs.example.com/api --only-main-content
Token-Efficient Scraping
Inspired by Anthropic's dynamic filtering—always filter before reasoning. This reduced input tokens by ~24% and improved accuracy by ~11% in their benchmarks.
The Principle: Search → Filter → Scrape → Filter → Reason
DO:
Search (titles/URLs only) → Evaluate relevance → Scrape top hits → Filter by section → Reason
DON'T:
Search → Scrape everything → Reason over all of it
Step-by-Step Efficient Workflow
# Step 1: Search — get titles/URLs only (cheap)
firecrawl search "query" --limit 20
# Step 2: Evaluate results, pick 3-5 best URLs
# Step 3: Scrape only those, filter to relevant sections
firecrawl scrape URL1 --only-main-content | \
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py \
--sections "API,Authentication" --max-chars 5000
Post-Processing with filter_web_results.py
Pipe any Firecrawl or Exa output through this script to reduce context before reasoning:
# Extract only matching sections from scraped page
firecrawl scrape URL --only-main-content | \
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"
# Keep only paragraphs with keywords
firecrawl search "query" --scrape --pretty | \
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000
# Extract specific JSON fields from API output
python3 ~/.claude/skills/exa-search/scripts/exa_search.py "query" --json | \
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000
# Combine filters with stats
firecrawl scrape URL --only-main-content | \
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats
Full path: python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
Flags: --sections, --keywords, --max-chars, --max-lines, --fields (JSON), --strip-links, --strip-images, --compact, --stats
Other Token-Saving Patterns
- Use
--only-main-contentto strip navigation and footer boilerplate, reducing token consumption. Omit only when nav/footer content is specifically needed. - Use
--only-clean-content(Python API script) for aggressive cleaning—strips nav, ads, and cookie banners. Stronger than--only-main-content; use when the page is still noisy after main-content filtering. - Use
firecrawl map URL --search "topic"first to find relevant subpages before scraping - Use
--format linksfirst to get URL list, evaluate, then scrape selectively - Use
--max-charswithexa_contents.pyto cap extraction length - Use
--formats summary(Python API script) over full text when you need the gist, not raw content
Claude API Native Tools (for API Agent Builders)
Anthropic's API now offers built-in dynamic filtering tools:
web_search_20260209 / web_fetch_20260209
Header: anthropic-beta: code-execution-web-tools-2026-02-09
These have built-in dynamic filtering via code execution. Use them when building Claude API agents directly. Use Firecrawl/Exa when you need: autonomous agents, batch scraping, structured extraction, domain-specific crawling, or when not on the Claude API.
Available Tools
1. Official Firecrawl CLI (firecrawl) — Primary
Setup: npm install -g firecrawl-cli && firecrawl login --api-key $FIRECRAWL_API_KEY
| Command | Purpose | Quick Example |
|---|---|---|
scrape |
Single page → markdown | firecrawl scrape URL --only-main-content |
crawl |
Entire site with progress | firecrawl crawl URL --wait --progress --limit 50 |
map |
Discover all URLs on a site | firecrawl map URL --search "API" |
search |
Web search (+ optional scrape) | firecrawl search "query" --limit 10 |
Full CLI reference: references/cli-reference.md
2. Auto-Save Alias (fc-save) — Shell Alias
Requires shell alias setup (not bundled with this skill).
fc-save URL
# → Saves to ~/Desktop/Screencaps & Chats/Web-Scrapes/docs-example-com-api.md
3. Python API Script (firecrawl_api.py) — Advanced Features
Command: python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py <command>
Requires: FIRECRAWL_API_KEY env var, pip install firecrawl-py requests
| Command | Purpose | Quick Example |
|---|---|---|
search |
Web search with scraping | firecrawl_api.py search "query" -n 10 |
scrape |
Single URL with page actions | firecrawl_api.py scrape URL --formats markdown summary |
batch-scrape |
Multiple URLs concurrently | firecrawl_api.py batch-scrape URL1 URL2 URL3 |
crawl |
Website crawling | firecrawl_api.py crawl URL --limit 20 |
map |
URL discovery | firecrawl_api.py map URL --search "query" |
parse |
Parse local documents (PDF, DOCX, XLSX) | firecrawl_api.py parse report.pdf |
extract |
LLM-powered structured extraction | firecrawl_api.py extract URL --prompt "Find pricing" |
agent |
Autonomous extraction (no URLs needed) | firecrawl_api.py agent "Find YC W24 AI startups" |
parallel-agent |
Bulk agent queries (v2.8.0+) | firecrawl_api.py parallel-agent "Q1" "Q2" "Q3" |
interact |
Post-scrape browser interaction | firecrawl_api.py interact SCRAPE_ID --prompt "Click pricing" |
interact-stop |
Stop an interact session | firecrawl_api.py interact-stop SCRAPE_ID |
Agent models: spark-1-fast (10 credits, simple), spark-1-mini (default), spark-1-pro (thorough)
Full Python API reference: references/python-api-reference.md
4. DeepWiki — GitHub Repo Documentation
~/.claude/skills/firecrawl/scripts/deepwiki.sh <owner/repo> [section] [options]
AI-generated wiki for any public GitHub repo. No API key required.
# Overview
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat
# Browse sections
~/.claude/skills/firecrawl/scripts/deepwiki.sh langchain-ai/langchain --toc
# Specific section
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat 4.1-gpt-transformer-implementation
# Full dump for RAG
~/.claude/skills/firecrawl/scripts/deepwiki.sh openai/openai-python --all --save
5. Jina Reader (jina) — Fallback
Use when Firecrawl fails or for Twitter/X URLs (Firecrawl blocks Twitter, Jina works).
jina https://x.com/username/status/123456
Firecrawl vs Exa vs Native Claude Tools
| Need | Best Tool | Why |
|---|---|---|
| Single page → markdown | firecrawl scrape --only-main-content |
Cleanest output |
| Search + scrape in one shot | firecrawl search --scrape |
Combined operation |
| Crawl entire site | firecrawl crawl --wait --progress |
Link following + progress |
| Local file → markdown | firecrawl_api.py parse FILE |
Direct upload, no URL needed |
| Autonomous data finding | firecrawl_api.py agent |
No URLs needed |
| Semantic/neural search | Exa exa_search.py |
AI-powered relevance |
| Find research papers | Exa --category "research paper" |
Academic index |
| Quick research answer | Exa exa_research.py |
Citations + synthesis |
| Find similar pages | Exa exa_similar.py |
Competitive analysis |
| Claude API agent building | Native web_search_20260209 |
Built-in dynamic filtering |
| Twitter/X content | jina URL |
Only tool that works |
| GitHub repo docs | deepwiki.sh owner/repo |
AI-generated wiki |
| Anti-bot / Cloudflare bypass | scrapling stealth fetch |
Local Turnstile solver |
| Element-level extraction | scrapling + CSS selectors |
Precision targeting, adaptive tracking |
| No API key scraping | scrapling HTTP fetch |
100% local, no credentials |
| Site redesign resilience | scrapling adaptive mode |
SQLite similarity matching |
| Budget JS-rendered scrape | cf_browser.py markdown URL |
CF free tier: 10 min/day, $0.09/hr paid |
| Free static page fetch | cf_browser.py markdown URL --no-render |
FREE during beta (no JS) |
| Budget multi-page crawl | cf_browser.py crawl URL |
5 free crawls/day, 100 pages each |
| Incremental re-crawl | cf_browser.py crawl --modified-since |
Built-in, Firecrawl lacks this |
| Page screenshot/PDF | cf_browser.py screenshot/pdf URL |
Built-in CF endpoints, cheaper |
| AI structured extraction | cf_browser.py json URL --prompt "..." |
Workers AI included free |
Common Workflows
Single Page Scraping
firecrawl scrape https://example.com/page --only-main-content
# Or auto-save: fc-save URL
# Or to file: firecrawl scrape URL --only-main-content -o page.md
Documentation Crawling
# Map first, then crawl relevant paths
firecrawl map https://docs.example.com --search "API"
firecrawl crawl https://docs.example.com --include-paths /api,/guides --wait --progress
Research Workflow
firecrawl search "machine learning best practices 2026" --scrape --scrape-formats markdown
Document Parsing (Local Files)
Parse local documents into clean Markdown. Use parse for local or non-public files; use scrape for public URLs pointing to documents—both use the same Rust-based parser.
# PDF to markdown
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py parse report.pdf
# Excel spreadsheet with main content only
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py parse data.xlsx --only-main-content
# Word doc with zero data retention, save to file
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py parse contract.docx --zero-data-retention -o contract.md
# Raw JSON output for programmatic use
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py parse invoice.pdf --json
Supported formats: PDF, DOCX, DOC, XLSX, XLS, HTML, HTM, ODT, RTF (up to 50 MB).
PDF Parsing (Fire-PDF v2.9)
Fire-PDF is now the default parsing pipeline for all PDF scrapes. Three modes:
# Auto mode (default) — detects text layer vs scanned, chooses best method
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape "https://example.com/report.pdf"
# Fast mode — text layer only, skip OCR (use for PDFs with selectable text)
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape URL --pdf-mode fast
# OCR mode — force full OCR (use for scanned docs or image-only PDFs)
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape URL --pdf-mode ocr
# Limit pages parsed (large documents)
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py parse report.pdf --pdf-mode auto --pdf-max-pages 20
| Flag | Values | Notes |
|---|---|---|
--pdf-mode |
fast, auto, ocr |
Default: auto. fast = text layer only; ocr = force OCR |
--pdf-max-pages |
integer | Caps pages parsed; useful for budget control on large PDFs |
Both flags work on scrape (for PDF URLs) and parse (for local files).
Agent-Powered Research (No URLs Needed)
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py agent \
"Compare pricing tiers for Firecrawl, Apify, and ScrapingBee"
Interact Workflows (Post-Scrape Browser Interaction)
Scrape a page, then take actions on it—click buttons, fill forms, extract dynamic content. Two modes: AI prompts (natural language) and code execution (Node.js/Python/Bash).
When to Use Interact vs. Actions
| Need | Use | Why |
|---|---|---|
| Click/wait before a single scrape | --actions on scrape |
Fire-and-forget, no session overhead |
| Multiple interactions with same page | interact |
Persistent session, back-and-forth |
| Fill forms, log in, navigate | interact |
Stateful, multi-step |
| Simple "wait for JS to load" | --actions with wait |
Cheaper, no session |
Basic Interact (AI Prompt Mode)
# Step 1: Scrape and note the Scrape ID from output
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape "https://example.com/pricing"
# Step 2: Interact using natural language
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact SCRAPE_ID \
--prompt "Click the Enterprise pricing tab"
# Step 3: More interactions on same session
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact SCRAPE_ID \
--prompt "What is the monthly price for the Enterprise plan?"
# Step 4: Stop when done
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact-stop SCRAPE_ID
Code Execution Mode (Cheaper)
# Execute Playwright code directly (2 credits/min vs 7 for prompts)
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact SCRAPE_ID \
--code "const text = await page.locator('.pricing-table').textContent(); console.log(text);"
# Python mode
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact SCRAPE_ID \
--code "text = await page.locator('.content').text_content(); print(text)" \
--language python
Persistent Profile (Login Sessions)
# Scrape with a named profile — browser state persists across sessions
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape "https://app.example.com/login" \
--profile my-app --json
# Interact to log in
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py interact SCRAPE_ID \
--code "await page.fill('#email', 'user@example.com'); await page.fill('#password', 'pass'); await page.click('button[type=submit]');"
# Later: scrape another page with same profile — cookies restored
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py scrape "https://app.example.com/dashboard" \
--profile my-app
Important: Interact does NOT return page markdown. To get updated content after interaction, use code mode to extract specific elements, or issue a follow-up scrape.
Full interact reference: references/interact-reference.md
Billing Notes
- Credit/token unification (v2.9): Credits and tokens are now unified—15 tokens = 1 credit. All pricing is expressed in credits.
- Default cache TTL: Results are cached for 2 days. Use
--max-age 0(ormaxAge: 0in API) to force a fresh scrape regardless of cache. queryformat: Passformats=["query"](Python API) to get a direct answer (data.answer) instead of full markdown. Use for factual lookups where you don't need the full page content.audioformat:formats=["audio"]returns an MP3 of the page read aloud. Useful for accessibility pipelines or voice interfaces.wikimediaengine: Passengine="wikimedia"in search options to route queries through Wikimedia. Useful for encyclopedic lookups.
Troubleshooting
# Check status and credits
firecrawl --status && firecrawl credit-usage
# Re-authenticate
firecrawl logout && firecrawl login --api-key $FIRECRAWL_API_KEY
# Check API key
echo $FIRECRAWL_API_KEY
- Scrape fails: Try
jina URL, or add--wait-for 3000for JS-heavy sites - Async job stuck: Check with
crawl-status/batch-status, cancel withcrawl-cancel/batch-cancel - Disable telemetry:
export FIRECRAWL_NO_TELEMETRY=1
Reference Documentation
| File | Contents |
|---|---|
references/cli-reference.md |
Full CLI parameter reference (scrape, crawl, map, search, fc-save, jina, deepwiki) |
references/python-api-reference.md |
Full Python API script reference (all commands, SDK examples) |
references/firecrawl-api.md |
Firecrawl Search API reference |
references/firecrawl-agent-api.md |
Agent API (spark models, parallel agents, webhooks) |
references/actions-reference.md |
Page actions for dynamic content (click, write, wait, scroll) |
references/interact-reference.md |
Interact API: post-scrape browser interaction (prompt, code, profiles) |
references/branding-format.md |
Brand identity extraction (colors, fonts, UI) |
Test Suite
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --quick # Quick validation
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py # Full suite
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --test scrape # Specific test
More from tdimino/claude-code-minoan
academic-research
Search academic papers, build literature reviews, and synthesize research findings — combines Exa MCP (research_paper category, arxiv filtering) with arxiv-mcp-server for paper discovery, download, and deep analysis. Triggers on academic paper, literature review, research synthesis, arxiv, find papers, scholarly search.
69travel-requirements-expert
Plan a trip, create an itinerary, or research a destination through a structured 5-phase workflow---discovery questions, Exa/Firecrawl research, expert detail gathering, and a day-by-day requirements spec. This skill should be used when a user says "plan a trip," "create an itinerary," "help me visit [place]," or needs travel research with specific venues, safety protocols, and dietary accommodations.
67twilio-api
Use this skill when working with Twilio communication APIs for SMS/MMS messaging, voice calls, phone number management, TwiML, webhook integration, two-way SMS conversations, bulk sending, or production deployment of telephony features. Includes official Twilio patterns, production code examples from Twilio-Aldea (provider-agnostic webhooks, signature validation, TwiML responses), and comprehensive TypeScript examples.
65figma-mcp
Convert Figma designs into production-ready code using MCP server tools. Use this skill when users provide Figma URLs, request design-to-code conversion, ask to implement Figma mockups, or need to extract design tokens and system values from Figma files. Works with frames, components, and entire design files to generate HTML, CSS, React, or other frontend code.
61scrapling
Scrape pages locally with anti-bot bypass, TLS impersonation, and adaptive element tracking — no API keys, no cloud. Handles Cloudflare protection, CSS/XPath element extraction, and survives site redesigns. Complements firecrawl (cloud) with 100% local execution. Triggers on Cloudflare bypass, anti-bot scraping, stealth fetch, local scraping, Scrapling.
47twitter
Twitter/X integration with three modes: official API v2 search/research via x-search (pay-per-use, $0.005/read), session-based posting/reading via bird CLI (free, browser cookies), and bookmark archival via Smaug. This skill should be used when searching tweets, researching topics on X, posting, monitoring accounts, or archiving bookmarks.
46