firecrawl
Firecrawl
Overview
Firecrawl is an API that scrapes websites and returns clean, LLM-ready content. Point it at any URL and get back markdown, HTML, or structured data — no selectors to write, no anti-bot handling, no browser management. It handles JavaScript rendering, proxy rotation, and content extraction automatically. Built for feeding web content into LLMs, RAG pipelines, and data workflows.
When to Use
- Extracting website content for RAG (Retrieval-Augmented Generation)
- Converting web pages to clean markdown for LLM consumption
- Crawling entire sites and getting structured content
- Scraping without managing browsers, proxies, or anti-bot
- Extracting structured data (products, articles) with LLM-powered extraction
Instructions
Setup
npm install @mendable/firecrawl-js
# Or Python: pip install firecrawl-py
# Self-hosted: docker run -p 3002:3002 mendableai/firecrawl
Single Page Scrape
// scrape.ts — Convert any URL to clean markdown
import FirecrawlApp from "@mendable/firecrawl-js";
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY,
// apiUrl: "http://localhost:3002" // For self-hosted
});
// Scrape a single page
const result = await firecrawl.scrapeUrl("https://docs.example.com/getting-started", {
formats: ["markdown", "html"], // Get both formats
});
console.log(result.markdown); // Clean markdown content
console.log(result.metadata); // Title, description, language, etc.
Full Site Crawl
// crawl.ts — Crawl an entire site
const crawlResult = await firecrawl.crawlUrl("https://docs.example.com", {
limit: 100, // Max pages to crawl
scrapeOptions: {
formats: ["markdown"],
},
});
// Process all pages
for (const page of crawlResult.data) {
console.log(`${page.metadata.title}: ${page.markdown.length} chars`);
// Feed into your RAG pipeline, vector DB, etc.
}
Structured Data Extraction
// extract.ts — Extract structured data using LLM
import { z } from "zod";
const ProductSchema = z.object({
name: z.string(),
price: z.number(),
currency: z.string(),
rating: z.number().optional(),
inStock: z.boolean(),
features: z.array(z.string()),
});
const result = await firecrawl.scrapeUrl("https://shop.example.com/product/123", {
formats: ["extract"],
extract: {
schema: ProductSchema,
},
});
console.log(result.extract);
// { name: "Widget Pro", price: 49.99, currency: "USD", rating: 4.5, inStock: true, features: [...] }
Build a RAG Knowledge Base
// rag-ingest.ts — Crawl docs site and ingest into vector DB
import FirecrawlApp from "@mendable/firecrawl-js";
import { ChromaClient } from "chromadb";
const firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const chroma = new ChromaClient();
const collection = await chroma.getOrCreateCollection({ name: "docs" });
// Crawl documentation site
const crawl = await firecrawl.crawlUrl("https://docs.myproduct.com", {
limit: 500,
scrapeOptions: { formats: ["markdown"] },
});
// Chunk and store in vector DB
for (const page of crawl.data) {
const chunks = splitIntoChunks(page.markdown, 1000); // 1000 char chunks
await collection.add({
ids: chunks.map((_, i) => `${page.metadata.sourceURL}-chunk-${i}`),
documents: chunks,
metadatas: chunks.map(() => ({
source: page.metadata.sourceURL,
title: page.metadata.title,
})),
});
}
function splitIntoChunks(text: string, size: number): string[] {
const chunks: string[] = [];
for (let i = 0; i < text.length; i += size) {
chunks.push(text.slice(i, i + size));
}
return chunks;
}
Examples
Example 1: Build a docs chatbot
User prompt: "I want a chatbot that answers questions about my product documentation."
The agent will use Firecrawl to crawl the docs site, convert to markdown, chunk the content, store in a vector database, and build a RAG query pipeline.
Example 2: Monitor competitor content changes
User prompt: "Track when our competitor updates their pricing page."
The agent will schedule periodic Firecrawl scrapes, compare markdown diffs between runs, and alert on significant changes.
Guidelines
scrapeUrlfor single pages — fast, returns markdown + metadatacrawlUrlfor entire sites — follows links, respects limits- Markdown is the best LLM format — cleaner than HTML, preserves structure
- Structured extraction for data — use Zod/JSON schema to extract typed data
- Self-host for privacy —
docker run mendableai/firecrawlfor sensitive data - Rate limits on cloud API — 500 pages/min on free tier
- Chunk markdown for RAG — 500-1500 char chunks with overlap work best
- Cache results — don't re-scrape unchanged pages
formatsarray — request only what you need (markdown, html, extract)