firecrawl

SKILL.md

Firecrawl

Overview

Firecrawl is an API that scrapes websites and returns clean, LLM-ready content. Point it at any URL and get back markdown, HTML, or structured data — no selectors to write, no anti-bot handling, no browser management. It handles JavaScript rendering, proxy rotation, and content extraction automatically. Built for feeding web content into LLMs, RAG pipelines, and data workflows.

When to Use

  • Extracting website content for RAG (Retrieval-Augmented Generation)
  • Converting web pages to clean markdown for LLM consumption
  • Crawling entire sites and getting structured content
  • Scraping without managing browsers, proxies, or anti-bot
  • Extracting structured data (products, articles) with LLM-powered extraction

Instructions

Setup

npm install @mendable/firecrawl-js
# Or Python: pip install firecrawl-py

# Self-hosted: docker run -p 3002:3002 mendableai/firecrawl

Single Page Scrape

// scrape.ts — Convert any URL to clean markdown
import FirecrawlApp from "@mendable/firecrawl-js";

const firecrawl = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY,
  // apiUrl: "http://localhost:3002" // For self-hosted
});

// Scrape a single page
const result = await firecrawl.scrapeUrl("https://docs.example.com/getting-started", {
  formats: ["markdown", "html"],  // Get both formats
});

console.log(result.markdown);     // Clean markdown content
console.log(result.metadata);     // Title, description, language, etc.

Full Site Crawl

// crawl.ts — Crawl an entire site
const crawlResult = await firecrawl.crawlUrl("https://docs.example.com", {
  limit: 100,                     // Max pages to crawl
  scrapeOptions: {
    formats: ["markdown"],
  },
});

// Process all pages
for (const page of crawlResult.data) {
  console.log(`${page.metadata.title}: ${page.markdown.length} chars`);
  // Feed into your RAG pipeline, vector DB, etc.
}

Structured Data Extraction

// extract.ts — Extract structured data using LLM
import { z } from "zod";

const ProductSchema = z.object({
  name: z.string(),
  price: z.number(),
  currency: z.string(),
  rating: z.number().optional(),
  inStock: z.boolean(),
  features: z.array(z.string()),
});

const result = await firecrawl.scrapeUrl("https://shop.example.com/product/123", {
  formats: ["extract"],
  extract: {
    schema: ProductSchema,
  },
});

console.log(result.extract);
// { name: "Widget Pro", price: 49.99, currency: "USD", rating: 4.5, inStock: true, features: [...] }

Build a RAG Knowledge Base

// rag-ingest.ts — Crawl docs site and ingest into vector DB
import FirecrawlApp from "@mendable/firecrawl-js";
import { ChromaClient } from "chromadb";

const firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const chroma = new ChromaClient();
const collection = await chroma.getOrCreateCollection({ name: "docs" });

// Crawl documentation site
const crawl = await firecrawl.crawlUrl("https://docs.myproduct.com", {
  limit: 500,
  scrapeOptions: { formats: ["markdown"] },
});

// Chunk and store in vector DB
for (const page of crawl.data) {
  const chunks = splitIntoChunks(page.markdown, 1000);  // 1000 char chunks

  await collection.add({
    ids: chunks.map((_, i) => `${page.metadata.sourceURL}-chunk-${i}`),
    documents: chunks,
    metadatas: chunks.map(() => ({
      source: page.metadata.sourceURL,
      title: page.metadata.title,
    })),
  });
}

function splitIntoChunks(text: string, size: number): string[] {
  const chunks: string[] = [];
  for (let i = 0; i < text.length; i += size) {
    chunks.push(text.slice(i, i + size));
  }
  return chunks;
}

Examples

Example 1: Build a docs chatbot

User prompt: "I want a chatbot that answers questions about my product documentation."

The agent will use Firecrawl to crawl the docs site, convert to markdown, chunk the content, store in a vector database, and build a RAG query pipeline.

Example 2: Monitor competitor content changes

User prompt: "Track when our competitor updates their pricing page."

The agent will schedule periodic Firecrawl scrapes, compare markdown diffs between runs, and alert on significant changes.

Guidelines

  • scrapeUrl for single pages — fast, returns markdown + metadata
  • crawlUrl for entire sites — follows links, respects limits
  • Markdown is the best LLM format — cleaner than HTML, preserves structure
  • Structured extraction for data — use Zod/JSON schema to extract typed data
  • Self-host for privacydocker run mendableai/firecrawl for sensitive data
  • Rate limits on cloud API — 500 pages/min on free tier
  • Chunk markdown for RAG — 500-1500 char chunks with overlap work best
  • Cache results — don't re-scrape unchanged pages
  • formats array — request only what you need (markdown, html, extract)
Weekly Installs
1
GitHub Stars
15
First Seen
1 day ago
Installed on
amp1
cline1
augment1
opencode1
cursor1
kimi-cli1