crawl4ai
crawl4ai
High-performance web crawler with intelligent chunking. Crawls web pages and extracts content as markdown using LLM-based skeleton planning.
Commands
crawl_url (alias: webCrawl)
Crawl a web page with native workflow execution and LLM-based intelligent chunking.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str | - | Target URL to crawl (required) |
action |
str | "smart" | Action mode: "smart", "skeleton", "crawl" |
fit_markdown |
bool | true | Clean and simplify markdown output |
max_depth |
int | 0 | Maximum crawling depth (0=single page) |
return_skeleton |
bool | false | Also return document skeleton (TOC) |
chunk_indices |
list[int] | - | List of section indices to extract |
Action Modes:
| Mode | Description | Use Case |
|---|---|---|
smart (default) |
LLM generates chunk plan, then extracts relevant sections | Large docs where you need specific info |
skeleton |
Extract lightweight TOC without full content | Quick overview, decide what to read |
crawl |
Return full markdown content | Small pages, complete content needed |
Runtime Transport:
max_depth = 0: Uses HTTP strategy (no browser cold-start) for lower latency.max_depth > 0: Uses browser deep-crawl strategy (BFS) for multi-page traversal.file://...withmax_depth = 0: Uses local fast-path (no crawl4ai runtime bootstrap) for deterministic fixture/local-note benchmarking.- Persistent worker mode reuses the HTTP crawler instance across requests to reduce repeated initialization cost.
Examples:
# Smart crawl with LLM chunking (default)
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com"}`
# Skeleton only - get TOC quickly
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com", "action": "skeleton"}`
# Full content crawl
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com", "action": "crawl"}`
# Extract specific sections
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com", "chunk_indices": [0, 1, 2]}`
# Deep crawl (follow links up to depth N)
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com", "max_depth": 2}`
# Get skeleton with full content
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com", "return_skeleton": true}`
Core Concepts
| Topic | Description | Reference |
|---|---|---|
| Skeleton Planning | LLM sees TOC (~500 tokens) not full content (~10k+) | smart-chunking.md |
| Chunk Extraction | Token-aware section extraction | chunking.md |
| Deep Crawling | Multi-page crawling with BFS strategy | deep-crawl.md |
Best Practices
- Use
skeletonmode first for large documents to understand structure - Use
chunk_indicesto extract specific sections instead of full content - Set
max_depth> 0 carefully - limits pages crawled to prevent runaway crawling - Keep
fit_markdown=truefor cleaner output, false for raw content
Advanced
- Batch multiple URLs with separate calls
- Combine with knowledge tools for RAG pipelines
- Use skeleton + LLM to auto-generate chunk plans for custom extraction
More from tao3k/omni-dev-fusion
software_engineering
Use when analyzing project architecture, exploring codebase structure, understanding system design, reviewing code patterns, or navigating modular components.
18python_engineering
Use when linting Python code, formatting with ruff/black, running pytest tests, type checking with pyright, or modernizing Python 3.12+ standards.
18code_tools
Use when searching code by structure or meaning, analyzing code patterns, finding class or function definitions, or exploring codebase architecture.
16rust_engineering
Use when analyzing Rust project structure, managing Cargo dependencies, building and testing Rust projects, or generating Rust code.
16git
Use when committing code, managing branches, pushing to remote, creating pull requests, or performing version control operations. Conforms to docs/reference/skill-routing-value-standard.md.
15researcher
Use when analyzing repositories, conducting deep research on codebases, performing architecture reviews, or exploring large projects from a git URL.
15