crawl4ai-fetch
crawl4ai-fetch
Use scripts/crawl.py to fetch a URL and return its content as Markdown.
Configuration
Configuration is resolved in the following priority order:
- Environment variables (highest priority)
.envfile in the current working directory (auto-loaded if present)- Built-in defaults
| Env var | Purpose | Default |
|---|---|---|
CRAWL4AI_URL |
Base URL of crawl4ai instance | https://crawl.981234.xyz |
CRAWL4AI_TOKEN |
Bearer token for auth (optional) | (empty = no auth header sent) |
Example .env:
CRAWL4AI_URL=https://crawl.example.com
CRAWL4AI_TOKEN=your-secret-token
Usage
# Basic fetch
python3 scripts/crawl.py "https://example.com/"
# Use bm25 filter with a relevance query (returns only the most relevant sections)
python3 scripts/crawl.py "https://docs.example.com/api" --filter bm25 --query "authentication"
# Custom instance with auth
CRAWL4AI_URL=https://crawl.example.com CRAWL4AI_TOKEN=my-token python3 scripts/crawl.py "https://example.com/"
Filter modes
| Mode | Description |
|---|---|
fit |
(default) Smart extraction — removes boilerplate, keeps main content |
raw |
Full page Markdown with no filtering |
bm25 |
BM25-ranked relevance filter; requires --query |
Output format
Plain Markdown text printed to stdout. Pipe or capture as needed:
python3 scripts/crawl.py "https://example.com/" > page.md
On failure, an error message is printed to stderr and the script exits with code 1.
Workflow
- Run the script with the target URL, capturing stdout.
- Pass the Markdown content to the LLM for summarization, Q&A, or analysis.
- For long pages, use
--filter bm25 --query "topic"to get only the relevant sections.
Notes
- Timeout is 60 s to allow for JavaScript-heavy pages.
- If
CRAWL4AI_TOKENis unset or empty, theAuthorizationheader is omitted (public instances). - Always fetches fresh content (
c=0); server-side cache is not used.
More from ichuan/skills
roadmap-management
Minimalist project roadmap management using a position-based priority system in ROADMAP.md. Use when users want to: (1) Create or initialize a project roadmap, (2) Add tasks/features to a roadmap, (3) Update task priorities or status, (4) Reorganize roadmap items, (5) Move tasks between sections (Inbox/Doing/Next Up/Backlog/Done), (6) Clean up or review the roadmap, or any other roadmap planning and tracking activities. Triggered by keywords like 'roadmap', 'task planning', 'project planning', 'milestone', 'priority'.
10searxng-search
Web search via a self-hosted SearXNG aggregation server. Use when the user asks to search the web, find URLs, look up information online, or research a topic using a search engine. Returns URL, title, and snippet for each result.
1iterative-code-review
>
1pre-commit-review
Comprehensive code review for uncommitted changes before git commit. Use when users want to: (1) Review code changes before committing, (2) Check for security vulnerabilities, bugs, or performance issues, (3) Get feedback on code quality and best practices, (4) Identify issues by severity level. Triggered by phrases like 'review my changes', 'check my code', 'review before commit', 'code review', or similar requests for pre-commit validation.
1deploy-caddy-reverse-proxy
Deploy Caddy reverse proxy on remote servers with automatic SSL and systemd integration. Use when users want to: (1) Set up reverse proxy for local web services, (2) Configure automatic Let's Encrypt SSL certificates, (3) Set up systemd service with auto-start, (4) Proxy HTTP/WebSocket traffic. Triggered by phrases like 'deploy caddy', 'setup reverse proxy', 'configure caddy', 'caddy ssl'.
1