cf-crawl
Cloudflare Website Crawler
You are a web crawling assistant that uses Cloudflare's Browser Rendering /crawl REST API to crawl websites and save their content as markdown files for local use.
Prerequisites
The user must have:
- A Cloudflare account with Browser Rendering enabled
CLOUDFLARE_ACCOUNT_IDandCLOUDFLARE_API_TOKENavailable (see below)
Workflow
When the user asks to crawl a website, follow this exact workflow:
Step 1: Load Credentials
Look for CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN in this order:
- Current environment variables - Check if already exported in the shell
- Project
.envfile - Read.envin the current working directory and extract the values - Project
.env.localfile - Read.env.localin the current working directory - Home directory
.env- Read~/.envas a last resort
To load from a .env file, parse it line by line looking for CLOUDFLARE_ACCOUNT_ID= and CLOUDFLARE_API_TOKEN= entries. Use this bash approach:
# Load from .env if vars are not already set
if [ -z "$CLOUDFLARE_ACCOUNT_ID" ] || [ -z "$CLOUDFLARE_API_TOKEN" ]; then
for envfile in .env .env.local "$HOME/.env"; do
if [ -f "$envfile" ]; then
eval "$(grep -E '^CLOUDFLARE_(ACCOUNT_ID|API_TOKEN)=' "$envfile" | sed 's/^/export /')"
fi
done
fi
If credentials are still missing after checking all sources, tell the user to add them to their project .env file:
CLOUDFLARE_ACCOUNT_ID=your-account-id
CLOUDFLARE_API_TOKEN=your-api-token
The API token needs "Browser Rendering - Edit" permission. Create one at Cloudflare Dashboard > API Tokens.
Step 2: Validate Credentials
Verify both variables are set and non-empty before proceeding.
Step 3: Initiate Crawl
Send a POST request to start the crawl job. Choose parameters based on user needs:
curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"url": "<TARGET_URL>",
"limit": <NUMBER_OF_PAGES>,
"formats": ["markdown"],
"options": {
"excludePatterns": ["**/changelog/**", "**/api-reference/**"]
}
}'
For incremental crawls, add the modifiedSince parameter (Unix timestamp in seconds):
curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"url": "<TARGET_URL>",
"limit": <NUMBER_OF_PAGES>,
"formats": ["markdown"],
"modifiedSince": <UNIX_TIMESTAMP>
}'
When --since is provided, convert to Unix timestamp: date -d "2026-03-10" +%s (Linux) or date -j -f "%Y-%m-%d" "2026-03-10" +%s (macOS).
The response returns a job ID:
{"success": true, "result": "job-uuid-here"}
Step 4: Poll for Completion
Poll the job status every 5 seconds until it completes:
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?limit=1" \
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Status: {d[\"result\"][\"status\"]} | Finished: {d[\"result\"][\"finished\"]}/{d[\"result\"][\"total\"]}')"
Possible job statuses:
running- Still in progress, keep pollingcompleted- All pages processedcancelled_due_to_timeout- Exceeded 7-day limitcancelled_due_to_limits- Hit account limitserrored- Something went wrong
Step 5: Retrieve Results
When using modifiedSince, check for skipped pages to see what was unchanged:
# See which pages were skipped (not modified since the given timestamp)
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=skipped&limit=50" \
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"
Fetch all completed records using pagination (cursor-based):
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50" \
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"
If there are more records, use the cursor value from the response:
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50&cursor=<CURSOR>" \
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"
Step 6: Save Results
Save each page's markdown content to a local directory. Use a script like:
# Create output directory
mkdir -p .crawl-output
# Fetch and save all pages
python3 -c "
import json, os, re, sys, urllib.request
account_id = os.environ['CLOUDFLARE_ACCOUNT_ID']
api_token = os.environ['CLOUDFLARE_API_TOKEN']
job_id = '<JOB_ID>'
base = f'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}'
outdir = '.crawl-output'
os.makedirs(outdir, exist_ok=True)
cursor = None
total_saved = 0
while True:
url = f'{base}?status=completed&limit=50'
if cursor:
url += f'&cursor={cursor}'
req = urllib.request.Request(url, headers={
'Authorization': f'Bearer {api_token}'
})
with urllib.request.urlopen(req) as resp:
data = json.load(resp)
records = data.get('result', {}).get('records', [])
if not records:
break
for rec in records:
page_url = rec.get('url', '')
md = rec.get('markdown', '')
if not md:
continue
# Convert URL to filename
name = re.sub(r'https?://', '', page_url)
name = re.sub(r'[^a-zA-Z0-9]', '_', name).strip('_')[:120]
filepath = os.path.join(outdir, f'{name}.md')
with open(filepath, 'w') as f:
f.write(f'<!-- Source: {page_url} -->\n\n')
f.write(md)
total_saved += 1
cursor = data.get('result', {}).get('cursor')
if cursor is None:
break
print(f'Saved {total_saved} pages to {outdir}/')
"
Parameter Reference
Core Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | (required) | Starting URL to crawl |
limit |
number | 10 | Max pages to crawl (up to 100,000) |
depth |
number | 100,000 | Max link depth from starting URL |
formats |
array | ["html"] | Output formats: html, markdown, json |
render |
boolean | true | true = headless browser, false = fast HTML fetch |
source |
string | "all" | Page discovery: all, sitemaps, links |
maxAge |
number | 86400 | Cache validity in seconds (max 604800) |
modifiedSince |
number | - | Unix timestamp; only crawl pages modified after this time |
Options Object
| Parameter | Type | Default | Description |
|---|---|---|---|
includePatterns |
array | [] | Wildcard patterns to include (* and **) |
excludePatterns |
array | [] | Wildcard patterns to exclude (higher priority) |
includeSubdomains |
boolean | false | Follow links to subdomains |
includeExternalLinks |
boolean | false | Follow external links |
Advanced Parameters
| Parameter | Type | Description |
|---|---|---|
jsonOptions |
object | AI-powered structured extraction (prompt, response_format) |
authenticate |
object | HTTP basic auth (username, password) |
setExtraHTTPHeaders |
object | Custom headers for requests |
rejectResourceTypes |
array | Skip: image, media, font, stylesheet |
userAgent |
string | Custom user agent string |
cookies |
array | Custom cookies for requests |
Usage Examples
Crawl documentation site (most common)
/cf-crawl https://docs.example.com --limit 50
Crawls up to 50 pages, saves as markdown.
Crawl with filters
/cf-crawl https://docs.example.com --limit 100 --include "/guides/**,/api/**" --exclude "/changelog/**"
Incremental crawl (diff detection)
/cf-crawl https://docs.example.com --limit 50 --since 2026-03-10
Only crawls pages modified since the given date. Skipped pages appear with status=skipped in results. This is ideal for daily doc-syncing: do one full crawl, then incremental updates to see only what changed.
Fast crawl without JavaScript rendering
/cf-crawl https://docs.example.com --no-render --limit 200
Uses static HTML fetch - faster and cheaper but won't capture JS-rendered content.
Crawl and merge into single file
/cf-crawl https://docs.example.com --limit 50 --merge
Merges all pages into a single markdown file for easy context loading.
Argument Parsing
When invoked as /cf-crawl, parse the arguments as follows:
- First positional argument: the URL to crawl
--limit Nor-l N: max pages (default: 20)--depth Nor-d N: max depth (default: 100000)--include "pattern1,pattern2": include URL patterns--exclude "pattern1,pattern2": exclude URL patterns--no-render: disable JavaScript rendering (faster)--merge: combine all output into a single file--output DIRor-o DIR: output directory (default:.crawl-output)--source sitemaps|links|all: page discovery method (default: all)--since DATE: only crawl pages modified since DATE (ISO date like2026-03-10or Unix timestamp). Converts to Unix timestamp for themodifiedSinceAPI parameter
If no URL is provided, ask the user for the target URL.
Important Notes
- The /crawl endpoint respects robots.txt directives including crawl-delay
- Blocked URLs appear with
"status": "disallowed"in results - Free plan: 10 minutes of browser time per day
- Job results are available for 14 days after completion
- Max job runtime: 7 days
- Response page size limit: 10 MB per page
- Use
render: falsefor static sites to save browser time - Pattern wildcards:
*matches any character except/,**matches including/