crawlio-mcp
Crawlio MCP Server
Crawlio MCP exposes 37 tools (full mode) or 6 tools (code mode) over stdio transport. The server connects to Crawlio.app's ControlServer for live operations and reads local state files for offline access.
Modes
Code Mode (default)
6 tools: search_api, execute_api, trigger_capture, extract_text_from_image, analyze_page, compare_pages. Use search_api to discover endpoints, then execute_api to call them. extract_text_from_image runs Vision OCR locally (no app required). Lower tool count, better for context-constrained clients.
Full Mode (--full)
35 individual tools with typed parameters and annotations. Better for clients that can handle many tools.
Full Mode Tools (37)
Status & Monitoring (6)
get_crawl_status — Engine state + progress counters.
since(int, opt): Sequence number for change detection.
get_crawl_logs — Recent log entries with filtering.
category(string, opt): engine | download | parser | localizer | network | uilevel(string, opt): debug | info | default | error | faultlimit(int, opt): Max entries (default 100).
get_errors — Error/fault-level logs only. No params.
get_downloads — All download items with status, HTTP code, bytes, timing. No params.
get_failed_urls — Failed items with URL + error. No params.
get_site_tree — File paths as directory tree. No params.
Control (4)
start_crawl — Start a new crawl.
url(string, opt): Single URL.urls(string[], opt): Multi-seed URLs.destinationPath(string, opt): Save directory.
stop_crawl — Stop crawl, cancel downloads, clear queue. No params.
pause_crawl — Pause (in-progress downloads complete). No params.
resume_crawl — Resume paused crawl. No params.
Settings & Configuration (3)
get_settings — Current pending settings + policy. No params.
update_settings — Partial merge (idle only).
settings(object, opt): maxConcurrent, crawlDelay, timeout, downloadImages, downloadVideo, downloadFonts, downloadScripts, downloadStyles, userAgent, maxRetries, stripTrackingParams, customCookies, customHeaders, preferHTTP2 (bool), proxyConfiguration ({type: "http"/"https"/"socks5", host, port, username?, password?, noProxyHosts?}).policy(object, opt): scopeMode, maxDepth, maxPagesPerCrawl, respectRobotsTxt, excludePatterns, includePatterns, includeSupportingFiles, downloadCrossDomainAssets, autoUpgradeHTTP, pinnedPublicKeys ({hostname: [sha256HexStrings]}).
recrawl_urls — Re-crawl specific URLs.
urls(string[], required).
Projects (5)
list_projects — All saved projects. No params.
save_project — Save current project.
name(string, opt).
load_project — Load project by ID.
id(string, required).
delete_project — Delete project by ID.
id(string, required).
get_project — Full project details.
id(string, required).
Export & Extraction (5)
export_site — Export downloaded site.
format(string, required): folder | zip | singleHTML | warcdestinationPath(string, required).warcConfiguration(object, opt): compressionEnabled (bool, default true), maxFileSize (int, default 1GB, 0=no split), cdxEnabled (bool, default true), dedupEnabled (bool, default true).
get_export_status — Export state + progress. No params.
extract_site — Run RSC extraction pipeline.
destinationPath(string, opt).
get_extraction_status — Extraction state + progress. No params.
trigger_capture — WebKit runtime capture (framework detection, network, console, DOM).
url(string, required).
OCR (1)
extract_text_from_image — Extract text from a local image using Vision OCR. No Crawlio.app required.
path(string, required): Absolute file path to image.languages(string[], opt): Recognition languages (e.g.["en-US"]).recognitionLevel(string, opt):accurate(default) orfast.
Enrichment (6)
get_enrichment — Browser enrichment data.
url(string, opt): Filter by URL.
submit_enrichment_bundle — Complete enrichment bundle.
url(string, required).framework(object, opt),networkRequests(array, opt),consoleLogs(array, opt),domSnapshotJSON(string, opt).
submit_enrichment_framework — Framework detection.
url(string, required),framework(object, required).
submit_enrichment_network — Network requests.
url(string, required),networkRequests(array, required).
submit_enrichment_console — Console logs.
url(string, required),consoleLogs(array, required).
submit_enrichment_dom — DOM snapshot.
url(string, required),domSnapshotJSON(string, required).
Observations & Findings (5)
get_observations — Append-only observation timeline.
host(string, opt),op(string, opt),source(string, opt),since(number, opt),limit(int, opt).
get_observation — Look up a single observation or finding by ID.
id(string, required): Observation ID (obs_xxxorfnd_xxx). Use to verify evidence chains.
create_finding — Create curated finding with evidence.
title(string, required),url(string, opt),evidence(string[], opt),synthesis(string, opt),confidence(string, opt: high/medium/low/none),category(string, opt).
get_findings — List curated findings.
host(string, opt),limit(int, opt).
get_crawled_urls — Downloaded URLs with pagination.
status(string, opt),type(string, opt),limit(int, opt),offset(int, opt).
Code Mode Tools (6)
search_api — Search available endpoints by keyword.
search_api(query: "enrichment", limit: 10)
execute_api — Execute HTTP request against ControlServer.
execute_api(method: "GET", path: "/status")
execute_api(method: "POST", path: "/start", body: {"url": "https://example.com"})
execute_api(method: "PATCH", path: "/settings", body: {"policy": {"maxDepth": 2}})
execute_api(method: "GET", path: "/crawled-urls?status=completed&limit=50")
trigger_capture — WebKit runtime capture (same as full mode).
trigger_capture(url: "https://example.com")
extract_text_from_image — Vision OCR on local image (same as full mode).
extract_text_from_image(path: "/path/to/image.png")
extract_text_from_image(path: "/path/to/image.jpg", languages: ["en-US"], recognitionLevel: "fast")
analyze_page — Composite analysis of a single page (capture + enrich + crawl status). Returns evidenceId, evidenceQuality, gaps.
analyze_page(url: "https://example.com")
compare_pages — Compare two pages side-by-side (runs analyze_page on each). Returns comparisonReadiness, symmetric, degradationNotes, timingDelta.
compare_pages(urlA: "https://example.com", urlB: "https://competitor.com")
HTTP-Only Endpoints (3)
Accessible via execute_api but not as MCP tools:
GET /health— Server health, version, uptime, PID.GET /debug/metrics— Engine metrics: connections, queue depth, memory.POST /debug/dump-state— Full engine state dump.
Resources (4)
| URI | Description |
|---|---|
crawlio://status |
Engine state and progress |
crawlio://settings |
Current crawl settings |
crawlio://site-tree |
Downloaded file tree |
crawlio://enrichment |
All browser enrichment data |
Template (1)
crawlio://enrichment/{url} — Per-URL enrichment data.
Prompts (4)
| Prompt | Arguments | Description |
|---|---|---|
crawl-and-analyze |
url (req), maxDepth (opt) | Crawl + analyze results |
export-site |
url (req), format (req), destination (opt) | Crawl + export |
compare-sites |
url1 (req), url2 (req) | Compare two sites |
fix-failed-urls |
none | Diagnose + retry failures |
Common Workflows
Crawl → Wait → Export
update_settings— Configure depth, scope, asset options.start_crawl— Begin crawl.get_crawl_status— Poll untilengineStateiscompleted. Usesinceparam for efficient polling.export_site— Export as zip/folder/singleHTML/warc.get_export_status— Confirm export finished.
WARC Export with Options
update_settings— Configure proxy/pinning if needed:{settings: {proxyConfiguration: {type: "http", host: "proxy.corp", port: 8080}}}.start_crawl— Crawl the target site.get_crawl_status— Poll until completed.export_site— Export with WARC options:{format: "warc", destinationPath: "/tmp/archive.warc.gz", warcConfiguration: {compressionEnabled: true, cdxEnabled: true, dedupEnabled: true, maxFileSize: 0}}.- Validate: CDX sidecar created, revisit records for dedup, GZIP compression.
Enrichment Pipeline
trigger_capture(url)— Run WebKit capture.get_enrichment(url)— Read framework detection, network, console, DOM.create_finding— Record insights with evidence.
Error Recovery
get_failed_urls— List failures.recrawl_urls— Retry failed URLs.get_crawl_status— Poll until re-crawl completes.get_failed_urls— Check remaining failures.
Status Polling Pattern
1. status = get_crawl_status()
2. seq = status.seq
3. Loop:
status = get_crawl_status(since: seq)
if status != "no changes": update seq, check engineState
sleep 5s
More from crawlio-app/crawlio-plugin
crawl-site
Use this skill when the user asks to "crawl a site", "download a website", "mirror a site", "scrape a site", or wants to download web pages for offline access or analysis. Configures Crawlio settings based on site type, starts the crawl, monitors progress, and reports results.
18audit-site
Use this skill when the user asks to "audit a site", "analyze a website", "review a site", "site health check", or wants a comprehensive analysis including technology stack, issues, and recommendations. Orchestrates a full crawl, enrichment capture, observation analysis, and findings report.
11observe
Use this skill when the user asks to "check observations", "what did Crawlio see", "show crawl timeline", "query the observation log", or wants to review what happened during a crawl session. Queries the append-only observation log with filtering by host, source, operation, and time range.
9finding
Use this skill when the user asks to "create a finding", "record an insight", "what findings exist", "show findings", or wants to create or review evidence-backed analysis insights from crawl observations. Creates and queries curated findings with evidence chains.
8web-research
Use this skill when the user asks to "research a site", "compare sites", "analyze technology", or wants structured evidence-based web research. Teaches the acquire-normalize-analyze protocol using CrawlioMCP's composite analysis tools.
1extract-and-export
Use this skill when the user asks to "download and export a site", "crawl and extract content", "archive a website", "export as WARC/ZIP/PDF", or wants a complete crawl-extract-export pipeline. Crawls the site, extracts structured content, and exports in the requested format.
1