crawl-site
crawl-site
Crawl a website using Crawlio. Configures settings based on site type, starts the crawl, monitors progress, and reports results.
When to Use
Use this skill when the user wants to download, mirror, or crawl a website for offline access, analysis, or archival.
Workflow
1. Determine Site Type
Before configuring settings, identify the site type. Ask the user or infer from context:
| Site Type | Indicators | Recommended Settings |
|---|---|---|
| Static site | HTML/CSS, no JS frameworks | maxDepth: 5, maxConcurrent: 8 |
| SPA (React, Vue, etc.) | JS-heavy, client-side routing | maxDepth: 3, includeSupportingFiles: true, consider using crawlio-agent for enrichment first |
| CMS (WordPress, etc.) | /wp-content/, admin paths |
maxDepth: 5, excludePatterns: ["/wp-admin/*", "/wp-json/*"] |
| Documentation site | /docs/, versioned paths |
maxDepth: 10, excludePatterns: ["/v[0-9]*/*"] for old versions |
| Single page snapshot | User wants just one page | maxDepth: 0, includeSupportingFiles: true |
2. Configure Settings
Use update_settings to set appropriate configuration:
update_settings({
settings: {
maxConcurrent: 4, // Parallel downloads (increase for large sites)
crawlDelay: 0.5, // Be polite — seconds between requests
timeout: 60, // Request timeout
stripTrackingParams: true
},
policy: {
scopeMode: "sameDomain",
maxDepth: 5,
respectRobotsTxt: true,
includeSupportingFiles: true,
downloadCrossDomainAssets: true, // Get CDN assets
autoUpgradeHTTP: true // Use HTTPS
}
})
3. Start the Crawl
start_crawl({ url: "https://example.com" })
For multi-page targeted downloads:
start_crawl({ urls: ["https://example.com/page1", "https://example.com/page2"] })
4. Monitor Progress
Poll get_crawl_status with the sequence number for efficient change detection:
get_crawl_status()
// Returns: seq: 42, downloaded: 85/150
get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status
5. Check for Issues
After crawl completes:
get_failed_urls() // Any failures to retry?
get_errors() // Any engine errors?
get_site_tree() // What was downloaded?
6. Retry Failures (if any)
recrawl_urls({ urls: ["https://example.com/failed-page"] })
7. Report Results
Summarize: pages downloaded, failures, site structure, any notable findings.
Tips
- For large sites (1000+ pages), set
maxPagesPerCrawlto avoid runaway crawls - Use
excludePatternsto skip known junk paths (admin panels, API routes, search results) - If a site requires authentication, set
customCookiesorcustomHeadersin settings - For SPA sites, combine with the crawlio-agent Chrome extension for framework detection and JavaScript-rendered content
More from crawlio-app/crawlio-plugin
audit-site
Use this skill when the user asks to "audit a site", "analyze a website", "review a site", "site health check", or wants a comprehensive analysis including technology stack, issues, and recommendations. Orchestrates a full crawl, enrichment capture, observation analysis, and findings report.
11observe
Use this skill when the user asks to "check observations", "what did Crawlio see", "show crawl timeline", "query the observation log", or wants to review what happened during a crawl session. Queries the append-only observation log with filtering by host, source, operation, and time range.
9finding
Use this skill when the user asks to "create a finding", "record an insight", "what findings exist", "show findings", or wants to create or review evidence-backed analysis insights from crawl observations. Creates and queries curated findings with evidence chains.
8crawlio-mcp
Complete reference for the Crawlio MCP server — 37 tools, 6 code-mode tools, 4 resources, 4 prompts. Use this skill when orchestrating website crawling, export, enrichment, or analysis via Crawlio MCP.
1web-research
Use this skill when the user asks to "research a site", "compare sites", "analyze technology", or wants structured evidence-based web research. Teaches the acquire-normalize-analyze protocol using CrawlioMCP's composite analysis tools.
1extract-and-export
Use this skill when the user asks to "download and export a site", "crawl and extract content", "archive a website", "export as WARC/ZIP/PDF", or wants a complete crawl-extract-export pipeline. Crawls the site, extracts structured content, and exports in the requested format.
1