crawl-site
SKILL.md
crawl-site
Crawl a website using Crawlio. Configures settings based on site type, starts the crawl, monitors progress, and reports results.
When to Use
Use this skill when the user wants to download, mirror, or crawl a website for offline access, analysis, or archival.
Workflow
1. Determine Site Type
Before configuring settings, identify the site type. Ask the user or infer from context:
| Site Type | Indicators | Recommended Settings |
|---|---|---|
| Static site | HTML/CSS, no JS frameworks | maxDepth: 5, maxConcurrent: 8 |
| SPA (React, Vue, etc.) | JS-heavy, client-side routing | maxDepth: 3, includeSupportingFiles: true, consider using crawlio-agent for enrichment first |
| CMS (WordPress, etc.) | /wp-content/, admin paths |
maxDepth: 5, excludePatterns: ["/wp-admin/*", "/wp-json/*"] |
| Documentation site | /docs/, versioned paths |
maxDepth: 10, excludePatterns: ["/v[0-9]*/*"] for old versions |
| Single page snapshot | User wants just one page | maxDepth: 0, includeSupportingFiles: true |
2. Configure Settings
Use update_settings to set appropriate configuration:
update_settings({
settings: {
maxConcurrent: 4, // Parallel downloads (increase for large sites)
crawlDelay: 0.5, // Be polite — seconds between requests
timeout: 60, // Request timeout
stripTrackingParams: true
},
policy: {
scopeMode: "sameDomain",
maxDepth: 5,
respectRobotsTxt: true,
includeSupportingFiles: true,
downloadCrossDomainAssets: true, // Get CDN assets
autoUpgradeHTTP: true // Use HTTPS
}
})
3. Start the Crawl
start_crawl({ url: "https://example.com" })
For multi-page targeted downloads:
start_crawl({ urls: ["https://example.com/page1", "https://example.com/page2"] })
4. Monitor Progress
Poll get_crawl_status with the sequence number for efficient change detection:
get_crawl_status()
// Returns: seq: 42, downloaded: 85/150
get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status
5. Check for Issues
After crawl completes:
get_failed_urls() // Any failures to retry?
get_errors() // Any engine errors?
get_site_tree() // What was downloaded?
6. Retry Failures (if any)
recrawl_urls({ urls: ["https://example.com/failed-page"] })
7. Report Results
Summarize: pages downloaded, failures, site structure, any notable findings.
Tips
- For large sites (1000+ pages), set
maxPagesPerCrawlto avoid runaway crawls - Use
excludePatternsto skip known junk paths (admin panels, API routes, search results) - If a site requires authentication, set
customCookiesorcustomHeadersin settings - For SPA sites, combine with the crawlio-agent Chrome extension for framework detection and JavaScript-rendered content
Weekly Installs
15
Repository
crawlio-app/cra…o-pluginFirst Seen
Feb 21, 2026
Security Audits
Installed on
opencode14
github-copilot14
codex14
amp14
cline14
kimi-cli14