crawl-site

SKILL.md

crawl-site

Crawl a website using Crawlio. Configures settings based on site type, starts the crawl, monitors progress, and reports results.

When to Use

Use this skill when the user wants to download, mirror, or crawl a website for offline access, analysis, or archival.

Workflow

1. Determine Site Type

Before configuring settings, identify the site type. Ask the user or infer from context:

Site Type Indicators Recommended Settings
Static site HTML/CSS, no JS frameworks maxDepth: 5, maxConcurrent: 8
SPA (React, Vue, etc.) JS-heavy, client-side routing maxDepth: 3, includeSupportingFiles: true, consider using crawlio-agent for enrichment first
CMS (WordPress, etc.) /wp-content/, admin paths maxDepth: 5, excludePatterns: ["/wp-admin/*", "/wp-json/*"]
Documentation site /docs/, versioned paths maxDepth: 10, excludePatterns: ["/v[0-9]*/*"] for old versions
Single page snapshot User wants just one page maxDepth: 0, includeSupportingFiles: true

2. Configure Settings

Use update_settings to set appropriate configuration:

update_settings({
  settings: {
    maxConcurrent: 4,      // Parallel downloads (increase for large sites)
    crawlDelay: 0.5,       // Be polite — seconds between requests
    timeout: 60,           // Request timeout
    stripTrackingParams: true
  },
  policy: {
    scopeMode: "sameDomain",
    maxDepth: 5,
    respectRobotsTxt: true,
    includeSupportingFiles: true,
    downloadCrossDomainAssets: true,  // Get CDN assets
    autoUpgradeHTTP: true             // Use HTTPS
  }
})

3. Start the Crawl

start_crawl({ url: "https://example.com" })

For multi-page targeted downloads:

start_crawl({ urls: ["https://example.com/page1", "https://example.com/page2"] })

4. Monitor Progress

Poll get_crawl_status with the sequence number for efficient change detection:

get_crawl_status()
// Returns: seq: 42, downloaded: 85/150

get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status

5. Check for Issues

After crawl completes:

get_failed_urls()     // Any failures to retry?
get_errors()          // Any engine errors?
get_site_tree()       // What was downloaded?

6. Retry Failures (if any)

recrawl_urls({ urls: ["https://example.com/failed-page"] })

7. Report Results

Summarize: pages downloaded, failures, site structure, any notable findings.

Tips

  • For large sites (1000+ pages), set maxPagesPerCrawl to avoid runaway crawls
  • Use excludePatterns to skip known junk paths (admin panels, API routes, search results)
  • If a site requires authentication, set customCookies or customHeaders in settings
  • For SPA sites, combine with the crawlio-agent Chrome extension for framework detection and JavaScript-rendered content
Weekly Installs
15
First Seen
Feb 21, 2026
Installed on
opencode14
github-copilot14
codex14
amp14
cline14
kimi-cli14