extract-and-export
extract-and-export
Complete crawl-extract-export pipeline. Crawls a site, extracts structured content (clean HTML, markdown, metadata, asset manifests), and exports in any of 7 formats.
When to Use
Use this skill when the user wants to download a site AND get usable output — not just a raw crawl, but extracted content ready for consumption, archival, or deployment.
For crawl-only workflows (no extraction or export), use crawl-site instead.
Arguments
$0(required): The URL to crawl$1(optional): Maximum crawl depth (default: 3)$2(optional): Export format (default:folder)
Export Formats
| Format | Description |
|---|---|
folder |
Mirror on disk with original directory structure |
zip |
Compressed archive, ready to share |
singleHTML |
All assets inlined into a single HTML file |
warc |
ISO 28500 web archive standard |
pdf |
Rendered pages as portable document |
extracted |
Structured data only — clean HTML, markdown, metadata, no raw assets |
deploy |
Production-ready bundle with crawl-manifest.json |
Workflow
1. Configure Settings
update_settings({
settings: {
maxConcurrent: 4,
crawlDelay: 0.5,
stripTrackingParams: true
},
policy: {
scopeMode: "sameDomain",
maxDepth: $1 or 3,
respectRobotsTxt: true,
includeSupportingFiles: true,
downloadCrossDomainAssets: true,
autoUpgradeHTTP: true
}
})
Adjust based on site size:
- Small site (<100 pages):
maxDepth: 10,maxConcurrent: 8 - Medium site (100-1000):
maxDepth: 5,maxConcurrent: 4 - Large site (1000+):
maxDepth: 3,maxPagesPerCrawl: 500
2. Start the Crawl
start_crawl({ url: "$0" })
3. Monitor Progress
Poll get_crawl_status with since parameter for efficient change detection:
get_crawl_status()
// Returns: seq: 42, downloaded: 85/150
get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status
4. Check for Issues
After crawl completes:
get_failed_urls() // Any failures to retry?
get_errors() // Any engine errors?
Retry transient failures:
recrawl_urls({ urls: ["https://example.com/failed-page"] })
5. Review What Was Downloaded
get_site_tree() // File structure overview
get_downloads() // Detailed download info with content types
6. Extract Content
extract_site()
This runs the extraction pipeline and produces per-page artifacts:
- Clean HTML (tracking scripts removed)
- Markdown conversion
- Metadata (title, description, headings, links)
- Asset manifests
Poll get_extraction_status if the extraction takes time.
7. Export
export_site({ format: "$2" or "folder" })
Poll get_export_status for large exports.
8. Report Results
Summarize:
- Crawl: Total pages discovered, downloaded, failed
- Extraction: Pages processed, artifacts created
- Export: Format, location, file size
- Issues: Any errors or notable findings
Tips
- For archival workflows, use
warc— it's the ISO standard and preserves full HTTP headers - For AI consumption, use
extracted— just the structured data, no raw assets - For sharing, use
zip— compressed and portable - For deployment, use
deploy— includescrawl-manifest.jsonwith full metadata - For large sites, set
maxPagesPerCrawlto avoid runaway crawls - Save the project after export for future reference:
save_project({ name: "example.com export" })
More from crawlio-app/crawlio-plugin
crawl-site
Use this skill when the user asks to "crawl a site", "download a website", "mirror a site", "scrape a site", or wants to download web pages for offline access or analysis. Configures Crawlio settings based on site type, starts the crawl, monitors progress, and reports results.
18audit-site
Use this skill when the user asks to "audit a site", "analyze a website", "review a site", "site health check", or wants a comprehensive analysis including technology stack, issues, and recommendations. Orchestrates a full crawl, enrichment capture, observation analysis, and findings report.
11observe
Use this skill when the user asks to "check observations", "what did Crawlio see", "show crawl timeline", "query the observation log", or wants to review what happened during a crawl session. Queries the append-only observation log with filtering by host, source, operation, and time range.
9finding
Use this skill when the user asks to "create a finding", "record an insight", "what findings exist", "show findings", or wants to create or review evidence-backed analysis insights from crawl observations. Creates and queries curated findings with evidence chains.
8crawlio-mcp
Complete reference for the Crawlio MCP server — 37 tools, 6 code-mode tools, 4 resources, 4 prompts. Use this skill when orchestrating website crawling, export, enrichment, or analysis via Crawlio MCP.
1web-research
Use this skill when the user asks to "research a site", "compare sites", "analyze technology", or wants structured evidence-based web research. Teaches the acquire-normalize-analyze protocol using CrawlioMCP's composite analysis tools.
1