skills/spider-rs/spider_skills/spider-cli-extraction

spider-cli-extraction

SKILL.md

Spider CLI Extraction

Overview

Use this skill to run Spider CLI workflows with explicit runtime mode control.

Canonical source for cross-agent behavior: skills/core/spider-cli-extraction.md

Load references/cli-workflows.md when you need exact command patterns or mode-selection rules.

Workflow

  1. Confirm CLI availability.
  • Prefer cargo run -p spider_cli -- ... from the Spider repo root.
  • If spider is globally installed, use spider ... for quick checks.
  1. Choose the task mode.
  • Use crawl to collect links.
  • Use scrape to emit per-page JSON records and optionally include HTML.
  • Use download to persist page markup to disk.
  1. Select runtime execution mode.
  • Use --headless for browser-rendered mode.
  • Use --http to force HTTP-only mode.
  • Omit both for default HTTP behavior.
  1. Add scope controls.
  • Set --limit, --depth, --budget, and --blacklist-url.
  • Add --respect-robots-txt when policy compliance is required.

Quick Commands

# Crawl links (default HTTP mode)
cargo run -p spider_cli -- --url https://example.com crawl --output-links

# Browser mode on demand
cargo run -p spider_cli -- --url https://example.com --headless crawl --output-links

# Scrape with HTML output
cargo run -p spider_cli -- --url https://example.com scrape --output-html

Script

Use scripts/spider_cli_helper.sh for wrappers:

./scripts/spider_cli_helper.sh verify-headless
./scripts/spider_cli_helper.sh crawl https://example.com --limit 20 --depth 2
./scripts/spider_cli_helper.sh scrape https://example.com --output-html --output-links
Weekly Installs
4
First Seen
Feb 20, 2026
Installed on
opencode4
gemini-cli3
claude-code3
github-copilot3
codex3
amp3