skills/besoeasy/open-skills/crawl-websites-at-scale

crawl-websites-at-scale

SKILL.md

Scrapy Web Scraping Skill

Scrapy is a fast, high-level Python web crawling and scraping framework. It enables structured data extraction from websites, supports crawling entire sites, and integrates pipelines to process and store scraped data.

When to use

  • Crawl entire websites or follow links across many pages
  • Extract structured data (prices, articles, product listings) into JSON/CSV
  • Run scheduled or large-scale scraping pipelines
  • Need built-in support for request throttling, retries, and middlewares

Required tools / APIs

  • No external API required
  • Python 3.8+ required
  • Scrapy: Web crawling and scraping framework

Install options:

# pip
pip install scrapy

# Ubuntu/Debian
sudo apt-get install -y python3-pip && pip install scrapy

# macOS
brew install python && pip install scrapy

# Verify installation
scrapy version

Skills

basic_usage

Create and run a simple Scrapy spider to scrape a single page.

# Create a new Scrapy project
scrapy startproject myproject
cd myproject

# Generate a spider
scrapy genspider quotes quotes.toscrape.com

# Run the spider and save to JSON
scrapy crawl quotes -o output.json

# Run the spider and save to CSV
scrapy crawl quotes -o output.csv

Python spider (quotes.py):

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("a.tag::text").getall(),
            }

        # Follow pagination links
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

robust_usage

Production-oriented spider with settings, item pipelines, and error handling.

# Run with custom settings (rate limiting, retries)
scrapy crawl quotes \
  -s DOWNLOAD_DELAY=1 \
  -s AUTOTHROTTLE_ENABLED=True \
  -s RETRY_TIMES=3 \
  -o output.json

# Run from a script (no project required)
scrapy runspider spider.py -o output.json

Python with error handling and structured items:

import scrapy
from scrapy import signals
from scrapy.crawler import CrawlerProcess

class ArticleSpider(scrapy.Spider):
    name = "articles"
    custom_settings = {
        "DOWNLOAD_DELAY": 1,
        "AUTOTHROTTLE_ENABLED": True,
        "AUTOTHROTTLE_START_DELAY": 1,
        "AUTOTHROTTLE_MAX_DELAY": 10,
        "ROBOTSTXT_OBEY": True,
        "USER_AGENT": "open-skills-bot/1.0 (+https://github.com/besoeasy/open-skills)",
        "RETRY_TIMES": 3,
        "FEEDS": {"output.json": {"format": "json"}},
    }

    def __init__(self, start_url=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_urls = [start_url or "https://quotes.toscrape.com"]

    def parse(self, response):
        for article in response.css("article, div.post, div.entry"):
            yield {
                "url": response.url,
                "title": article.css("h1::text, h2::text").get("").strip(),
                "body": " ".join(article.css("p::text").getall()),
            }

        for link in response.css("a::attr(href)").getall():
            if link.startswith("/") or response.url in link:
                yield response.follow(link, self.parse)

    def errback(self, failure):
        self.logger.error(f"Request failed: {failure.request.url}{failure.value}")


# Run without a Scrapy project
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(ArticleSpider, start_url="https://quotes.toscrape.com")
    process.start()

extract_with_xpath

Use XPath selectors for precise extraction from complex HTML structures.

import scrapy

class XPathSpider(scrapy.Spider):
    name = "xpath_example"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.xpath("//div[@class='quote']"):
            yield {
                "text": quote.xpath(".//span[@class='text']/text()").get(),
                "author": quote.xpath(".//small[@class='author']/text()").get(),
                "tags": quote.xpath(".//a[@class='tag']/text()").getall(),
            }

Output format

Scrapy yields Python dicts (or Item objects) per scraped record. When saved to file:

  • output.json — Array of JSON objects, one per item
  • output.csv — CSV with headers matching dict keys
  • output.jsonl — One JSON object per line (memory-efficient for large crawls)

Example item:

{
  "text": "The world as we have created it is a process of our thinking.",
  "author": "Albert Einstein",
  "tags": ["change", "deep-thoughts", "thinking", "world"]
}

Error shape: Scrapy logs errors to stderr; unhandled HTTP errors trigger the errback method if defined.

Rate limits / Best practices

  • Enable ROBOTSTXT_OBEY = True to respect robots.txt automatically
  • Set DOWNLOAD_DELAY (seconds between requests) to avoid overloading servers
  • Enable AUTOTHROTTLE_ENABLED = True for adaptive rate limiting
  • Set a descriptive USER_AGENT identifying your bot
  • Use CONCURRENT_REQUESTS_PER_DOMAIN = 1 for polite single-domain crawling
  • Cache responses during development: HTTPCACHE_ENABLED = True

Agent prompt

You have scrapy web-scraping capability. When a user asks to scrape or crawl a website:

1. Confirm the target URL and data fields to extract (e.g., title, price, link)
2. Create a Scrapy spider using CSS or XPath selectors to target those fields
3. Enable ROBOTSTXT_OBEY=True and set DOWNLOAD_DELAY>=1 to be polite
4. Follow pagination links if the user needs data across multiple pages
5. Save results to output.json or output.csv

Always identify your bot with a descriptive USER_AGENT and never scrape login-protected or paywalled content.

Troubleshooting

Error: "Forbidden by robots.txt"

  • Symptom: Spider skips URLs and logs "Forbidden by robots.txt"
  • Solution: Review the site's robots.txt; only scrape paths that are allowed, or set ROBOTSTXT_OBEY = False if you have explicit permission from the site owner

Error: "Empty or missing data"

  • Symptom: Items are yielded with empty strings or None values
  • Solution: Inspect the page source (scrapy shell <url>) and adjust your CSS/XPath selectors to match the actual HTML structure

Error: "Too many redirects / 429 Too Many Requests"

  • Symptom: Requests fail with HTTP 429 or redirect loops
  • Solution: Increase DOWNLOAD_DELAY, enable AUTOTHROTTLE_ENABLED = True, or add a Retry-After respecting middleware

Error: "JavaScript-rendered content not found"

  • Symptom: Expected data is missing because the site uses client-side rendering
  • Solution: Use scrapy-playwright or scrapy-splash middleware to render JavaScript before parsing

See also

Weekly Installs
1
GitHub Stars
89
First Seen
1 day ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1