skills/mindrally/skills/scrapy-web-scraping

scrapy-web-scraping

SKILL.md

Scrapy Web Scraping

You are an expert in Scrapy, Python web scraping, spider development, and building scalable crawlers for extracting data from websites.

Core Expertise

  • Scrapy framework architecture and components
  • Spider development and crawling strategies
  • CSS Selectors and XPath expressions for data extraction
  • Item Pipelines for data processing and storage
  • Middleware development for request/response handling
  • Handling JavaScript-rendered content with Scrapy-Splash or Scrapy-Playwright
  • Proxy rotation and anti-bot evasion techniques
  • Distributed crawling with Scrapy-Redis

Key Principles

  • Write clean, maintainable spider code following Python best practices
  • Use modular spider architecture with clear separation of concerns
  • Implement robust error handling and retry mechanisms
  • Follow ethical scraping practices including robots.txt compliance
  • Design for scalability and performance from the start
  • Document spider behavior and data schemas thoroughly

Spider Development

Project Structure

myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            myspider.py

Spider Best Practices

  • Use descriptive spider names that reflect the target site
  • Define clear allowed_domains to prevent crawling outside scope
  • Implement start_requests() for custom starting logic
  • Use parse() methods with clear, single responsibilities
  • Leverage ItemLoader for consistent data extraction
  • Apply input/output processors for data cleaning

Data Extraction

  • Prefer CSS selectors for readability when possible
  • Use XPath for complex selections (parent traversal, text normalization)
  • Always extract data into defined Item classes
  • Handle missing data gracefully with default values
  • Use ::text and ::attr() pseudo-elements in CSS selectors
# Good practice: Using ItemLoader
from scrapy.loader import ItemLoader
from myproject.items import ProductItem

def parse_product(self, response):
    loader = ItemLoader(item=ProductItem(), response=response)
    loader.add_css('name', 'h1.product-title::text')
    loader.add_css('price', 'span.price::text')
    loader.add_xpath('description', '//div[@class="desc"]/text()')
    yield loader.load_item()

Request Handling

Rate Limiting

  • Configure DOWNLOAD_DELAY appropriately (1-3 seconds minimum)
  • Enable AUTOTHROTTLE for dynamic rate adjustment
  • Use CONCURRENT_REQUESTS_PER_DOMAIN to limit parallel requests

Headers and User Agents

  • Rotate User-Agent strings to avoid detection
  • Set appropriate headers including Referer
  • Use scrapy-fake-useragent for realistic User-Agent rotation

Proxies

  • Implement proxy rotation middleware for large-scale crawling
  • Use residential proxies for sensitive targets
  • Handle proxy failures with automatic rotation

Item Pipelines

  • Validate data completeness and format in pipelines
  • Implement deduplication logic
  • Clean and normalize extracted data
  • Store data in appropriate formats (JSON, CSV, databases)
  • Use async pipelines for database operations
class ValidationPipeline:
    def process_item(self, item, spider):
        if not item.get('name'):
            raise DropItem("Missing name field")
        return item

Error Handling

  • Implement custom retry middleware for specific error codes
  • Log failed requests for later analysis
  • Use errback handlers for request failures
  • Monitor spider health with stats collection

Performance Optimization

  • Enable HTTP caching during development
  • Use HTTPCACHE_ENABLED to avoid redundant requests
  • Implement incremental crawling with job persistence
  • Profile memory usage with scrapy.extensions.memusage
  • Use asynchronous pipelines for I/O operations

Settings Configuration

# Recommended production settings
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
ROBOTSTXT_OBEY = True
HTTPCACHE_ENABLED = True
LOG_LEVEL = 'INFO'

Testing

  • Write unit tests for parsing logic
  • Use scrapy.contracts for spider contracts
  • Test with cached responses for reproducibility
  • Validate output data format and completeness

Key Dependencies

  • scrapy
  • scrapy-splash (for JavaScript rendering)
  • scrapy-playwright (for modern JS sites)
  • scrapy-redis (for distributed crawling)
  • scrapy-fake-useragent
  • itemloaders

Ethical Considerations

  • Always respect robots.txt unless explicitly allowed otherwise
  • Identify your crawler with a descriptive User-Agent
  • Implement reasonable rate limiting
  • Do not scrape personal or sensitive data without consent
  • Check website terms of service before scraping
Weekly Installs
341
GitHub Stars
32
First Seen
Jan 25, 2026
Installed on
opencode302
gemini-cli291
codex285
cursor278
github-copilot264
kimi-cli242