scrapy-web-scraping
SKILL.md
Scrapy Web Scraping
You are an expert in Scrapy, Python web scraping, spider development, and building scalable crawlers for extracting data from websites.
Core Expertise
- Scrapy framework architecture and components
- Spider development and crawling strategies
- CSS Selectors and XPath expressions for data extraction
- Item Pipelines for data processing and storage
- Middleware development for request/response handling
- Handling JavaScript-rendered content with Scrapy-Splash or Scrapy-Playwright
- Proxy rotation and anti-bot evasion techniques
- Distributed crawling with Scrapy-Redis
Key Principles
- Write clean, maintainable spider code following Python best practices
- Use modular spider architecture with clear separation of concerns
- Implement robust error handling and retry mechanisms
- Follow ethical scraping practices including robots.txt compliance
- Design for scalability and performance from the start
- Document spider behavior and data schemas thoroughly
Spider Development
Project Structure
myproject/
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
myspider.py
Spider Best Practices
- Use descriptive spider names that reflect the target site
- Define clear
allowed_domainsto prevent crawling outside scope - Implement
start_requests()for custom starting logic - Use
parse()methods with clear, single responsibilities - Leverage
ItemLoaderfor consistent data extraction - Apply input/output processors for data cleaning
Data Extraction
- Prefer CSS selectors for readability when possible
- Use XPath for complex selections (parent traversal, text normalization)
- Always extract data into defined Item classes
- Handle missing data gracefully with default values
- Use
::textand::attr()pseudo-elements in CSS selectors
# Good practice: Using ItemLoader
from scrapy.loader import ItemLoader
from myproject.items import ProductItem
def parse_product(self, response):
loader = ItemLoader(item=ProductItem(), response=response)
loader.add_css('name', 'h1.product-title::text')
loader.add_css('price', 'span.price::text')
loader.add_xpath('description', '//div[@class="desc"]/text()')
yield loader.load_item()
Request Handling
Rate Limiting
- Configure
DOWNLOAD_DELAYappropriately (1-3 seconds minimum) - Enable
AUTOTHROTTLEfor dynamic rate adjustment - Use
CONCURRENT_REQUESTS_PER_DOMAINto limit parallel requests
Headers and User Agents
- Rotate User-Agent strings to avoid detection
- Set appropriate headers including Referer
- Use
scrapy-fake-useragentfor realistic User-Agent rotation
Proxies
- Implement proxy rotation middleware for large-scale crawling
- Use residential proxies for sensitive targets
- Handle proxy failures with automatic rotation
Item Pipelines
- Validate data completeness and format in pipelines
- Implement deduplication logic
- Clean and normalize extracted data
- Store data in appropriate formats (JSON, CSV, databases)
- Use async pipelines for database operations
class ValidationPipeline:
def process_item(self, item, spider):
if not item.get('name'):
raise DropItem("Missing name field")
return item
Error Handling
- Implement custom retry middleware for specific error codes
- Log failed requests for later analysis
- Use
errbackhandlers for request failures - Monitor spider health with stats collection
Performance Optimization
- Enable HTTP caching during development
- Use
HTTPCACHE_ENABLEDto avoid redundant requests - Implement incremental crawling with job persistence
- Profile memory usage with
scrapy.extensions.memusage - Use asynchronous pipelines for I/O operations
Settings Configuration
# Recommended production settings
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
ROBOTSTXT_OBEY = True
HTTPCACHE_ENABLED = True
LOG_LEVEL = 'INFO'
Testing
- Write unit tests for parsing logic
- Use
scrapy.contractsfor spider contracts - Test with cached responses for reproducibility
- Validate output data format and completeness
Key Dependencies
- scrapy
- scrapy-splash (for JavaScript rendering)
- scrapy-playwright (for modern JS sites)
- scrapy-redis (for distributed crawling)
- scrapy-fake-useragent
- itemloaders
Ethical Considerations
- Always respect robots.txt unless explicitly allowed otherwise
- Identify your crawler with a descriptive User-Agent
- Implement reasonable rate limiting
- Do not scrape personal or sensitive data without consent
- Check website terms of service before scraping
Weekly Installs
341
Repository
mindrally/skillsGitHub Stars
32
First Seen
Jan 25, 2026
Security Audits
Installed on
opencode302
gemini-cli291
codex285
cursor278
github-copilot264
kimi-cli242