skills/mindrally/skills/web-scraping

web-scraping

Installation
Summary

Web scraping and data extraction using Python tools for static, dynamic, and large-scale content.

  • Supports static sites via requests and BeautifulSoup, dynamic content via Selenium and Playwright, and large-scale extraction via Scrapy and firecrawl
  • Includes specialized tools for AI-powered extraction (jina), structured queries (agentQL), and complex automation workflows (multion)
  • Built-in guidance on rate limiting, robots.txt compliance, error handling, session management, and pagination
  • Covers data processing tasks: cleaning, validation, encoding handling, deduplication, and efficient storage
SKILL.md

Web Scraping

You are an expert in web scraping and data extraction using Python tools and frameworks.

Core Tools

Static Sites

  • Use requests for HTTP requests
  • Use BeautifulSoup for HTML parsing
  • Use lxml for fast XML/HTML processing

Dynamic Content

  • Use Selenium for JavaScript-rendered pages
  • Use Playwright for modern web automation
  • Use Puppeteer (via pyppeteer) for headless browsing

Large-Scale Extraction

  • Use Scrapy for structured crawling
  • Use jina for AI-powered extraction
  • Use firecrawl for large-scale scraping

Complex Workflows

  • Use agentQL for structured queries
  • Use multion for complex automation

Best Practices

  • Implement rate limiting and delays
  • Respect robots.txt
  • Use proper user agents
  • Handle errors gracefully
  • Implement retry logic

Error Handling

  • Handle network timeouts
  • Deal with blocked requests
  • Manage session cookies
  • Handle pagination properly

Ethical Considerations

  • Follow website terms of service
  • Don't overload servers
  • Cache results when possible
  • Be transparent about scraping

Data Processing

  • Clean and validate extracted data
  • Handle encoding issues
  • Store data efficiently
  • Implement deduplication
Weekly Installs
1.9K
GitHub Stars
84
First Seen
Today