skills/mindrally/skills/web-scraping

web-scraping

Installation

Summary

Web scraping and data extraction using Python tools for static, dynamic, and large-scale content.

Supports static sites via requests and BeautifulSoup, dynamic content via Selenium and Playwright, and large-scale extraction via Scrapy and firecrawl
Includes specialized tools for AI-powered extraction (jina), structured queries (agentQL), and complex automation workflows (multion)
Built-in guidance on rate limiting, robots.txt compliance, error handling, session management, and pagination
Covers data processing tasks: cleaning, validation, encoding handling, deduplication, and efficient storage

SKILL.md

Web Scraping

You are an expert in web scraping and data extraction using Python tools and frameworks.

Core Tools

Static Sites

Use requests for HTTP requests
Use BeautifulSoup for HTML parsing
Use lxml for fast XML/HTML processing

Dynamic Content

Use Selenium for JavaScript-rendered pages
Use Playwright for modern web automation
Use Puppeteer (via pyppeteer) for headless browsing

Large-Scale Extraction

Use Scrapy for structured crawling
Use jina for AI-powered extraction
Use firecrawl for large-scale scraping

Complex Workflows

Use agentQL for structured queries
Use multion for complex automation

Best Practices

Implement rate limiting and delays
Respect robots.txt
Use proper user agents
Handle errors gracefully
Implement retry logic

Error Handling

Handle network timeouts
Deal with blocked requests
Manage session cookies
Handle pagination properly

Ethical Considerations

Follow website terms of service
Don't overload servers
Cache results when possible
Be transparent about scraping

Data Processing

Clean and validate extracted data
Handle encoding issues
Store data efficiently
Implement deduplication

Weekly Installs

1.9K

Repository

mindrally/skills

GitHub Stars

84

First Seen

Today

Security Audits

Gen Agent Trust HubPass