web-scraper
SKILL.md
Web Scraper
A toolkit for extracting content from web pages using Python.
When to Use This Skill
Activate this skill when the user needs to:
- Fetch the HTML content of a web page
- Extract all links from a page
- Get readable text content from HTML
- Scrape data from websites
- Download and analyze web content
Requirements
This skill requires external packages:
pip install requests beautifulsoup4
Available Scripts
Always run scripts with --help first to see all available options.
| Script | Purpose |
|---|---|
fetch_page.py |
Download HTML content from a URL |
extract_links.py |
Extract all links from a page |
extract_text.py |
Extract readable text from HTML |
Decision Tree
Task → What do you need?
│
├─ Raw HTML content?
│ └─ Use: fetch_page.py <url>
│
├─ List of links on a page?
│ └─ Use: extract_links.py <url>
│
└─ Text content (no HTML tags)?
└─ Use: extract_text.py <url>
Quick Examples
Fetch page HTML:
python scripts/fetch_page.py https://example.com
python scripts/fetch_page.py https://example.com --output page.html
Extract all links:
python scripts/extract_links.py https://example.com
python scripts/extract_links.py https://example.com --absolute --filter "\.pdf$"
Extract text content:
python scripts/extract_text.py https://example.com
python scripts/extract_text.py https://example.com --paragraphs
Best Practices
- Respect robots.txt - Check if scraping is allowed
- Add delays - Don't overwhelm servers with rapid requests
- Use appropriate User-Agent - Identify your scraper properly
- Handle errors gracefully - Websites may block or timeout
- Cache responses - Don't re-fetch unchanged pages
Common Issues
- 403 Forbidden: Site may be blocking scrapers. Try with
--user-agentflag. - Timeout: Site may be slow. Increase
--timeoutvalue. - Empty content: Page may require JavaScript. These scripts handle static HTML only.
- Encoding issues: Use
--encodingflag if text appears garbled.
Reference Files
See references/selectors.md for CSS selector syntax reference.
Ethical Considerations
- Only scrape public data
- Respect rate limits and robots.txt
- Don't scrape personal/private information
- Check website terms of service
- Consider using official APIs when available
Weekly Installs
7
Repository
ivanvza/dspy-skillsGitHub Stars
12
First Seen
Feb 3, 2026
Security Audits
Installed on
opencode7
gemini-cli7
codex7
cursor7
github-copilot6
amp6