web-scraper
Web Scraper
A toolkit for extracting content from web pages using Python.
When to Use This Skill
Activate this skill when the user needs to:
- Fetch the HTML content of a web page
- Extract all links from a page
- Get readable text content from HTML
- Scrape data from websites
- Download and analyze web content
Requirements
This skill requires external packages:
pip install requests beautifulsoup4
Available Scripts
Always run scripts with --help first to see all available options.
| Script | Purpose |
|---|---|
fetch_page.py |
Download HTML content from a URL |
extract_links.py |
Extract all links from a page |
extract_text.py |
Extract readable text from HTML |
Decision Tree
Task → What do you need?
│
├─ Raw HTML content?
│ └─ Use: fetch_page.py <url>
│
├─ List of links on a page?
│ └─ Use: extract_links.py <url>
│
└─ Text content (no HTML tags)?
└─ Use: extract_text.py <url>
Quick Examples
Fetch page HTML:
python scripts/fetch_page.py https://example.com
python scripts/fetch_page.py https://example.com --output page.html
Extract all links:
python scripts/extract_links.py https://example.com
python scripts/extract_links.py https://example.com --absolute --filter "\.pdf$"
Extract text content:
python scripts/extract_text.py https://example.com
python scripts/extract_text.py https://example.com --paragraphs
Best Practices
- Respect robots.txt - Check if scraping is allowed
- Add delays - Don't overwhelm servers with rapid requests
- Use appropriate User-Agent - Identify your scraper properly
- Handle errors gracefully - Websites may block or timeout
- Cache responses - Don't re-fetch unchanged pages
Common Issues
- 403 Forbidden: Site may be blocking scrapers. Try with
--user-agentflag. - Timeout: Site may be slow. Increase
--timeoutvalue. - Empty content: Page may require JavaScript. These scripts handle static HTML only.
- Encoding issues: Use
--encodingflag if text appears garbled.
Reference Files
See references/selectors.md for CSS selector syntax reference.
Ethical Considerations
- Only scrape public data
- Respect rate limits and robots.txt
- Don't scrape personal/private information
- Check website terms of service
- Consider using official APIs when available
More from ivanvza/dspy-skills
file-utils
File utility toolkit for searching, analyzing, and comparing files. Find files by pattern/size/date, count lines, get file statistics, and compare file contents. Use when working with file discovery, analysis, or comparison tasks.
9pentest-commands
This skill should be used when the user asks to "run pentest commands", "scan with nmap", "use metasploit exploits", "crack passwords with hydra or john", "scan web vulnerabilities with nikto", "enumerate networks", or needs essential penetration testing command references.
7json-tools
JSON processing toolkit for validating, formatting, querying, and comparing JSON data. Use when working with JSON files, API responses, configuration files, or any structured JSON data that needs parsing, validation, transformation, or comparison.
7web-fingerprint
Find and fingerprint web servers on a target. Use when asked to "find web servers", "fingerprint a website", "what's running on this web server", "identify web technologies", or "scan for web services".
7network-recon
Perform network reconnaissance including host discovery, port scanning, and service enumeration. Use when asked to "scan a network", "find hosts", "discover devices", "enumerate services", "recon a subnet", or "what's on my network".
6network-check
Network connectivity testing toolkit for checking host reachability, port availability, and DNS resolution. Use when diagnosing network issues, verifying service availability, testing connectivity to servers, or troubleshooting DNS problems.
5