web-scraper

Installation

SKILL.md

Web Scraper

A toolkit for extracting content from web pages using Python.

When to Use This Skill

Activate this skill when the user needs to:

Fetch the HTML content of a web page
Extract all links from a page
Get readable text content from HTML
Scrape data from websites
Download and analyze web content

Requirements

This skill requires external packages:

pip install requests beautifulsoup4

Available Scripts

Always run scripts with --help first to see all available options.

Script	Purpose
`fetch_page.py`	Download HTML content from a URL
`extract_links.py`	Extract all links from a page
`extract_text.py`	Extract readable text from HTML

Decision Tree

Task → What do you need?
    │
    ├─ Raw HTML content?
    │   └─ Use: fetch_page.py <url>
    │
    ├─ List of links on a page?
    │   └─ Use: extract_links.py <url>
    │
    └─ Text content (no HTML tags)?
        └─ Use: extract_text.py <url>

Quick Examples

Fetch page HTML:

python scripts/fetch_page.py https://example.com
python scripts/fetch_page.py https://example.com --output page.html

Extract all links:

python scripts/extract_links.py https://example.com
python scripts/extract_links.py https://example.com --absolute --filter "\.pdf$"

Extract text content:

python scripts/extract_text.py https://example.com
python scripts/extract_text.py https://example.com --paragraphs

Best Practices

Respect robots.txt - Check if scraping is allowed
Add delays - Don't overwhelm servers with rapid requests
Use appropriate User-Agent - Identify your scraper properly
Handle errors gracefully - Websites may block or timeout
Cache responses - Don't re-fetch unchanged pages

Common Issues

403 Forbidden: Site may be blocking scrapers. Try with --user-agent flag.
Timeout: Site may be slow. Increase --timeout value.
Empty content: Page may require JavaScript. These scripts handle static HTML only.
Encoding issues: Use --encoding flag if text appears garbled.

Reference Files

See references/selectors.md for CSS selector syntax reference.

Ethical Considerations

Only scrape public data
Respect rate limits and robots.txt
Don't scrape personal/private information
Check website terms of service
Consider using official APIs when available

Related skills

More from ivanvza/dspy-skills

Installs

Repository

ivanvza/dspy-skills

GitHub Stars

First Seen

Feb 3, 2026

Security Audits

Gen Agent Trust HubWarn

SocketPass

SnykWarn

web-scraper

Web Scraper

When to Use This Skill

Requirements

Available Scripts

Decision Tree

Quick Examples

Best Practices

Common Issues

Reference Files

Ethical Considerations

More from ivanvza/dspy-skills

file-utils

pentest-commands

json-tools

web-fingerprint

network-recon

network-check