Web Scraping & Data Extraction Engine
SKILL.md
Web Scraping & Data Extraction Engine
Quick Health Check (Run First)
Score your scraping operation (2 points each):
| Signal | Healthy | Unhealthy |
|---|---|---|
| Legal compliance | robots.txt checked, ToS reviewed | Scraping blindly |
| Architecture | Tool matches site complexity | Using Puppeteer for static HTML |
| Anti-detection | Rotation, delays, fingerprint diversity | Single IP, no delays |
| Data quality | Validation + dedup pipeline | Raw dumps, no cleaning |
| Error handling | Retry logic, circuit breakers | Crashes on first 403 |
| Monitoring | Success rates tracked, alerts set | No visibility |
| Storage | Structured, deduplicated, versioned | Flat files, duplicates |
| Scheduling | Appropriate frequency, off-peak | Hammering during business hours |
Score: /16 → 12+: Production-ready | 8-11: Needs work | <8: Stop and redesign