playwright-scraper-skill
Installation
SKILL.md
Playwright Scraper Skill
A Playwright-based web scraping skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.
π― Use Case Matrix
| Target Website | Anti-Bot Level | Recommended Method | Script |
|---|---|---|---|
| Regular Sites | Low | web_fetch tool | N/A (built-in) |
| Dynamic Sites | Medium | Playwright Simple | scripts/playwright-simple.js |
| Cloudflare Protected | High | Playwright Stealth β | scripts/playwright-stealth.js |
π¦ Installation
cd playwright-scraper-skill
npm install
npx playwright install chromium
π Quick Start
1οΈβ£ Simple Sites (No Anti-Bot)
Use built-in web_fetch tool for static sites.
2οΈβ£ Dynamic Sites (Requires JavaScript)
Use Playwright Simple:
node scripts/playwright-simple.js "https://example.com"
3οΈβ£ Anti-Bot Protected Sites (Cloudflare etc.)
Use Playwright Stealth:
node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"
Features:
- Hide automation markers (
navigator.webdriver = false) - Realistic User-Agent (iPhone, Android)
- Random delays to mimic human behavior
- Screenshot and HTML saving support
π Script Descriptions
scripts/playwright-simple.js
- Use Case: Regular dynamic websites
- Speed: Fast (3-5 seconds)
- Anti-Bot: None
- Output: JSON (title, content, URL)
scripts/playwright-stealth.js β
- Use Case: Sites with Cloudflare or anti-bot protection
- Speed: Medium (5-20 seconds)
- Anti-Bot: Medium-High (hides automation, realistic UA)
- Output: JSON + Screenshot + HTML file
- Verified: 100% success on Discuss.com.hk
π§ Customization
All scripts support environment variables:
# Set screenshot path
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL
# Set wait time (milliseconds)
WAIT_TIME=10000 node scripts/playwright-simple.js URL
# Enable headful mode (show browser)
HEADLESS=false node scripts/playwright-stealth.js URL
# Save HTML
SAVE_HTML=true node scripts/playwright-stealth.js URL
# Custom User-Agent
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL
π‘οΈ Anti-Bot Techniques Summary
β Effective Anti-Bot Measures
- Hide
navigator.webdriverβ Essential - Realistic User-Agent β Use real devices (iPhone, Android)
- Mimic Human Behavior β Random delays, scrolling
- Avoid Framework Signatures β Crawlee, Selenium are easily detected
- Use
addInitScript(Playwright) β Inject before page load
β Ineffective Anti-Bot Measures
- Only changing User-Agent β Not enough
- Using high-level frameworks (Crawlee) β More easily detected
- Docker isolation β Doesn't help with Cloudflare
π Troubleshooting
Issue: 403 Forbidden
Solution: Use playwright-stealth.js
Issue: Cloudflare Challenge Page
Solution:
- Increase wait time (10-15 seconds)
- Try
headless: false(headful mode sometimes has higher success rate) - Consider using proxy IPs