playwright-scraper-skill
Playwright Scraper Skill
A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.
🎯 Use Case Matrix
| Target Website | Anti-Bot Level | Recommended Method | Script |
|---|---|---|---|
| Regular Sites | Low | web_fetch tool | N/A (built-in) |
| Dynamic Sites | Medium | Playwright Simple | scripts/playwright-simple.js |
| Cloudflare Protected | High | Playwright Stealth ⭐ | scripts/playwright-stealth.js |
| YouTube | Special | deep-scraper | Install separately |
| Special | reddit-scraper | Install separately |
📦 Installation
cd playwright-scraper-skill
npm install
npx playwright install chromium
🚀 Quick Start
1️⃣ Simple Sites (No Anti-Bot)
Use OpenClaw's built-in web_fetch tool:
# Invoke directly in OpenClaw
Hey, fetch me the content from https://example.com
2️⃣ Dynamic Sites (Requires JavaScript)
Use Playwright Simple:
node scripts/playwright-simple.js "https://example.com"
Example output:
{
"url": "https://example.com",
"title": "Example Domain",
"content": "...",
"elapsedSeconds": "3.45"
}
3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)
Use Playwright Stealth:
node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"
Features:
- Hide automation markers (
navigator.webdriver = false) - Realistic User-Agent (iPhone, Android)
- Random delays to mimic human behavior
- Screenshot and HTML saving support
4️⃣ YouTube Video Transcripts
Use deep-scraper (install separately):
# Install deep-scraper skill
npx clawhub install deep-scraper
# Use it
cd skills/deep-scraper
node assets/youtube_handler.js "https://www.youtube.com/watch?v=VIDEO_ID"
📖 Script Descriptions
scripts/playwright-simple.js
- Use Case: Regular dynamic websites
- Speed: Fast (3-5 seconds)
- Anti-Bot: None
- Output: JSON (title, content, URL)
scripts/playwright-stealth.js ⭐
- Use Case: Sites with Cloudflare or anti-bot protection
- Speed: Medium (5-20 seconds)
- Anti-Bot: Medium-High (hides automation, realistic UA)
- Output: JSON + Screenshot + HTML file
- Verified: 100% success on Discuss.com.hk
🎓 Best Practices
1. Try web_fetch First
If the site doesn't have dynamic loading, use OpenClaw's web_fetch tool—it's fastest.
2. Need JavaScript? Use Playwright Simple
If you need to wait for JavaScript rendering, use playwright-simple.js.
3. Getting Blocked? Use Stealth
If you encounter 403 or Cloudflare challenges, use playwright-stealth.js.
4. Special Sites Need Specialized Skills
- YouTube → deep-scraper
- Reddit → reddit-scraper
- Twitter → bird skill
🔧 Customization
All scripts support environment variables:
# Set screenshot path
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL
# Set wait time (milliseconds)
WAIT_TIME=10000 node scripts/playwright-simple.js URL
# Enable headful mode (show browser)
HEADLESS=false node scripts/playwright-stealth.js URL
# Save HTML
SAVE_HTML=true node scripts/playwright-stealth.js URL
# Custom User-Agent
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL
📊 Performance Comparison
| Method | Speed | Anti-Bot | Success Rate (Discuss.com.hk) |
|---|---|---|---|
| web_fetch | ⚡ Fastest | ❌ None | 0% |
| Playwright Simple | 🚀 Fast | ⚠️ Low | 20% |
| Playwright Stealth | ⏱️ Medium | ✅ Medium | 100% ✅ |
| Puppeteer Stealth | ⏱️ Medium | ✅ Medium-High | ~80% |
| Crawlee (deep-scraper) | 🐢 Slow | ❌ Detected | 0% |
| Chaser (Rust) | ⏱️ Medium | ❌ Detected | 0% |
🛡️ Anti-Bot Techniques Summary
Lessons learned from our testing:
✅ Effective Anti-Bot Measures
- Hide
navigator.webdriver— Essential - Realistic User-Agent — Use real devices (iPhone, Android)
- Mimic Human Behavior — Random delays, scrolling
- Avoid Framework Signatures — Crawlee, Selenium are easily detected
- Use
addInitScript(Playwright) — Inject before page load
❌ Ineffective Anti-Bot Measures
- Only changing User-Agent — Not enough
- Using high-level frameworks (Crawlee) — More easily detected
- Docker isolation — Doesn't help with Cloudflare
🔍 Troubleshooting
Issue: 403 Forbidden
Solution: Use playwright-stealth.js
Issue: Cloudflare Challenge Page
Solution:
- Increase wait time (10-15 seconds)
- Try
headless: false(headful mode sometimes has higher success rate) - Consider using proxy IPs
Issue: Blank Page
Solution:
- Increase
waitForTimeout - Use
waitUntil: 'networkidle'or'domcontentloaded' - Check if login is required
📝 Memory & Experience
2026-02-07 Discuss.com.hk Test Conclusions
- ✅ Pure Playwright + Stealth succeeded (5s, 200 OK)
- ❌ Crawlee (deep-scraper) failed (403)
- ❌ Chaser (Rust) failed (Cloudflare)
- ❌ Puppeteer standard failed (403)
Best Solution: Pure Playwright + anti-bot techniques (framework-independent)
🚧 Future Improvements
- Add proxy IP rotation
- Implement cookie management (maintain login state)
- Add CAPTCHA handling (2captcha / Anti-Captcha)
- Batch scraping (parallel URLs)
- Integration with OpenClaw's
browsertool
📚 References
More from family3253/skill
multi-search-engine
Multi search engine integration with 17 engines (8 CN + 9 Global). Supports advanced search operators, time filters, site search, privacy engines, and WolframAlpha knowledge queries. No API keys required.
22opencode-agent-creator
Expert guidance for creating, configuring, and refining OpenCode agents. Use when working with agent files, authoring new agents, improving existing agents, or understanding agent structure and best practices. Use PROACTIVELY when user mentions creating agents, configuring tools, setting permissions, or agent architecture.
7add-skill
Wrapper skill for the add-skill CLI. Installs skills from arbitrary GitHub repos for OpenCode/Claude/Codex.
5obsidian-plugin-templater
创建和编辑带有变量、函数、控制流和Obsidian特定语法的Templater模板。当处理包含Templater模板的.md文件、创建动态内容,或用户提及Templater、模板变量或模板函数时使用。
4gog
Google Workspace CLI for Gmail, Calendar, Drive, Contacts, Sheets, and Docs.
3superpowers-verification-before-completion
Use when about to claim work is complete, fixed, or passing, before committing or creating PRs - requires running verification commands and confirming output before making any success claims; evidence before assertions always
3