anti-scraping
Anti-Scraping & Web Scraping
When to use: Websites with Cloudflare protection, JavaScript rendering requirements, or anti-bot measures.
Overview
Provides battle-tested solutions for bypassing common anti-scraping measures using Playwright headless browser with stealth configurations.
Key Capabilities
- ✅ Cloudflare challenge bypass
- ✅ JavaScript rendering
- ✅ Real browser context simulation
- ✅ Stealth mode (hides automation detection)
- ✅ Screenshot capture for debugging
Quick Start
Prerequisites
# Install Playwright
npm install -g playwright
playwright install chromium
Basic Usage Pattern
// n8n Execute Command node
const { execSync } = require('child_process');
const url = 'https://example.com';
const outputFile = '/tmp/page.html';
// Playwright command with stealth
const command = `node playwright-cloudflare.js "${url}" "${outputFile}"`;
execSync(command);
// Read result
const html = fs.readFileSync(outputFile, 'utf8');
Core Script: playwright-cloudflare.js
Location: n8n-skills/anti-scraping/playwright-cloudflare.js
Key Features:
- Disables automation detection
- Sets real browser headers
- Configures viewport and user agent
- Handles Cloudflare waiting
- Captures screenshots on failure
Configuration:
const config = {
waitForCloudflare: true, // Wait for CF challenge
waitTime: 15000, // Max wait time (ms)
selector: '.product-list', // Element to wait for
screenshotOnError: true, // Debug screenshots
userAgent: 'Mozilla/5.0...' // Real browser UA
};
n8n Workflow Pattern
[Manual Trigger]
↓
[Set Parameters]
target_url: https://site.com
wait_selector: .content
↓
[Execute Command: Playwright]
Command: node
Arguments: playwright-cloudflare.js {{$json.target_url}} /tmp/output.html
↓
[Read HTML File]
File: /tmp/output.html
↓
[Parse with Cheerio]
(use html-parsing skill)
Performance
- Speed: 15-25 seconds per page
- Success Rate: ~95% for Cloudflare sites
- Resource Usage: ~200-300MB RAM per browser instance
Troubleshooting
Cloudflare Still Blocking
# Increase wait time
--wait 30000
# Add specific selector to wait for
--selector '.product-list'
# Check screenshot for errors
/tmp/error-screenshot.png
Timeout Errors
# Increase timeout in playwright script
timeout: 60000 // 60 seconds
Memory Issues
# Close browser properly
await browser.close();
# Limit concurrent instances
# Use n8n Split Into Batches with batch size = 1
Best Practices
- Add Delays: Wait 3-5 seconds between requests
- Rotate User Agents: Change UA periodically
- Use Residential Proxies: For high-volume scraping
- Handle Errors: Implement retry logic with exponential backoff
- Respect robots.txt: Check site policies
Common Patterns
Pattern 1: Single Page Scraping
Trigger → Playwright → Parse → Export
Pattern 2: Multi-Page with Pagination
Trigger → Generate URLs (pagination skill) →
Split Into Batches → Playwright → Wait 5s →
Parse → Deduplicate → Export
Pattern 3: With Error Handling
Playwright → [Error Trigger] → Retry Logic → Notification
Integration with Other Skills
- pagination: Generate URLs for multi-page scraping
- html-parsing: Extract data from rendered HTML
- error-handling: Retry on failures
- debugging: Validate extracted data
Full Code and Documentation
Complete implementation with examples:
/mnt/d/work/n8n_agent/n8n-skills/anti-scraping/
Files:
playwright-cloudflare.js- Main scraping scriptREADME.md- Detailed documentationexample-workflow.json- n8n workflow exampleconfig.template.env- Configuration template
More from aixier/n8n-automation-hub
n8n-skills-catalog
Use to find the right n8n skill for a task, browse available skills, discover workflow patterns, or get an overview of all n8n automation capabilities
10oauth-automation
Use when OAuth tokens expire frequently, need automatic token refresh, YouTube/Google API integration, or when workflows fail due to expired credentials
2n8n-best-practices
Use when encountering n8n workflow issues, Code node errors, HTTP requests failing, data flow problems, environment variables not working, JSON parsing errors, or need n8n development patterns and debugging strategies
2video-processing
Use when processing YouTube videos, extracting subtitles, parsing VTT files, analyzing video content, generating timestamps, or creating video summaries
1ai-integration
Use when integrating LLMs (OpenAI, Qwen, Claude), extracting structured data from text, building prompts, parsing AI responses, handling JSON output, or implementing multi-step AI workflows
1notion-operations
Use when working with Notion databases, creating/updating pages, querying data, syncing between systems, or building knowledge management workflows
1