Web Scraping with Browser Automation

Objectives

Use Playwright to simulate real user behavior and bypass anti-bot detection
Handle dynamic content, infinite scroll, and JavaScript-heavy sites
Implement robust error handling, retries, and rate limiting
Extract structured data efficiently

Core Strategy

1. Stealth Mode (Anti-Bot)

Always use stealth configuration to avoid detection:

# Remove automation flags
context.add_init_script("""
    Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined
    });
""")

# Use realistic settings
browser = playwright.chromium.launch(
    args=['--disable-blink-features=AutomationControlled']
)

2. Human-like Behavior

Add random delays and smooth interactions:

import random
await asyncio.sleep(random.uniform(0.5, 2.0))
await page.mouse.move(x, y, steps=random.randint(10, 30))

3. Wait for Content

Use appropriate wait strategies:

# For static content
await page.goto(url, wait_until='networkidle')

# For dynamic content
await page.wait_for_selector('.content')
await page.wait_for_function("document.querySelectorAll('.item').length > 10")

Common Patterns

Pattern 1: Article/Blog Content

title = await page.locator('h1').first.text_content()
paragraphs = await page.locator('article p').all_text_contents()
content = '\n\n'.join(paragraphs)

Pattern 2: Infinite Scroll

while len(items) < max_items:
    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    await page.wait_for_timeout(2000)
    current_items = await page.locator('.item').all()
    if len(current_items) == previous_count:
        break

Pattern 3: Handle Popups

# Close cookie banners and modals
try:
    await page.click('button:has-text("Accept")', timeout=3000)
except:
    pass

Pattern 4: Login Required

await page.fill('input[name="username"]', username)
await page.fill('input[name="password"]', password)
await page.click('button[type="submit"]')
await page.wait_for_url('**/dashboard')
cookies = await context.cookies()  # Save for reuse

Rate Limiting & Retries

Implement rate limiting to avoid bans:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=4, max=10))
async def scrape_with_retry(url: str):
    # Your scraping logic
    pass

Track requests per time window:

class RateLimiter:
    def __init__(self, max_requests: int, time_window: int):
        self.max_requests = max_requests
        self.time_window = time_window
        self.requests = []

    async def wait_if_needed(self):
        # Remove old requests, wait if limit reached
        pass

Data Extraction

Use BeautifulSoup for parsing after Playwright renders:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

# Remove unwanted elements
for element in soup(['script', 'style', 'nav', 'footer']):
    element.decompose()

# Extract structured data
article = soup.find('article') or soup.find('main')
paragraphs = [p.get_text(strip=True) for p in article.find_all('p')]

Caching

Cache results to minimize requests:

import hashlib
import json

def get_cache_path(url: str) -> Path:
    url_hash = hashlib.md5(url.encode()).hexdigest()
    return Path(f'.cache/{url_hash}.json')

# Check cache before scraping
cached = load_from_cache(url)
if cached:
    return cached

Installation

pip install playwright beautifulsoup4 lxml tenacity
playwright install chromium

# Or with uv
uv add playwright beautifulsoup4 lxml tenacity
uv run playwright install chromium

Project Structure

scripts/scrapers/
├── base.py              # Base scraper class with stealth mode
├── extractors/          # Site-specific extractors
│   ├── medium.py
│   ├── github.py
│   └── generic.py
├── utils/
│   ├── stealth.py       # Anti-bot utilities
│   ├── cache.py         # Caching logic
│   └── rate_limit.py    # Rate limiting
└── config.py            # User agents, timeouts, etc.

Validation Checklist

Before deploying:

Uses stealth mode (removes webdriver flag)
Implements rate limiting
Has retry logic with exponential backoff
Uses caching to avoid redundant requests
Handles errors gracefully
Closes browser resources properly
Respects robots.txt
Logs all operations

Common Issues

"Executable doesn't exist" → Run playwright install chromium
Timeout errors → Increase timeout or use wait_until='domcontentloaded'
Element not found → Add explicit waits with wait_for_selector()
Detected as bot → Use stealth mode, rotate user agents, add random delays
Memory leaks → Always close browser in finally block

Best Practices

Respect robots.txt - Check before scraping
Use caching - Avoid redundant requests
Rate limit - Don't overload servers
Rotate user agents - Avoid detection
Log everything - Debug and monitor
Handle errors - Retry with backoff
Clean up resources - Close browsers properly

For detailed code examples: See references/examples.md For site-specific patterns: See references/patterns.md For advanced anti-bot techniques: See references/stealth-guide.md

web-scraping