web-scraping
Web Scraping — Multi-Page Collection
For any task that needs data from N pages: articles, products, listings, search results, directory entries. Read chrome-bridge, navigation, and data-extraction first — this skill composes them.
Two phases
- Discovery — find the target URLs
- Collection — visit each, extract, accumulate
Reusable scripts
| Task | Pre-condition | Script | Notes |
|---|---|---|---|
| Discover item URLs from a listing | On the listing page | scripts/listing_links.js | Scoped to article h2 a, article h3 a. Edit per site. |
| Find the "next page" link | On a paginated listing | scripts/next_page_link.js | Returns href or null. Loop until null. |
To use: Read the script file and pass its contents as the code argument to mcp__chrome-bridge__execute_script.
Phase 1: Discovery
From a listing page
navigate(url='https://example.com/articles/')
# wait for ready
# Try schema first — many listings embed an ItemList
page_schema()
# If that returns [{ '@type': 'ItemList', itemListElement: [...] }], you're done.
# Otherwise extract links via the scoped script:
execute_script(code=<contents of scripts/listing_links.js>)
If the listing uses a different wrapper than <article>, copy listing_links.js and adjust the selector — keep it scoped to the content region (main, .list-content). A bare a[href] pulls in nav and footer chaff.
From a search
See the search skill for Google / DuckDuckGo extractors.
Pagination
Two strategies:
Detect "next" link and recurse:
nextUrl = execute_script(code=<contents of scripts/next_page_link.js>)
# navigate to returned URL, re-extract, repeat until null
Infer URL pattern and precompute:
# If the site uses /articles?page=1,2,3...
urls = [f'https://example.com/articles?page={i}' for i in range(1, 11)]
# Then scrape each in parallel via multi-tab.
Phase 2: Collection
Sequential (simple, safe)
items = []
for url in discovered_urls:
navigate(url=url)
# wait for ready
article = page_article()
if 'error' in article:
article = { 'markdown': page_markdown() }
items.append({
'source_url': execute_script(code="(()=>location.href)()"), # captures redirects
'title': article.get('title'),
'author': article.get('author'),
'date': article.get('date'),
'content': article.get('content') or article.get('markdown'),
'captured_at': now_iso(),
})
Parallel (3–5 tabs)
batch = discovered_urls[:5]
tab_ids = [tabs_create(url=u).id for u in batch]
# wait ~4s for all to settle
results = []
for tid in tab_ids:
action('tabs.update', { 'tabId': tid, 'active': True }) # background tabs may not hydrate
# short wait
results.append({ 'url': tabs_get(tid).get('url'), 'schema': page_schema(tab_id=tid), 'article': page_article(tab_id=tid) })
for tid in tab_ids:
tabs_close(tid)
Don't exceed 5 concurrent tabs. Past that Chrome swaps and everything slows.
Extraction priority per item
page_schema—Article,NewsArticle,BlogPosting,Product,Recipe,Eventall tend to be completepage_article— editorial content not in schemaexecute_script— specific fields not covered by either (prices, ratings, custom attrs)page_markdown— last resort for non-article layouts
Return shape
{
"source": "https://example.com/articles/",
"captured_at": "2026-04-21T12:34:56Z",
"count": 23,
"items": [
{
"url": "https://example.com/articles/foo",
"title": "Foo",
"author": "...",
"date": "...",
"content": "..."
}
]
}
Include source, captured_at, and count at the top level. Each item has its own url — never strip it; it's the audit trail.
Anti-bot / blocked content
Challenge page instead of content:
page_sourcereturns a Cloudflare / reCAPTCHA challenge shellpage_textcontains "Please verify you are human" or similar
Fallbacks (in order):
- Wait longer, retry — some challenges pass after the human has already cleared them in the real session
- Spoof user agent via CDP before navigating:
cdp_send('Network.setUserAgentOverride', { userAgent: '...' }) - If still blocked, stop and tell the user
Paywall that hides content via CSS only:
page_text often still contains the full body text even when it's visually hidden. Try that before giving up.
Rate limiting:
Space requests — 2–3s between navigate calls for the same domain. If you see 429s or "too many requests" in page_text, back off to 10s.
Critical rules
- Always scope the link selector to the content region — never
a[href]alone. - If asked for N items, collect all N before returning. Partial results confuse the caller.
- Close tabs as you go. Don't batch cleanup.
- Return JSON — every time, with
source,captured_at,items[]. - Prefer
navigateover URL construction for SPA sites — the session cookies matter. - Don't follow external links unless the task asks for it.
- Verbatim — never summarize. When the user asks for a body / description / full text / content, copy what
page_article/page_text/page_markdownreturned. Never paraphrase, never write a summary in your own words, never write filler like "extracted body covering X" or "fetched live content from Y". A summary disguised as a body is a failed task. - Per-item navigation is mandatory. Reusing one item's body for N items (copy-paste one and serve all entries) is a hard failure — the validator will reject it. Each item gets its own
navigate(url=…)then its own extraction call.
Anti-patterns (WILL be rejected)
// 1. Same body across every item — extracted once, copy-pasted
[
{"url": "https://site.com/a", "body": "Designed as mixed development..."},
{"url": "https://site.com/b", "body": "Designed as mixed development..."}, // SAME
{"url": "https://site.com/c", "body": "Designed as mixed development..."} // SAME
]
// 2. Same thumbnail / same URL across every item
[
{"url": "https://site.com/a", "title": "A", "thumb": "https://cdn/x.jpg"},
{"url": "https://site.com/a", "title": "B", "thumb": "https://cdn/x.jpg"}, // SAME url!
{"url": "https://site.com/a", "title": "C", "thumb": "https://cdn/x.jpg"}
]
// 3. Body field stubbed with a summary instead of real content
[
{"url": "...", "body": "Article covering Q1 earnings with revenue breakdown."}, // ← summary
{"url": "...", "body": "Coverage of the new product launch including price tier."} // ← summary
]
// 4. Body field stubbed with placeholder text / promises
[
{"url": "...", "body": "Full article text can be fetched individually upon request."},
{"url": "...", "body": "Retrieved live article content from the source page."}
]
// 5. Required field omitted entirely
// User asked for "url, title, full text, thumbnail" — but result has no full_text key.
[
{"url": "...", "title": "...", "thumbnail": "..."} // ← full_text missing
]
Worked example: latest N items with bodies
User asks: "get the latest 3 [articles | products | posts] from , each with ".
# 1. ONE navigation to the listing page
navigate(url='<listing url>')
# 2. ONE snapshot to discover item URLs
page_snapshot(urls=true)
# read the `urls` map; pick the N item refs (skip nav, ads, related)
# → urls = ["<item-a>", "<item-b>", "<item-c>"]
# 3. PER-ITEM SUB-LOOP — repeat for each of the N URLs
items = []
for url in urls[:N]:
navigate(url=url) # go to THIS item
body = page_text() # this item's verbatim body
if len(body) < 300 or body looks like boilerplate:
body = page_article().get('content') # readability fallback
# fetch any other per-item fields the user asked for, e.g. og:image:
og = dom_query(selector="meta[property='og:image']", attribute='content')
thumb = og or dom_query(selector="article img, .hero img, main img", attribute='src')
items.append({
'url': url,
'title': page-derived title, # h1 or page_article().title
'full_text': body, # ← the actual returned text, NOT a summary
'thumbnail_url': thumb,
})
# 4. ONE submit_result with the list — every item distinct
submit_result(result=json.dumps(items))
The shape of the JSON keys (url, full_text, thumbnail_url, etc.) is whatever the user asked for. The pattern — one navigate per item, verbatim body, per-item fields — is universal.
Common failures
| Signal | Cause | Fix |
|---|---|---|
| Discovery returns 0 items | Selector too narrow, or content is lazy-loaded | Use page_text to confirm content is present; scroll with execute_script; broaden selector |
Items have title but no content |
page_article hit a non-article layout |
Fall back to custom execute_script with a per-site selector |
| Collection slows to a crawl | Too many open tabs | Cap at 5 concurrent; close aggressively |
| Random items are empty | Background tab didn't hydrate | Activate with action('tabs.update', ...) before extraction |
/execute rejects script with Invalid \escape |
Regex backslashes (\d, \s, \w) in the JS string weren't double-escaped for JSON |
Pass scripts as file contents (no JSON encoding) — the scripts/ files don't need escaping. If you must inline regex, prefer non-regex matching (indexOf, includes, startsWith). |
| Site returns a challenge page | Anti-bot triggered | See "Anti-bot" above |