web-scraping

SKILL.md

Web Scraping — Multi-Page Collection

For any task that needs data from N pages: articles, products, listings, search results, directory entries. Read chrome-bridge, navigation, and data-extraction first — this skill composes them.

Two phases

  1. Discovery — find the target URLs
  2. Collection — visit each, extract, accumulate

Reusable scripts

Task Pre-condition Script Notes
Discover item URLs from a listing On the listing page scripts/listing_links.js Scoped to article h2 a, article h3 a. Edit per site.
Find the "next page" link On a paginated listing scripts/next_page_link.js Returns href or null. Loop until null.

To use: Read the script file and pass its contents as the code argument to mcp__chrome-bridge__execute_script.

Phase 1: Discovery

From a listing page

navigate(url='https://example.com/articles/')
# wait for ready

# Try schema first — many listings embed an ItemList
page_schema()
# If that returns [{ '@type': 'ItemList', itemListElement: [...] }], you're done.

# Otherwise extract links via the scoped script:
execute_script(code=<contents of scripts/listing_links.js>)

If the listing uses a different wrapper than <article>, copy listing_links.js and adjust the selector — keep it scoped to the content region (main, .list-content). A bare a[href] pulls in nav and footer chaff.

From a search

See the search skill for Google / DuckDuckGo extractors.

Pagination

Two strategies:

Detect "next" link and recurse:

nextUrl = execute_script(code=<contents of scripts/next_page_link.js>)
# navigate to returned URL, re-extract, repeat until null

Infer URL pattern and precompute:

# If the site uses /articles?page=1,2,3...
urls = [f'https://example.com/articles?page={i}' for i in range(1, 11)]
# Then scrape each in parallel via multi-tab.

Phase 2: Collection

Sequential (simple, safe)

items = []
for url in discovered_urls:
    navigate(url=url)
    # wait for ready
    article = page_article()
    if 'error' in article:
        article = { 'markdown': page_markdown() }
    items.append({
        'source_url': execute_script(code="(()=>location.href)()"),  # captures redirects
        'title': article.get('title'),
        'author': article.get('author'),
        'date': article.get('date'),
        'content': article.get('content') or article.get('markdown'),
        'captured_at': now_iso(),
    })

Parallel (3–5 tabs)

batch = discovered_urls[:5]
tab_ids = [tabs_create(url=u).id for u in batch]
# wait ~4s for all to settle

results = []
for tid in tab_ids:
    action('tabs.update', { 'tabId': tid, 'active': True })  # background tabs may not hydrate
    # short wait
    results.append({ 'url': tabs_get(tid).get('url'), 'schema': page_schema(tab_id=tid), 'article': page_article(tab_id=tid) })

for tid in tab_ids:
    tabs_close(tid)

Don't exceed 5 concurrent tabs. Past that Chrome swaps and everything slows.

Extraction priority per item

  1. page_schemaArticle, NewsArticle, BlogPosting, Product, Recipe, Event all tend to be complete
  2. page_article — editorial content not in schema
  3. execute_script — specific fields not covered by either (prices, ratings, custom attrs)
  4. page_markdown — last resort for non-article layouts

Return shape

{
  "source": "https://example.com/articles/",
  "captured_at": "2026-04-21T12:34:56Z",
  "count": 23,
  "items": [
    {
      "url": "https://example.com/articles/foo",
      "title": "Foo",
      "author": "...",
      "date": "...",
      "content": "..."
    }
  ]
}

Include source, captured_at, and count at the top level. Each item has its own url — never strip it; it's the audit trail.

Anti-bot / blocked content

Challenge page instead of content:

  • page_source returns a Cloudflare / reCAPTCHA challenge shell
  • page_text contains "Please verify you are human" or similar

Fallbacks (in order):

  1. Wait longer, retry — some challenges pass after the human has already cleared them in the real session
  2. Spoof user agent via CDP before navigating: cdp_send('Network.setUserAgentOverride', { userAgent: '...' })
  3. If still blocked, stop and tell the user

Paywall that hides content via CSS only: page_text often still contains the full body text even when it's visually hidden. Try that before giving up.

Rate limiting: Space requests — 2–3s between navigate calls for the same domain. If you see 429s or "too many requests" in page_text, back off to 10s.

Critical rules

  1. Always scope the link selector to the content region — never a[href] alone.
  2. If asked for N items, collect all N before returning. Partial results confuse the caller.
  3. Close tabs as you go. Don't batch cleanup.
  4. Return JSON — every time, with source, captured_at, items[].
  5. Prefer navigate over URL construction for SPA sites — the session cookies matter.
  6. Don't follow external links unless the task asks for it.
  7. Verbatim — never summarize. When the user asks for a body / description / full text / content, copy what page_article / page_text / page_markdown returned. Never paraphrase, never write a summary in your own words, never write filler like "extracted body covering X" or "fetched live content from Y". A summary disguised as a body is a failed task.
  8. Per-item navigation is mandatory. Reusing one item's body for N items (copy-paste one and serve all entries) is a hard failure — the validator will reject it. Each item gets its own navigate(url=…) then its own extraction call.

Anti-patterns (WILL be rejected)

// 1. Same body across every item — extracted once, copy-pasted
[
  {"url": "https://site.com/a", "body": "Designed as mixed development..."},
  {"url": "https://site.com/b", "body": "Designed as mixed development..."},  // SAME
  {"url": "https://site.com/c", "body": "Designed as mixed development..."}   // SAME
]

// 2. Same thumbnail / same URL across every item
[
  {"url": "https://site.com/a", "title": "A", "thumb": "https://cdn/x.jpg"},
  {"url": "https://site.com/a", "title": "B", "thumb": "https://cdn/x.jpg"},  // SAME url!
  {"url": "https://site.com/a", "title": "C", "thumb": "https://cdn/x.jpg"}
]

// 3. Body field stubbed with a summary instead of real content
[
  {"url": "...", "body": "Article covering Q1 earnings with revenue breakdown."},  // ← summary
  {"url": "...", "body": "Coverage of the new product launch including price tier."} // ← summary
]

// 4. Body field stubbed with placeholder text / promises
[
  {"url": "...", "body": "Full article text can be fetched individually upon request."},
  {"url": "...", "body": "Retrieved live article content from the source page."}
]

// 5. Required field omitted entirely
// User asked for "url, title, full text, thumbnail" — but result has no full_text key.
[
  {"url": "...", "title": "...", "thumbnail": "..."}  // ← full_text missing
]

Worked example: latest N items with bodies

User asks: "get the latest 3 [articles | products | posts] from , each with ".

# 1. ONE navigation to the listing page
navigate(url='<listing url>')

# 2. ONE snapshot to discover item URLs
page_snapshot(urls=true)
# read the `urls` map; pick the N item refs (skip nav, ads, related)
# → urls = ["<item-a>", "<item-b>", "<item-c>"]

# 3. PER-ITEM SUB-LOOP — repeat for each of the N URLs
items = []
for url in urls[:N]:
    navigate(url=url)                                   # go to THIS item
    body = page_text()                                  # this item's verbatim body
    if len(body) < 300 or body looks like boilerplate:
        body = page_article().get('content')            # readability fallback
    # fetch any other per-item fields the user asked for, e.g. og:image:
    og = dom_query(selector="meta[property='og:image']", attribute='content')
    thumb = og or dom_query(selector="article img, .hero img, main img", attribute='src')
    items.append({
        'url': url,
        'title': page-derived title,                    # h1 or page_article().title
        'full_text': body,                              # ← the actual returned text, NOT a summary
        'thumbnail_url': thumb,
    })

# 4. ONE submit_result with the list — every item distinct
submit_result(result=json.dumps(items))

The shape of the JSON keys (url, full_text, thumbnail_url, etc.) is whatever the user asked for. The pattern — one navigate per item, verbatim body, per-item fields — is universal.

Common failures

Signal Cause Fix
Discovery returns 0 items Selector too narrow, or content is lazy-loaded Use page_text to confirm content is present; scroll with execute_script; broaden selector
Items have title but no content page_article hit a non-article layout Fall back to custom execute_script with a per-site selector
Collection slows to a crawl Too many open tabs Cap at 5 concurrent; close aggressively
Random items are empty Background tab didn't hydrate Activate with action('tabs.update', ...) before extraction
/execute rejects script with Invalid \escape Regex backslashes (\d, \s, \w) in the JS string weren't double-escaped for JSON Pass scripts as file contents (no JSON encoding) — the scripts/ files don't need escaping. If you must inline regex, prefer non-regex matching (indexOf, includes, startsWith).
Site returns a challenge page Anti-bot triggered See "Anti-bot" above
Installs
9
First Seen
Apr 21, 2026