data-extraction
Data Extraction — Structured Data from Pages
Goal: turn a rendered page into clean JSON. Always try the lightest-weight tool first; escalate only when it returns empty or wrong data.
Priority order
page_schema(tab_id?)— JSON-LD + microdata. Products, recipes, events, articles, local-business listings usually embed complete structured data. One call, often zero parsing. Caveat: returns{"schemas": []}on plenty of pages — disambiguation / list pages (Wikipedia), Google search results, Instagram profile pages, most SPAs without server-side rendering. Don't assume schema exists; fall through to the next tool if empty.page_article(tab_id?, max_wait=10000)— Mozilla Readability + SPA wait. Returns{ title, author, date, image, content }for editorial pages. Use for anything you'd call "an article." Increasemax_waitto 15000 for heavy SPAs.page_headings(tab_id?)— h1–h6 outline. Cheap. Use to plan a targeted extraction before dumping the whole page.execute_script— custom DOM extractor. Use when the data you want isn't in schema and isn't article prose (price badges, review counts, custom data attributes, tabular data).page_markdown(tab_id?)— full-page Markdown dump. Includes nav/footer/sidebar. Last resort whenpage_articleerrors (dashboards, search results, listings).page_source(tab_id?)/page_text(tab_id?)— raw HTML / plain text. Fallback when selectors misbehave or paywalls are CSS-only.
Reusable scripts
Each capability has a focused JS file under scripts/. To use one:
Readthe script file to load it as a string.- Pass the contents as the
codeargument tomcp__chrome-bridge__execute_script. - The script returns a structured object — see the header comment at the top of each file for the exact return shape.
| Task | Pre-condition | Script | Notes |
|---|---|---|---|
| Product DOM fallback | page_schema() returned [] or sparse Product |
scripts/product_dom_fields.js | Adapt the right-hand selectors per site. |
| Tables | Page rendered | scripts/tables_extract.js | Returns every table as { headers, rows }. Use instead of page_markdown() — markdown mangles tables. |
| Listing links | On a listing/index page | scripts/listing_links.js | Scoped to article h2 a, article h3 a. Edit the selector for sites that use a different wrapper. |
Pattern: schema-first
navigate(url='https://www.example.com/product/xyz')
# wait for ready
data = page_schema()
# If data contains a @type=Product entry with name/price/offers, you're done.
When schema is missing or sparse, fall through to the DOM fallback:
execute_script(code=<contents of scripts/product_dom_fields.js>)
Pattern: articles
For news, blogs, docs, Substack, Medium — try page_article first:
navigate(url='https://...')
# wait
article = page_article()
# → { title, author, date, image, content } (Markdown body)
If page_article returns { error: ... } (dashboards, listing pages, custom layouts), fall through to page_schema then page_markdown.
Pattern: tables
execute_script(code=<contents of scripts/tables_extract.js>)
# → [{ index, caption, headers, rows }, ...] one entry per <table>
Pattern: links / listings
execute_script(code=<contents of scripts/listing_links.js>)
# → [{ title, url }, ...]
Dynamic content
Content injected after load (infinite scroll, "Read more" expanders):
# trigger the load
execute_script(code="(function(){ window.scrollTo(0, document.body.scrollHeight); return 'scrolled'; })()")
# wait a few seconds, then re-extract
page_schema() # or whichever tool
If content only appears after a click:
dom_click('button.expand-all')
# wait, re-extract
Return shape
Every extractor should return a plain JSON-serializable object. Good shapes:
{ source_url, captured_at, items: [{ title, url, price, ... }] }
Always include the source_url and a timestamp so downstream consumers can audit staleness. Don't return DOM nodes, HTML blobs, or images — extract the field you need and return that.
VERBATIM RULE — never summarize a body field
When the user asks for a body, description, content, full text, article, excerpt, or anything similar, the value MUST be the actual text returned by page_text / page_article().content / page_markdown — copied as-is (you may trim trailing site chrome like "RELATED NEWS", "Top Articles", "Follow Us", "Comments", footer links). You may NOT:
- write a summary in your own words
- paraphrase or condense
- substitute a description like "Article covers Q1 results and analyst commentary"
- substitute placeholder text like "Full article text can be fetched on request" or "Retrieved live content from source"
- return only the title or only the first paragraph when the request was the body
Field names like full_text, body, content, description, excerpt are content slots, not metadata slots. If you don't have the verbatim text yet, navigate to the item and call page_text before answering.
Examples — body field
// ✓ CORRECT: the value is the actual returned text from page_text()
{
"url": "https://news.site/q1-results",
"full_text": "Acme Corp posted Q1 revenue of $4.2B, up 18% YoY, beating analyst consensus of $3.9B. CEO Jane Doe attributed the gain to enterprise contract renewals and a 12% lift in international bookings. Operating margin expanded to 23.4% from 21.1% a year ago, helped by lower customer-acquisition spend. The company guided FY revenue to $17–18B, in line with the Street. Shares rose 6% in after-hours trading. ..."
}
// ✗ WRONG: summary disguised as body
{
"url": "https://news.site/q1-results",
"full_text": "Article on Acme's Q1 earnings covering revenue beat, margin expansion, and full-year guidance."
}
// ✗ WRONG: placeholder
{
"url": "https://news.site/q1-results",
"full_text": "Q1 earnings article — full text available on the source page."
}
// ✗ WRONG: title repeated as body
{
"url": "https://news.site/q1-results",
"full_text": "Acme Corp Q1 Earnings"
}
Common failures
| Signal | Cause | Fix |
|---|---|---|
page_schema returns [] |
Site has no structured data | Skip to page_article or custom extractor |
page_article returns { error } |
Not an article page | Use page_schema / page_markdown / custom extractor instead |
| Selectors match nothing | Page not yet hydrated | Poll document.readyState or a known element; scroll to trigger lazy load |
execute_script returns undefined |
Bare return at top level |
Wrap the code in (function(){ ... })() (the scripts in scripts/ already do this) |
execute_script errors with Invalid \escape |
JS regex backslashes not double-escaped for JSON | Pass scripts as file contents (no JSON encoding) — the scripts/ files don't need escaping. If you must inline regex, prefer indexOf() / includes(). |
execute_script times out (HTTP 504) |
Script does something slow (image fetch, big loop) | Trim the logic; return fewer fields; avoid fetch() inside the script |
| Paywalled content hidden | Some paywalls only hide via CSS | Try page_text — the body text is often still in the DOM |