extract-page-data

Installation

SKILL.md

Extract Page Data

Extract structured data from any web page for design analysis, page reconstruction, or content review. The script handles dependency installation automatically — just point it at a URL.

When to use this

User shares a web page URL and wants to capture screenshots
User asks to extract design tokens, CSS theme, or color palette from a page
User wants page content as Markdown text
User wants to know what links, buttons, and interactive elements are on a page
User asks to export or analyze any web page's structure
User wants to take a screenshot of a specific element on a page (use --selector)

Do NOT use for Axure prototypes — use extract-axure-data instead.

Quick start

Run the extraction script from scripts/extract.mjs relative to this skill directory.

# Full page screenshot
node scripts/extract.mjs https://example.com --screenshot

# Screenshot of a specific element
node scripts/extract.mjs https://example.com --screenshot --selector "#hero"

# Extract design tokens (colors, typography, spacing, etc.)
node scripts/extract.mjs https://example.com --theme

# Extract page content as Markdown
node scripts/extract.mjs https://example.com --markdown

# Collect all links and interactive elements
node scripts/extract.mjs https://example.com --links

# Extract everything
node scripts/extract.mjs https://example.com --all

# Export full data pack (screenshot + theme + markdown + links as zip)
node scripts/extract.mjs https://example.com --pack

First run auto-installs Playwright + Chromium to ~/.cache/page-extractor/. This takes 1-2 minutes and happens once.

Parameters

Flag	What it does	Default
`<url>`	Web page URL (required)	—
`--screenshot`	Capture page screenshot	off
`--theme`	Extract design tokens	off
`--markdown`	Convert page to Markdown	off
`--links`	Collect interactive elements	off
`--pack`	Export full data pack (zip)	off
`--all`	Run all extractions	off
`--selector SEL`	CSS selector to scope extraction	whole page
`-o DIR`	Output directory	`./page-export`
`--viewport WxH`	Viewport size	`1280x720`
`--wait MS`	Extra wait after page load (ms)	`0`
`--scroll`	Scroll page to trigger lazy content	off
`--scroll-step PX`	Pixels per scroll step	`800`
`--scroll-delay MS`	Delay between scroll steps (ms)	`200`
`--no-headless`	Show browser window	headless
`--connect-cdp URL`	Attach to running Chrome (for auth)	—
`--format FMT`	Screenshot format: png or jpeg	`png`
`--verbose`	Detailed logging	off

Output structure

page-export/
├── screenshot.png       # Page or element screenshot    (--screenshot)
├── theme.json           # Design tokens                 (--theme)
├── content.md           # Page text as Markdown          (--markdown)
├── links.json           # Links and interactive elements (--links)
└── page-data.zip        # Full data pack                 (--pack)
    ├── screenshot.png
    ├── theme.json
    ├── content.md
    ├── links.json
    └── meta.json        # Page metadata (title, url, viewport)

Selector support

All extraction modes support --selector to scope to a specific element:

# Screenshot only the header
node scripts/extract.mjs https://example.com --screenshot --selector "header"

# Extract design tokens from a card component
node scripts/extract.mjs https://example.com --theme --selector ".card"

# Get Markdown from the main content area
node scripts/extract.mjs https://example.com --markdown --selector "main"

# Collect links from the navigation
node scripts/extract.mjs https://example.com --links --selector "nav"

Authenticated pages

Some pages require login. Two approaches:

Connect to an existing Chrome session (recommended — reuses cookies):

macOS:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222

Windows:

start chrome --remote-debugging-port=9222

Then log in via that Chrome window and run:

node scripts/extract.mjs https://example.com --all --connect-cdp http://localhost:9222

Show the browser window (manual login during extraction):

node scripts/extract.mjs https://example.com --all --no-headless

Dependency installation

The script auto-installs on first run. If it fails:

# Manual install
cd ~/.cache/page-extractor
npm install playwright
npx playwright install chromium

# If Chromium download is slow (e.g. behind GFW):
PLAYWRIGHT_DOWNLOAD_HOST=https://npmmirror.com/mirrors/playwright npx playwright install chromium

Output data formats

theme.json

Design system tokens extracted from computed CSS styles:

{
  "colors": {
    "background": [{"value": "rgb(255,255,255)", "count": 42, "tags": ["div"]}],
    "text": [{"value": "rgb(51,51,51)", "count": 28, "tags": ["p","span"]}],
    "border": [{"value": "rgb(221,221,221)", "count": 12}]
  },
  "typography": {
    "families": [{"value": "\"Inter\", sans-serif", "count": 56}],
    "textStyles": [{"size": "14px", "lineHeight": "22px", "weight": "400", "count": 20}]
  },
  "spacing": [{"value": "16px", "count": 15}],
  "radius": [{"value": "4px", "count": 8}],
  "lineWidth": [{"value": "1px", "count": 10}],
  "shadow": {
    "box": [{"value": "0 2px 8px rgba(0,0,0,0.1)", "count": 5}],
    "text": []
  },
  "transitions": [{"value": "all 0.2s ease", "count": 38, "tags": ["button","a"]}],
  "animations": [{"value": "fadeIn|0.3s|ease", "count": 5, "tags": ["div"]}],
  "cssVariables": {
    "--color-primary": "#6366f1",
    "--radius": "0.5rem",
    "--font-sans": "Inter, sans-serif"
  },
  "assets": {
    "backgroundImages": [{"value": "url(hero.webp)", "count": 1, "tags": ["section"]}],
    "images": [
      {"src": "https://example.com/logo.svg", "alt": "Logo", "position": "static", "zIndex": "auto", "siblingImgCount": 1},
      {"src": "https://example.com/hero.png", "alt": "", "position": "absolute", "zIndex": "2", "siblingImgCount": 2}
    ],
    "svgCount": 12
  }
}

links.json

Interactive elements on the page:

{
  "pageUrl": "https://example.com",
  "pageTitle": "Example",
  "links": [
    {"type": "a", "text": "Home", "href": "https://example.com/", "visible": true},
    {"type": "button", "text": "Sign Up", "visible": true},
    {"type": "form", "href": "https://example.com/login", "visible": true}
  ],
  "totalLinks": 3,
  "visibleLinks": 3
}

Customization

The extraction logic is modular. Each feature lives in its own file under scripts/:

File	Purpose	Customization
`lib/browser.mjs`	Playwright browser management	Change browser launch args, proxy
`lib/screenshot.mjs`	Screenshot capture	Modify scroll behavior, format
`lib/theme.mjs`	Design token extraction	Adjust token categories, limits
`lib/markdown.mjs`	HTML → Markdown	Change conversion rules
`lib/links.mjs`	Element collection	Add new element types
`lib/pack.mjs`	Zip packaging	Change pack contents
`inject/extract-theme.js`	Browser-injected theme logic	Edit CSS properties to extract
`inject/extract-markdown.js`	Browser-injected markdown logic	Customize text extraction
`inject/extract-links.js`	Browser-injected links logic	Add custom selectors

Theme Clone Workflow

When the user wants to clone or replicate a website's design language, run in this order:

Step 1: Extract all data

node extract.mjs <url> --all --scroll --viewport 1440x900

Step 2: Interpret theme.json

Field	Maps to	CSS Variable
`colors.background[0]`	Primary surface	`--color-bg`
`colors.text[0]`	Body text	`--color-text`
`colors.text[1]`	Accent / brand	`--color-primary`
`typography.families[0]`	Main font	Google Fonts import
`spacing` (top 5)	Spacing scale	`--spacing-1` …
`radius[0]`	Default radius	`--radius-base`
`transitions[0]`	Hover timing	apply to `.card, a, button`
`cssVariables`	Design tokens (if :root vars exist)	use names directly

Step 3: Generate CSS variables

:root {
  --color-bg:       <colors.background[0].value>;
  --color-text:     <colors.text[0].value>;
  --color-primary:  <colors.text[1].value>;
  --font-sans:      <typography.families[0].value>;
  --radius-base:    <radius[0].value>;
  --transition:     <transitions[0].value>;
}

If cssVariables is non-empty, prefer those names — they reflect the original design system's intent.

Step 4: Restore animations

transitions[0].value → apply as transition: <value> on interactive elements
animations → reconstruct @keyframes blocks by animation name; apply to entry elements

Step 5: Handle layered assets

Images with position: absolute and zIndex > 0 in assets.images are overlay layers. Reconstruct with CSS position: relative on the parent and position: absolute on overlays.

Playwright 查缺补漏

导出数据是 best-effort 快照，有时需要回到原始页面补充信息（hover 态样式、懒加载内容、动画效果等）。可使用 Playwright CLI 进行交互式补充采集。

# 打开页面并获取快照
playwright-cli open https://example.com
playwright-cli snapshot

# 交互后采集（如 hover 态）
playwright-cli hover e15
playwright-cli screenshot --filename=hover-state.png

# 获取元素属性
playwright-cli eval "el => getComputedStyle(el).backdropFilter" e5

playwright-cli close

完整命令参考: Playwright CLI 官方技能文档

何时需要 Playwright

场景	Playwright 做什么
截图模糊/不完整	对指定元素高分辨率截图
样式数据缺失	获取完整 computedStyle
hover/focus/active 态	触发交互后采集样式
懒加载内容	滚动触发后截图
动画/过渡效果	定时截图或录屏
响应式断点校验	设置 viewport 后截图

Troubleshooting

Problem	Fix
Blank screenshot	Page needs time to render — add `--wait 2000`
Missing lazy content	Use `--scroll` to trigger lazy loading
Need login	Use `--connect-cdp` (see Authenticated pages above)
Playwright install fails	Manual install, or set `PLAYWRIGHT_DOWNLOAD_HOST`
Wrong element selected	Check selector with browser DevTools first

Related skills

More from lintendo/axhub-skills

Installs

Repository

lintendo/axhub-skills

GitHub Stars

First Seen

Apr 14, 2026

Security Audits

Gen Agent Trust HubFail

SocketWarn

SnykWarn