scrape-webpage
Scrape Webpage
Extract content, metadata, and images from a webpage for import/migration.
When to Use This Skill
Use this skill when:
- Starting a page import and need to extract content from source URL
- Need webpage analysis with local image downloads
- Want metadata extraction (Open Graph, JSON-LD, etc.)
Invoked by: page-import skill (Step 1)
Prerequisites
Before using this skill, ensure:
- ✅ Node.js is available
- ✅ npm playwright is installed (
npm install playwright) - ✅ Chromium browser is installed (
npx playwright install chromium) - ✅ Sharp image library is installed (
cd .claude/skills/scrape-webpage/scripts && npm install)
Related Skills
- page-import - Orchestrator that invokes this skill
- identify-page-structure - Uses this skill's output (screenshot, HTML, metadata)
- generate-import-html - Uses image mapping and paths from this skill
Scraping Workflow
Step 1: Run Analysis Script
Command:
node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-work
What the script does:
- Sets up network interception to capture all images
- Loads page in headless Chromium
- Scrolls through entire page to trigger lazy-loaded images
- Downloads all images locally (converts WebP/AVIF/SVG to PNG)
- Captures full-page screenshot for visual reference
- Extracts metadata (title, description, Open Graph, JSON-LD, canonical)
- Fixes images in DOM (background-image→img, picture elements, srcset→src, relative→absolute, inline SVG→img)
- Extracts cleaned HTML (removes scripts/styles)
- Replaces image URLs in HTML with local paths (./images/...)
- Generates document paths (sanitized, lowercase, no .html extension)
- Saves complete analysis with image mapping to metadata.json
For detailed explanation: See resources/web-page-analysis.md
Step 2: Verify Output
Output files:
./import-work/metadata.json- Complete analysis with paths and image mapping./import-work/screenshot.png- Visual reference for layout comparison./import-work/cleaned.html- Main content HTML with local image paths./import-work/images/- All downloaded images (WebP/AVIF/SVG converted to PNG)
Verify files exist:
ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html
ls -lh ./import-work/images/ | head -5
Step 3: Review Metadata JSON
Output JSON structure:
{
"url": "https://example.com/page",
"timestamp": "2025-01-12T10:30:00.000Z",
"paths": {
"documentPath": "/us/en/about",
"htmlFilePath": "us/en/about.plain.html",
"mdFilePath": "us/en/about.md",
"dirPath": "us/en",
"filename": "about"
},
"screenshot": "./import-work/screenshot.png",
"html": {
"filePath": "./import-work/cleaned.html",
"size": 45230
},
"metadata": {
"title": "Page Title",
"description": "Page description",
"og:image": "https://example.com/image.jpg",
"canonical": "https://example.com/page"
},
"images": {
"count": 15,
"mapping": {
"https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg",
"https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png"
},
"stats": {
"total": 15,
"converted": 3,
"skipped": 12,
"failed": 0
}
}
}
Key fields:
paths.documentPath- Used for browser preview URLpaths.htmlFilePath- Where to save final HTML fileimages.mapping- Original URLs → local pathsmetadata- Extracted page metadata
Output
This skill provides:
- ✅ metadata.json with paths, metadata, image mapping
- ✅ screenshot.png for visual reference
- ✅ cleaned.html with local image references
- ✅ images/ folder with all downloaded images
Next step: Pass these outputs to identify-page-structure skill
Troubleshooting
Browser not installed:
npx playwright install chromium
Sharp not installed:
cd .claude/skills/scrape-webpage/scripts && npm install
Image download failures:
- Check images.stats.failed count in metadata.json
- Some images may require authentication or be blocked by CORS
- Failed images will be noted but won't stop the scraping process
Lazy-loaded images not captured:
- Script scrolls through page to trigger lazy loading
- Some advanced lazy-loading may need customization in scripts/analyze-webpage.js
More from adobe/helix-website
authoring-analysis
Analyze content sequences and determine authoring approach (default content vs blocks). Validates block selection and section styling for import/migration to AEM Edge Delivery Services.
30page-decomposition
Analyze content sequences within a section and provide neutral descriptions for AEM Edge Delivery Services. Invoked per section during page import to identify breaking points between default content and blocks.
29page-import
Import a single webpage from any URL to structured HTML content for authoring in AEM Edge Delivery Services. Scrapes the page, analyzes structure, maps to existing blocks, and generates HTML for immediate local preview. Also triggered by terms like "migrate", "migration", or "migrating".
28preview-import
Preview and verify imported content in local AEM Edge Delivery Services dev server. Validates rendering, compares with original page, and troubleshoots common issues.
28identify-page-structure
Identify section boundaries and content sequences within a scraped webpage for AEM Edge Delivery Services import. Performs two-level analysis (sections, then sequences per section) and surveys available blocks.
28modeling content
Create effective content models for your blocks that are easy for authors to work with. Use this skill anytime you are building new blocks, making changes to existing blocks that modify the initial structure authors work with.
26