content-extraction
This skill extracts ALL content from an existing website and outputs it as structured, reusable data files. It crawls every page, downloads every asset, and produces a complete content inventory.
The user provides: the URL of the site to extract from, and optionally the target format (TypeScript, JSON, Markdown).
What Gets Extracted
For each page on the site:
| Content type | Output |
|---|---|
| Text | Headings, paragraphs, lists, quotes — preserved with hierarchy |
| Metadata | <title>, <meta description>, OG tags, canonical URL, lang |
| Images | Downloaded to public/images/ with original filenames. Alt text cataloged |
| Links | Internal + external, with anchor text and destination URL |
| PDFs & assets | Downloaded to public/assets/. Filenames and original URLs cataloged |
| Forms | Field names, types, labels, validation rules, action URLs |
| Navigation | Menu structure, link hierarchy, active states |
| Structured data | JSON-LD, microdata, schema.org markup |
Extraction Process
Step 1: Discover all pages
Use browser automation (agent-browser or Playwright) to:
1. Start at the site root
2. Extract all internal links from navigation, footer, sitemap.xml
3. Follow every internal link recursively
4. Build a complete page list with URLs
5. Detect and note any client-side routing (SPA)
Step 2: Extract content per page
For each discovered page:
1. Navigate to the page
2. Wait for full load (networkidle)
3. Extract the DOM structure:
- All headings (h1-h6) with hierarchy
- All paragraphs and text blocks
- All lists (ordered/unordered)
- All images (src, alt, dimensions)
- All links (href, text, target)
- All forms (fields, labels, actions)
4. Extract <head> metadata
5. Take a full-page screenshot for reference
6. Save structured data to output files
Step 3: Download assets
1. Download all images to public/images/{page-slug}/
2. Download all PDFs to public/assets/
3. Download favicons, OG images, other static assets
4. Preserve original filenames where possible
5. Note any broken/404 asset URLs
Step 4: Generate output files
content-inventory.md
Human-readable summary of everything extracted:
# Content Inventory — {site-name}
## Pages ({count})
### / (Homepage)
- Title: "Site Name — Tagline"
- Description: "Meta description here"
- H1: "Main Heading"
- Sections: Hero, Features (3), CTA, Testimonials (4)
- Images: 8 (hero.jpg, feature-1.png, ...)
- Links: 12 internal, 3 external
- Forms: none
### /about
...
src/data/*.ts (TypeScript format)
Structured data files ready for import:
// src/data/pages.ts
export interface Page {
slug: string;
url: string;
title: string;
description: string;
headings: { level: number; text: string }[];
sections: Section[];
}
// src/data/navigation.ts
export interface NavItem {
label: string;
href: string;
children?: NavItem[];
}
// src/data/images.ts
export interface ImageAsset {
originalUrl: string;
localPath: string;
alt: string;
width?: number;
height?: number;
page: string;
}
screenshots/
Full-page screenshots of every page for visual reference.
Agent Team Integration
When used as a teammate in website-refactor, this skill runs as the content-extractor agent:
- Owns:
src/data/,scripts/extract/,public/images/,public/assets/,content-inventory.md,screenshots/ - Outputs: Structured data files that
designerandcontent-verifierteammates consume - Signals completion: By marking its task complete and confirming
content-inventory.mdis written
Standalone Usage
Can be invoked independently for:
- Content audits: "How much content does this site have?"
- Migration planning: "Extract everything from the old site before we rebuild"
- Competitive analysis: "What content does competitor X have?"
- Archival: "Save a complete copy of this site's content"
Output Formats
The default output is TypeScript (.ts files). Pass format preference as an argument:
typescript(default) —src/data/*.tswith interfaces and typed exportsjson—content/*.jsonfilesmarkdown—content/*.mdfiles with frontmatter
Common Pitfalls
- SPAs and client-side routing: Some sites don't render content without JavaScript. Always use a real browser (Playwright/agent-browser), not HTTP fetches.
- Lazy-loaded content: Scroll the full page before extracting to trigger lazy-loaded images and infinite scroll sections.
- Authentication walls: Some content may be behind login. Note these pages as "requires auth" in the inventory.
- Rate limiting: Add delays between page fetches to avoid being blocked. 1-2 seconds between requests is safe.
- Duplicate content: The same content may appear on multiple pages (shared sections, footers). Deduplicate in the data files.
- Relative URLs: Always resolve to absolute URLs before cataloging or downloading.
More from saccoai/agent-skills
website-analysis
Crawl any website in a single pass to produce both a complete structural map and full content extraction. Discovers all pages, routes, navigation, multilingual variants, and issues while simultaneously extracting all text, images, metadata, and assets. Use before any migration, redesign, or audit.
16nextjs-fullstack
Opinionated Next.js fullstack patterns — App Router, Tailwind CSS v4, shadcn/ui, Better Auth, Drizzle ORM, Server Actions, and Vercel deployment. Use when scaffolding a new project or enforcing consistent architecture across client projects.
13seo-migration
SEO preservation during website migrations — redirect mapping, canonical URLs, sitemap generation, structured data, meta tags, and Search Console verification. Use when rebuilding a site to ensure zero SEO loss from URL changes, content moves, or domain switches.
9project-handoff
Generate complete client handoff documentation — deployment guide, environment setup, CMS instructions, maintenance checklist, architecture overview, and operational runbook. Use when delivering a finished project to a client or their team.
8client-proposal
Generate a professional project proposal from a website audit. Analyzes the prospect's current site, identifies issues, and produces a structured proposal with scope, deliverables, tech recommendations, and phased timeline. Use as a sales tool or for scoping client engagements.
6web-audit
Comprehensive website quality audit — Lighthouse scores, accessibility (axe-core), cross-browser testing, performance budgets, and mobile responsiveness. Generates actionable reports with pass/fail per page. Use to audit any live website or as a QA gate before deployment.
6