web-content-extraction
Content extraction
Overview
This skill extracts documentation and website content as markdown files. It uses a tiered approach: try the simplest method first, fall back to heavier tools only when needed.
Context
User provides a URL to extract content from. This skill is appropriate when:
- Extracting documentation sites (API docs, tutorials, reference guides)
- Crawling entire websites to markdown for offline reference
- Archiving websites or generating sitemaps
- Converting multi-page documentation to organized markdown
Process
- Discover all pages (llms.txt, sitemap XML, nav link extraction, or progressive crawling)
- Detect platform to choose extraction method
- Batch extract content using the simplest working method
- Organize output and verify results
Page discovery (ordered by preference)
Before extracting content, discover all pages on the site. Try these methods in order:
Method 1: llms.txt
Some documentation platforms (notably Mintlify) serve an llms.txt file listing all pages:
# Check for llms.txt
curl -sL "https://example.com/docs/llms.txt" -o /tmp/llms.txt
head -20 /tmp/llms.txt
# Extract all URLs
grep -oE 'https://[^\s]+\.md' /tmp/llms.txt
This is the most reliable method when available. It gives you every page URL immediately.
Method 2: sitemap XML
Most documentation sites have a sitemap:
# Check common sitemap locations
curl -sL "https://example.com/sitemap.xml" | head -20
curl -sL "https://example.com/sitemap-index.xml" | head -20
curl -sL "https://example.com/docs/sitemap-index.xml" | head -20
curl -sL "https://example.com/docs/sitemap-0.xml" | head -20
# Extract all URLs from sitemap
curl -sL "https://example.com/sitemap-0.xml" | grep -oE 'https://[^<]+' | sort
Starlight/Astro sites reliably have sitemaps. Look for <link rel="sitemap"> in the HTML source.
Method 3: Jina AI Reader nav extraction
Use Jina to render the page and extract sidebar/nav links:
curl -sL "https://r.jina.ai/https://example.com/docs/" -o /tmp/nav.md
grep -oE '/docs/[a-z0-9/-]+' /tmp/nav.md | sort -u
Method 4: progressive crawling (Crawl4AI)
For sites where the above methods fail, use Crawl4AI to discover pages by following links. See the "Advanced: Crawl4AI" section below.
Platform detection and extraction
Mintlify sites (Anthropic, Stripe, many API docs)
Mintlify serves raw markdown when you append .md to any page URL. This is the ideal case: zero dependencies, clean output, no nav/footer noise.
Detection:
llms.txtexists- Appending
.mdto a URL returns markdown (not HTML)
Extraction:
# Test: does the .md URL return markdown?
curl -sL "https://example.com/docs/en/overview.md" | head -5
# If it starts with markdown (# heading, etc.), use direct curl for all pages:
OUT="./output"
mkdir -p "$OUT"
pages=(overview quickstart setup cli-reference) # from llms.txt
for page in "${pages[@]}"; do
curl -sL "https://example.com/docs/en/${page}.md" -o "${OUT}/${page}.md" &
done
wait
# Verify
ls -1 "$OUT" | wc -l
find "$OUT" -name "*.md" -size 0 # check for empty files
du -sh "$OUT"
Starlight/Astro sites (OpenCode, many OSS projects)
Starlight renders HTML server-side. Use Jina AI Reader to convert to markdown.
Detection:
- HTML contains
Starlightorastroin meta tags / generator - Has
sitemap-index.xml
Extraction:
# Discover pages via sitemap
curl -sL "https://example.com/docs/sitemap-0.xml" | grep -oE 'https://[^<]+' > /tmp/urls.txt
OUT="./output"
mkdir -p "$OUT"
# Batch download via Jina (rate limit: max 5 concurrent)
while read url; do
slug=$(echo "$url" | sed 's|.*/docs/||; s|/$||; s|/|-|g')
[ -z "$slug" ] && slug="index"
curl -sL "https://r.jina.ai/${url}" -o "${OUT}/${slug}.md" &
running=$(jobs -r | wc -l)
if [ "$running" -ge 5 ]; then
wait -n
fi
done < /tmp/urls.txt
wait
Note: Jina output includes nav/sidebar noise. For reference docs this is acceptable. For cleaner output, use Crawl4AI with CSS selectors.
Docusaurus, GitBook, ReadTheDocs, Sphinx
These platforms render HTML. Use Jina as the first attempt; fall back to Crawl4AI if Jina output is too noisy.
Platform-specific CSS selectors (for Crawl4AI fallback):
| Platform | Content selector | Exclude |
|---|---|---|
| Docusaurus | article, .markdown, .theme-doc-markdown |
nav, footer, .pagination, .table-of-contents |
| GitBook | .gitbook-root, .page-body, .theme-doc |
nav, header, .sidebar, .navigation |
| ReadTheDocs | .document, .role-content, .bd-content |
nav, .sidebar, .toctree, .related-topics |
| Sphinx | .document, .body, .section |
nav, .related, .sphinxsidebar, .toctree-wrapper |
Generic sites
Try methods in this order:
- Check for
llms.txtor.mdURL suffix - Check for sitemap XML
- Use Jina AI Reader
- Fall back to Crawl4AI
Complete workflow examples
Example 1: Mintlify docs (simplest case)
# 1. Discover pages
curl -sL "https://code.claude.com/docs/llms.txt" -o /tmp/llms.txt
grep -oE 'https://[^\s]+\.md' /tmp/llms.txt > /tmp/urls.txt
# 2. Extract page names
sed 's|.*/en/||; s|\.md||' /tmp/urls.txt > /tmp/pages.txt
# 3. Batch download
OUT="./claude-code-docs"
mkdir -p "$OUT"
while read page; do
curl -sL "https://code.claude.com/docs/en/${page}.md" -o "${OUT}/${page}.md" &
done < /tmp/pages.txt
wait
# 4. Verify
echo "$(ls -1 "$OUT"/*.md | wc -l) files, $(du -sh "$OUT" | cut -f1)"
find "$OUT" -name "*.md" -size 0 -exec echo "EMPTY: {}" \;
Example 2: Starlight/Astro docs (Jina approach)
# 1. Discover pages via sitemap
curl -sL "https://opencode.ai/docs/sitemap-index.xml" # find sitemap URL
curl -sL "https://opencode.ai/docs/sitemap-0.xml" | grep -oE 'https://[^<]+' > /tmp/urls.txt
# 2. Batch download via Jina (throttled)
OUT="./opencode-docs"
mkdir -p "$OUT"
while read url; do
slug=$(echo "$url" | sed 's|.*/docs/||; s|/$||; s|/|-|g')
[ -z "$slug" ] && slug="index"
curl -sL "https://r.jina.ai/${url}" -o "${OUT}/${slug}.md" &
running=$(jobs -r | wc -l)
[ "$running" -ge 5 ] && wait -n
done < /tmp/urls.txt
wait
# 3. Verify
echo "$(ls -1 "$OUT"/*.md | wc -l) files, $(du -sh "$OUT" | cut -f1)"
Example 3: full site crawl to markdown (Crawl4AI)
For sites that need JavaScript rendering or where simpler methods fail:
# Quick CLI approach
uvx crawl4ai crawl \
--url "https://example.com" \
--output-dir "output/example-com-$(date +%Y%m%d-%H%M%S)" \
--max-depth 3 \
--format markdown
Python implementation for more control:
import asyncio
import os
from pathlib import Path
from datetime import datetime
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from urllib.parse import urljoin, urlparse
async def crawl_to_markdown(base_url, output_dir, max_depth=3):
"""Crawl entire website and save as markdown files"""
base_domain = urlparse(base_url).netloc
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
crawl_dir = Path(output_dir) / f"{base_domain}-{timestamp}"
crawl_dir.mkdir(parents=True, exist_ok=True)
visited = set()
queue = [(base_url, 0)]
async with AsyncWebCrawler() as crawler:
while queue and len(visited) < 100:
url, depth = queue.pop(0)
if url in visited or depth > max_depth:
continue
try:
result = await crawler.arun(
url,
config=CrawlerRunConfig(
page_timeout=30000,
remove_overlay_elements=True
)
)
if result.success:
path = urlparse(url).path
if path.endswith('/') or path == '':
path = path + 'index.md'
else:
path = path + '.md' if not path.endswith('.md') else path
output_file = crawl_dir / path.lstrip('/')
output_file.parent.mkdir(parents=True, exist_ok=True)
with open(output_file, 'w', encoding='utf-8') as f:
f.write(f"# {result.metadata.get('title', 'Page')}\n\n")
f.write(str(result.markdown))
visited.add(url)
for link_info in result.links.get("internal", []):
href = link_info.get("href", "")
absolute_url = urljoin(base_url, href)
if (absolute_url not in visited and
not absolute_url.startswith('#') and
absolute_url.startswith(base_url)):
queue.append((absolute_url, depth + 1))
except Exception as e:
print(f"Failed: {url}: {e}")
print(f"Done: {len(visited)} pages saved to {crawl_dir}")
return crawl_dir
Advanced: Crawl4AI patterns
Universal documentation extraction
For sites where simple methods fail, use Crawl4AI with platform-aware selectors:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def extract_documentation(url):
"""Universal documentation extractor with platform detection"""
content_selectors = [
"article", ".markdown", ".theme-doc-markdown",
".gitbook-root", ".document", ".bd-content",
".page-body", ".role-content", ".section",
".main-content", "[role='main']"
]
config = CrawlerRunConfig(
css_selector=", ".join(content_selectors),
wait_for="css:article, .markdown, .document, [role='main']",
remove_overlay_elements=True,
exclude_tags=[
"nav", "header", "footer",
".sidebar", ".navigation", ".menu",
".table-of-contents", ".toc",
".pagination", ".breadcrumbs",
"script", "style", "noscript"
],
page_timeout=45000
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url, config=config)
if result.success:
return str(result.markdown)
Relevance-based filtering
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def extract_relevant_docs(url, focus_area):
"""Extract documentation focused on specific topics"""
bm25_filter = BM25ContentFilter(
user_query=focus_area,
bm25_threshold=1.2,
include_tables=True,
include_code=True
)
md_generator = DefaultMarkdownGenerator(
content_filter=bm25_filter,
options={
"ignore_links": False,
"ignore_images": False,
"code_block_format": "fenced"
}
)
config = CrawlerRunConfig(
css_selector="article, .document, .markdown",
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url, config=config)
return str(result.markdown.fit_markdown)
Batch extraction with concurrency
async def extract_complete_docs(urls, max_concurrent=3):
"""Extract multiple pages concurrently"""
config = CrawlerRunConfig(
css_selector="article, .document, .markdown",
remove_overlay_elements=True,
page_timeout=30000
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
config=config,
max_concurrent=max_concurrent
)
extracted = {}
for result in results:
if result.success:
extracted[result.url] = str(result.markdown)
return extracted
API documentation extraction
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "api_endpoints",
"baseSelector": ".api-endpoint, .method, .endpoint",
"fields": [
{"name": "method", "selector": ".http-method, .verb", "type": "text"},
{"name": "path", "selector": ".path, .route", "type": "text"},
{"name": "description", "selector": ".description", "type": "text"},
{"name": "parameters", "selector": ".parameters", "type": "text"},
{"name": "example", "selector": ".example, .code-example", "type": "text"}
]
}
async def extract_api_docs(url):
extraction_strategy = JsonCssExtractionStrategy(schema=schema)
config = CrawlerRunConfig(
css_selector=".api-reference, .endpoints, .methods",
extraction_strategy=extraction_strategy
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url, config=config)
return result.extracted_content
Advanced: intelligent crawling
For sites that require link-following discovery (no sitemap, no llms.txt):
class IntelligentCrawler:
def __init__(self, base_url, max_depth=5, max_pages=100):
self.base_url = base_url
self.base_domain = urlparse(base_url).netloc
self.max_depth = max_depth
self.max_pages = max_pages
self.visited = set()
self.queue = [(base_url, 0)]
async def crawl(self):
"""Crawl with intelligent depth control and link prioritization"""
async with AsyncWebCrawler() as crawler:
while self.queue and len(self.visited) < self.max_pages:
url, depth = self.queue.pop(0)
if url in self.visited or depth > self.max_depth:
continue
config = CrawlerRunConfig(
page_timeout=30000,
remove_overlay_elements=True,
exclude_tags=["script", "style", "nav", "footer"]
)
try:
result = await crawler.arun(url, config=config)
if result.success:
self.visited.add(url)
yield url, str(result.markdown), result.metadata
for link_info in result.links.get("internal", []):
href = link_info.get("href", "")
if self._is_valid(href):
absolute = urljoin(self.base_url, href)
if absolute not in self.visited:
priority = self._priority(href, link_info)
self.queue.append((absolute, depth + 1))
except Exception as e:
print(f"Failed: {url}: {e}")
await asyncio.sleep(1)
def _is_valid(self, href):
if href.startswith(('#', 'mailto:', 'tel:', 'javascript:')):
return False
skip = ['.pdf', '.jpg', '.png', '.gif', '.zip', '.exe']
return not any(href.lower().endswith(ext) for ext in skip)
def _priority(self, href, link_info):
text = (href + link_info.get("text", "")).lower()
if any(kw in text for kw in ["docs", "guide", "tutorial", "api"]):
return 10
return 5
Sitemap generation
After crawling, generate a sitemap for navigation:
def create_markdown_sitemap(pages, base_url):
"""Generate markdown sitemap from crawled pages"""
lines = [f"# Sitemap for {base_url}\n"]
for url, title in sorted(pages.items()):
depth = url.replace(base_url, '').count('/')
indent = " " * depth
lines.append(f"{indent}- [{title or url}]({url})")
return "\n".join(lines)
Guidelines
- Always try the simplest extraction method first (direct curl > Jina > Crawl4AI)
- Check for
llms.txtand sitemap XML before crawling - Use parallel
curlwith&+waitfor batch downloads (no tokens wasted in agent context) - Throttle Jina requests to max 5 concurrent
- For Crawl4AI, use
arun_many()withmax_concurrent=3for multi-page docs - Add delays between requests when crawling large sites
- Always verify downloads: check file count, empty files, and total size
- Maintain heading hierarchy and code block formatting