scraping-documentation
SKILL.md
Scraping Documentation
Scrape documentation sites and convert to organized markdown files.
Usage
/scraping-documentation <url> [options]
Arguments
URL (required):
- Base URL of documentation site (e.g.,
https://docs.example.com)
Options:
--output DIR: Output directory (default:./docs)--depth N: How many levels to crawl (default: 2)--include PATTERN: Only include URLs matching pattern--exclude PATTERN: Exclude URLs matching pattern--format FORMAT: Output format (markdown|html|both)--index: Generate index file
Examples
# Scrape library documentation
/scraping-documentation https://docs.example.com --output ai_docs/knowledge/example
# Limited depth crawl
/scraping-documentation https://api.example.com/docs --depth 1
# Include only API reference
/scraping-documentation https://docs.example.com --include "/api/*"
# Exclude changelog pages
/scraping-documentation https://docs.example.com --exclude "/changelog/*"
Workflow
Step 1: Discover Pages
# Start from base URL
base_url = args.url
discovered = set()
to_crawl = [base_url]
while to_crawl and len(discovered) < max_pages:
url = to_crawl.pop(0)
if url in discovered:
continue
if not matches_include(url) or matches_exclude(url):
continue
# Fetch page
content = fetch_url(url)
# Extract links
links = extract_links(content, base_url)
# Add to queue (respect depth)
depth = get_depth(url, base_url)
if depth < max_depth:
to_crawl.extend(links)
discovered.add(url)
Step 2: Fetch and Convert
for url in discovered:
# Fetch content
html = fetch_url(url)
# Convert to markdown
markdown = html_to_markdown(html)
# Clean up
markdown = clean_markdown(markdown)
# Determine output path
path = url_to_filepath(url, output_dir)
# Write file
write_file(path, markdown)
Step 3: HTML to Markdown Conversion
Handle common documentation patterns:
- Code blocks with syntax highlighting
- Tables
- Admonitions/callouts
- Navigation (strip)
- Headers (preserve hierarchy)
- Links (convert to relative)
- Images (download and reference locally)
Step 4: Generate Index
# {Site Name} Documentation
Scraped from: {base_url}
Date: {timestamp}
## Contents
{for section in sections:}
### {section.title}
{for page in section.pages:}
- [{page.title}]({page.path})
Step 5: Report
## Scraping Complete
### Summary
- Base URL: {base_url}
- Pages scraped: {count}
- Output directory: {output_dir}
- Total size: {size}
### Files Created
{for file in files:}
- {file.path} ({file.size})
### Structure
{directory tree}
### Next Steps
Add to CLAUDE.md:
```markdown
## Documentation
@{output_dir}/index.md
## Output Structure
ai_docs/knowledge/example/ ├── index.md # Table of contents ├── getting-started.md # Converted pages ├── api/ │ ├── index.md │ ├── authentication.md │ └── endpoints.md ├── guides/ │ ├── index.md │ └── quickstart.md └── _assets/ # Downloaded images └── diagram.png
## Conversion Rules
### Code Blocks
```html
<pre><code class="language-python">print("hello")</code></pre>
→
```python
print("hello")
```
Tables
HTML tables → Markdown tables
Callouts
<div class="warning">Important note</div>
→
> ⚠️ **Warning**: Important note
Navigation
Strip navigation, sidebars, footers - keep content only.
Error Handling
| Issue | Action |
|---|---|
| 404 page | Skip and log |
| Rate limited | Back off and retry |
| Login required | Report and skip |
| JavaScript rendered | Warn (content may be incomplete) |
| Large file | Skip with warning |
Best Practices
- Respect robots.txt - Check before scraping
- Rate limiting - Don't overload servers
- Attribution - Keep source URL in files
- Updates - Re-run periodically to update
- Selection - Use include/exclude to get relevant content only
Weekly Installs
2
Repository
memyselfandm/cconamiGitHub Stars
7
First Seen
Mar 1, 2026
Security Audits
Installed on
gemini-cli2
opencode2
codebuddy2
github-copilot2
codex2
kimi-cli2