markdown-fetch
Markdown Fetch
Efficiently fetch web content as clean Markdown using the markdown.new service.
Why Use This
- 80% fewer tokens than raw HTML
- 5x more content fits in context window
- No external dependencies or parsing libraries needed
- Three-tier conversion (Markdown-first, AI fallback, browser rendering)
Triggering
This skill should trigger automatically when:
- User provides a URL (e.g., "Read https://example.com")
- User asks to extract/fetch/analyze web content
- User requests summarization of a webpage
- User needs to process article/blog/documentation URLs
Quick Start
# Fetch any URL
scripts/fetch.sh "https://example.com"
# Use browser rendering for JS-heavy sites
scripts/fetch.sh "https://example.com" --method browser
# Retain images in output
scripts/fetch.sh "https://example.com" --retain-images
Typical Usage Patterns
When a user says:
- "Read this article: https://..." → Use this skill to fetch the content
- "Summarize https://..." → Fetch with this skill first, then summarize
- "What does this page say: https://..." → Fetch the content
- "Extract the text from https://..." → Use this skill
Conversion Methods
auto (default) - Try Markdown-first, fall back to AI or browser as needed
ai - Use Cloudflare Workers AI for conversion
browser - Full browser rendering for JS-heavy content
Options
--method <auto|ai|browser> - Conversion method
--retain-images - Keep image references in output
--output <file> - Save to file instead of stdout
Output
Returns clean Markdown with metadata:
---
title: Page Title
url: https://example.com
method: auto
duration_ms: 725
fetched_at: 2026-03-07T12:00:00Z
---
# Content here...
When to Use
- Extracting articles, documentation, or blog posts
- Building RAG pipelines with web content
- Summarizing web pages
- Fetching content for analysis
- Converting sites to Markdown format
Implementation Notes
The service handles:
- Content negotiation (Accept: text/markdown)
- Cloudflare Workers AI conversion
- Browser rendering for dynamic content
- Automatic fallback between methods
Common Pitfalls
HTML Parsing Failures
Symptom: Empty output, truncated content, or "Error: No content in response"
Root Causes:
- CSS selectors broken by page layout changes
- Dynamic content not rendered by default (use
--method browser) - Page structure uses unusual HTML patterns (iframes, shadow DOM, Web Components)
Debugging & Workarounds:
# 1. Check if the page loads at all
curl -s "https://example.com" | head -c 500
# 2. Try browser rendering (handles JS-heavy sites)
scripts/fetch.sh "https://example.com" --method browser
# 3. Check what the service is getting
scripts/fetch.sh "https://example.com" 2>&1 | head -20
# 4. If still empty, the page may use iframes or require auth
Best Practice: Always use --method browser for single-page apps, social platforms, or content-heavy dashboards.
Rate Limiting & IP Blocks
Symptom: HTTP 429 (Too Many Requests), 403 (Forbidden), or connection timeouts
Root Causes:
- Multiple rapid requests to same domain
- Service IP address blacklisted by the target site
- Bot detection triggering (User-Agent headers, request patterns)
- Cloudflare/WAF blocks (common on high-traffic sites)
Debugging & Workarounds:
# 1. Check HTTP status code
curl -s -w "\n%{http_code}\n" -o /dev/null "https://example.com"
# 2. Test with curl directly (bypasses service)
curl -H "User-Agent: Mozilla/5.0" "https://example.com" | head -c 500
# 3. Add delays between requests (if fetching multiple URLs)
for url in url1 url2 url3; do
scripts/fetch.sh "$url"
sleep 5 # Wait 5 seconds between requests
done
# 4. If blocked, try from different IP or use residential proxy
# For markdown.new service: no built-in proxy rotation; consider mirror services
Best Practice: Respect site robots.txt and use sensible request delays when processing multiple pages. Some sites require retry-after headers.
Encoding Issues
Symptom: Garbled text, mojibake (weird characters), missing non-ASCII content (emojis, accents, CJK)
Root Causes:
- Page declares wrong charset in headers vs actual content
- Service not respecting Content-Type charset declaration
- UTF-8 vs Latin-1 mismatch
- Emoji or special symbols stripped during conversion
Debugging & Workarounds:
# 1. Check the page's declared charset
curl -sI "https://example.com" | grep -i "charset"
# 2. Save raw HTML and inspect encoding
curl -s "https://example.com" > raw.html
file raw.html
head -c 200 raw.html | od -c # Show raw bytes
# 3. Try the markdown.new service and check output
scripts/fetch.sh "https://example.com" | file - # Check output encoding
# 4. If output is garbled, convert explicitly
scripts/fetch.sh "https://example.com" | iconv -f UTF-8 -t UTF-8 > fixed.md
Best Practice: Always verify output is valid UTF-8. If encountering garbled content, the page likely has a charset mismatch — notify the site owner or use --method browser (more robust).
JavaScript-Rendered Content
Symptom: Page returns but content is missing, loads as empty, or shows loading spinners instead of data
Root Causes:
- Page uses client-side JavaScript to render content (React, Vue, Angular, etc.)
- Default
automethod only fetches HTML skeleton without executing JS - Content loaded after page render (lazy loading, infinite scroll)
- JavaScript requires authentication or API keys
Debugging & Workarounds:
# 1. Fetch with browser rendering (executes JS)
scripts/fetch.sh "https://example.com" --method browser
# 2. Check what the auto method returns
scripts/fetch.sh "https://example.com" --method auto
# 3. Inspect page source to confirm JS-rendering
curl -s "https://example.com" | grep -i "react\|vue\|angular\|<div id=\"app\""
# 4. If content still missing, page may require interaction
# (e.g., clicking buttons, scrolling) — no workaround in current service
Best Practice: Use --method browser by default for modern web apps, news sites, and any site built in the last 5 years. The auto method is faster but misses JS-rendered content.
Authentication & Paywall Content
Symptom: Returns login page instead of content, shows "Subscribe to read more", or HTTP 401/403
Root Causes:
- Page requires login/authentication
- Content behind paywall (subscription, membership)
- IP-based access restrictions (geo-blocking)
- Session-based authentication (cookies) not provided
Debugging & Workarounds:
# 1. Check if curl can access without auth
curl -s "https://example.com/article" | head -c 300
# 2. Identify authentication method
curl -sI "https://example.com" | grep -i "auth\|set-cookie\|www-authenticate"
# 3. Try public/preview version if available
# For paywalled sites, look for preview URLs or RSS feeds
curl -s "https://example.com/rss" | head -c 500
# 4. Check for paywall indicators
curl -s "https://example.com/article" | grep -i "paywall\|subscribe\|login required"
# 5. If member-only, request your credentials in the conversation
# (Don't embed in scripts — use prompt)
Best Practice: This skill cannot bypass authentication or paywalls by design. For protected content:
- Use public preview links if available
- Ask the user for authenticated access
- Try RSS feeds (often unpaywalled summaries)
- Check archive services (archive.org, 12ft.io for paywalled content) separately
Service-Side Failures
Symptom: HTTP error codes (5xx), timeout, or malformed JSON responses
Root Causes:
- markdown.new service is down or overloaded
- Request body malformed (invalid URL, missing required fields)
- Service response is corrupted or incomplete
- Cloudflare Workers timeout (pages >10MB or slow servers)
Debugging & Workarounds:
# 1. Check service health
curl -s "https://markdown.new/health" 2>&1 | head
# 2. Verify request format
jq -n --arg url "https://example.com" --arg method "auto" \
'{url: $url, method: $method, retain_images: false}' | jq .
# 3. Retry with exponential backoff
for attempt in {1..3}; do
result=$(scripts/fetch.sh "https://example.com" 2>&1)
if [[ $? -eq 0 ]]; then
echo "$result"
break
fi
sleep $((2 ** attempt)) # 2s, 4s, 8s
done
# 4. If service is down, no workaround available
# — try again later or use alternative (curl, browser, etc.)
Best Practice: The service is stateless and generally reliable. If failures persist, fall back to curl directly or ask the user for an alternative source.
Troubleshooting Decision Tree
No content returned?
├─ Check if site requires JavaScript
│ └─ YES → Use --method browser
│ └─ NO → Continue
├─ Check if site requires authentication
│ └─ YES → Use authenticated request (see Paywall section)
│ └─ NO → Continue
├─ Check for encoding issues
│ └─ YES → Garbled text → likely site charset mismatch
│ └─ NO → Continue
└─ Service may be down → Wait and retry
Getting rate-limited (HTTP 429)?
├─ Add delays between requests (5-10 seconds)
└─ If persistent, try different time or request fewer pages
Truncated or broken output?
└─ Try --method browser (more robust but slower)
Can't access paywalled content?
└─ No workaround — try public preview, RSS, or archive services
More from ckorhonen/claude-skills
video-editor
Expert guidance for video editing with ffmpeg, encoding best practices, and quality optimization. Use when working with video files, transcoding, remuxing, encoding settings, color spaces, or troubleshooting video quality issues.
63tui-designer
Design and implement retro/cyberpunk/hacker-style terminal UIs. Covers React (Tuimorphic), SwiftUI (Metal shaders), and CSS approaches. Use when creating terminal aesthetics, CRT effects, neon glow, scanlines, phosphor green displays, or retro-futuristic interfaces.
35practical-typography
Professional typography guidance based on Matthew Butterick's Practical Typography. Use when evaluating, critiquing, or improving document formatting, text layout, font choices, punctuation, spacing, or any typography-related decisions for print or web content.
34app-marketing-copy
Write marketing copy and App Store / Google Play listings (ASO keywords, titles, subtitles, short+long descriptions, feature bullets, release notes), plus screenshot caption sets and text-to-image prompt templates for generating store screenshot backgrounds/promo visuals. Use when asked to: write/refresh app marketing copy, craft app store metadata, brainstorm taglines/value props, produce ad/landing/email copy, or generate prompts for screenshot/creative generation.
33llm-advisor
Consult other LLMs (GPT-4.1, o4-mini, Gemini 2.5 Pro, Claude Opus) for second opinions on complex bugs, hard problems, planning, and architecture decisions. Use proactively when stuck for 15+ minutes or facing complex debugging. Use when user says 'ask Gemini/GPT/Claude about X' or 'get a second opinion'.
22gsplat-optimizer
Optimize 3D Gaussian Splat scenes for real-time rendering on iOS, macOS, and visionOS. Use when working with .ply or .splat files, targeting mobile/Apple GPU performance, or needing LOD, pruning, or compression strategies for 3DGS scenes.
22