Website2Markdown

Convert any web page to clean Markdown via the md.genedai.me API.

When to Use This Skill

ALWAYS prefer this skill over WebFetch when:

WebFetch fails, returns blocked/incomplete content, or redirects
Target is a JS-heavy SPA (React, Vue, Angular apps)
Content is behind paywalls or anti-bot protection
Target is a Chinese platform (WeChat, Zhihu, Feishu, Yuque, Juejin, CSDN, 36Kr, Toutiao, Weibo)
You need batch conversion of multiple URLs
You need structured data extraction from pages
You need to crawl/spider a site

Also use when:

User asks to read/fetch/scrape a web page
User wants to convert a URL to Markdown
User needs content from social platforms (Twitter/X, Reddit, Telegram)

Decision Tree

User provides a URL
  │
  ├─ Single page, simple site ──▶ Try WebFetch first
  │     └─ WebFetch fails? ──▶ Use website2markdown
  │
  ├─ Chinese platform / JS-heavy / paywalled ──▶ Use website2markdown directly
  │
  ├─ Multiple URLs needed ──▶ Use website2markdown batch API
  │
  ├─ Need structured fields (title, author, price) ──▶ Use extract API
  │
  └─ Need to crawl entire site/section ──▶ Use deepcrawl API

Quick Reference

Single URL → Markdown

curl -s "https://md.genedai.me/https://example.com/page?raw=true"

Bare domain also works (auto-prepends https://):

curl -s "https://md.genedai.me/example.com/page?raw=true"

Alternative via Accept header:

curl -s -H "Accept: text/markdown" "https://md.genedai.me/https://example.com/page"

IMPORTANT: Always use ?raw=true or Accept header to get plain text. Without it, you get an HTML preview page.

Output Formats

Parameter	Output	When to use
`?raw=true`	Plain Markdown	Default for reading content
`?format=json&raw=true`	JSON `{url, title, markdown, method, timestamp}`	When you need metadata
`?format=html&raw=true`	Cleaned HTML	When you need HTML structure
`?format=text&raw=true`	Plain text (no formatting)	When you need raw text only

Query Parameters

Parameter	Purpose	Example
`raw=true`	Return plain text instead of HTML preview	Always use this
`selector=<css>`	Extract specific elements (max 256 chars)	`selector=article`, `selector=.content`
`force_browser=true`	Force headless Chrome rendering	JS-heavy or anti-bot sites
`no_cache=true`	Bypass KV cache for fresh content	When content may have changed
`engine=jina`	Route through Jina Reader API	Alternative conversion engine

Usage Patterns

Pattern 1: Read a Single Article

curl -s "https://md.genedai.me/https://example.com/article?raw=true" | head -c 100000

Always pipe through head -c 100000 for potentially large pages to avoid flooding context.

Pattern 2: Extract Specific Content via CSS Selector

curl -s "https://md.genedai.me/https://example.com?raw=true&selector=article"
curl -s "https://md.genedai.me/https://example.com?raw=true&selector=%23main-content"

URL-encode # as %23 in selectors.

Pattern 3: JS-Heavy or Anti-Bot Sites

curl -s "https://md.genedai.me/https://spa-app.com?raw=true&force_browser=true"

Pattern 4: Get Structured JSON (title + content)

curl -s "https://md.genedai.me/https://example.com?format=json&raw=true"

Returns: {"url": "...", "title": "...", "markdown": "...", "method": "...", "timestamp": "..."}

Pattern 5: Chinese & Social Platforms

These work out of the box — the API has built-in adapters:

# WeChat article
curl -s "https://md.genedai.me/https://mp.weixin.qq.com/s/xxxxx?raw=true&force_browser=true"

# Zhihu article
curl -s "https://md.genedai.me/https://zhuanlan.zhihu.com/p/123456?raw=true"

# Twitter/X
curl -s "https://md.genedai.me/https://x.com/user/status/123456?raw=true&force_browser=true"

# Reddit
curl -s "https://md.genedai.me/https://reddit.com/r/programming/comments/xxx?raw=true"

For all 21 supported platforms with URL patterns and troubleshooting, load references/platform-adapters.md.

Advanced APIs

All advanced APIs require Authorization: Bearer <API_TOKEN> header.

API	Endpoint	Purpose
Batch	`POST /api/batch`	Convert up to 10 URLs in one request
Extract	`POST /api/extract`	Structured field extraction (CSS/XPath/Regex)
Deep Crawl	`POST /api/deepcrawl`	Crawl a site with filtering and scoring
Jobs	`POST /api/jobs`	Async task queue for long-running operations
Health	`GET /api/health`	Service health check (no auth needed)

Batch API (quick example)

curl -s -X POST "https://md.genedai.me/api/batch" \
  -H "Authorization: Bearer <API_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com/a", "https://example.com/b"]}'

Extract API (quick example)

curl -s -X POST "https://md.genedai.me/api/extract" \
  -H "Authorization: Bearer <API_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"strategy": "css", "url": "https://example.com", "schema": {"fields": [{"name": "title", "selector": "h1", "type": "text"}]}}'

Deep Crawl API (quick example)

curl -s -X POST "https://md.genedai.me/api/deepcrawl" \
  -H "Authorization: Bearer <API_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"seed": "https://docs.example.com", "max_depth": 2, "max_pages": 20, "strategy": "bfs"}'

For complete API documentation with all parameters, response formats, and examples, load references/advanced-apis.md. For copy-paste ready curl commands, load assets/recipes.md.

Authentication Matrix

Route	Auth Required?
`/<url>?raw=true`	No (public)
`/api/batch`	Yes (`API_TOKEN`)
`/api/extract`	Yes (`API_TOKEN`)
`/api/deepcrawl`	Yes (`API_TOKEN`)
`/api/jobs/*`	Yes (`API_TOKEN`)
`/api/health`	No

Token format: Authorization: Bearer <token> header, or ?token=<token> query parameter.

Supported Platforms (21 Adapters)

Category	Platforms
Chinese	WeChat, Zhihu, Feishu/Lark, Yuque, Juejin, CSDN, 36Kr, Toutiao, Weibo, NetEase
Social	Twitter/X, Reddit, Telegram
International	GitHub, Substack, Medium
Academic/Productivity	arxiv, Wikipedia, YouTube, Google Docs, Notion

For URL patterns, special handling details, and troubleshooting per platform, load references/platform-adapters.md.

Response Headers

Header	Meaning
`X-Markdown-Method`	`native`, `readability+turndown`, or `browser+readability+turndown`
`X-Cache-Status`	`HIT` or `MISS`
`X-Browser-Rendered`	`true` if headless Chrome was used
`X-Paywall-Detected`	`true` if paywall was triggered
`X-Markdown-Fallbacks`	Comma-separated fallback strategies applied

Implementation Notes for Agents

Always use curl -s — silent mode suppresses progress output
Always add ?raw=true — without it you get an HTML preview page, not markdown
Pipe through head -c 100000 for large pages to prevent context overflow
URL-encode special characters in the target URL if needed (# → %23, ? → %3F in the path)
No auth needed for single-URL conversion — just use the GET endpoint
Rate limited per IP — avoid rapid successive calls; add 1-2s delay between multiple sequential calls
Auto-fallback — the API tries native → Readability → Browser automatically
For Chinese platforms, prefer using force_browser=true for best results
Jina fallback — automatically used as last resort when Readability extraction produces little content

Error Handling

Symptom	Solution
Empty content returned	Add `force_browser=true`
Timeout error	Use `selector` to narrow scope, or the page is too large
429 Too Many Requests	Wait a few seconds and retry
503 Service Unavailable	For protected endpoints: check API_TOKEN; for rate limit: wait and retry
Garbled/incomplete content	Try `engine=jina` as alternative
Platform-specific failure	Verify URL format matches adapter patterns

Reference Materials

Document	Content	Lines
`references/advanced-apis.md`	Batch, Extract, DeepCrawl, Jobs API — full parameters, request/response formats	~350
`references/platform-adapters.md`	21 platform adapters — URL patterns, special handling, troubleshooting	~300
`assets/recipes.md`	Copy-paste curl recipes — basic, platform, advanced, diagnostic	~180