markdown-new

SKILL.md

markdown-new

Convert public web pages into clean Markdown via markdown.new — a free hosted service that strips navigation, ads, and boilerplate, returning only the readable content.

When to Use

  • Extracting article text for summarization or analysis
  • Building RAG pipelines that ingest web content
  • Archiving pages in a readable format
  • Reducing token usage compared to raw HTML or full browser snapshots
  • Research workflows where you need clean text from multiple URLs

API

Prefix Mode (simplest)

Prepend https://markdown.new/ to any URL:

# Basic conversion
curl -s 'https://markdown.new/https://example.com/article'

# With options
curl -s 'https://markdown.new/https://example.com?method=browser&retain_images=true'

POST Mode (recommended for automation)

curl -s -X POST https://markdown.new/ \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com/article",
    "method": "auto",
    "retain_images": false
  }'

Parameters

Parameter Values Default Description
method auto, ai, browser auto Conversion pipeline
retain_images true, false false Keep image links in output

Method Selection

  • auto — fastest; lets the service pick the best pipeline. Use first.
  • ai — forces Workers AI HTML-to-Markdown conversion. Good for well-structured HTML.
  • browser — headless browser rendering. Use for JavaScript-heavy SPAs and pages where auto misses content.

Strategy: Always try auto first. Fall back to browser only when output is incomplete or empty.

Response Headers

The service returns useful metadata in response headers:

  • x-markdown-tokens — estimated token count of the output
  • x-rate-limit-remaining — requests remaining in current window

Usage Patterns

Single Page Extraction

"""fetch_article.py — Extract a single article as Markdown."""
import requests

def fetch_markdown(url: str, method: str = "auto") -> str:
    """Convert a URL to clean Markdown.

    Args:
        url: Public HTTP/HTTPS URL to convert.
        method: Conversion method — "auto", "ai", or "browser".

    Returns:
        Markdown string of the page content.
    """
    resp = requests.post(
        "https://markdown.new/",
        json={"url": url, "method": method, "retain_images": False},
        timeout=30,
    )
    resp.raise_for_status()
    return resp.text

# Extract an article
content = fetch_markdown("https://example.com/blog/post-title")
print(f"Extracted {len(content)} chars")

Batch Extraction with Rate Limiting

"""batch_extract.py — Extract multiple URLs with rate limiting."""
import time
import requests

def batch_extract(urls: list[str], delay: float = 0.5) -> dict[str, str]:
    """Extract Markdown from multiple URLs with rate limiting.

    Args:
        urls: List of public URLs to convert.
        delay: Seconds to wait between requests to respect rate limits.

    Returns:
        Dict mapping URL to extracted Markdown content.
    """
    results = {}
    for url in urls:
        try:
            resp = requests.post(
                "https://markdown.new/",
                json={"url": url, "method": "auto"},
                timeout=30,
            )
            if resp.status_code == 429:  # Rate limited
                print(f"Rate limited, waiting 60s...")
                time.sleep(60)
                resp = requests.post(
                    "https://markdown.new/",
                    json={"url": url, "method": "auto"},
                    timeout=30,
                )
            resp.raise_for_status()
            results[url] = resp.text
        except Exception as e:
            print(f"Failed {url}: {e}")
            results[url] = ""
        time.sleep(delay)  # Respect rate limits
    return results

Shell One-Liner

# Quick article extraction — pipe to file or another tool
curl -s 'https://markdown.new/https://example.com/article' > article.md

# Extract and count tokens (rough estimate: words / 0.75)
curl -s 'https://markdown.new/https://example.com/article' | wc -w

Node.js

// fetch-markdown.js — URL to Markdown in Node.js
async function fetchMarkdown(url, method = 'auto') {
  const resp = await fetch('https://markdown.new/', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ url, method, retain_images: false }),
  });

  if (resp.status === 429) {
    throw new Error('Rate limited — wait and retry');
  }

  if (!resp.ok) {
    throw new Error(`Conversion failed: ${resp.status}`);
  }

  return resp.text();
}

Limits and Best Practices

  • Rate limit: ~500 requests/day per IP. Monitor x-rate-limit-remaining header.
  • 429 responses mean you've hit the limit — back off and retry after a delay.
  • Public URLs only — the service cannot access authenticated or private pages.
  • Respect robots.txt and copyright when extracting content.
  • Verify critical extractions — output is not guaranteed complete for every page.
  • Use auto first, fall back to browser for JS-heavy pages.
  • Disable retain_images when you only need text — reduces output size.

Combining with Other Tools

  • Pair with whisper for multimedia research (audio transcription + article extraction)
  • Feed output into langchain or langgraph for RAG pipelines
  • Use with elasticsearch to build a searchable content index
  • Combine with sox / yt-dlp for multi-format content ingestion
Weekly Installs
1
GitHub Stars
15
First Seen
3 days ago
Installed on
amp1
cline1
augment1
opencode1
cursor1
kimi-cli1