skills/mderazon/agent-skills/cloudflare-tomarkdown

cloudflare-tomarkdown

SKILL.md

Cloudflare Markdown Conversion

Use this skill to convert URLs or local files (PDFs, Images, HTML, CSV, Office docs) into clean, structured Markdown for text analysis, RAG, and LLMs.

Features & Supported Formats

  • Scraping URLs: Extracts HTML, resolves relative links, handles JSON-LD, extracts title/description.
  • Images: Automatically runs object-detection and uses an LLM (gemma-3-12b-it) to generate image descriptions. Converts SVG to raster.
  • PDFs: Parses internal StructTree tagging for high-fidelity semantic Markdown extraction.
  • Office Docs: Supports .docx, .xlsx, .csv, .ods, .odt, and more.

Usage

Setup & Authentication

This skill requires CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN.

Automatic Setup: For convenience, the script automatically looks for a .env file in the current directory or its parents. If you have a .env file in your project root, it will work out of the box.

Manual Setup: Alternatively, you can export them in your shell or pass them as parameters (--account and --token).

Instruction for the Agent: If the skill fails due to missing environment variables, check if a .env file exists in the workspace root.

Scraping a URL

# Basic usage (defaults to 'auto' method, trying AI parsing first, then browser rendering)
node scripts/render.js --url "https://example.com"

Scraping with Options (CSS Selectors, etc.)

Cloudflare allows filtering elements using cssSelector or providing a hostname.

# Only extract the main content container
node scripts/render.js --url "https://developer.cloudflare.com" \
  --options '{"html": {"cssSelector": "main.content"}}'

Converting a Local File (PDFs, Images, Office Docs)

node scripts/render.js --file "report.pdf"

Converting Images with Language Options

Image descriptions are generated via AI. You can specify a desired output language for the description (en, it, de, es, fr, pt).

node scripts/render.js --file "cat.jpeg" \
  --options '{"image": {"descriptionLanguage": "es"}}'

Advanced Options for JS-Heavy Sites

If a site requires complex JavaScript rendering or redirects, use the browser method with specific wait conditions.

# Wait for network to be idle before extracting content
node scripts/render.js --url "https://complex-site.com" --wait "networkidle2"

# Wait for a specific element to appear (e.g. price or main content)
node scripts/render.js --url "https://shop.com/prod" --selector ".product-price"

# Increase timeout for slow pages (in milliseconds)
node scripts/render.js --url "https://slow-site.com" --timeout 60000

Valid --wait options are: load, domcontentloaded (default), networkidle0, and networkidle2.

How It Works Intelligently

The --method auto capability tests two separate rendering paths:

  1. Workers AI tomarkdown (Primary): Ideal for documents, standard web pages, extracting JSON-LD structured data, and resolving standard HTML features. Uses multipart form data.
  2. Browser Rendering API (Fallback): If the page uses complex JavaScript (e.g. Single Page Apps) and the AI path cannot see the content, the Browser Rendering engine opens a headless real browser for accurate conversion.

Calling the REST API Directly (Advanced)

If you'd prefer not to use scripts/render.js, here is the curl equivalent for a local file using the tomarkdown REST API:

curl https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/ai/tomarkdown \
  -X POST \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -F "files=@document.pdf" \
  -F 'conversionOptions={"pdf":{"metadata":false}}'

Note: For URLs, you should use curl to fetch the source to a local file first before uploading it as files=@<temp.html>. The tomarkdown REST API does not directly ingest a --data url="https...".

Weekly Installs
5
First Seen
8 days ago
Installed on
opencode5
gemini-cli5
github-copilot5
codex5
kimi-cli5
amp5