html-get

Installation
SKILL.md

html-get

html-get returns reliable HTML for a URL, choosing fetch or prerender depending on page needs.

Quick Start

Install:

npm install html-get browserless puppeteer

Minimal usage:

const createBrowserless = require('browserless')
const getHTML = require('html-get')

const browser = createBrowserless()
const context = browser.createContext()

const result = await getHTML('https://example.com', {
  getBrowserless: () => context
})

console.log(result.html)

await context((browserless) => browserless.destroyContext())
await browser.close()

Recommended Workflow

  1. Start with default prerender: 'auto'.
  2. Set prerender: false for static pages when speed is priority.
  3. Enable rewriteUrls: true when downstream parsing needs absolute links.
  4. Enable rewriteHtml: true when source pages have broken meta tags.
  5. Reuse one browser process and create/destroy contexts per request.

CLI

One-off usage:

npx -y html-get https://example.com

Debug output with mode, timing, and headers:

npx -y html-get https://example.com --debug

Core Options

  • getBrowserless (function): required unless prerender: false.
  • prerender ('auto' | true | false): mode selector.
  • rewriteUrls (boolean): rewrite relative HTML/CSS URLs to absolute.
  • rewriteHtml (boolean): normalize common meta-tag mistakes.
  • headers (object): request headers for fetch/prerender.
  • gotOpts (object): extra options for got in fetch mode.
  • puppeteerOpts (object): options passed to browserless evaluate flow.
  • serializeHtml (function): custom output serializer from Cheerio instance.
  • encoding (string): output encoding, default utf-8.

Output Shape

getHTML(url, opts) resolves to:

  • html: serialized HTML (or custom serializer output fields).
  • url: final URL.
  • statusCode: HTTP status.
  • headers: response headers.
  • redirects: redirect chain.
  • stats: { mode, timing }.

Common Patterns

Force fast fetch mode for known static targets:

const result = await getHTML(url, {
  prerender: false,
  rewriteUrls: true
})

Prepare HTML for metadata extraction:

const page = await getHTML(url, {
  getBrowserless,
  rewriteUrls: true,
  rewriteHtml: true
})

const metadata = await metascraper({ url: page.url, html: page.html })

Custom serializer (avoid returning full HTML):

const result = await getHTML(url, {
  getBrowserless,
  serializeHtml: ($) => ({
    html: $.html(),
    title: $('title').first().text()
  })
})

Reliability Notes

  • If getBrowserless is missing and prerender is not false, html-get throws.
  • PDF URLs are fetched and can be converted via mutool when available.
  • Media URLs are normalized to HTML wrappers (img, video, audio) for consistent downstream parsing.
  • For large batch jobs, control concurrency outside html-get and always clean up browser contexts.
Related skills

More from kikobeats/skills

Installs
1
GitHub Stars
1
First Seen
Mar 29, 2026