html-get
html-get
html-get returns reliable HTML for a URL, choosing fetch or prerender depending on page needs.
Quick Start
Install:
npm install html-get browserless puppeteer
Minimal usage:
const createBrowserless = require('browserless')
const getHTML = require('html-get')
const browser = createBrowserless()
const context = browser.createContext()
const result = await getHTML('https://example.com', {
getBrowserless: () => context
})
console.log(result.html)
await context((browserless) => browserless.destroyContext())
await browser.close()
Recommended Workflow
- Start with default
prerender: 'auto'. - Set
prerender: falsefor static pages when speed is priority. - Enable
rewriteUrls: truewhen downstream parsing needs absolute links. - Enable
rewriteHtml: truewhen source pages have broken meta tags. - Reuse one browser process and create/destroy contexts per request.
CLI
One-off usage:
npx -y html-get https://example.com
Debug output with mode, timing, and headers:
npx -y html-get https://example.com --debug
Core Options
getBrowserless(function): required unlessprerender: false.prerender('auto' | true | false): mode selector.rewriteUrls(boolean): rewrite relative HTML/CSS URLs to absolute.rewriteHtml(boolean): normalize common meta-tag mistakes.headers(object): request headers for fetch/prerender.gotOpts(object): extra options forgotin fetch mode.puppeteerOpts(object): options passed to browserless evaluate flow.serializeHtml(function): custom output serializer from Cheerio instance.encoding(string): output encoding, defaultutf-8.
Output Shape
getHTML(url, opts) resolves to:
html: serialized HTML (or custom serializer output fields).url: final URL.statusCode: HTTP status.headers: response headers.redirects: redirect chain.stats:{ mode, timing }.
Common Patterns
Force fast fetch mode for known static targets:
const result = await getHTML(url, {
prerender: false,
rewriteUrls: true
})
Prepare HTML for metadata extraction:
const page = await getHTML(url, {
getBrowserless,
rewriteUrls: true,
rewriteHtml: true
})
const metadata = await metascraper({ url: page.url, html: page.html })
Custom serializer (avoid returning full HTML):
const result = await getHTML(url, {
getBrowserless,
serializeHtml: ($) => ({
html: $.html(),
title: $('title').first().text()
})
})
Reliability Notes
- If
getBrowserlessis missing andprerenderis notfalse,html-getthrows. - PDF URLs are fetched and can be converted via
mutoolwhen available. - Media URLs are normalized to HTML wrappers (
img,video,audio) for consistent downstream parsing. - For large batch jobs, control concurrency outside
html-getand always clean up browser contexts.
More from kikobeats/skills
k8s-hpa-cost-tuning
Tune Kubernetes HPA scale-up/down behavior, topology spread, and resource requests to reduce idle cluster capacity. Use when users need to audit cluster costs on a schedule, analyze post-incident scaling behavior, investigate why replicas or nodes do not scale down, or reduce over-reservation and wasted compute resources.
14optimo
Optimize and convert images and videos using format-specific compression pipelines on top of ImageMagick and FFmpeg. Use when users need to reduce image or video file sizes, batch-optimize a media directory, convert between formats (JPEG, PNG, WebP, AVIF, HEIC, JXL, MP4, WebM, MOV), resize media by percentage/dimensions/target file size, strip audio tracks from videos, or output optimized images as data URLs.
1keyvhq
Build and operate key-value caching with @keyvhq/core and official storage adapters. Use when users need to add a cache layer to a Node.js module, store data with TTL expiration, choose between storage backends (in-memory, Redis, Mongo, MySQL, PostgreSQL, SQLite), implement cache-aside patterns with namespace isolation, or memoize function results.
1use-pnpm
Always use pnpm as the package manager. Use when installing, adding, or removing dependencies, running scripts, or any npm/yarn/pnpm command. Replaces npm and yarn with pnpm equivalents.
1browserless
Automate headless Chrome with a high-level Puppeteer wrapper for screenshots, PDFs, and content extraction. Use when users need to capture web page screenshots or PDFs programmatically, extract rendered HTML or text from JavaScript-heavy pages, check URL status codes, run Lighthouse audits, or build reliable headless browser automation pipelines.
1metascraper
Extract structured metadata from HTML using composable rule bundles. Use when users need to build link previews, parse Open Graph/Twitter Cards/JSON-LD from pages, extract titles/descriptions/images/authors from HTML, or create metadata extraction pipelines with custom rules and provider-specific parsers.
1