metascraper
metascraper
metascraper extracts normalized metadata from the input HTML you provide, using ordered rule bundles.
Extraction quality depends on the accuracy of that input HTML.
Quick Start
Install only what you need:
npm install metascraper \
metascraper-author \
metascraper-date \
metascraper-description \
metascraper-image \
metascraper-logo \
metascraper-publisher \
metascraper-title \
metascraper-url
Minimal extraction:
const metascraper = require('metascraper')([
require('metascraper-author')(),
require('metascraper-date')(),
require('metascraper-description')(),
require('metascraper-image')(),
require('metascraper-logo')(),
require('metascraper-publisher')(),
require('metascraper-title')(),
require('metascraper-url')()
])
const metadata = await metascraper({
url: 'https://example.com/article',
html: '<html>...</html>'
})
Recommended Workflow
- Get accurate HTML for the target URL.
- Compose rule bundles for the fields you need.
- Run
metascraper({ url, html }). - Narrow extraction with
pickPropNameswhen speed matters. - Add custom rules only when defaults miss required data.
Getting HTML
metascraper needs both url and html.
- Static pages: fetch raw HTML with your HTTP client.
- JS-heavy pages: use a rendered HTML source (for example,
html-get). - At scale: if browser infra is not desired, use Microlink API as managed alternative.
For accurate rendered markup, use the local html-get skill before running metascraper.
Core API
Create an instance:
const metascraper = require('metascraper')([/* rule bundles */])
Run extraction:
const metadata = await metascraper({
url,
html,
htmlDom, // optional pre-parsed DOM
rules, // optional extra rules at runtime
pickPropNames, // optional Set of fields to compute
omitPropNames, // optional Set of fields to skip
validateUrl // default true
})
Property Selection Patterns
Compute only selected fields:
const metadata = await metascraper({
url,
html,
pickPropNames: new Set(['title', 'description', 'image'])
})
Skip expensive or unwanted fields:
const metadata = await metascraper({
url,
html,
omitPropNames: new Set(['logo'])
})
When both are present, pickPropNames takes precedence.
Rule Bundle Notes
- Rules are evaluated in order.
- The first rule that resolves a field wins.
- Put specific rules before generic fallbacks.
- Use package-level options for bundle behavior.
Example with bundle options:
const metascraper = require('metascraper')([
require('metascraper-logo')({
filter: imageUrl => imageUrl.endsWith('.png')
})
])
Common Bundles
Core article fields:
metascraper-authormetascraper-datemetascraper-descriptionmetascraper-imagemetascraper-logometascraper-publishermetascraper-titlemetascraper-url
Media and enrichment:
metascraper-audiometascraper-videometascraper-langmetascraper-iframemetascraper-readability
Provider-specific examples:
metascraper-youtubemetascraper-spotifymetascraper-soundcloudmetascraper-xmetascraper-instagrammetascraper-tiktok
Troubleshooting
- Missing metadata: verify the HTML contains rendered tags (Open Graph, Twitter, JSON-LD, etc.).
- Wrong relative URLs: ensure
urlis absolute and matches the fetched page. - Slow extraction: use
pickPropNamesto compute only needed fields. - URL validation failures: pass
validateUrl: falseonly when input is intentionally non-standard.
Output Fields (Typical)
Common fields include:
titledescriptionauthorpublisherdateimagelogourllangaudiovideo
The returned shape depends on loaded rule bundles.
More from kikobeats/skills
k8s-hpa-cost-tuning
Tune Kubernetes HPA scale-up/down behavior, topology spread, and resource requests to reduce idle cluster capacity. Use when users need to audit cluster costs on a schedule, analyze post-incident scaling behavior, investigate why replicas or nodes do not scale down, or reduce over-reservation and wasted compute resources.
14optimo
Optimize and convert images and videos using format-specific compression pipelines on top of ImageMagick and FFmpeg. Use when users need to reduce image or video file sizes, batch-optimize a media directory, convert between formats (JPEG, PNG, WebP, AVIF, HEIC, JXL, MP4, WebM, MOV), resize media by percentage/dimensions/target file size, strip audio tracks from videos, or output optimized images as data URLs.
1html-get
Retrieve normalized, render-ready HTML from any URL using fetch or headless prerender. Use when users need to get rendered HTML from JavaScript-heavy pages, normalize relative URLs to absolute for downstream parsing, prepare HTML for metadata extraction pipelines, or choose between fast fetch and full browser rendering per URL.
1keyvhq
Build and operate key-value caching with @keyvhq/core and official storage adapters. Use when users need to add a cache layer to a Node.js module, store data with TTL expiration, choose between storage backends (in-memory, Redis, Mongo, MySQL, PostgreSQL, SQLite), implement cache-aside patterns with namespace isolation, or memoize function results.
1use-pnpm
Always use pnpm as the package manager. Use when installing, adding, or removing dependencies, running scripts, or any npm/yarn/pnpm command. Replaces npm and yarn with pnpm equivalents.
1browserless
Automate headless Chrome with a high-level Puppeteer wrapper for screenshots, PDFs, and content extraction. Use when users need to capture web page screenshots or PDFs programmatically, extract rendered HTML or text from JavaScript-heavy pages, check URL status codes, run Lighthouse audits, or build reliable headless browser automation pipelines.
1