metascraper
metascraper
metascraper extracts normalized metadata from the input HTML you provide, using ordered rule bundles.
Extraction quality depends on the accuracy of that input HTML.
Quick Start
Install only what you need:
npm install metascraper \
metascraper-author \
metascraper-date \
metascraper-description \
metascraper-image \
metascraper-logo \
metascraper-publisher \
metascraper-title \
metascraper-url
Minimal extraction:
const metascraper = require('metascraper')([
require('metascraper-author')(),
require('metascraper-date')(),
require('metascraper-description')(),
require('metascraper-image')(),
require('metascraper-logo')(),
require('metascraper-publisher')(),
require('metascraper-title')(),
require('metascraper-url')()
])
const metadata = await metascraper({
url: 'https://example.com/article',
html: '<html>...</html>'
})
Recommended Workflow
- Get accurate HTML for the target URL.
- Compose rule bundles for the fields you need.
- Run
metascraper({ url, html }). - Narrow extraction with
pickPropNameswhen speed matters. - Add custom rules only when defaults miss required data.
Getting HTML
metascraper needs both url and html.
- Static pages: fetch raw HTML with your HTTP client.
- JS-heavy pages: use a rendered HTML source (for example,
html-get). - At scale: if browser infra is not desired, use Microlink API as managed alternative.
For accurate rendered markup, use the local html-get skill before running metascraper.
Core API
Create an instance:
const metascraper = require('metascraper')([/* rule bundles */])
Run extraction:
const metadata = await metascraper({
url,
html,
htmlDom, // optional pre-parsed DOM
rules, // optional extra rules at runtime
pickPropNames, // optional Set of fields to compute
omitPropNames, // optional Set of fields to skip
validateUrl // default true
})
Property Selection Patterns
Compute only selected fields:
const metadata = await metascraper({
url,
html,
pickPropNames: new Set(['title', 'description', 'image'])
})
Skip expensive or unwanted fields:
const metadata = await metascraper({
url,
html,
omitPropNames: new Set(['logo'])
})
When both are present, pickPropNames takes precedence.
Rule Bundle Notes
- Rules are evaluated in order.
- The first rule that resolves a field wins.
- Put specific rules before generic fallbacks.
- Use package-level options for bundle behavior.
Example with bundle options:
const metascraper = require('metascraper')([
require('metascraper-logo')({
filter: imageUrl => imageUrl.endsWith('.png')
})
])
Common Bundles
Core article fields:
metascraper-authormetascraper-datemetascraper-descriptionmetascraper-imagemetascraper-logometascraper-publishermetascraper-titlemetascraper-url
Media and enrichment:
metascraper-audiometascraper-videometascraper-langmetascraper-iframemetascraper-readability
Provider-specific examples:
metascraper-youtubemetascraper-spotifymetascraper-soundcloudmetascraper-xmetascraper-instagrammetascraper-tiktok
Troubleshooting
- Missing metadata: verify the HTML contains rendered tags (Open Graph, Twitter, JSON-LD, etc.).
- Wrong relative URLs: ensure
urlis absolute and matches the fetched page. - Slow extraction: use
pickPropNamesto compute only needed fields. - URL validation failures: pass
validateUrl: falseonly when input is intentionally non-standard.
Output Fields (Typical)
Common fields include:
titledescriptionauthorpublisherdateimagelogourllangaudiovideo
The returned shape depends on loaded rule bundles.
More from microlinkhq/skills
optimo
Optimize and convert images and videos using format-specific compression pipelines on top of ImageMagick and FFmpeg. Use when users need to reduce image or video file sizes, batch-optimize a media directory, convert between formats (JPEG, PNG, WebP, AVIF, HEIC, JXL, MP4, WebM, MOV), resize media by percentage/dimensions/target file size, strip audio tracks from videos, or output optimized images as data URLs.
49k8s-hpa-cost-tuning
Tune Kubernetes HPA scale-up/down behavior, topology spread, and resource requests to reduce idle cluster capacity. Use when users need to audit cluster costs on a schedule, analyze post-incident scaling behavior, investigate why replicas or nodes do not scale down, or reduce over-reservation and wasted compute resources.
32nodejs-performance
Identify, validate, and ship production-safe Node.js optimizations with execution time as the primary objective. Use when users ask to reduce latency (p50/p95/p99), improve throughput, and then reduce CPU/memory/event-loop lag/FD pressure or retry amplification, using one-PR-per-improvement workflows with benchmarks.
29use-pnpm
Always use pnpm as the package manager. Use when installing, adding, or removing dependencies, running scripts, or any npm/yarn/pnpm command. Replaces npm and yarn with pnpm equivalents.
11browserless
Automate headless Chrome with a high-level Puppeteer wrapper for screenshots, PDFs, and content extraction. Use when users need to capture web page screenshots or PDFs programmatically, extract rendered HTML or text from JavaScript-heavy pages, check URL status codes, run Lighthouse audits, or build reliable headless browser automation pipelines.
3microlink-api
Extract metadata, take screenshots, generate PDFs, and scrape custom data from URLs via Microlink API and MQL. Use when users need to build link previews from URLs, capture web page screenshots or PDFs programmatically, scrape DOM elements with CSS selectors, or get structured data from any web page without managing browser infrastructure.
3