web-crawler
Web crawler
Use this only after the normal web_fetch path fails, returns unusable boilerplate, or the user specifically asks for YouTube video content/transcript. Prefer native tools first; paid fallback calls should be deliberate and scoped.
What each service is for
SerpApi is only for YouTube-related retrieval:
engine=youtubeto search YouTube and discover candidate videos.engine=youtube_videoto fetch video metadata, description, chapters, related videos, and the transcript discovery link.engine=youtube_video_transcriptto fetch timestamped transcript segments for AI analysis.
Do not use SerpApi for Google/Bing/general SERP scraping, shopping, maps, news, or any non-YouTube engine. The transparent proxy blocks those engines.
Firecrawl is only a fallback crawler for one web page when ordinary fetching fails. Use POST /v2/scrape with a single url and focused formats like markdown, html, rawHtml, links, summary, or constrained json/question/highlights extraction.
Do not use Firecrawl crawl/map/search/agent/browser endpoints for this fallback skill. Do not request screenshots, audio, branding, images, or browser actions unless the proxy policy is expanded later.
Access pattern through transparent proxy
Use Python scripts with core.http_client.proxied_get / proxied_post; include a typed SC-CALLER-ID header for cost tracking.
SerpApi examples:
from core.http_client import proxied_get
headers = {"SC-CALLER-ID": "chat:youtube-transcript"}
search = proxied_get(
"https://serpapi.com/search.json",
params={"engine": "youtube", "search_query": "topic keywords"},
headers=headers,
).json()
video = proxied_get(
"https://serpapi.com/search.json",
params={"engine": "youtube_video", "v": "VIDEO_ID"},
headers=headers,
).json()
transcript = proxied_get(
"https://serpapi.com/search.json",
params={"engine": "youtube_video_transcript", "v": "VIDEO_ID", "language_code": "en"},
headers=headers,
).json()
Firecrawl example:
from core.http_client import proxied_post
headers = {"SC-CALLER-ID": "chat:web-crawl-fallback"}
page = proxied_post(
"https://api.firecrawl.dev/v2/scrape",
json={
"url": "https://example.com/article",
"formats": ["markdown", "links"],
"onlyMainContent": True,
"timeout": 60000,
},
headers=headers,
).json()
Decision rules
For a YouTube URL, extract the 11-character video id and call youtube_video_transcript first if the user's goal is content analysis, summarization, quote extraction, or topic mining. This is usually more useful than fetching the public watch page because it returns timestamped transcript segments directly.
If the transcript call returns empty, unavailable, or the wrong language, call youtube_video next to inspect title, description, channel, publication date, and any advertised transcript metadata. Try one obvious language_code change only when the desired language is clear (for example en after a non-English request gives no transcript). If no transcript exists, summarize from metadata only and say the transcript was not available.
For a YouTube topic query, call engine=youtube, choose relevant video_results, then call metadata/transcript only for the videos needed. Avoid fetching many transcripts by default.
For a blocked or JS-heavy web page, call Firecrawl once with formats:["markdown","links"] and onlyMainContent:true. Treat the returned Markdown as the extraction substrate, not as final truth: parse the title, price/value fields, specs, body description, image URLs, outbound links, and obvious contact/location hints from the page structure.
General web-page extraction lessons:
- Many listing/detail pages render important content with JavaScript, image galleries, hidden sections, or repeated UI labels.
web_fetchmay return boilerplate while Firecrawl can still recover the real main content. - Do not hard-code site-specific labels. Convert page text into a generic structured summary: what it is, where it is, key numbers, evidence snippets, media/links, and caveats.
- Preserve source URLs for images and links when they help verify the page, but do not download or batch-process every media asset unless the user asks.
- If Markdown misses important layout or structured fields, retry once with
rawHtml; usejson,question, orhighlightsonly when the user asked for narrow extraction and the schema/prompt is specific.
Cost discipline
SerpApi is billed per successful search-like call, regardless of result count. Firecrawl is billed per page plus expensive modifiers. Keep calls tight: one page, one video, or a small shortlist. Never batch-crawl whole websites with this skill.
If the proxy returns 403, the request is outside the allowed use case. Change the approach instead of retrying.
If the proxy returns 429, back off; do not parallelize around the limit.
If the upstream returns a failure, report the exact failure and avoid repeated paid retries unless one parameter change is clearly justified.
More from starchild-ai-agent/official-skills
coinglass
Comprehensive crypto derivatives data - funding rates, open interest,
8.1Kcoingecko
CoinGecko crypto price data, charts, market discovery, and global stats
5.4Khyperliquid
Trade perpetual futures and spot on Hyperliquid DEX
5.4Ktwitter
Twitter/X (x.com) data lookup — fetch tweets by URL or ID, search tweets, user profiles, followers, replies. Use for ANY x.com or twitter.com URL.
5.3Kskill-creator
Create and scaffold new skills with proper frontmatter, directory structure,
5.3Ktwelvedata
Stocks, forex, and commodities price data — real-time quotes, historical time series, and reference data
5.1K