Smart Web Fetch

当需要抓取网页正文并转为干净 Markdown 时，优先使用本 skill 自带脚本 scripts/smart_web_fetch.py。

Quick Start

先看帮助（不要先读源码）：

python3 scripts/smart_web_fetch.py --help

抓取并输出 Markdown：

python3 scripts/smart_web_fetch.py "https://example.com"

输出结构化 JSON（包含 provider/title/source_url/markdown）：

python3 scripts/smart_web_fetch.py "https://example.com" --json

对需要登录态或验证态的页面传入 Cookie：

python3 scripts/smart_web_fetch.py "https://mp.weixin.qq.com/s?..." --service wechat --cookie "name=value; name2=value2"

也可以从文件读取 Cookie：

python3 scripts/smart_web_fetch.py "https://mp.weixin.qq.com/s?..." --service wechat --cookie-file ./wechat.cookie

Providers (Auto Routing)

默认 --service auto 会并行请求：

jina: https://r.jina.ai/<original_url>
wechat: 对 mp.weixin.qq.com 调用本地 wechat-article-extractor，提取结构化字段后再转为 Markdown
scrapling: 本机已安装 scrapling + html2text 时启用，优先抽取正文节点并转 Markdown
markdown.new: https://markdown.new/<original_url>（必要时 fallback 到 POST https://markdown.new/）
defuddle: https://defuddle.md/<original_url_without_scheme>（例如 defuddle.md/example.com/path）

脚本会做有效性校验（长度/错误关键字/HTML 假返回），选择“第一个有效结果”，并取消其他请求。

对于 mp.weixin.qq.com，会优先尝试 wechat-article-extractor，再尝试 scrapling，避免浪费 Jina 配额。若文章被微信环境校验拦截，可通过 --cookie 或 --cookie-file 传入浏览器中的 Cookie 请求头重试。

Notes

基础依赖：系统需要有 python3 与 curl
微信增强：系统需要有 node，并且 wechat-article-extractor 的 npm 依赖已安装
可选增强：安装 scrapling 与 html2text 后，可获得更强的反爬能力与更干净的正文提取
推荐安装：pip install "scrapling[fetchers]" html2text && scrapling install
不适用：登录/支付/上传/强交互页面、或被严格反爬/验证码拦截页面
需要调参时用：--timeout、--min-chars、--service、--cookie、--cookie-file

smart-web-fetch

Smart Web Fetch

Quick Start

Providers (Auto Routing)

Notes