Web Browse

Fetch web pages and extract readable text content.

Quick fetch (raw HTML)

curl -sL "https://example.com" | head -200

Extract text with Python

curl -sL "https://example.com" | python3 -c "
import sys, html, re
raw = sys.stdin.read()
text = re.sub(r'<script[^>]*>.*?</script>', '', raw, flags=re.DOTALL)
text = re.sub(r'<style[^>]*>.*?</style>', '', text, flags=re.DOTALL)
text = re.sub(r'<[^>]+>', ' ', text)
text = html.unescape(text)
text = re.sub(r'\s+', ' ', text).strip()
print(text[:8000])
"

Get page title and meta

curl -sL "https://example.com" | python3 -c "
import sys, re
h = sys.stdin.read()
title = re.search(r'<title>(.*?)</title>', h, re.I|re.S)
desc = re.search(r'<meta[^>]*name=[\"']description[\"'][^>]*content=[\"'](.*?)[\"']', h, re.I)
print(f'Title: {title.group(1).strip() if title else \"N/A\"}')
print(f'Description: {desc.group(1).strip() if desc else \"N/A\"}')
"

Download a file

curl -sL -o /tmp/file.pdf "https://example.com/report.pdf"

Notes

Respect robots.txt. Do not scrape excessively.
Use -L to follow redirects.
For JavaScript-heavy sites, consider the browser skill.

web-browse

Web Browse

Quick fetch (raw HTML)

Extract text with Python

Get page title and meta

Download a file

Notes

More from thinkfleetai/thinkfleet-engine

flyio-cli-public

kagi-search

feishu-bridge

video-subtitles

bambu-local

freshrss