article-extractor
SKILL.md
Article Extractor
Extract main content from web articles and blog posts, removing navigation, ads, and clutter. Saves clean, readable text.
Prerequisites
This skill requires UV for dependency management. Run from the tapestry-skills project root.
Workflow
URL → Validate → Detect Tool → Extract Content → Sanitize Filename → Save
Tools (in priority order):
reader(Mozilla Readability) - best for most articlestrafilatura- excellent for blogs/news (included in dependencies)- Fallback (
tapestry-extract-html) - works without external tools
Security Requirements
All security utilities are available via UV from the project root.
URL Validation
# Run security validation (checks protocol, blocks SSRF, etc.)
uv run tapestry-validate-url "$URL" || exit 1
Filename Sanitization
# Use tapestry sanitization utility
SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")
Step 1: Check Available Tools
if command -v reader &> /dev/null; then
TOOL="reader"
else
# trafilatura is included in project dependencies
TOOL="trafilatura"
fi
echo "Using: $TOOL"
Optional: Install reader (npm)
For best results, install reader separately:
npm install -g @aspect/readability-cli
# or
npm install -g reader-cli
Step 2: Extract Content
Using reader
TEMP_FILE=$(mktemp)
trap "rm -f '$TEMP_FILE'" EXIT
reader "$URL" > "$TEMP_FILE"
# Get title (first line in markdown format)
TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')
Using trafilatura (via UV)
TEMP_FILE=$(mktemp)
trap "rm -f '$TEMP_FILE'" EXIT
# Get content
uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"
# Get title from metadata
TITLE=$(uv run trafilatura --URL "$URL" --json 2>/dev/null | \
python3 -c "import json,sys; print(json.load(sys.stdin).get('title','Article'))" 2>/dev/null || echo "Article")
Fallback Method (tapestry-extract-html)
Use the built-in HTML extractor when other tools aren't available:
# Extract content using tapestry's HTML extractor
uv run tapestry-extract-html "$URL" --output article.txt
Or get title and content separately:
TEMP_FILE=$(mktemp)
trap "rm -f '$TEMP_FILE'" EXIT
# Fetch and extract
uv run tapestry-extract-html "$URL" --output "$TEMP_FILE"
# Title is on first line (after "# ")
TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')
Step 3: Save with Clean Filename
SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")
CONTENT_FILE="${SAFE_TITLE}.txt"
# Verify content was extracted
if [ ! -s "$TEMP_FILE" ]; then
echo "Error: No content extracted"
exit 1
fi
mv "$TEMP_FILE" "$CONTENT_FILE"
trap - EXIT
WORD_COUNT=$(wc -w < "$CONTENT_FILE" | tr -d ' ')
echo "Extracted: $TITLE"
echo "Saved to: $CONTENT_FILE"
echo "Words: $WORD_COUNT"
Complete Workflow Script
#!/bin/bash
set -e
URL="$1"
# Validate URL
uv run tapestry-validate-url "$URL" || exit 1
# Detect tool
if command -v reader &> /dev/null; then
TOOL="reader"
else
TOOL="trafilatura"
fi
echo "Extracting with: $TOOL"
# Create temp file
TEMP_FILE=$(mktemp)
trap "rm -f '$TEMP_FILE'" EXIT
# Extract based on tool
case $TOOL in
reader)
reader "$URL" > "$TEMP_FILE"
TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')
;;
trafilatura)
uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"
TITLE=$(uv run trafilatura --URL "$URL" --json 2>/dev/null | \
python3 -c "import json,sys; print(json.load(sys.stdin).get('title','Article'))" 2>/dev/null || echo "Article")
;;
esac
# If extraction failed, try fallback
if [ ! -s "$TEMP_FILE" ]; then
echo "Primary extraction failed, trying fallback..."
uv run tapestry-extract-html "$URL" --output "$TEMP_FILE"
TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')
fi
# Verify extraction
if [ ! -s "$TEMP_FILE" ]; then
echo "Error: No content extracted. Site may require authentication."
exit 1
fi
# Save with clean filename
SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")
CONTENT_FILE="${SAFE_TITLE}.txt"
mv "$TEMP_FILE" "$CONTENT_FILE"
trap - EXIT
# Show results
WORD_COUNT=$(wc -w < "$CONTENT_FILE" | tr -d ' ')
echo ""
echo "Extracted: $TITLE"
echo "Saved to: $CONTENT_FILE"
echo "Words: $WORD_COUNT"
echo ""
echo "Preview:"
head -n 10 "$CONTENT_FILE"
Error Handling
| Issue | Solution |
|---|---|
| UV not installed | Install with curl -LsSf https://astral.sh/uv/install.sh | sh |
| Invalid URL | Reject with clear message |
| Internal URL (SSRF) | Block localhost/private IPs |
| Paywall/login required | Inform user, cannot extract |
| Empty extraction | Try fallback method, inform user |
| Timeout | Fallback uses 30s timeout |
What Gets Extracted
Included:
- Article title
- Author (if available)
- Main text content
- Section headings
Removed:
- Navigation menus
- Ads and promotions
- Newsletter signups
- Related articles
- Comment sections
- Social buttons
- Cookie notices
Tool Comparison
| Tool | Strengths | Availability |
|---|---|---|
| reader | Best overall, Firefox algorithm | npm install separately |
| trafilatura | News/blogs, multi-language | Included in dependencies |
| tapestry-extract-html | No external dependencies | Built-in fallback |
Dependencies
All dependencies are managed via UV and pyproject.toml:
- trafilatura: Article extraction (pinned version)
- tapestry-extract-html: Built-in fallback extractor
Optional (install separately):
- reader: Mozilla Readability CLI (
npm install -g reader-cli)
Security Reference
For complete security guidelines: ../shared/references/security-guidelines.md
Weekly Installs
5
Repository
ryanhudson/tape…ude-codeGitHub Stars
1
First Seen
Jan 23, 2026
Security Audits
Installed on
claude-code4
trae3
gemini-cli3
antigravity3
opencode3
windsurf2