article-extractor
Article Extractor
This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.
Prerequisites
At least one of:
- Node.js 18+ and npm — required for
reader(npm install -g reader-cli) - Python 3 and pip — required for
trafilatura(pip3 install trafilatura)
If neither is installed, the skill falls back to a basic curl-based extraction (no install needed).
When to Use This Skill
Activate when the user:
- Provides an article/blog URL and wants the text content
- Asks to "download this article"
- Wants to "extract the content from [URL]"
- Asks to "save this blog post as text"
- Needs clean article text without distractions
How It Works
Priority Order:
- Check if tools are installed (reader or trafilatura)
- Download and extract article using best available tool
- Clean up the content (remove extra whitespace, format properly)
- Save to file with article title as filename
- Confirm location and show preview
Installation Check
Check for article extraction tools in this order:
Option 1: reader (Recommended - Mozilla's Readability)
command -v reader
If not installed:
npm install -g @mozilla/readability-cli
# or
npm install -g reader-cli
Option 2: trafilatura (Python-based, very good)
command -v trafilatura
If not installed:
pip3 install trafilatura
Option 3: Fallback (curl + simple parsing)
If no tools available, use basic curl + text extraction (less reliable but works)
Extraction Methods
Method 1: Using reader (Best for most articles)
# Extract article
reader "URL" > article.txt
Pros:
- Based on Mozilla's Readability algorithm
- Excellent at removing clutter
- Preserves article structure
Method 2: Using trafilatura (Best for blogs/news)
# Extract article
trafilatura --URL "URL" --output-format txt > article.txt
# Or with more options
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
Pros:
- Very accurate extraction
- Good with various site structures
- Handles multiple languages
Options:
--no-comments: Skip comment sections--no-tables: Skip data tables--precision: Favor precision over recall--recall: Extract more content (may include some noise)
Method 3: Fallback (curl + basic parsing)
# Download and extract basic content
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
self.current_tag = None
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
self.in_content = True
self.current_tag = tag
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)
parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt
Note: This is less reliable but works without dependencies.
Getting Article Title
Extract title for filename:
Using reader:
# reader outputs markdown with title at top
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
Using trafilatura:
# Get metadata including title
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
Using curl (fallback):
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')
Filename Creation
Clean title for filesystem:
# Get title
TITLE="Article Title from Website"
# Clean for filesystem (remove special chars, limit length)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
# Add extension
FILENAME="${FILENAME}.txt"
Complete Workflow
ARTICLE_URL="https://example.com/article"
# Check for tools
if command -v reader &> /dev/null; then
TOOL="reader"
echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
TOOL="trafilatura"
echo "Using trafilatura"
else
TOOL="fallback"
echo "Using fallback method (may be less accurate)"
fi
# Extract article
case $TOOL in
reader)
# Get content
reader "$ARTICLE_URL" > temp_article.txt
# Get title (first line after # in markdown)
TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
;;
trafilatura)
# Get title from metadata
METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
# Get clean content
trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
;;
fallback)
# Get title
TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
TITLE=${TITLE%% - *} # Remove site name
TITLE=${TITLE%% | *} # Remove site name (alternate)
# Get content (basic extraction)
curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main'}:
self.in_content = True
if tag in {'h1', 'h2', 'h3'}:
self.content.append('\\n')
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)
parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
;;
esac
# Clean filename
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"
# Move to final filename
mv temp_article.txt "$FILENAME"
# Show result
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"
Error Handling
Common Issues
1. Tool not installed
- Try alternate tool (reader → trafilatura → fallback)
- Offer to install: "Install reader with: npm install -g reader-cli"
2. Paywall or login required
- Extraction tools may fail
- Inform user: "This article requires authentication. Cannot extract."
3. Invalid URL
- Check URL format
- Try with and without redirects
4. No content extracted
- Site may use heavy JavaScript
- Try fallback method
- Inform user if extraction fails
5. Special characters in title
- Clean title for filesystem
- Remove:
/,:,?,",<,>,| - Replace with
-or remove
Output Format
Saved File Contains:
- Article title (if available)
- Author (if available from tool)
- Main article text
- Section headings
- No navigation, ads, or clutter
What Gets Removed:
- Navigation menus
- Ads and promotional content
- Newsletter signup forms
- Related articles sidebars
- Comment sections (optional)
- Social media buttons
- Cookie notices
Tips for Best Results
1. Use reader for most articles
- Best all-around tool
- Based on Firefox Reader View
- Works on most news sites and blogs
2. Use trafilatura for:
- Academic articles
- News sites
- Blogs with complex layouts
- Non-English content
3. Fallback method limitations:
- May include some noise
- Less accurate paragraph detection
- Better than nothing for simple sites
4. Check extraction quality:
- Always show preview to user
- Ask if it looks correct
- Offer to try different tool if needed
Example Usage
Simple extraction:
# User: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"
With error handling:
if ! reader "$URL" > temp.txt 2>/dev/null; then
if command -v trafilatura &> /dev/null; then
trafilatura --URL "$URL" --output-format txt > temp.txt
else
echo "Error: Could not extract article. Install reader or trafilatura."
exit 1
fi
fi
Best Practices
- ✅ Always show preview after extraction (first 10 lines)
- ✅ Verify extraction succeeded before saving
- ✅ Clean filename for filesystem compatibility
- ✅ Try fallback method if primary fails
- ✅ Inform user which tool was used
- ✅ Keep filename length reasonable (< 100 chars)
After Extraction
Display to user:
- "✓ Extracted: [Article Title]"
- "✓ Saved to: [filename]"
- Show preview (first 10-15 lines)
- File size and location
Ask if needed:
- "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
- "Should I extract another article?"
More from dedalus-erp-pas/foundation-skills
react-best-practices
Guide complet des bonnes pratiques React et Next.js couvrant l'optimisation des performances, l'architecture des composants, les patrons shadcn/ui, les animations Motion et les patrons modernes React 19+. À utiliser lors de l'écriture, la revue ou le refactoring de code React/Next.js. Se déclenche sur les tâches impliquant des composants React, des pages Next.js, du data fetching, des composants UI, des animations ou de l'amélioration de la qualité du code.
208vue-best-practices
Guide des bonnes pratiques Vue.js 3 couvrant la Composition API, la conception de composants, les patrons de réactivité, le styling utility-first avec Tailwind CSS, l'intégration native de la bibliothèque de composants PrimeVue et l'organisation du code. À utiliser lors de l'écriture, la revue ou le refactoring de code Vue.js pour garantir des patrons idiomatiques et un code maintenable.
205playwright-skill
Automatisation complète du navigateur et tests web avec Playwright. Détecte automatiquement les serveurs de développement, gère le cycle de vie des serveurs, écrit des scripts de test propres dans /tmp. Tester des pages, remplir des formulaires, capturer des screenshots, vérifier le responsive design, valider l'UX, tester les flux de connexion, vérifier les liens, déboguer des webapps dynamiques, automatiser toute tâche navigateur. À utiliser quand l'utilisateur veut tester des sites web, automatiser des interactions navigateur, valider des fonctionnalités web ou effectuer tout test basé sur le navigateur.
170changelog-generator
Crée automatiquement des changelogs orientés utilisateur à partir des commits git en analysant l'historique, catégorisant les changements et transformant les commits techniques en notes de version claires et compréhensibles. Transforme des heures de rédaction manuelle en minutes de génération automatisée.
147postgres
Exécute des requêtes SQL en lecture seule sur plusieurs bases de données PostgreSQL. À utiliser pour : (1) interroger des bases PostgreSQL, (2) explorer les schémas/tables, (3) exécuter des requêtes SELECT pour l'analyse de données, (4) vérifier le contenu des bases. Supporte plusieurs connexions avec descriptions pour une sélection automatique intelligente. Bloque toutes les opérations d'écriture (INSERT, UPDATE, DELETE, DROP, etc.) par sécurité.
147pptx
Création, édition et analyse de présentations. Quand Claude doit travailler avec des présentations (.pptx) pour : (1) Créer de nouvelles présentations, (2) Modifier ou éditer du contenu, (3) Travailler avec les mises en page, (4) Ajouter des commentaires ou notes du présentateur, ou toute autre tâche de présentation.
144