Article Extractor Skill

This skill extracts clean article content from web URLs, removing ads, navigation, sidebars, and other clutter to save readable text files.

When to Use This Skill

Downloading article text from a URL
Saving blog posts as clean text
Removing distractions from web articles
Archiving content for offline reading
Extracting content for research
Creating a local reading library

How to Use

Basic Extraction

Extract the article from https://example.com/article

Save to Specific Location

Extract this article and save to ~/reading/
https://example.com/interesting-post

Multiple Articles

Extract these articles:
- https://example.com/post-1
- https://example.com/post-2
- https://example.com/post-3

Extraction Methods

The skill uses multiple tools in priority order:

1. Reader (Mozilla Readability)

Uses Firefox Reader View algorithm
Excellent at removing clutter
Preserves article structure

2. Trafilatura (Python)

Very accurate extraction
Works great for blogs and news
Options: --no-comments, --precision

3. Fallback (curl + parsing)

No dependencies required
Basic HTML parsing
Less reliable but always works

What Gets Preserved

Article text and paragraphs
Section headings
Author information
Publication date
Article structure

What Gets Removed

Navigation bars
Advertisements
Newsletter signup forms
Sidebars
Comments sections
Social sharing buttons
Cookie notices
Related article widgets

Filename Generation

Files are named based on:

Article title (cleaned)
Special characters removed (/, :, ?, ", <, >, |)
Length limited to 80-100 characters
Extension: .txt

Example:

"How to Build a Great Product: A Guide"
  → "How to Build a Great Product - A Guide.txt"

Output Format

After extraction:

Title: [Article Title]
Author: [Author Name]
Date: [Publication Date]
Source: [Original URL]

---

[Clean article content...]

Error Handling

The skill handles:

Paywalled content: Extracts available preview
Missing tools: Falls back to alternatives
Invalid URLs: Provides clear error message
Failed extraction: Suggests manual copy
Filename issues: Auto-sanitizes problematic characters

Advanced Options

With Metadata Only

Extract just the title and author from this URL

Specific Format

Extract this article as markdown

Research Mode

Extract and summarize the key points from this article

Best Practices

Check Output: Always verify extraction quality
Save Originals: Keep the source URL for reference
Organize Files: Use meaningful folder structures
Batch Processing: Extract multiple related articles together
Respect Copyright: Use for personal research only

Dependencies

For best results, install:

# Mozilla Readability
npm install -g @nicolo-ribaudo/readability-cli

# Or Trafilatura (Python)
pip install trafilatura

Without dependencies, the skill uses fallback methods.

article-extractor