ai-tech-fulltext-fetch

Installation

SKILL.md

AI Tech Fulltext Fetch

Core Goal

Reuse the same SQLite database populated by ai-tech-rss-fetch.
Fetch article body text from each RSS entry URL.
Persist extraction status and text in a companion table (entry_content).
Support incremental runs and safe retries without creating duplicate fulltext rows.

Triggering Conditions

Receive a request to fetch article body/full text for entries already in ai_rss.db.
Receive a request to build a second-stage pipeline after RSS metadata sync.
Need a stable, resumable queue over existing entries rows.
Need URL-based fulltext persistence before chunking, indexing, or summarization.

Workflow

Ensure metadata table exists first.

Run ai-tech-rss-fetch and populate entries in SQLite before using this skill.
This skill requires the entries table to exist.
In multi-agent runtimes, pin DB to the same absolute path used by ai-tech-rss-fetch:

export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"

Initialize fulltext table.

python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"

Run incremental fulltext sync.

Default behavior fetches rows that are missing full text or currently failed.

python3 scripts/fulltext_fetch.py sync \
  --db "$AI_RSS_DB_PATH" \
  --limit 50 \
  --timeout 20 \
  --min-chars 300

Fetch one entry on demand.

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$AI_RSS_DB_PATH" \
  --entry-id 1234

Inspect extracted content state.

python3 scripts/fulltext_fetch.py list-content \
  --db "$AI_RSS_DB_PATH" \
  --status ready \
  --limit 100

Data Contract

Reads from existing entries table:
- id, canonical_url, url, title.
Writes to entry_content table:
- entry_id (unique, one row per entry)
- source_url, final_url, http_status
- extractor (trafilatura, html-parser, or none)
- content_text, content_hash, content_length
- status (ready or failed)
- retry_count, last_error, timestamps.

Extraction and Update Rules

URL source priority: canonical_url first, fallback to url.
Attempt trafilatura extraction when dependency is available, fallback to built-in HTML parser.
Upsert by entry_id:
- Success: write/update full text and reset retry_count to 0.
- Failure with existing ready content: keep old text, keep status ready, record last_error.
- Failure without ready content: status becomes failed, increment retry_count, set next_retry_at.
Failed retries are capped by --max-retries (default 3) and paced by --retry-backoff-minutes.
--force allows refetching already ready rows.
--refetch-days N allows refreshing rows older than N days.

Configurable Parameters

--db
AI_RSS_DB_PATH (recommended absolute path in multi-agent runtime)
--limit
--force
--only-failed
--refetch-days
--oldest-first
--timeout
--max-bytes
--min-chars
--max-retries
--retry-backoff-minutes
--user-agent
--disable-trafilatura
--fail-on-errors

Error Handling

Missing entries table: return actionable error and stop.
Network/HTTP/parse errors: store failure state and continue processing other entries.
Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry.
Extraction too short (--min-chars): treat as failure to avoid low-quality body text.

References

references/schema.md
references/fetch-rules.md

Assets

assets/config.example.json

Scripts

scripts/fulltext_fetch.py

Related skills

More from fadeloo/skills

Installs

Repository

fadeloo/skills

First Seen

Feb 24, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykWarn

ai-tech-fulltext-fetch

AI Tech Fulltext Fetch

Core Goal

Triggering Conditions

Workflow

Data Contract

Extraction and Update Rules

Configurable Parameters

Error Handling

References

Assets

Scripts

More from fadeloo/skills

email-imap-fetch

ai-tech-summary

email-smtp-send

ai-tech-rss-fetch

sustainability-rss-fetch

sustainability-summary