ai-tech-fulltext-fetch
AI Tech Fulltext Fetch
Core Goal
- Reuse the same SQLite database populated by
ai-tech-rss-fetch. - Fetch article body text from each RSS entry URL.
- Persist extraction status and text in a companion table (
entry_content). - Support incremental runs and safe retries without creating duplicate fulltext rows.
Triggering Conditions
- Receive a request to fetch article body/full text for entries already in
ai_rss.db. - Receive a request to build a second-stage pipeline after RSS metadata sync.
- Need a stable, resumable queue over existing
entriesrows. - Need URL-based fulltext persistence before chunking, indexing, or summarization.
Workflow
- Ensure metadata table exists first.
- Run
ai-tech-rss-fetchand populateentriesin SQLite before using this skill. - This skill requires the
entriestable to exist. - In multi-agent runtimes, pin DB to the same absolute path used by
ai-tech-rss-fetch:
export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"
- Initialize fulltext table.
python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"
- Run incremental fulltext sync.
- Default behavior fetches rows that are missing full text or currently failed.
python3 scripts/fulltext_fetch.py sync \
--db "$AI_RSS_DB_PATH" \
--limit 50 \
--timeout 20 \
--min-chars 300
- Fetch one entry on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
--db "$AI_RSS_DB_PATH" \
--entry-id 1234
- Inspect extracted content state.
python3 scripts/fulltext_fetch.py list-content \
--db "$AI_RSS_DB_PATH" \
--status ready \
--limit 100
Data Contract
- Reads from existing
entriestable:id,canonical_url,url,title.
- Writes to
entry_contenttable:entry_id(unique, one row per entry)source_url,final_url,http_statusextractor(trafilatura,html-parser, ornone)content_text,content_hash,content_lengthstatus(readyorfailed)retry_count,last_error, timestamps.
Extraction and Update Rules
- URL source priority:
canonical_urlfirst, fallback tourl. - Attempt
trafilaturaextraction when dependency is available, fallback to built-in HTML parser. - Upsert by
entry_id:- Success: write/update full text and reset
retry_countto0. - Failure with existing
readycontent: keep old text, keep statusready, recordlast_error. - Failure without ready content: status becomes
failed, incrementretry_count, setnext_retry_at.
- Success: write/update full text and reset
- Failed retries are capped by
--max-retries(default3) and paced by--retry-backoff-minutes. --forceallows refetching alreadyreadyrows.--refetch-days Nallows refreshing rows older thanNdays.
Configurable Parameters
--dbAI_RSS_DB_PATH(recommended absolute path in multi-agent runtime)--limit--force--only-failed--refetch-days--oldest-first--timeout--max-bytes--min-chars--max-retries--retry-backoff-minutes--user-agent--disable-trafilatura--fail-on-errors
Error Handling
- Missing
entriestable: return actionable error and stop. - Network/HTTP/parse errors: store failure state and continue processing other entries.
- Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry.
- Extraction too short (
--min-chars): treat as failure to avoid low-quality body text.
References
references/schema.mdreferences/fetch-rules.md
Assets
assets/config.example.json
Scripts
scripts/fulltext_fetch.py
More from fadeloo/skills
email-imap-fetch
Listen for one or more IMAP inboxes with the IDLE command, fetch unread email metadata plus text previews, and forward each message to OpenClaw webhooks. Use when tasks need near-real-time mailbox monitoring, multi-account inbox ingestion via environment variables, and automatic trigger delivery into OpenClaw automation.
8ai-tech-summary
Retrieve time-windowed RSS evidence from SQLite and let the agent produce final summaries using RAG over selected records and fields. Use when generating daily, weekly, monthly, or custom-range AI tech digests directly in agent responses instead of fixed template reports.
7email-smtp-send
Send emails through SMTP with optional local attachments and optional IMAP APPEND sync to Sent mailbox. Use when tasks need reliable outbound email delivery, attachment sending, SMTP connectivity checks, or cross-client sent-mail visibility (for example appending to "Sent Items" after SMTP send).
7ai-tech-rss-fetch
Subscribe to AI and tech RSS feeds and persist normalized metadata into SQLite using mature Python tooling (feedparser + sqlite3). Use when adding feed URLs/OPML sources, running incremental sync with deduplication, and storing entry metadata without full-text extraction or summarization.
7sustainability-rss-fetch
Ingest all sustainability journal RSS entries into a dedicated RSS SQLite database first, keyed by DOI, then mark relevance and prune non-relevant rows to DOI-only. Use when building a DOI-first ingestion pipeline with mandatory full ingestion before topic filtering.
7sustainability-summary
Retrieve time-windowed relevant sustainability RSS evidence from the RSS metadata SQLite database and optionally join DOI-keyed enriched content from a separate fulltext SQLite database. Use when generating grounded daily, weekly, monthly, or custom-range digests after relevance labeling.
7