ai-tech-fulltext-fetch
SKILL.md
AI Tech Fulltext Fetch
Core Goal
- Reuse the same SQLite database populated by
ai-tech-rss-fetch. - Fetch article body text from each RSS entry URL.
- Persist extraction status and text in a companion table (
entry_content). - Support incremental runs and safe retries without creating duplicate fulltext rows.
Triggering Conditions
- Receive a request to fetch article body/full text for entries already in
ai_rss.db. - Receive a request to build a second-stage pipeline after RSS metadata sync.
- Need a stable, resumable queue over existing
entriesrows. - Need URL-based fulltext persistence before chunking, indexing, or summarization.
Workflow
- Ensure metadata table exists first.
- Run
ai-tech-rss-fetchand populateentriesin SQLite before using this skill. - This skill requires the
entriestable to exist. - In multi-agent runtimes, pin DB to the same absolute path used by
ai-tech-rss-fetch:
export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"
- Initialize fulltext table.
python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"
- Run incremental fulltext sync.
- Default behavior fetches rows that are missing full text or currently failed.
python3 scripts/fulltext_fetch.py sync \
--db "$AI_RSS_DB_PATH" \
--limit 50 \
--timeout 20 \
--min-chars 300
- Fetch one entry on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
--db "$AI_RSS_DB_PATH" \
--entry-id 1234
- Inspect extracted content state.
python3 scripts/fulltext_fetch.py list-content \
--db "$AI_RSS_DB_PATH" \
--status ready \
--limit 100
Data Contract
- Reads from existing
entriestable:id,canonical_url,url,title.
- Writes to
entry_contenttable:entry_id(unique, one row per entry)source_url,final_url,http_statusextractor(trafilatura,html-parser, ornone)content_text,content_hash,content_lengthstatus(readyorfailed)retry_count,last_error, timestamps.
Extraction and Update Rules
- URL source priority:
canonical_urlfirst, fallback tourl. - Attempt
trafilaturaextraction when dependency is available, fallback to built-in HTML parser. - Upsert by
entry_id:- Success: write/update full text and reset
retry_countto0. - Failure with existing
readycontent: keep old text, keep statusready, recordlast_error. - Failure without ready content: status becomes
failed, incrementretry_count, setnext_retry_at.
- Success: write/update full text and reset
- Failed retries are capped by
--max-retries(default3) and paced by--retry-backoff-minutes. --forceallows refetching alreadyreadyrows.--refetch-days Nallows refreshing rows older thanNdays.
Configurable Parameters
--dbAI_RSS_DB_PATH(recommended absolute path in multi-agent runtime)--limit--force--only-failed--refetch-days--oldest-first--timeout--max-bytes--min-chars--max-retries--retry-backoff-minutes--user-agent--disable-trafilatura--fail-on-errors
Error Handling
- Missing
entriestable: return actionable error and stop. - Network/HTTP/parse errors: store failure state and continue processing other entries.
- Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry.
- Extraction too short (
--min-chars): treat as failure to avoid low-quality body text.
References
references/schema.mdreferences/fetch-rules.md
Assets
assets/config.example.json
Scripts
scripts/fulltext_fetch.py
Weekly Installs
32
Repository
tiangong-ai/skillsGitHub Stars
4
First Seen
Feb 10, 2026
Security Audits
Installed on
openclaw30
opencode26
gemini-cli26
github-copilot26
codex26
kimi-cli26