skills/fadeloo/skills/ai-tech-fulltext-fetch

ai-tech-fulltext-fetch

Originally fromtiangong-ai/skills
SKILL.md

AI Tech Fulltext Fetch

Core Goal

  • Reuse the same SQLite database populated by ai-tech-rss-fetch.
  • Fetch article body text from each RSS entry URL.
  • Persist extraction status and text in a companion table (entry_content).
  • Support incremental runs and safe retries without creating duplicate fulltext rows.

Triggering Conditions

  • Receive a request to fetch article body/full text for entries already in ai_rss.db.
  • Receive a request to build a second-stage pipeline after RSS metadata sync.
  • Need a stable, resumable queue over existing entries rows.
  • Need URL-based fulltext persistence before chunking, indexing, or summarization.

Workflow

  1. Ensure metadata table exists first.
  • Run ai-tech-rss-fetch and populate entries in SQLite before using this skill.
  • This skill requires the entries table to exist.
  • In multi-agent runtimes, pin DB to the same absolute path used by ai-tech-rss-fetch:
export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"
  1. Initialize fulltext table.
python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"
  1. Run incremental fulltext sync.
  • Default behavior fetches rows that are missing full text or currently failed.
python3 scripts/fulltext_fetch.py sync \
  --db "$AI_RSS_DB_PATH" \
  --limit 50 \
  --timeout 20 \
  --min-chars 300
  1. Fetch one entry on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$AI_RSS_DB_PATH" \
  --entry-id 1234
  1. Inspect extracted content state.
python3 scripts/fulltext_fetch.py list-content \
  --db "$AI_RSS_DB_PATH" \
  --status ready \
  --limit 100

Data Contract

  • Reads from existing entries table:
    • id, canonical_url, url, title.
  • Writes to entry_content table:
    • entry_id (unique, one row per entry)
    • source_url, final_url, http_status
    • extractor (trafilatura, html-parser, or none)
    • content_text, content_hash, content_length
    • status (ready or failed)
    • retry_count, last_error, timestamps.

Extraction and Update Rules

  • URL source priority: canonical_url first, fallback to url.
  • Attempt trafilatura extraction when dependency is available, fallback to built-in HTML parser.
  • Upsert by entry_id:
    • Success: write/update full text and reset retry_count to 0.
    • Failure with existing ready content: keep old text, keep status ready, record last_error.
    • Failure without ready content: status becomes failed, increment retry_count, set next_retry_at.
  • Failed retries are capped by --max-retries (default 3) and paced by --retry-backoff-minutes.
  • --force allows refetching already ready rows.
  • --refetch-days N allows refreshing rows older than N days.

Configurable Parameters

  • --db
  • AI_RSS_DB_PATH (recommended absolute path in multi-agent runtime)
  • --limit
  • --force
  • --only-failed
  • --refetch-days
  • --oldest-first
  • --timeout
  • --max-bytes
  • --min-chars
  • --max-retries
  • --retry-backoff-minutes
  • --user-agent
  • --disable-trafilatura
  • --fail-on-errors

Error Handling

  • Missing entries table: return actionable error and stop.
  • Network/HTTP/parse errors: store failure state and continue processing other entries.
  • Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry.
  • Extraction too short (--min-chars): treat as failure to avoid low-quality body text.

References

  • references/schema.md
  • references/fetch-rules.md

Assets

  • assets/config.example.json

Scripts

  • scripts/fulltext_fetch.py
Weekly Installs
7
Repository
fadeloo/skills
First Seen
Feb 24, 2026
Installed on
opencode7
gemini-cli7
github-copilot7
codex7
amp7
openclaw7