skills/tiangong-ai/skills/eceee-news-fulltext-fetch

eceee-news-fulltext-fetch

SKILL.md

eceee News Fulltext Fetch

Core Goal

  • Discover news article URLs from https://www.eceee.org/all-news/.
  • Persist discovered entry metadata into SQLite.
  • Fetch and extract article body text from each entry page.
  • Persist status and text in a companion table (entry_content) with retry-safe updates.

Triggering Conditions

  • Receive a request to extract full text from eceee news archive pages.
  • Receive a request to run incremental fulltext sync for eceee news links.
  • Need a resilient local SQLite queue for discovery + extraction + retries.

Workflow

  1. Initialize database.
export ECEEE_NEWS_DB_PATH="/absolute/path/to/eceee_news.db"
python3 scripts/fulltext_fetch.py init-db --db "$ECEEE_NEWS_DB_PATH"
  1. Discover links and fetch fulltext incrementally.
python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --index-url "https://www.eceee.org/all-news/" \
  --limit 50 \
  --min-chars 180
  1. Discover only (refresh URL catalog without fetching bodies).
python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --discover-only
  1. Fetch one entry on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --entry-id 123

Or by URL:

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --url "https://www.eceee.org/all-news/news/example-slug/"
  1. Inspect stored state.
python3 scripts/fulltext_fetch.py list-entries --db "$ECEEE_NEWS_DB_PATH" --limit 100
python3 scripts/fulltext_fetch.py list-content --db "$ECEEE_NEWS_DB_PATH" --status ready --limit 100

Data Contract

  • entries table stores discovery metadata:
    • url, title, published_at
    • discovered_at, last_seen_at
  • entry_content table stores extraction result (one row per entry_id):
    • source_url, final_url, http_status
    • extractor (trafilatura, html-parser, or none)
    • content_text, content_hash, content_length
    • status (ready or failed)
    • retry fields + timestamps

Extraction and Update Rules

  • Discovery source is https://www.eceee.org/all-news/, extracting anchor tags with class newslink under /all-news/news/.
  • Fulltext extraction uses article main content region (mainContentColumn) and removes related-news/share blocks.
  • Extraction path:
    1. trafilatura (if installed and not disabled)
    2. built-in HTML parser fallback
  • Upsert by entry_id:
    • Success: set ready, write text/hash/length, reset retry counters.
    • Failure with existing ready content: keep old content, update error/retry metadata.
    • Failure without ready content: set failed, increment retries, set next_retry_at.

Configurable Parameters

  • --db
  • ECEEE_NEWS_DB_PATH
  • --index-url
  • --discover-only
  • --limit
  • --force
  • --only-failed
  • --since-date
  • --refetch-days
  • --oldest-first
  • --timeout
  • --max-bytes
  • --min-chars
  • --max-retries
  • --retry-backoff-minutes
  • --user-agent
  • --disable-trafilatura
  • --fail-on-errors

Error Handling

  • Index fetch/parse failure returns actionable error.
  • HTTP/network/content-type failures are recorded per entry and do not stop the whole sync batch.
  • Short extracted text (< --min-chars) is treated as failed to avoid low-quality bodies.
  • Retry queue is controlled via max_retries + exponential backoff.

References

  • references/schema.md
  • references/fetch-rules.md

Assets

  • assets/config.example.json

Scripts

  • scripts/fulltext_fetch.py
Weekly Installs
14
GitHub Stars
4
First Seen
12 days ago
Installed on
openclaw14
github-copilot13
codex13
cline13
opencode13
gemini-cli12