skills/tiangong-ai/skills/sustainability-fulltext-fetch

sustainability-fulltext-fetch

SKILL.md

Sustainability Fulltext Fetch

Core Goal

  • Read relevant DOI entries from RSS metadata DB.
  • Write fetched content into a separate fulltext DB.
  • Process only relevant entries (is_relevant=1).
  • Prefer API metadata retrieval by DOI (OpenAlex first, Semantic Scholar fallback).
  • Fallback to webpage fulltext extraction when API metadata is unavailable.
  • Persist one content row per DOI in entry_content.

Triggering Conditions

  • Receive a request to enrich relevant DOI records with abstract/fulltext content.
  • Receive a request to replace webpage-first crawling with API-first enrichment.
  • Need retry-safe incremental updates without duplicate rows.

Workflow

  1. Ensure upstream DOI/relevance data exists.
export SUSTAIN_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_rss.db"
export SUSTAIN_FULLTEXT_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_fulltext.db"
python3 scripts/fulltext_fetch.py init-db --content-db "$SUSTAIN_FULLTEXT_DB_PATH"
  1. Run incremental sync (API first, webpage fallback).
python3 scripts/fulltext_fetch.py sync \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --limit 50 \
  --openalex-email "you@example.com" \
  --api-min-chars 80 \
  --min-chars 300
  1. Fetch one DOI on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --doi "10.1038/nature12373"
  1. Inspect stored content state.
python3 scripts/fulltext_fetch.py list-content \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --status ready \
  --limit 100

Data Contract

  • Reads from RSS DB entries:
    • doi, doi_is_surrogate, is_relevant, canonical_url, url, title.
  • Writes to fulltext DB entry_content (primary key doi):
    • source URL/status/extractor
    • content_kind (abstract or fulltext)
    • content_text, content_hash, content_length
    • retry fields and timestamps.

Extraction Priority

  1. API metadata path:
  • OpenAlex by DOI.
  • Semantic Scholar fallback by DOI.
  • If accepted (--api-min-chars), persist as content_kind=abstract.
  1. Webpage fallback path:
  • Use canonical_url then url.
  • Extract with trafilatura when available, else built-in HTML parser.
  • Persist as content_kind=fulltext.

Update Semantics

  • Upsert key: doi.
  • Success: status ready, reset retry counters.
  • Failure with existing ready row: keep old content, record latest error.
  • Failure without ready row: set status=failed, increment retry state.

Configurable Parameters

  • --rss-db
  • --content-db
  • SUSTAIN_RSS_DB_PATH
  • SUSTAIN_FULLTEXT_DB_PATH
  • --limit
  • --force
  • --only-failed
  • --refetch-days
  • --timeout
  • --max-bytes
  • --min-chars
  • --openalex-email / OPENALEX_EMAIL
  • --s2-api-key / S2_API_KEY
  • --api-timeout
  • --api-min-chars
  • --disable-api-metadata
  • --max-retries
  • --retry-backoff-minutes
  • --user-agent
  • --disable-trafilatura
  • --fail-on-errors

Error Handling

  • Missing DOI-keyed entries table: stop with actionable message.
  • RSS DB and fulltext DB path collision: fail fast and require separate files.
  • API/network/HTTP failures: record failures and continue queue.
  • Webpage non-text content: mark failed for that DOI.
  • Short extraction: fail by threshold to avoid low-quality content.

References

  • references/schema.md
  • references/fetch-rules.md

Assets

  • assets/config.example.json

Scripts

  • scripts/fulltext_fetch.py
Weekly Installs
31
GitHub Stars
4
First Seen
Feb 11, 2026
Installed on
openclaw29
github-copilot28
codex28
kimi-cli28
gemini-cli28
cursor28