Sustainability Fulltext Fetch

Core Goal

Read relevant DOI entries from RSS metadata DB.
Write fetched content into a separate fulltext DB.
Process only relevant entries (is_relevant=1).
Prefer API metadata retrieval by DOI (OpenAlex first, Semantic Scholar fallback).
Fallback to webpage fulltext extraction when API metadata is unavailable.
Persist one content row per DOI in entry_content.

Triggering Conditions

Receive a request to enrich relevant DOI records with abstract/fulltext content.
Receive a request to replace webpage-first crawling with API-first enrichment.
Need retry-safe incremental updates without duplicate rows.

Workflow

Ensure upstream DOI/relevance data exists.

export SUSTAIN_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_rss.db"
export SUSTAIN_FULLTEXT_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_fulltext.db"
python3 scripts/fulltext_fetch.py init-db --content-db "$SUSTAIN_FULLTEXT_DB_PATH"

Run incremental sync (API first, webpage fallback).

python3 scripts/fulltext_fetch.py sync \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --limit 50 \
  --openalex-email "you@example.com" \
  --api-min-chars 80 \
  --min-chars 300

Fetch one DOI on demand.

python3 scripts/fulltext_fetch.py fetch-entry \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --doi "10.1038/nature12373"

Inspect stored content state.

python3 scripts/fulltext_fetch.py list-content \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --status ready \
  --limit 100

Data Contract

Reads from RSS DB entries:
- doi, doi_is_surrogate, is_relevant, canonical_url, url, title.
Writes to fulltext DB entry_content (primary key doi):
- source URL/status/extractor
- content_kind (abstract or fulltext)
- content_text, content_hash, content_length
- retry fields and timestamps.

Extraction Priority

API metadata path:

OpenAlex by DOI.
Semantic Scholar fallback by DOI.
If accepted (--api-min-chars), persist as content_kind=abstract.

Webpage fallback path:

Use canonical_url then url.
Extract with trafilatura when available, else built-in HTML parser.
Persist as content_kind=fulltext.

Update Semantics

Upsert key: doi.
Success: status ready, reset retry counters.
Failure with existing ready row: keep old content, record latest error.
Failure without ready row: set status=failed, increment retry state.

Configurable Parameters

--rss-db
--content-db
SUSTAIN_RSS_DB_PATH
SUSTAIN_FULLTEXT_DB_PATH
--limit
--force
--only-failed
--refetch-days
--timeout
--max-bytes
--min-chars
--openalex-email / OPENALEX_EMAIL
--s2-api-key / S2_API_KEY
--api-timeout
--api-min-chars
--disable-api-metadata
--max-retries
--retry-backoff-minutes
--user-agent
--disable-trafilatura
--fail-on-errors

Error Handling

Missing DOI-keyed entries table: stop with actionable message.
RSS DB and fulltext DB path collision: fail fast and require separate files.
API/network/HTTP failures: record failures and continue queue.
Webpage non-text content: mark failed for that DOI.
Short extraction: fail by threshold to avoid low-quality content.

References

references/schema.md
references/fetch-rules.md

Assets

assets/config.example.json

Scripts

scripts/fulltext_fetch.py

sustainability-fulltext-fetch

Sustainability Fulltext Fetch

Core Goal

Triggering Conditions

Workflow

Data Contract

Extraction Priority

Update Semantics

Configurable Parameters

Error Handling

References

Assets

Scripts