sustainability-fulltext-fetch
Originally fromtiangong-ai/skills
SKILL.md
Sustainability Fulltext Fetch
Core Goal
- Read relevant DOI entries from RSS metadata DB.
- Write fetched content into a separate fulltext DB.
- Process only relevant entries (
is_relevant=1). - Prefer API metadata retrieval by DOI (OpenAlex first, Semantic Scholar fallback).
- Fallback to webpage fulltext extraction when API metadata is unavailable.
- Persist one content row per DOI in
entry_content.
Triggering Conditions
- Receive a request to enrich relevant DOI records with abstract/fulltext content.
- Receive a request to replace webpage-first crawling with API-first enrichment.
- Need retry-safe incremental updates without duplicate rows.
Workflow
- Ensure upstream DOI/relevance data exists.
export SUSTAIN_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_rss.db"
export SUSTAIN_FULLTEXT_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_fulltext.db"
python3 scripts/fulltext_fetch.py init-db --content-db "$SUSTAIN_FULLTEXT_DB_PATH"
- Run incremental sync (API first, webpage fallback).
python3 scripts/fulltext_fetch.py sync \
--rss-db "$SUSTAIN_RSS_DB_PATH" \
--content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
--limit 50 \
--openalex-email "you@example.com" \
--api-min-chars 80 \
--min-chars 300
- Fetch one DOI on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
--rss-db "$SUSTAIN_RSS_DB_PATH" \
--content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
--doi "10.1038/nature12373"
- Inspect stored content state.
python3 scripts/fulltext_fetch.py list-content \
--rss-db "$SUSTAIN_RSS_DB_PATH" \
--content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
--status ready \
--limit 100
Data Contract
- Reads from RSS DB
entries:doi,doi_is_surrogate,is_relevant,canonical_url,url,title.
- Writes to fulltext DB
entry_content(primary keydoi):- source URL/status/extractor
content_kind(abstractorfulltext)content_text,content_hash,content_length- retry fields and timestamps.
Extraction Priority
- API metadata path:
- OpenAlex by DOI.
- Semantic Scholar fallback by DOI.
- If accepted (
--api-min-chars), persist ascontent_kind=abstract.
- Webpage fallback path:
- Use
canonical_urlthenurl. - Extract with
trafilaturawhen available, else built-in HTML parser. - Persist as
content_kind=fulltext.
Update Semantics
- Upsert key:
doi. - Success: status
ready, reset retry counters. - Failure with existing ready row: keep old content, record latest error.
- Failure without ready row: set
status=failed, increment retry state.
Configurable Parameters
--rss-db--content-dbSUSTAIN_RSS_DB_PATHSUSTAIN_FULLTEXT_DB_PATH--limit--force--only-failed--refetch-days--timeout--max-bytes--min-chars--openalex-email/OPENALEX_EMAIL--s2-api-key/S2_API_KEY--api-timeout--api-min-chars--disable-api-metadata--max-retries--retry-backoff-minutes--user-agent--disable-trafilatura--fail-on-errors
Error Handling
- Missing DOI-keyed
entriestable: stop with actionable message. - RSS DB and fulltext DB path collision: fail fast and require separate files.
- API/network/HTTP failures: record failures and continue queue.
- Webpage non-text content: mark failed for that DOI.
- Short extraction: fail by threshold to avoid low-quality content.
References
references/schema.mdreferences/fetch-rules.md
Assets
assets/config.example.json
Scripts
scripts/fulltext_fetch.py
Weekly Installs
6
Repository
fadeloo/skillsFirst Seen
Feb 24, 2026
Security Audits
Installed on
opencode6
gemini-cli6
github-copilot6
codex6
amp6
openclaw6