Scrape Status Diagnostic

Check the scrape state for date $ARGUMENTS (default: today).

1. JSON files on disk

Check JSON_data/ for the target date. For each pipeline:

Pipeline	Merged file pattern	Shard file pattern
CM Products	`collected_data_{date}.json`	`collected_data_{date}_*.json`
CM Singles	`collected_singles_{date}.json`	`collected_singles_{date}_*.json`
CM Graded	`collected_singles_graded_{date}.json`	`collected_singles_graded_{date}_*.json`
CM Boosters	`collected_boosters_{date}.json`	—
eBay Topsun	`ebay_topsun_{date}.json`	`ebay_topsun_{date}_*.json`
eBay Carddass	`ebay_carddass_{date}.json`	`ebay_carddass_{date}_*.json`
eBay Products	`ebay_products_{date}.json`	—

For each, report:

Does the merged file exist? How many records?
Are there shard files? How many, and do they all have .done markers?
Stranded data: shard files exist but merged file is missing/stale (fewer records than shards combined) — this means merge didn't run

Use: python -c "import json; d=json.load(open('JSON_data/<file>')); print(len(d) if isinstance(d,list) else sum(len(v) for v in d.values() if isinstance(v,list)))" to count records.

2. Database counts

Query backend/tracker.db for the target date:

SELECT 'price_entries' as t, COUNT(*) FROM price_entries WHERE collection_date = '{date}'
UNION ALL SELECT 'singles_price_entries', COUNT(*) FROM singles_price_entries WHERE collection_date = '{date}'
UNION ALL SELECT 'ebay_price_entries', COUNT(*) FROM ebay_price_entries WHERE collection_date = '{date}'
UNION ALL SELECT 'ebay_carddass_price_entries', COUNT(*) FROM ebay_carddass_price_entries WHERE collection_date = '{date}'
UNION ALL SELECT 'ebay_product_price_entries', COUNT(*) FROM ebay_product_price_entries WHERE collection_date = '{date}';

Also check booster_price_entries if it exists.

3. Expected vs actual

Compare DB counts against config sizes:

Products: from products_config import PRODUCTS — count per language
Singles: from singles_config import SINGLES — total count
Check coverage percentage

4. VPS sync status

Check if data has been pushed to VPS by looking at the scrape_runs table:

SELECT * FROM scrape_runs WHERE run_date = '{date}' ORDER BY scraped_at DESC;

If the scrape_runs table doesn't exist locally, note that push status can only be checked on VPS.

Output format

Scrape Status for {date}
========================

CM Products:   {json_count} scraped → {db_count} in DB  ({pct}% of {expected})
CM Singles:    {json_count} scraped → {db_count} in DB  ({pct}% of {expected})
CM Graded:     ...
CM Boosters:   ...
eBay Topsun:   ...
eBay Carddass: ...
eBay Products: ...

Issues:
  - [STRANDED] Singles: 3 shard files (3362 records) not merged — run: python daily/merge_collected_singles.py {date}
  - [PARTIAL] Graded: shard 3 missing .done marker — was interrupted mid-scrape
  - [NOT PUSHED] 1900 new records not on VPS — run: python deploy/push_to_vps.py --date {date}

Focus on actionable issues. If everything looks clean, say so briefly.

scrape-status

Scrape Status Diagnostic

1. JSON files on disk

2. Database counts

3. Expected vs actual

4. VPS sync status

Output format

More from dangaiden/something-skill

dangaiden-project-overview

review-pending