data-pipeline
Data Pipeline
4-stage V2 charity evaluation pipeline with 100-point scoring.
Philosophy: Capture broadly, filter later. Correctness > cost, but we can have both.
Quick Reference (V2 Pipeline)
| Stage | Entry Point | What It Does |
|---|---|---|
| 1. Crawl | crawl.py |
Collect data from 5 sources |
| 2. Process Data | process_data.py |
Derive fields + reconcile sources |
| 3. Process Baseline | process_baseline.py |
Generate baseline narratives + export + verify |
| 4. Process Rich | process_rich.py |
Generate rich narratives + export + verify |
Wrapper: ./run_v2.sh runs all 4 stages
Decision Tree
Working on data collection? → See extraction.md for sources, patterns, red flags
Working on pipeline phases or state machine? → See orchestration.md for workflow, CLI, transitions
Working with database or debugging queries?
→ See data-pipeline/src/db/ for Supabase repositories
Implementing freshness checks or versioning? → See versioning.md for hashing, TTLs, skip logic
State Machine
NOT_STARTED → COLLECTED → DERIVED → RECONCILED
→ BASELINE_QUEUED → BASELINE_REVIEW
→ RICH_QUEUED → RICH_REVIEW
→ APPROVED (terminal) or REJECTED (terminal)
Terminal states require force=True to transition.
Key Files
data-pipeline/
├── run_v2.sh # Wrapper: all 4 stages
├── crawl.py # Stage 1: Collect data
├── process_data.py # Stage 2: Derive + reconcile
├── process_baseline.py # Stage 3: Baseline narratives
├── process_rich.py # Stage 4: Rich narratives
├── src/
│ ├── collectors/ # 5 data sources
│ ├── evaluators/ # NarrativeEvaluator, Judge
│ ├── scorers/ # V2 scoring (100-point scale)
│ ├── quality_judges/ # LLM-as-judge scorers
│ ├── database/ # Schema, WriteQueue, repository
│ └── cli/wizard.py # Interactive menu (uv run z)
└── pilot_charities.txt # Source of truth for EINs
CLI Commands
# Full V2 pipeline
./run_v2.sh --charities pilot_charities.txt --workers 10
# Individual stages
uv run python crawl.py --charities pilot_charities.txt --workers 10
uv run python process_data.py --charities pilot_charities.txt
uv run python process_baseline.py --charities pilot_charities.txt --workers 5
uv run python process_rich.py --charities rich_charities.txt --workers 3
# Interactive wizard
uv run z
# Status
zakaat status --ein 95-4453134
Critical Patterns
Supabase Repositories
Data access via repository pattern in src/db/:
from src.db import get_client
from src.db.charity_repository import CharityRepository
client = get_client()
repo = CharityRepository(client)
charity = repo.get_by_ein("95-4453134")
Pilot Charities
All operations scope to pilot_charities.txt:
from src.cli.wizard import get_pilot_eins
eins = get_pilot_eins()
Anti-Patterns
Don't:
- Skip phases (must go in order)
- Hardcode EINs (use
pilot_charities.txt) - Fabricate missing data
Do:
- Use repository pattern for database access
- Track source for every datum
- Check freshness before expensive operations
Related Skills
- llm-prompting: Prompt patterns, schema enforcement
- form990-expert: 990 parsing, financial analysis
- zakat-fiqh: Zakat classification, wallet tags
More from uabbasi/good-measure-giving
frontend-design
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics.
11form990-expert
Deep expertise on IRS Form 990 structure, fields, and nonprofit financial analysis. Activates when working on Form 990 parsing, ProPublica API integration, or charity financial evaluation code.
1zakat-fiqh
Islamic jurisprudence expertise on zakat - the 8 categories of recipients (asnaf), scholarly interpretations across madhabs, and how to assess charity alignment with zakat eligibility. Activates when working on zakat classification, wallet tags, or donor guidance.
1webapp-testing
Test web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing screenshots, and viewing browser logs. Use when testing the website or debugging UI issues.
1llm-prompting
Expert guidance on LLM prompting patterns used in this project. Covers versioned prompts, schema enforcement, category calibration, multi-provider fallback, and narrative generation. Activates when working on prompts, LLM integration, or evaluation logic.
1analytics
Unified analytics (Cloudflare + Firestore + GA4). Pull traffic data, user activity, feature adoption, and giving metrics. Use when checking site performance, user behavior, or feature usage.
1