data-pipelines
SKILL.md
Data Pipelines Skill
Version: 1.0 Stack: Airflow, Step Functions, dbt, general ETL
Pipelines fail silently. A non-idempotent task appends duplicate rows every time it retries. A task without validation passes bad data downstream, and you discover it three stages later when a dashboard shows impossible numbers. A pipeline without parameterized dates can't be backfilled, which means when something goes wrong on Tuesday, you manually reprocess every day since the last known good state.
Idempotent tasks, quality checks between stages, and parameterized execution mean failures are recoverable and errors are caught where they happen, not downstream.
Scope and Boundaries
This skill covers:
- ETL/ELT architecture decisions
- Orchestration patterns (DAGs, dependencies)
- Idempotency and retry strategies
- Data quality checks
- Incremental vs. full refresh
- Backfill patterns
Defers to other skills:
data-python: Python code within tasksdata-sql: Query patterns within transformsdata-aws: AWS-specific service patterns
Use this skill when: Designing or building data pipelines.
Core Principles
- Idempotent Tasks — Running twice produces the same result.
- Atomic Operations — All-or-nothing; no partial writes.
- Data Contracts — Define schema expectations between stages.
- Observable Pipelines — Log, metric, alert at every stage.
- Incremental When Possible — Process only changed data.
Patterns
Idempotent Writes
# Bad - appends duplicate data on retry
def load_data(df, table):
df.to_sql(table, engine, if_exists="append")
# Good - replace partition, safe to retry
def load_data(df, table, partition_date):
with engine.begin() as conn:
conn.execute(f"DELETE FROM {table} WHERE date = '{partition_date}'")
df.to_sql(table, conn, if_exists="append", index=False)
Stage Pattern
raw/ # Exact copy of source, immutable
├── source_a/
└── source_b/
staged/ # Cleaned, typed, validated
├── source_a/
└── source_b/
curated/ # Business logic applied, joined
├── dim_users/
├── dim_products/
└── fact_orders/
aggregated/ # Pre-computed metrics
├── daily_revenue/
└── user_cohorts/
Data Quality Checks
def validate_output(df: pd.DataFrame) -> None:
"""Run after transform, before load."""
assert not df.empty, "Output is empty"
assert df["id"].is_unique, "Duplicate IDs found"
assert df["amount"].notna().all(), "NULL amounts found"
assert (df["amount"] >= 0).all(), "Negative amounts found"
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Non-idempotent tasks | Duplicates on retry | DELETE then INSERT, or use upserts |
| Monolithic DAGs | Hard to debug, long recovery | Break into smaller, focused DAGs |
| No data validation | Bad data propagates silently | Add checks between stages |
| Hardcoded dates | Can't backfill | Parameterize execution date |
| Silent failures | Issues discovered too late | Alert on anomalies, not just errors |
Checklist
- All tasks are idempotent
- Data quality checks between stages
- Execution date is parameterized
- Backfill tested and documented
- Alerting on failures AND anomalies
- Clear data lineage documented
References
references/idempotency.md— Idempotent write patterns and testing
Assets
assets/pipeline-checklist.md— Comprehensive pipeline design checklist
Weekly Installs
9
Repository
alexanderstephe…aude-hubGitHub Stars
1
First Seen
Feb 17, 2026
Security Audits
Installed on
github-copilot9
codex9
gemini-cli9
opencode9
kimi-cli8
cursor8