data-pipeline-architect
SKILL.md
Data Pipeline Architect
This skill provides guidance for designing robust, scalable data pipelines that move data reliably from sources to destinations.
Core Competencies
- ETL vs ELT: Traditional Extract-Transform-Load vs modern Extract-Load-Transform patterns
- Orchestration: Airflow, Dagster, Prefect, dbt for workflow management
- Data Quality: Validation, monitoring, lineage tracking
- Scalability: Batch vs streaming, partitioning, parallelization
Pipeline Design Process
1. Requirements Analysis
To begin pipeline design, gather:
- Source systems and data formats (APIs, databases, files, streams)
- Target destinations (data warehouse, lake, lakehouse)
- Freshness requirements (real-time, hourly, daily)
- Data volume and velocity estimates
- Quality and compliance requirements
2. Architecture Selection
Batch Pipelines - For periodic bulk processing:
- Schedule-driven (hourly, daily, weekly)
- Higher latency tolerance
- Simpler error recovery (re-run entire batch)
- Tools: Airflow, dbt, Spark
Streaming Pipelines - For real-time requirements:
- Event-driven processing
- Sub-second to minute latency
- Complex state management
- Tools: Kafka, Flink, Spark Streaming
Hybrid Approaches - Lambda or Kappa architecture:
- Batch layer for completeness
- Speed layer for low latency
- Serving layer for queries
3. ETL vs ELT Decision
ETL (Transform before Load):
- When target has limited compute
- When transformation reduces data volume significantly
- When sensitive data must be masked before landing
- Legacy data warehouse patterns
ELT (Transform after Load):
- Modern cloud warehouses with cheap compute
- When raw data preservation is needed
- When transformations change frequently
- dbt-style transformations in warehouse
4. Pipeline Components
Extraction Layer:
- Full extraction vs incremental (CDC, timestamp-based)
- API pagination and rate limiting
- Connection pooling and retry logic
- Schema detection and drift handling
Transformation Layer:
- Data cleansing and standardization
- Business logic application
- Aggregation and denormalization
- Type casting and null handling
Loading Layer:
- Upsert strategies (merge, delete+insert)
- Partitioning schemes (time, hash, range)
- Index management
- Transaction boundaries
5. Error Handling Patterns
┌─────────────────────────────────────────────────────────┐
│ Pipeline Execution │
├─────────────────────────────────────────────────────────┤
│ ┌─────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Extract │───▶│ Transform │───▶│ Load │ │
│ └────┬────┘ └─────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Retry │ │ Dead Letter│ │ Rollback │ │
│ │ w/Backoff│ │ Queue │ │ Checkpoint│ │
│ └─────────┘ └───────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
- Retry with backoff: Transient failures (network, rate limits)
- Dead letter queues: Poison messages that can't be processed
- Checkpointing: Resume from last successful point
- Idempotency: Safe to re-run without duplicates
6. Data Quality Framework
Implement checks at each stage:
| Stage | Check Type | Example |
|---|---|---|
| Extract | Completeness | Row count matches source |
| Extract | Freshness | Data timestamp within SLA |
| Transform | Validity | Values in expected ranges |
| Transform | Uniqueness | Primary keys unique |
| Load | Reconciliation | Target matches source totals |
| Load | Integrity | Foreign keys valid |
7. Monitoring and Observability
Essential metrics to track:
- Pipeline duration and trends
- Row counts at each stage
- Error rates and types
- Data freshness (time since last successful run)
- Resource utilization
Alert on:
- SLA breaches (data not fresh)
- Anomalous row counts (±20% from baseline)
- Schema changes in sources
- Repeated failures
Common Patterns
Slowly Changing Dimensions (SCD)
- Type 1: Overwrite (no history)
- Type 2: Add row with validity dates
- Type 3: Previous value column
- Type 4: History table
Incremental Processing
-- Timestamp-based incremental
SELECT * FROM source
WHERE updated_at > {{ last_run_timestamp }}
-- CDC-based (Change Data Capture)
-- Captures inserts, updates, deletes from transaction log
Idempotent Loads
-- Delete + Insert pattern
DELETE FROM target WHERE date_partition = '2024-01-15';
INSERT INTO target SELECT * FROM staging WHERE date_partition = '2024-01-15';
-- Merge/Upsert pattern
MERGE INTO target t
USING staging s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
References
references/orchestration-patterns.md- Airflow, Dagster, Prefect patternsreferences/data-quality-checks.md- Validation frameworks and rulesreferences/pipeline-templates.md- Common pipeline architectures
Weekly Installs
2
Repository
4444j99/a-i--skillsGitHub Stars
3
First Seen
6 days ago
Security Audits
Installed on
amp2
cline2
openclaw2
opencode2
cursor2
kimi-cli2