data-pipeline-architect
Data Pipeline Architect
This skill provides guidance for designing robust, scalable data pipelines that move data reliably from sources to destinations.
Core Competencies
- ETL vs ELT: Traditional Extract-Transform-Load vs modern Extract-Load-Transform patterns
- Orchestration: Airflow, Dagster, Prefect, dbt for workflow management
- Data Quality: Validation, monitoring, lineage tracking
- Scalability: Batch vs streaming, partitioning, parallelization
Pipeline Design Process
1. Requirements Analysis
To begin pipeline design, gather:
- Source systems and data formats (APIs, databases, files, streams)
- Target destinations (data warehouse, lake, lakehouse)
- Freshness requirements (real-time, hourly, daily)
- Data volume and velocity estimates
- Quality and compliance requirements
2. Architecture Selection
Batch Pipelines - For periodic bulk processing:
- Schedule-driven (hourly, daily, weekly)
- Higher latency tolerance
- Simpler error recovery (re-run entire batch)
- Tools: Airflow, dbt, Spark
Streaming Pipelines - For real-time requirements:
- Event-driven processing
- Sub-second to minute latency
- Complex state management
- Tools: Kafka, Flink, Spark Streaming
Hybrid Approaches - Lambda or Kappa architecture:
- Batch layer for completeness
- Speed layer for low latency
- Serving layer for queries
3. ETL vs ELT Decision
ETL (Transform before Load):
- When target has limited compute
- When transformation reduces data volume significantly
- When sensitive data must be masked before landing
- Legacy data warehouse patterns
ELT (Transform after Load):
- Modern cloud warehouses with cheap compute
- When raw data preservation is needed
- When transformations change frequently
- dbt-style transformations in warehouse
4. Pipeline Components
Extraction Layer:
- Full extraction vs incremental (CDC, timestamp-based)
- API pagination and rate limiting
- Connection pooling and retry logic
- Schema detection and drift handling
Transformation Layer:
- Data cleansing and standardization
- Business logic application
- Aggregation and denormalization
- Type casting and null handling
Loading Layer:
- Upsert strategies (merge, delete+insert)
- Partitioning schemes (time, hash, range)
- Index management
- Transaction boundaries
5. Error Handling Patterns
┌─────────────────────────────────────────────────────────┐
│ Pipeline Execution │
├─────────────────────────────────────────────────────────┤
│ ┌─────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Extract │───▶│ Transform │───▶│ Load │ │
│ └────┬────┘ └─────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Retry │ │ Dead Letter│ │ Rollback │ │
│ │ w/Backoff│ │ Queue │ │ Checkpoint│ │
│ └─────────┘ └───────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
- Retry with backoff: Transient failures (network, rate limits)
- Dead letter queues: Poison messages that can't be processed
- Checkpointing: Resume from last successful point
- Idempotency: Safe to re-run without duplicates
6. Data Quality Framework
Implement checks at each stage:
| Stage | Check Type | Example |
|---|---|---|
| Extract | Completeness | Row count matches source |
| Extract | Freshness | Data timestamp within SLA |
| Transform | Validity | Values in expected ranges |
| Transform | Uniqueness | Primary keys unique |
| Load | Reconciliation | Target matches source totals |
| Load | Integrity | Foreign keys valid |
7. Monitoring and Observability
Essential metrics to track:
- Pipeline duration and trends
- Row counts at each stage
- Error rates and types
- Data freshness (time since last successful run)
- Resource utilization
Alert on:
- SLA breaches (data not fresh)
- Anomalous row counts (±20% from baseline)
- Schema changes in sources
- Repeated failures
Common Patterns
Slowly Changing Dimensions (SCD)
- Type 1: Overwrite (no history)
- Type 2: Add row with validity dates
- Type 3: Previous value column
- Type 4: History table
Incremental Processing
-- Timestamp-based incremental
SELECT * FROM source
WHERE updated_at > {{ last_run_timestamp }}
-- CDC-based (Change Data Capture)
-- Captures inserts, updates, deletes from transaction log
Idempotent Loads
-- Delete + Insert pattern
DELETE FROM target WHERE date_partition = '2024-01-15';
INSERT INTO target SELECT * FROM staging WHERE date_partition = '2024-01-15';
-- Merge/Upsert pattern
MERGE INTO target t
USING staging s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
References
references/orchestration-patterns.md- Airflow, Dagster, Prefect patternsreferences/data-quality-checks.md- Validation frameworks and rulesreferences/pipeline-templates.md- Common pipeline architectures
More from 4444j99/a-i--skills
creative-writing-craft
Craft compelling fiction and creative nonfiction with attention to structure, voice, prose style, and revision. Supports short stories, novel chapters, essays, and hybrid forms. Triggers on creative writing, fiction writing, story craft, prose style, or literary technique requests.
184skill-creator
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
15freelance-client-ops
Manage freelance and client work professionally—proposals, contracts, scope management, invoicing, and client communication. Covers the business side of creative work. Triggers on freelance, client work, proposals, contracts, pricing, or project scope requests.
14generative-music-composer
Creates algorithmic music composition systems using procedural generation, Markov chains, L-systems, and neural approaches for ambient, adaptive, and experimental music.
12generative-art-algorithms
Create algorithmic and generative art using mathematical patterns, noise functions, particle systems, and procedural generation. Covers flow fields, L-systems, fractals, and creative coding foundations. Triggers on generative art, algorithmic art, creative coding, procedural generation, or mathematical visualization requests.
10interfaith-sacred-geometry
Generate sacred geometry patterns with interfaith symbolism for spiritual visualizations and art. Use when creating visual representations that honor multiple religious traditions, designing meditation aids, building soul journey visualizations, or producing art that bridges sacred traditions through geometric harmony. Triggers on sacred geometry requests, interfaith symbol design, spiritual visualization projects, or multi-tradition sacred art.
8