data-engineering-orchestration
Pipeline Orchestration
Workflow orchestration tools for data pipelines: Prefect, Dagster, and dbt. These tools handle scheduling, dependency resolution, retries, monitoring, and state management for production data pipelines.
Quick Comparison
| Tool | Paradigm | Best For | Learning Curve |
|---|---|---|---|
| Prefect | Flow-based | Pythonic workflows, quick prototypes, cloud-first | Moderate |
| Dagster | Asset-based | Data asset lineage, reproducibility, type checking | Steeper |
| dbt | SQL transformations | Analytics engineering, ELT, data warehouses | Low (SQL-focused) |
| FlowerPower | Hamilton DAGs | Lightweight batch ETL, configuration-driven pipelines | Low-Moderate |
When to Use Which?
-
Prefect: You want Python code flexibility, Prefect Cloud UI, and quick setup. Good for general-purpose data pipelines, ETL, and API integrations.
-
Dagster: You care about data asset observability, type safety, and reproducibility. Good for complex data platforms with clear asset dependencies.
-
dbt: Your transformations are primarily SQL and you're building analytics marts in a data warehouse. Great for analytics engineering teams.
Skill Dependencies
Assumes familiarity with:
@data-engineering-core- Polars, DuckDB, PyArrow@data-engineering-storage-remote-access- Cloud storage for intermediate data
Related:
@data-engineering-quality- Data validation integrated into orchestration@data-engineering-observability- Monitoring and tracing@data-engineering-storage-lakehouse- Delta/Iceberg for state management
Detailed Guides
Prefect
See: @data-engineering-orchestration/prefect.md
- Flows and tasks with decorators
- Retries, caching, and parameters
- Prefect Cloud (serverless) vs Prefect Server (self-hosted)
- Deployment patterns
Dagster
See: @data-engineering-orchestration/dagster.md
- Asset-based programming model
- Materialization and partitions
- Type checking with Dagster types
- Sensors and schedules
- Integration with data platforms
dbt (Data Build Tool)
See: @data-engineering-orchestration/dbt.md
- Projects, models, tests, snapshots, seeds
- Jinja templating and macros
- Data testing (schema, cardinality, custom)
- Documentation generation
- Package management (dbt packages)
- Adapters (DuckDB, Postgres, Snowflake, BigQuery, Spark)
FlowerPower (Lightweight Alternative)
FlowerPower is a lightweight DAG orchestration framework built on Apache Hamilton, ideal for batch ETL and data transformation scripts without the overhead of full orchestrators.
Key characteristics:
- Hamilton-based: Define pipelines as Python functions; DAG auto-constructed
- Configuration-driven: YAML files for parameters and execution settings
- Lightweight: No database, no scheduler, no state persistence (batch-only)
- Multiple executors: synchronous, threadpool, processpool, ray, dask
- I/O plugins: Delta Lake, DuckDB, Polars, Pandas, S3, PostgreSQL, and more
When to choose FlowerPower over Prefect/Dagster:
- Simple batch pipelines (daily/Hourly ETL)
- Quick prototyping that can grow
- Teams that prefer code-first (Python functions) over YAML/UI
- No need for sophisticated scheduling, SLA tracking, or long-running state
When NOT to use:
- Production 24/7 workflows requiring reliability guarantees
- Complex dependency graphs with cross-dependencies
- Need for built-in retry policies with circuit breakers
- Workflows requiring checkpoints and state recovery
- Multi-team orchestration with fine-grained permissions
FlowerPower limitations vs. Prefect/Dagster:
| Feature | Prefect/Dagster | FlowerPower |
|---|---|---|
| Scheduling | Native (cron, intervals) | External (cron/systemd) |
| State persistence | Database/cloud | None (ephemeral) |
| Retry policies | Configurable per task | Per-pipeline via YAML |
| Observability | Rich UI, lineage | Basic Hamilton UI |
| Production readiness | High | Moderate (batch jobs) |
Integration with data-engineering stack:
- Uses Polars/DuckDB for DataFrame operations (
@data-engineering-core) - Delta Lake for ACID table formats (
@data-engineering-storage-lakehouse) - fsspec/S3 for cloud storage (
@data-engineering-storage-remote-access) - Pandera for data validation (
@data-engineering-quality) - Follows medallion architecture (
@data-engineering-best-practices)
Skill reference: @flowerpower - Complete guide to FlowerPower with advanced production patterns (watermarks, data quality, incremental loads, cloud deployment).
Cloud Storage Integration
See: @data-engineering-orchestration/integrations/cloud-storage.md
- dbt + S3/GCS via HTTPFS (DuckDB), aws_s3 extension (Postgres)
- Configuration patterns for profiles.yml
- Credential management best practices
Common Patterns
Retry Pattern (All Orchestrators)
# Prefect: @task(retries=3, retry_delay_seconds=60)
# Dagster: @asset(retry_policy=RetryPolicy(...))
# dbt: --fail-fast flag + custom macro retry logic
Idempotency
All orchestrators assume idempotent operations - running twice should produce identical results. Design your INSERT, UPDATE, MERGE operations to be idempotent.
State Management
- Prefect: Flow run state persisted to database/cloud
- Dagster: Asset materialization events tracked
- dbt: Model run status in
dbt_run_results.json; usesSELECT+INSERTby default
Dependency Management
- Prefect: Explicit task dependencies (
task1 >> task2) - Dagster: Asset dependencies (
@asset(depends_on=[other_asset])) - dbt: DAG built from DAG from
ref()calls in models
Production Recommendations
- Version control everything: Code, configs, dbt models, Prefect/Dagster definitions
- Test locally first: Use unit tests for transformation logic, integration tests for pipeline runs
- Use environment variables for credentials (never hardcode)
- Monitor pipeline runs: Prefect Cloud UI, Dagster Dagit, dbt Cloud or custom alerts
- Alert on failures: Configure email/Slack/webhook notifications
- Log aggregation: Send orchestrator logs to centralized system (Datadog, CloudWatch)
- Idempotent writes: Avoid duplicate data on retries
- Schema evolution: Handle schema changes gracefully (additive only preferred)