data-orchestration-skill
Data Orchestration Skill for Claude
Expert guidance for orchestrating data pipelines. Dagster-first for greenfield projects, Airflow for brownfield. Covers scheduling, dependencies, monitoring, retries, alerting, and dbt/DLT integration.
When to Use This Skill
Activate when:
- Choosing between orchestration tools (Dagster vs Airflow vs Prefect vs embedded)
- Building Dagster assets, resources, sensors, schedules, or partitions
- Writing Airflow DAGs, operators, TaskFlow tasks, or connections
- Integrating orchestrators with dbt or DLT
- Implementing retry logic, alerting, failure handling, or partitioned backfills
- Deciding whether you need an orchestrator at all
Don't use for: dbt model writing (use dbt-skill), DLT source/destination config (use integration-patterns-skill), Kafka/Flink streaming (use streaming-data-skill), IaC provisioning, or CI/CD pipelines.
Scope Constraints
- Generate orchestration code, DAG definitions, asset configs, and scheduling logic only.
- Credential management: reference environment variables and secrets managers. Never hardcode secrets. See Security & Compliance Patterns.
- Limit scope to orchestration concerns. Hand off transformation logic to dbt-skill and ingestion logic to integration-patterns-skill.
Model Routing
| reasoning_demand | preferred | acceptable | minimum |
|---|---|---|---|
| medium | Sonnet | Opus, Haiku | Haiku |
Core Principles
- Assets over tasks — Define persistent data artifacts (tables, views, models), not computation steps. Asset lineage is visible, dependencies declarative, backfills targeted, freshness observable.
- Idempotent runs — Use MERGE/upsert, partition by date/key, track state externally. Every run must be safe to re-execute.
- Declarative dependencies — Declare dependencies; the framework resolves execution order.
- Observable pipelines — Log metadata (row counts, schema changes, execution times). Monitor freshness SLAs. Trace lineage from source to dashboard. Alert with actionable context.
- Graceful failure — Retry with backoff for transient failures. Re-run only failed assets. Capture failed records via dead letter patterns. Group alerts to prevent fatigue.
Orchestrator Decision Matrix
| Factor | Dagster | Airflow | Prefect | Embedded |
|---|---|---|---|---|
| Philosophy | Asset-oriented | Task-oriented | Flow-oriented | Tool-native |
| Best for | Greenfield platforms | Brownfield, large DAGs | Python-native, event-driven | Single-tool workflows |
| dbt integration | dagster-dbt (first-class) |
cosmos (good) |
CLI wrapper | Native (dbt Cloud) |
| DLT integration | dagster-dlt (first-class) |
Task wrapper | Task wrapper | N/A |
| Asset lineage | Built-in, UI-native | Via plugins (limited) | Via artifacts | Tool-specific |
| Partitioning | First-class | Dynamic task mapping | Map/reduce | Tool-specific |
| Local dev | dagster dev (full UI) |
Local executor (limited) | prefect server start |
N/A |
| Managed offering | Dagster Cloud | MWAA, Composer, Astronomer | Prefect Cloud | Built into platform |
See Embedded Orchestration Reference for dbt Cloud, Databricks Workflows, Snowflake Tasks, and Prefect patterns.
The Trifecta: Dagster + DLT + dbt
Each tool handles one concern:
| Layer | Tool | Responsibility |
|---|---|---|
| Orchestration | Dagster | Scheduling, dependency resolution, monitoring, alerting |
| Ingestion | DLT | Extract from sources, load to warehouse (raw layer) |
| Transformation | dbt | SQL transformations (staging, intermediate, marts) |
┌─────────────────────────────────────────────────────┐
│ Dagster (Orchestrator) │
│ ┌───────────┐ ┌────────────┐ ┌──────────┐ │
│ │ DLT Assets│ ──→ │ dbt Assets │ ──→ │ Quality │ │
│ │ (ingest) │ │ (transform)│ │ (assert) │ │
│ └───────────┘ └────────────┘ └──────────┘ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Snowflake / BigQuery │ │
│ │ raw.* → staging.* → intermediate.* → marts.*│ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Key benefits: DLT assets auto-depend on source availability. dbt assets auto-depend on raw tables DLT creates. Dagster shows complete lineage from API to marts. Backfills target specific date ranges across both DLT and dbt. Failures in DLT prevent dbt from running on stale data.
For full trifecta code example, see Dagster Integrations Reference.
Dagster Quickstart
pip install dagster dagster-webserver
dagster dev -f my_pipeline.py # Full UI at http://localhost:3000
For asset examples (basic, resources, schedules, sensors), see Dagster Patterns Reference. For dbt and DLT integrations, see Dagster Integrations Reference.
Airflow Quickstart
Airflow stores credentials in Connections — never in DAG code. Configure via Airflow UI, environment variables (AIRFLOW_CONN_*), or secrets backends (Vault, AWS SM, GCP SM).
For TaskFlow API, classic operator pattern, dynamic task mapping, cosmos dbt integration, and managed Airflow (MWAA/Composer/Astronomer), see Airflow Patterns Reference.
Common Patterns
Retry Strategy
# Dagster
@asset(retry_policy=RetryPolicy(max_retries=3, delay=30, backoff=Backoff.EXPONENTIAL))
def flaky_api_asset():
return call_external_api()
# Airflow
default_args = {"retries": 3, "retry_delay": timedelta(minutes=5),
"retry_exponential_backoff": True, "max_retry_delay": timedelta(minutes=30)}
Alerting
# Dagster: freshness policies
@asset(freshness_policy=FreshnessPolicy(maximum_lag_minutes=120))
def fct_orders(stg_orders): ...
# Configure Slack/PagerDuty alerts in Dagster Cloud or via dagster-slack
# Airflow: failure callbacks
def on_failure_callback(context):
send_slack_alert(channel="#data-alerts",
message=f"Task {context['dag'].dag_id}.{context['task'].task_id} failed on {context['ds']}")
Partitioned Backfills
# Dagster: daily partitions
daily_partitions = DailyPartitionsDefinition(start_date="2024-01-01")
@asset(partitions_def=daily_partitions)
def daily_orders(context):
date = context.partition_key
orders = fetch_orders(date=date)
load_to_warehouse(orders, partition=date)
# Backfill via UI (select range → Materialize) or CLI:
# dagster asset materialize --partition "2024-01-01...2024-01-31"
Security Posture
This skill generates orchestration code including DAG definitions, asset configurations, and scheduling logic. See Security & Compliance Patterns for the full security framework.
Credentials required: Warehouse connections, API keys, secrets manager access, alerting webhooks Where to configure: Dagster EnvVar resources, Airflow Connections, environment variables Minimum role/permissions: Orchestrator service account with scoped warehouse access
| Capability | Tier 1 (Cloud-Native) | Tier 2 (Regulated) | Tier 3 (Air-Gapped) |
|---|---|---|---|
| Execute pipelines | Against dev/staging | Generate code for review | Generate code only |
| Configure schedules | Deploy to dev | Generate configs for review | Generate configs only |
| Manage connections | Configure dev connections | Generate templates | Document requirements |
| Backfill data | Execute against dev | Generate plans for review | Generate plans only |
Reference Files
- Dagster Patterns — Assets, resources, sensors, schedules, partitions, backfills, I/O managers, asset checks
- Dagster Integrations —
dagster-dbt,dagster-dlt, trifecta example, dagster-k8s, dagster-cloud - Airflow Patterns — TaskFlow API, operators, connections, dynamic task mapping,
cosmos, MWAA/Composer - Embedded Orchestration — When NOT to add an orchestrator: dbt Cloud, Databricks Workflows, Snowflake Tasks, Prefect
- Consulting Orchestration — Recurring engagement patterns, file-drop sensors, client-specific scheduling, per-client dashboards
More from dtsong/data-engineering-skills
data-observability
Use this skill when implementing monitoring, alerting, and incident response for data pipelines. Covers freshness monitoring, volume anomaly detection, schema change detection, alerting patterns, and incident response workflows. Common phrases: \"data freshness\", \"pipeline monitoring\", \"data anomaly\", \"schema drift\", \"data alerting\", \"incident response\", \"data observability\", \"stale data\". Do NOT use for writing dbt models (use dbt-transforms), pipeline scheduling (use data-pipelines), or data quality testing as deliverables (use data-testing).
3duckdb
Use this skill when working with DuckDB for local data analysis, file ingestion, or data exploration. Covers reading CSV/Excel/Parquet/JSON files into DuckDB, SQL analytics on local data, data profiling, cleaning transformations, and export to various formats. Common phrases: \"analyze this CSV\", \"DuckDB query\", \"local data analysis\", \"read Excel in SQL\", \"profile this data\". Do NOT use for dbt model building (use dbt-transforms with DuckDB adapter) or cloud warehouse administration.
2data-governance
Use this skill when implementing data governance as part of engineering work. Covers data cataloging (dbt docs, external tools), lineage documentation, data classification (PII/PHI taxonomy), access control patterns (RBAC, row-level security), and compliance frameworks (GDPR, HIPAA, SOX, CCPA). Common phrases: \"data catalog\", \"data lineage\", \"PII classification\", \"access control\", \"RBAC\", \"data governance\", \"compliance requirements\". Do NOT use for writing dbt models (use dbt-transforms), pipeline orchestration (use data-pipelines), or data quality testing (use data-testing).
2dlt-extract
Use this skill when building DLT pipelines for file-based or consulting data extraction. Covers Excel/CSV/SharePoint ingestion via DLT, destination swapping (DuckDB dev to warehouse prod), schema contracts for cleaning, and portable pipeline patterns. Common phrases: \"dlt pipeline for files\", \"extract Excel with dlt\", \"portable data pipeline\", \"dlt filesystem source\". Do NOT use for core DLT concepts like REST API or SQL database sources (use data-integration) or pipeline scheduling (use data-pipelines).
2data-testing
Use this skill when designing testing strategies for data pipelines, writing SQL assertions, validating pipeline output, or packaging tests as client deliverables. Covers dbt test patterns, pipeline validation, SQL assertion libraries, test coverage targets, and test-as-deliverable packaging. Common phrases: \"data testing strategy\", \"pipeline validation\", \"SQL assertions\", \"test coverage\", \"test as deliverable\", \"data quality tests\". Do NOT use for writing dbt models (use dbt-transforms), DuckDB analytical queries (use duckdb), or pipeline scheduling (use data-pipelines).
2event-streaming
Use this skill when building real-time or near-real-time data pipelines. Covers Kafka, Flink, Spark Streaming, Snowpipe, BigQuery streaming, materialized views, and batch-vs-streaming decisions. Common phrases: \"real-time pipeline\", \"Kafka consumer\", \"streaming vs batch\", \"low latency ingestion\". Do NOT use for batch integration patterns (use data-integration) or pipeline orchestration (use data-pipelines).
2