data-integration
Integration Patterns Skill
Expert guidance for designing, implementing, and troubleshooting enterprise data integrations.
When to Use
Activate when:
- Designing integrations between SaaS platforms (Salesforce, NetSuite, Stripe, Workday, ServiceNow)
- Evaluating iPaaS (Workato, MuleSoft, Boomi) vs DLT vs custom code
- Implementing CDC with Debezium, Snowflake Streams, or BigQuery CDC
- Building event-driven architectures with Kafka, Pub/Sub, or EventBridge
- Designing webhook receivers, Reverse ETL, or API integrations with pagination/rate limiting
- Building data sync patterns, file-based integrations, or canonical data models
Scope Constraints
This skill covers enterprise data integration patterns. It does NOT cover: basic SQL, BI tools, infrastructure provisioning, or database optimization.
Model Routing
| reasoning_demand | preferred | acceptable | minimum |
|---|---|---|---|
| medium | Sonnet | Sonnet, Opus | Sonnet |
Core Principles
Loose Coupling: Use queues, event buses, and async patterns. Implement circuit breakers to prevent cascading failures.
Idempotency: Use idempotency keys, upsert patterns, and deduplication logic. Assume network failures and duplicate messages will occur.
Contract-First Design: Use OpenAPI for REST, Protobuf for gRPC, Avro/JSON Schema for events. Version contracts; enforce schema validation at boundaries.
Error Isolation: Implement dead letter queues, per-integration retry logic, and clear failure boundaries. Log errors with context for troubleshooting.
Data Freshness Awareness: Match pattern to SLA: webhooks give sub-second latency, CDC near-real-time, batch may be hours old. Monitor actual vs expected freshness.
Canonical Data Models: Map external schemas to a shared canonical model at integration boundaries. Maintain crosswalk tables for ID mapping. Version canonical models and handle schema evolution.
Integration Pattern Decision Matrix
| Pattern | Latency | Volume | Complexity | Best For |
|---|---|---|---|---|
| Request/Reply (REST, gRPC) | Low (ms-sec) | Low-Med | Low | User-facing, CRUD |
| Pub/Sub Events | Low-Med (sec) | High | Medium | Event notifications, fan-out |
| CDC | Low (sec) | High | Med-High | DB replication, real-time analytics |
| Batch/File-based | High (hours) | Very High | Low | Bulk transfers, daily loads |
| Webhooks | Low (sec) | Medium | Medium | SaaS notifications, alerts |
| Reverse ETL | Med (min-hours) | Medium | Medium | Data activation, warehouse to CRM |
| DLT (Python-first) | Med (min) | High | Low-Med | Code-first ingestion, auto-schema |
iPaaS vs DLT vs Custom Code
| Factor | iPaaS (Fivetran/Airbyte) | DLT | Custom Python |
|---|---|---|---|
| Setup time | Minutes (UI) | Hours (code) | Days (full build) |
| Connectors | 300+ managed | 50+ verified + custom | Unlimited |
| Schema mgmt | Auto-evolve only | Auto-evolve + contracts + Pydantic | Manual |
| Incremental | Built-in | Built-in with cursor state | Manual |
| Cost at scale | Per-MAR ($$+) | Free + compute ($) | Compute only ($) |
| Best for | Standard SaaS, non-technical teams | Custom APIs, nested data, Python teams | Unique protocols, ultra-low-latency |
Rule of thumb: Start iPaaS for standard SaaS connectors. Use DLT for custom logic, complex schemas, or cost control. Fall back to custom Python for unique requirements.
iPaaS Platform Comparison
| Platform | Best For | Pricing | Complexity |
|---|---|---|---|
| Workato | Business ops, pre-built connectors | Per-task | Low |
| MuleSoft | Enterprise API management, governance | License + runtime | High |
| Boomi | Multi-cloud, B2B/EDI, MDM | Per-connection | Med-High |
| Zapier | Simple automations, SMB | Per-task | Very Low |
| Tray.io | Advanced logic, developer-friendly | Per-task | Medium |
Reverse ETL / Data Activation
| Tool | Sync Modes | Warehouse Support | Pricing |
|---|---|---|---|
| Hightouch | Upsert, mirror, append | Snowflake, BigQuery, Redshift, Databricks | Per-row, ~$500/mo |
| Census | Upsert, mirror, append, delete | Snowflake, BigQuery, Redshift, Databricks | Per-row, ~$1000/mo |
| Custom | Full control | Any | Dev + infra costs |
Error Handling Strategy
Apply these layers for resilient integrations:
- Retry with backoff — Use
tenacitywith exponential backoff for transient failures - Circuit breaker — Track failure count; open circuit after threshold; auto-recover after timeout
- Dead letter queue — Persist failed messages with error context for manual review or replay
- Idempotency check — Track processed message IDs to prevent duplicate processing
Security Posture
See Security & Compliance Patterns for full framework.
Credentials: API keys, OAuth tokens, DB connections, webhook secrets via environment variables. Secrets managers for production.
| Capability | Tier 1 (Cloud-Native) | Tier 2 (Regulated) | Tier 3 (Air-Gapped) |
|---|---|---|---|
| API calls | Execute against dev | Generate for review | Generate only |
| CDC config | Deploy to dev | Generate for review | Generate only |
| Webhooks | Deploy and test | Generate with sig verification | Generate only |
| Reverse ETL | Execute against dev | Generate sync configs | Generate only |
Credential best practices: Use scoped/restricted keys. Prefer OAuth 2.0 over API keys. Store webhook secrets in env vars; always verify signatures. Prefer key-pair/IAM auth over passwords. Rotate on schedule.
Reference Files
Load the appropriate reference for deep-dive guidance:
- DLT Pipelines — Sources, resources, incremental loading, schema contracts, REST API source, testing, orchestration
- Enterprise Connectors — Salesforce, NetSuite, Stripe, Workday, ServiceNow patterns
- Event-Driven Architecture — Kafka, Pub/Sub, EventBridge, schema registry, delivery guarantees
- iPaaS Platforms — Workato, MuleSoft, Boomi comparison, recipes, governance
- CDC Patterns — Debezium, Snowflake Streams, BigQuery CDC, backfill strategies
- Data Mapping & Crosswalks — Canonical models, crosswalk tables, schema drift detection
Related skills:
- For file-based DLT extraction and consulting portability, see dlt-extract
- For DLT vs managed connector comparison, see DLT vs Managed Connectors
More from dtsong/data-engineering-skills
data-governance
Use this skill when implementing data governance as part of engineering work. Covers data cataloging (dbt docs, external tools), lineage documentation, data classification (PII/PHI taxonomy), access control patterns (RBAC, row-level security), and compliance frameworks (GDPR, HIPAA, SOX, CCPA). Common phrases: \"data catalog\", \"data lineage\", \"PII classification\", \"access control\", \"RBAC\", \"data governance\", \"compliance requirements\". Do NOT use for writing dbt models (use dbt-transforms), pipeline orchestration (use data-pipelines), or data quality testing (use data-testing).
2data-testing
Use this skill when designing testing strategies for data pipelines, writing SQL assertions, validating pipeline output, or packaging tests as client deliverables. Covers dbt test patterns, pipeline validation, SQL assertion libraries, test coverage targets, and test-as-deliverable packaging. Common phrases: \"data testing strategy\", \"pipeline validation\", \"SQL assertions\", \"test coverage\", \"test as deliverable\", \"data quality tests\". Do NOT use for writing dbt models (use dbt-transforms), DuckDB analytical queries (use duckdb), or pipeline scheduling (use data-pipelines).
2tsfm-forecast
Use this skill when generating time-series forecasting pipelines using foundation models. Covers TimesFM, Chronos, MOIRAI, and Lag-Llama model selection, DuckDB-based preprocessing code, Python inference generation, backtesting harnesses, multi-model comparison, and client forecast deliverables. Common phrases: \"time-series forecast\", \"demand forecasting\", \"TimesFM\", \"Chronos\", \"predict future values\", \"zero-shot forecast\". Do NOT use for ML model training or fine-tuning (use python-data-engineering), real-time/streaming forecasts (use event-streaming), or pipeline scheduling (use data-pipelines).
2microsoft-data-stack
Use this skill when working with Microsoft data technologies. Covers Azure Data Factory orchestration, Azure Synapse Analytics, Microsoft Fabric lakehouse, SQL Server data engineering (CDC, temporal tables, partitioning), ADLS Gen2 storage patterns, SSIS migration guidance, and dbt-sqlserver adapter configuration. Common phrases: \"ADF pipeline\", \"Synapse\", \"Fabric lakehouse\", \"SQL Server CDC\", \"SSIS migration\", \"ADLS Gen2\", \"dbt-sqlserver\". Do NOT use for general pipeline orchestration (use data-pipelines), dbt modeling patterns (use dbt-transforms), or Kafka streaming (use event-streaming).
2streaming-data-skill
Use this skill when building real-time or near-real-time data pipelines. Covers Kafka, Flink, Spark Streaming, Snowpipe, BigQuery streaming, materialized views, and batch-vs-streaming decisions. Common phrases: \"real-time pipeline\", \"Kafka consumer\", \"streaming vs batch\", \"low latency ingestion\". Do NOT use for batch integration patterns (use integration-patterns-skill) or pipeline orchestration (use data-orchestration-skill).
2integration-patterns-skill
Use this skill when designing data integrations or connecting systems. Covers iPaaS platforms (Workato, MuleSoft, Boomi), dlt pipelines, API patterns, CDC, webhooks, and Reverse ETL. Common phrases: \"connect these systems\", \"build a dlt pipeline\", \"event-driven architecture\", \"change data capture\". Do NOT use for stream processing frameworks (use streaming-data-skill) or pipeline scheduling (use data-orchestration-skill).
2