senior-data-engineer

SKILL.md

Senior Data Engineer

Core Capabilities

  • Batch Pipeline Orchestration - Design and implement production-ready ETL/ELT pipelines with Airflow, intelligent dependency resolution, retry logic, and comprehensive monitoring
  • Real-Time Streaming - Build event-driven streaming pipelines with Kafka, Flink, Kinesis, and Spark Streaming with exactly-once semantics and sub-second latency
  • Data Quality Management - Comprehensive batch and streaming data quality validation covering completeness, accuracy, consistency, timeliness, and validity
  • Streaming Quality Monitoring - Track consumer lag, data freshness, schema drift, throughput, and dead letter queue rates for streaming pipelines
  • Performance Optimization - Analyze and optimize pipeline performance with query optimization, Spark tuning, and cost analysis recommendations

Key Workflows

Workflow 1: Build ETL Pipeline

Time: 2-4 hours

Steps:

  1. Design pipeline architecture using Lambda, Kappa, or Medallion pattern
  2. Configure YAML pipeline definition with sources, transformations, targets
  3. Generate Airflow DAG with pipeline_orchestrator.py
  4. Define data quality validation rules
  5. Deploy and configure monitoring/alerting

Expected Output: Production-ready ETL pipeline with 99%+ success rate, automated quality checks, and comprehensive monitoring

Workflow 2: Build Real-Time Streaming Pipeline

Time: 3-5 days

Steps:

  1. Select streaming architecture (Kappa vs Lambda) based on requirements
  2. Configure streaming pipeline YAML (sources, processing, sinks, quality)
  3. Generate Kafka configurations with kafka_config_generator.py
  4. Generate Flink/Spark job scaffolding with stream_processor.py
  5. Deploy and monitor with streaming_quality_validator.py

Expected Output: Streaming pipeline processing 10K+ events/sec with P99 latency < 1s, exactly-once delivery, and real-time quality monitoring

World-class data engineering for production-grade data systems, scalable pipelines, and enterprise data platforms.

Overview

This skill provides comprehensive expertise in data engineering fundamentals through advanced production patterns. From designing medallion architectures to implementing real-time streaming pipelines, it covers the full spectrum of modern data engineering including ETL/ELT design, data quality frameworks, pipeline orchestration, and DataOps practices.

What This Skill Provides:

  • Production-ready pipeline templates (Airflow, Spark, dbt)
  • Comprehensive data quality validation framework
  • Performance optimization and cost analysis tools
  • Data architecture patterns (Lambda, Kappa, Medallion)
  • Complete DataOps CI/CD workflows

Best For:

  • Building scalable data pipelines for enterprise systems
  • Implementing data quality and governance frameworks
  • Optimizing ETL performance and cloud costs
  • Designing modern data architectures (lake, warehouse, lakehouse)
  • Production ML/AI data infrastructure

Quick Start

Pipeline Orchestration

# Generate Airflow DAG from configuration
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --output dags/

# Validate pipeline configuration
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --validate

# Use incremental load template
python scripts/pipeline_orchestrator.py --template incremental --output dags/

Data Quality Validation

# Validate CSV file with quality checks
python scripts/data_quality_validator.py --input data/sales.csv --output report.html

# Validate database table with custom rules
python scripts/data_quality_validator.py \
    --connection postgresql://user:pass@host/db \
    --table sales_transactions \
    --rules rules/sales_validation.yaml \
    --threshold 0.95

Performance Optimization

# Analyze pipeline performance and get recommendations
python scripts/etl_performance_optimizer.py \
    --airflow-db postgresql://host/airflow \
    --dag-id sales_etl_pipeline \
    --days 30 \
    --optimize

# Analyze Spark job performance
python scripts/etl_performance_optimizer.py \
    --spark-history-server http://spark-history:18080 \
    --app-id app-20250115-001

Real-Time Streaming

# Validate streaming pipeline configuration
python scripts/stream_processor.py --config streaming_config.yaml --validate

# Generate Kafka topic and client configurations
python scripts/kafka_config_generator.py \
    --topic user-events \
    --partitions 12 \
    --replication 3 \
    --output kafka/topics/

# Generate exactly-once producer configuration
python scripts/kafka_config_generator.py \
    --producer \
    --profile exactly-once \
    --output kafka/producer.properties

# Generate Flink job scaffolding
python scripts/stream_processor.py \
    --config streaming_config.yaml \
    --mode flink \
    --generate \
    --output flink-jobs/

# Monitor streaming quality
python scripts/streaming_quality_validator.py \
    --lag --consumer-group events-processor --threshold 10000 \
    --freshness --topic processed-events --max-latency-ms 5000 \
    --output streaming-health-report.html

Core Workflows

1. Building Production Data Pipelines

Steps:

  1. Design Architecture: Choose pattern (Lambda, Kappa, Medallion) based on requirements
  2. Configure Pipeline: Create YAML configuration with sources, transformations, targets
  3. Generate DAG: python scripts/pipeline_orchestrator.py --config config.yaml
  4. Add Quality Checks: Define validation rules for data quality
  5. Deploy & Monitor: Deploy to Airflow, configure alerts, track metrics

Pipeline Patterns: See frameworks.md for Lambda Architecture, Kappa Architecture, Medallion Architecture (Bronze/Silver/Gold), and Microservices Data patterns.

Templates: See templates.md for complete Airflow DAG templates, Spark job templates, dbt models, and Docker configurations.

2. Data Quality Management

Steps:

  1. Define Rules: Create validation rules covering completeness, accuracy, consistency
  2. Run Validation: python scripts/data_quality_validator.py --rules rules.yaml
  3. Review Results: Analyze quality scores and failed checks
  4. Integrate CI/CD: Add validation to pipeline deployment process
  5. Monitor Trends: Track quality scores over time

Quality Framework: See frameworks.md for complete Data Quality Framework covering all dimensions (completeness, accuracy, consistency, timeliness, validity).

Validation Templates: See templates.md for validation configuration examples and Python API usage.

3. Data Modeling & Transformation

Steps:

  1. Choose Modeling Approach: Dimensional (Kimball), Data Vault 2.0, or One Big Table
  2. Design Schema: Define fact tables, dimensions, and relationships
  3. Implement with dbt: Create staging, intermediate, and mart models
  4. Handle SCD: Implement slowly changing dimension logic (Type 1/2/3)
  5. Test & Deploy: Run dbt tests, generate documentation, deploy

Modeling Patterns: See frameworks.md for Dimensional Modeling (Kimball), Data Vault 2.0, One Big Table (OBT), and SCD implementations.

dbt Templates: See templates.md for complete dbt model templates including staging, intermediate, fact tables, and SCD Type 2 logic.

4. Performance Optimization

Steps:

  1. Profile Pipeline: Run performance analyzer on recent pipeline executions
  2. Identify Bottlenecks: Review execution time breakdown and slow tasks
  3. Apply Optimizations: Implement recommendations (partitioning, indexing, batching)
  4. Tune Spark Jobs: Optimize memory, parallelism, and shuffle settings
  5. Measure Impact: Compare before/after metrics, track cost savings

Optimization Strategies: See frameworks.md for performance best practices including partitioning strategies, query optimization, and Spark tuning.

Analysis Tools: See tools.md for complete documentation on etl_performance_optimizer.py with query analysis and Spark tuning.

5. Building Real-Time Streaming Pipelines

Steps:

  1. Architecture Selection: Choose Kappa (streaming-only) or Lambda (batch + streaming) architecture
  2. Configure Pipeline: Create YAML config with sources, processing engine, sinks, quality thresholds
  3. Generate Kafka Configs: python scripts/kafka_config_generator.py --topic events --partitions 12
  4. Generate Job Scaffolding: python scripts/stream_processor.py --mode flink --generate
  5. Deploy Infrastructure: Use Docker Compose for local dev, Kubernetes for production
  6. Monitor Quality: python scripts/streaming_quality_validator.py --lag --freshness --throughput

Streaming Patterns: See frameworks.md for stateful processing, stream joins, windowing, exactly-once semantics, and CDC patterns.

Templates: See templates.md for Flink DataStream jobs, Kafka Streams applications, PyFlink templates, and Docker Compose configurations.

Python Tools

pipeline_orchestrator.py

Automated Airflow DAG generation with intelligent dependency resolution and monitoring.

Key Features:

  • Generate production-ready DAGs from YAML configuration
  • Automatic task dependency resolution
  • Built-in retry logic and error handling
  • Multi-source support (PostgreSQL, S3, BigQuery, Snowflake)
  • Integrated quality checks and alerting

Usage:

# Basic DAG generation
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --output dags/

# With validation
python scripts/pipeline_orchestrator.py --config config.yaml --validate

# From template
python scripts/pipeline_orchestrator.py --template incremental --output dags/

Complete Documentation: See tools.md for full configuration options, templates, and integration examples.

data_quality_validator.py

Comprehensive data quality validation framework with automated checks and reporting.

Capabilities:

  • Multi-dimensional validation (completeness, accuracy, consistency, timeliness, validity)
  • Great Expectations integration
  • Custom business rule validation
  • HTML/PDF report generation
  • Anomaly detection
  • Historical trend tracking

Usage:

# Validate with custom rules
python scripts/data_quality_validator.py \
    --input data/sales.csv \
    --rules rules/sales_validation.yaml \
    --output report.html

# Database table validation
python scripts/data_quality_validator.py \
    --connection postgresql://host/db \
    --table sales_transactions \
    --threshold 0.95

Complete Documentation: See tools.md for rule configuration, API usage, and integration patterns.

etl_performance_optimizer.py

Pipeline performance analysis with actionable optimization recommendations.

Capabilities:

  • Airflow DAG execution profiling
  • Bottleneck detection and analysis
  • SQL query optimization suggestions
  • Spark job tuning recommendations
  • Cost analysis and optimization
  • Historical performance trending

Usage:

# Analyze Airflow DAG
python scripts/etl_performance_optimizer.py \
    --airflow-db postgresql://host/airflow \
    --dag-id sales_etl_pipeline \
    --days 30 \
    --optimize

# Spark job analysis
python scripts/etl_performance_optimizer.py \
    --spark-history-server http://spark-history:18080 \
    --app-id app-20250115-001

Complete Documentation: See tools.md for profiling options, optimization strategies, and cost analysis.

stream_processor.py

Streaming pipeline configuration generator and validator for Kafka, Flink, and Kinesis.

Capabilities:

  • Multi-platform support (Kafka, Flink, Kinesis, Spark Streaming)
  • Configuration validation with best practice checks
  • Flink/Spark job scaffolding generation
  • Kafka topic configuration generation
  • Docker Compose for local streaming stacks
  • Exactly-once semantics configuration

Usage:

# Validate configuration
python scripts/stream_processor.py --config streaming_config.yaml --validate

# Generate Kafka configurations
python scripts/stream_processor.py --config streaming_config.yaml --mode kafka --generate

# Generate Flink job scaffolding
python scripts/stream_processor.py --config streaming_config.yaml --mode flink --generate --output flink-jobs/

# Generate Docker Compose for local development
python scripts/stream_processor.py --config streaming_config.yaml --mode docker --generate

Complete Documentation: See tools.md for configuration format, validation checks, and generated outputs.

streaming_quality_validator.py

Real-time streaming data quality monitoring with comprehensive health scoring.

Capabilities:

  • Consumer lag monitoring with thresholds
  • Data freshness validation (P50/P95/P99 latency)
  • Schema drift detection
  • Throughput analysis (events/sec, bytes/sec)
  • Dead letter queue rate monitoring
  • Overall quality scoring with recommendations
  • Prometheus metrics export

Usage:

# Monitor consumer lag
python scripts/streaming_quality_validator.py \
    --lag --consumer-group events-processor --threshold 10000

# Monitor data freshness
python scripts/streaming_quality_validator.py \
    --freshness --topic processed-events --max-latency-ms 5000

# Full quality validation
python scripts/streaming_quality_validator.py \
    --lag --freshness --throughput --dlq \
    --output streaming-health-report.html

Complete Documentation: See tools.md for all monitoring dimensions and integration patterns.

kafka_config_generator.py

Production-grade Kafka configuration generator with performance and security profiles.

Capabilities:

  • Topic configuration (partitions, replication, retention, compaction)
  • Producer profiles (high-throughput, exactly-once, low-latency, ordered)
  • Consumer profiles (exactly-once, high-throughput, batch)
  • Kafka Streams configuration with state store tuning
  • Security configuration (SASL-PLAIN, SASL-SCRAM, mTLS)
  • Kafka Connect source/sink configurations
  • Multiple output formats (properties, YAML, JSON)

Usage:

# Generate topic configuration
python scripts/kafka_config_generator.py \
    --topic user-events --partitions 12 --replication 3 --retention-hours 168

# Generate exactly-once producer
python scripts/kafka_config_generator.py \
    --producer --profile exactly-once --transactional-id producer-001

# Generate Kafka Streams config
python scripts/kafka_config_generator.py \
    --streams --application-id events-processor --exactly-once

Complete Documentation: See tools.md for all profiles, security options, and Connect configurations.

Reference Documentation

Frameworks (frameworks.md)

Comprehensive data engineering frameworks and patterns:

  • Architecture Patterns: Lambda, Kappa, Medallion, Microservices data architecture
  • Data Modeling: Dimensional (Kimball), Data Vault 2.0, One Big Table
  • ETL/ELT Patterns: Full load, incremental load, CDC, SCD, idempotent pipelines
  • Data Quality: Complete framework covering all quality dimensions
  • DataOps: CI/CD for data pipelines, testing strategies, monitoring
  • Orchestration: Airflow DAG patterns, backfill strategies
  • Real-Time Streaming: Stateful processing, stream joins, windowing strategies, exactly-once semantics, event time processing, watermarks, backpressure, Apache Flink patterns, AWS Kinesis patterns, CDC for streaming
  • Governance: Data catalog, lineage tracking, access control

Templates (templates.md)

Production-ready code templates and examples:

  • Airflow DAGs: Complete ETL DAG, incremental load, dynamic task generation
  • Spark Jobs: Batch processing, streaming, optimized configurations
  • dbt Models: Staging, intermediate, fact tables, dimensions with SCD Type 2
  • SQL Patterns: Incremental merge (upsert), deduplication, date spine, window functions
  • Python Pipelines: Data quality validation class, retry decorators, error handling
  • Real-Time Streaming: Apache Flink DataStream jobs (Java), Kafka Streams applications, PyFlink jobs, AWS Kinesis consumers, Docker Compose for streaming stack
  • Kafka Configs: Producer/consumer properties templates, topic configurations, security configurations
  • Docker: Dockerfiles for data pipelines, Docker Compose for local development including streaming stack (Kafka, Flink, Schema Registry)
  • Configuration: dbt project config, Spark configuration, Airflow variables, streaming pipeline YAML
  • Testing: pytest fixtures, integration tests, data quality tests

Tools (tools.md)

Python automation tool documentation:

  • pipeline_orchestrator.py: Complete usage guide, configuration format, DAG templates
  • data_quality_validator.py: Validation rules, dimension checks, Great Expectations integration
  • etl_performance_optimizer.py: Performance analysis, query optimization, Spark tuning
  • stream_processor.py: Streaming pipeline configuration, validation, job scaffolding generation
  • streaming_quality_validator.py: Consumer lag, data freshness, schema drift, throughput monitoring
  • kafka_config_generator.py: Topic, producer, consumer, Kafka Streams, and Connect configurations
  • Integration Patterns: Airflow, dbt, CI/CD, monitoring systems, Prometheus
  • Best Practices: Configuration management, error handling, performance, monitoring, streaming quality

Tech Stack

Core Technologies:

  • Languages: Python 3.8+, SQL, Scala (Spark), Java (Flink)
  • Orchestration: Apache Airflow, Prefect, Dagster
  • Batch Processing: Apache Spark, dbt, Pandas
  • Stream Processing: Apache Kafka, Apache Flink, Kafka Streams, Spark Structured Streaming, AWS Kinesis
  • Storage: PostgreSQL, BigQuery, Snowflake, Redshift, S3, GCS
  • Schema Management: Confluent Schema Registry, AWS Glue Schema Registry
  • Containerization: Docker, Kubernetes
  • Monitoring: Datadog, Prometheus, Grafana, Kafka UI

Data Platforms:

  • Cloud Data Warehouses: Snowflake, BigQuery, Redshift
  • Data Lakes: Delta Lake, Apache Iceberg, Apache Hudi
  • Streaming Platforms: Apache Kafka, AWS Kinesis, Google Pub/Sub, Azure Event Hubs
  • Stream Processing Engines: Apache Flink, Kafka Streams, Spark Structured Streaming
  • Workflow: Airflow, Prefect, Dagster

Integration Points

This skill integrates with:

  • Orchestration: Airflow, Prefect, Dagster for workflow management
  • Transformation: dbt for SQL transformations and testing
  • Quality: Great Expectations for data validation
  • Monitoring: Datadog, Prometheus for pipeline monitoring
  • BI Tools: Looker, Tableau, Power BI for analytics
  • ML Platforms: MLflow, Kubeflow for ML pipeline integration
  • Version Control: Git for pipeline code and configuration

See tools.md for detailed integration patterns and examples.

Best Practices

Pipeline Design:

  1. Idempotent operations for safe reruns
  2. Incremental processing where possible
  3. Clear data lineage and documentation
  4. Comprehensive error handling
  5. Automated recovery mechanisms

Data Quality:

  1. Define quality rules early
  2. Validate at every pipeline stage
  3. Automate quality monitoring
  4. Track quality trends over time
  5. Block bad data from downstream

Performance:

  1. Partition large tables by date/region
  2. Use columnar formats (Parquet, ORC)
  3. Leverage predicate pushdown
  4. Optimize for your query patterns
  5. Monitor and tune regularly

Operations:

  1. Version control everything
  2. Automate testing and deployment
  3. Implement comprehensive monitoring
  4. Document runbooks for incidents
  5. Regular performance reviews

Performance Targets

Batch Pipeline Execution:

  • P50 latency: < 5 minutes (hourly pipelines)
  • P95 latency: < 15 minutes
  • Success rate: > 99%
  • Data freshness: < 1 hour behind source

Streaming Pipeline Execution:

  • Throughput: 10K+ events/second sustained
  • End-to-end latency: P99 < 1 second
  • Consumer lag: < 10K records behind
  • Exactly-once delivery: Zero duplicates or losses

Data Quality (Batch):

  • Quality score: > 95%
  • Completeness: > 99%
  • Timeliness: < 2 hours data lag
  • Zero critical failures

Streaming Quality:

  • Data freshness: P95 < 5 minutes from event generation
  • Late data rate: < 5% outside watermark window
  • Dead letter queue rate: < 1%
  • Schema compatibility: 100% backward/forward compatible changes

Cost Efficiency:

  • Cost per GB processed: < $0.10
  • Cloud cost trend: Stable or decreasing
  • Resource utilization: > 70%

Resources


Version: 2.0.0 Last Updated: December 16, 2025 Documentation Structure: Progressive disclosure with comprehensive references Streaming Enhancement: Task #8 - Real-time streaming capabilities added

Weekly Installs
5
First Seen
Jan 24, 2026
Installed on
claude-code4
windsurf2
trae2
opencode2
codex2
antigravity2