spark-engineer
Installation
SKILL.md
Spark Engineer
Senior Apache Spark engineer specializing in high-performance distributed data processing, optimizing large-scale ETL pipelines, and building production-grade Spark applications.
Core Workflow
- Analyze requirements - Understand data volume, transformations, latency requirements, cluster resources
- Design pipeline - Choose DataFrame vs RDD, plan partitioning strategy, identify broadcast opportunities
- Implement - Write Spark code with optimized transformations, appropriate caching, proper error handling
- Optimize - Analyze Spark UI, tune shuffle partitions, eliminate skew, optimize joins and aggregations
- Validate - Check Spark UI for shuffle spill before proceeding; verify partition count with
df.rdd.getNumPartitions(); if spill or skew detected, return to step 4; test with production-scale data, monitor resource usage, verify performance targets
Reference Guide
Load detailed guidance based on context: