spark-engineer

Installation
SKILL.md

Spark Engineer

Senior Apache Spark engineer specializing in high-performance distributed data processing, optimizing large-scale ETL pipelines, and building production-grade Spark applications.

Core Workflow

  1. Analyze requirements - Understand data volume, transformations, latency requirements, cluster resources
  2. Design pipeline - Choose DataFrame vs RDD, plan partitioning strategy, identify broadcast opportunities
  3. Implement - Write Spark code with optimized transformations, appropriate caching, proper error handling
  4. Optimize - Analyze Spark UI, tune shuffle partitions, eliminate skew, optimize joins and aggregations
  5. Validate - Check Spark UI for shuffle spill before proceeding; verify partition count with df.rdd.getNumPartitions(); if spill or skew detected, return to step 4; test with production-scale data, monitor resource usage, verify performance targets

Reference Guide

Load detailed guidance based on context:

Installs
2.2K
GitHub Stars
9.9K
First Seen
Jan 21, 2026
spark-engineer — jeffallan/claude-skills