spark-version-upgrade by openhands/skills

Upgrade Apache Spark applications between major versions with a structured, phase-by-phase workflow.

When to Use

Migrating from Spark 2.x → 3.x or Spark 3.x → 4.x
Updating PySpark, Spark SQL, or Structured Streaming applications
Resolving deprecation warnings before a Spark version bump

Workflow Overview

Inventory & Impact Analysis — Scan the codebase and assess scope
Build File Updates — Bump Spark/Scala/Java dependencies
API Migration — Replace deprecated and removed APIs
Configuration Migration — Update Spark config properties
SQL & DataFrame Migration — Fix query-level breaking changes
Test Validation — Compile, run tests, verify results

Phase 1: Inventory & Impact Analysis

Before changing any code, assess what needs to change. Read the official Apache Spark migration guide for the target version — it documents every API removal, config rename, and behavioral change per release: https://spark.apache.org/docs/latest/migration-guide.html

Checklist

Read the migration guide section for the target Spark version
Identify current Spark version (check pom.xml, build.sbt, build.gradle, or requirements.txt)
Identify target Spark version
Search for deprecated APIs: grep -rn 'import org.apache.spark' --include='*.scala' --include='*.java' --include='*.py'
List all Spark config properties: grep -rn 'spark\.' --include='*.conf' --include='*.properties' --include='*.scala' --include='*.java' --include='*.py' | grep -v 'test'
Check for custom SparkSession or SparkContext extensions
Identify connector dependencies (Hive, Kafka, Cassandra, Delta, Iceberg)
Document findings in spark_upgrade_impact.md

Output

spark_upgrade_impact.md   # Summary of affected files, APIs, and configs

Phase 2: Build File Updates

Update dependency versions and resolve compilation.

Maven (`pom.xml`)

<!-- Update Spark version property -->
<spark.version>3.5.1</spark.version>    <!-- or 4.0.0 -->
<scala.version>2.13.12</scala.version>  <!-- Spark 3.x: 2.12/2.13; Spark 4.x: 2.13 -->

<!-- Update artifact IDs if Scala cross-version changed -->
<artifactId>spark-core_2.13</artifactId>
<artifactId>spark-sql_2.13</artifactId>

SBT (`build.sbt`)

val sparkVersion = "3.5.1" // or "4.0.0"
scalaVersion := "2.13.12"

libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion

Gradle (`build.gradle`)

ext {
    sparkVersion = '3.5.1' // or '4.0.0'
}
dependencies {
    implementation "org.apache.spark:spark-core_2.13:${sparkVersion}"
    implementation "org.apache.spark:spark-sql_2.13:${sparkVersion}"
}

PySpark (`requirements.txt` / `pyproject.toml`)

pyspark==3.5.1   # or 4.0.0

Checklist

Update Spark version in build file
Update Scala version if crossing 2.12→2.13 boundary
Update Java source/target level if required (Spark 4.x requires Java 17+)
Update connector library versions to match new Spark version
Resolve dependency conflicts (mvn dependency:tree / sbt dependencyTree)
Confirm project compiles (errors at this stage are expected — they guide Phase 3)

Phase 3: API Migration

Replace removed and deprecated APIs. Work through compiler errors systematically.

Common Patterns

Consult the official Apache Spark migration guide for the complete list of changes for each version: https://spark.apache.org/docs/latest/migration-guide.html

SparkSession Creation (2.x → 3.x)

// BEFORE (Spark 1.x/2.x)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

// AFTER (Spark 2.x+/3.x)
val spark = SparkSession.builder()
  .config(conf)
  .enableHiveSupport() // if needed
  .getOrCreate()
val sc = spark.sparkContext

RDD to DataFrame (2.x → 3.x)

// BEFORE
rdd.toDF()  // implicit from SQLContext

// AFTER
import spark.implicits._
rdd.toDF()  // implicit from SparkSession

Accumulator API (2.x → 3.x)

// BEFORE
val acc = sc.accumulator(0)

// AFTER
val acc = sc.longAccumulator("name")

Checklist

Replace SQLContext / HiveContext with SparkSession
Replace deprecated Accumulator with AccumulatorV2
Update DataFrame → Dataset[Row] where needed
Replace removed RDD.mapPartitionsWithContext with mapPartitions
Fix SparkConf deprecated setters
Update custom UserDefinedFunction registration
Migrate Experimental / DeveloperApi usages that were removed
Verify all compilation errors from Phase 2 are resolved

Phase 4: Configuration Migration

Spark renames and removes configuration properties between versions. The official migration guide documents every renamed and removed property per release: https://spark.apache.org/docs/latest/migration-guide.html

Checklist

Rename deprecated config keys (e.g., spark.shuffle.file.buffer.kb → spark.shuffle.file.buffer)
Update removed configs to their replacements
Review spark-defaults.conf, application code, and submit scripts
Check for hardcoded config values in test fixtures
Verify SparkSession.builder().config(...) calls use current property names

Phase 5: SQL & DataFrame Migration

Spark SQL behavior changes between versions can silently alter query results.

Key Breaking Changes (2.x → 3.x)

CAST to integer no longer truncates silently — set spark.sql.ansi.enabled if needed
FROM clause is required in SELECT (no more SELECT 1)
Column resolution order changed in subqueries
spark.sql.legacy.timeParserPolicy controls date/time parsing behavior

Key Breaking Changes (3.x → 4.x)

ANSI mode is default (spark.sql.ansi.enabled=true)
Stricter type coercion in comparisons
spark.sql.legacy.* flags removed

Checklist

Audit SQL strings and DataFrame expressions for changed behavior
Add explicit CAST where implicit coercion relied on legacy behavior
Update date/time format patterns to match new parser
Test SQL queries with representative data and compare output to pre-upgrade baseline
Set spark.sql.legacy.* flags temporarily if needed for phased migration

Phase 6: Test Validation

Checklist

All code compiles without errors
All existing unit tests pass
All existing integration tests pass
Run Spark jobs locally with sample data and compare output to pre-upgrade baseline
No deprecation warnings remain (or are documented with a migration timeline)
Update CI/CD pipeline to use new Spark version
Document any spark.sql.legacy.* flags that are set temporarily

Done When

✓ Project compiles against target Spark version ✓ All tests pass ✓ No removed APIs remain in code ✓ Configuration properties are current ✓ SQL queries produce correct results ✓ Upgrade impact documented in spark_upgrade_impact.md

spark-version-upgrade

When to Use

Workflow Overview

Phase 1: Inventory & Impact Analysis

Checklist

Output

Phase 2: Build File Updates

Maven (pom.xml)

SBT (build.sbt)

Gradle (build.gradle)

PySpark (requirements.txt / pyproject.toml)

Checklist

Phase 3: API Migration

Common Patterns

SparkSession Creation (2.x → 3.x)

RDD to DataFrame (2.x → 3.x)

Accumulator API (2.x → 3.x)

Checklist

Phase 4: Configuration Migration

Checklist

Phase 5: SQL & DataFrame Migration

Key Breaking Changes (2.x → 3.x)

Key Breaking Changes (3.x → 4.x)

Checklist

Phase 6: Test Validation

Checklist

Done When

Maven (`pom.xml`)

SBT (`build.sbt`)

Gradle (`build.gradle`)

PySpark (`requirements.txt` / `pyproject.toml`)