Migrating Spark codebases from Scala 2.12 to 2.13
2024-11-15With Apache Spark 4.0 coming soon and preview releases available for a few months, you might want to anticipate the migration to Scala 2.13 and avoid dealing with both Scala and Spark upgrades at the same time. While the Spark API hasn’t changed that much since the late 2.x versions, moving from Scala 2.12 to 2.13 can be painful, given the new collections introduced many breaking changes in the standard library. If you migrated from 2.11 to 2.12 previously, for instance moving a codebase from Spark 2.4 to 3.0, this was probably fairly smooth. Be warned it won’t necessarily be the case this time.
Groundwork
Spark 3 has been cross-built against Scala 2.12 and 2.13 since the version 3.2, but you should probably bump it to the latest 3.5 patch version before starting. Realistically, if you cannot move your codebase to Spark 3.5, it’s unlikely you’ll be ready for the move to 4.0 later on.
Let the compiler help
Flags
Over the years, many sanity checks have been added to scalac itself. Keeping up with the matrix of flags for different compiler versions is not easy. Use the sbt plugin typelevel/sbt-tpolecat. This results in a pretty sizeable amount of flags, let’s highlight a few, particularly useful ones:
-Xlint:infer-any
-Xfatal-warnings
-Wdead-code
-Wvalue-discard
-Wnonunit-statement
False positives
In a real-life codebase, it’s almost impossible not to run into false positives when you enable scalac’s most pedantic settings.
While you can manually silence a warning using the @nowarn
annotation, a few build settings might help.
Generated code and the Scala collection compatibility library might trigger unused warnings, we can exclude them. Also, let’s make sure to avoid warnings before macro expansion:
scalacOptions ++= Seq(
"-Ywarn-macros:after",
s"-Wconf:src=${target.value}/.*:silent",
"-Wconf:origin=scala.collection.compat.*:silent"
)
If you use ScalaTest, its assertions are seen as discarded values by the compiler, we can skip those in the Test
scope:
Test / tpolecatExcludeOptions ++= Set(
ScalacOptions.warnNonUnitStatement,
ScalacOptions.warnDeadCode
)
Finally, if your own Scaladoc refers to external classes, it’s seen as detached by the compiler:
Compile / doc / tpolecatExcludeOptions ++=
ScalacOptions.fatalWarningOptions +
ScalacOptions.lintDocDetached
Compiler plugins
Additionally, two linters exist as compiler plugins.
Wartremover is well maintained and supports Scala 3. Its rules are a bit limited in my experience and trigger false positives with higher-kinded types.
Scapegoat is more comprehensive but in a state of minimal maintenance, and it has typically been slower to keep up with patch versions of the compiler. There’s limited Scala 3 support.
Personal opinion: they are not worth the hassle anymore.
Scalafix
Through a less powerful but saner approach than compiler plugins, Scalafix uses Scalameta and SemanticDB to inspect and rewrite your code.
It was initially developed for rewrite rules. E.g., the Scala 2.13 new collections library, the Scala 3 migration, third party libraries going through breaking changes...
But today, it’s mostly used for its linting rules. It comes with a bunch of them built in:
Rule | Description |
---|---|
ExplicitResultTypes | Inserts type annotations for inferred public members. Only compatible with Scala 2.x. |
NoAutoTupling | Inserts explicit tuples for adapted argument lists for compatibility with -Yno-adapted-args . |
OrganizeImports | Organize import statements. |
RemoveUnused | Removes unused imports and terms that reported by the compiler under -Wunused . |
DisableSyntax | Reports an error for disabled features such as var or XML literals. |
LeakingImplicitClassVal | Adds private to val parameters of implicit value classes. |
NoValInForComprehension | Removes deprecated val inside for-comprehension binders. |
ProcedureSyntax | Replaces deprecated procedure syntax with explicit Unit return type. |
RedundantSyntax | Removes redundant syntax such as final modifiers on an object. |
Scalafmt
If you use Scalafmt (and you should), keep in mind that its behavior might not always be idempotent. It’s a good idea to run it after Scalafix, not before.
In the future1 it will also help you to migrate the codebase to Scala 3’s syntax.
Specific Spark migration pains
Now that we have covered general settings, let’s get into the main topic.
First, make sure all your dependencies are available for Scala 2.13. In the Spark ecosystem there are still a few offenders, for instance Microsoft SynapseML and AWS Labs Deequ.
If the library is small, you might want to consider upgrading or cross-compiling it yourself, and possibly submit a PR upstream. Otherwise, well, you’re stuck. But with Spark 4.0 coming, convincing upstream maintainers shouldn’t be too hard.
Auto application
To prepare for Scala 3, auto-application results in a warning in recent versions of Scala 2.13.
def foo: T
def bar(): Unit
Both need to be called consistently, i.e., foo()
or bar
are invalid.
Exception: nullary Java methods can still be called without application:
foo.toString.length
But you’re in luck, this is covered by the recent actionable diagnostics initiative, and can be fixed by the compiler itself!
scalacOptions += "-quickfix:any"
And let the magic happen on recompile.
Deprecated symbol literals
If you are using the symbol literal syntax for column selection in SparkSQL expressions, they will be rewritten to Symbol(...)
by -quickfix
.
This is a bit ugly and you might want to run a project wide search and replace to use functions.col(...)
or the short-hand $"..."
syntax instead.
Pattern matching exhaustivity check
Scala 2.13 is pickier. Example: collection sliding.
val check = foo.sliding(2).forall {
case List(t1, t2) => t1 <= t2
}
This won’t compile without warning. You can add the match error explicitly if you know your code can’t fail:
val check = foo.sliding(2).forall {
case List(t1, t2) => t1 <= t2
case other => throw new MatchError(other)
}
Minor compiler regressions
In a few edge cases2 code that was destructuring nested tuples fine with Scala 2.12 fails to compile. For instance, you might need to rewrite:
data.reduceByKey {
case ((left1, left2), (right1, right2)) => ...
}
Into:
data.reduceByKey { (left, right) =>
val ((left1, left2), (right1, right2)) = (left, right)
...
}
Explicit conversions from and to views
There are no more implicit materialization of views thanks to CanBuildFrom
:
data.groupBy(_.id).mapValues(_.distinct)
Has to become explicit:
data.groupBy(_.id).view.mapValues(_.distinct).toMap
This can be rewritten somewhat automatically by the Scalafix rules in scala-collection-compat.
Moreover, collection.breakOut
is gone, be careful about the materialization order if you are chaining several transformations on lazy views between sets, maps, and sequences.
scala.Seq is now immutable.Seq
For instance, you cannot pass a mutable.Buffer[T]
to a method taking a Seq[T]
. Moreover, there’s no more implicit conversion from Array[T]
to Seq[T]
.
In many cases, an explicit call to .toSeq
solves your problem. In performance sensitive code, you might want to use ArraySeq.unsafeWrapArray(...)
in combination with a mutable.ArrayBuilder
.
No more CanBuildFrom
If you have defined custom collections, you’re in for a rewrite. See:
-
The official documentation example: Implementing custom collections - RNA sequences.
-
Stefan Zeiger’s new collections demo and his talk at Scala Days 2019.
In my case, the standard library LongMap
implementation was very useful as inspiration.
Spark on Scala 3... who am I kidding?
See issue SPARK-30195.