Migrating Spark codebases from Scala 2.12 to 2.13

With Apache Spark 4.0 coming soon and preview releases available for a few months, you might want to anticipate the migration to Scala 2.13 and avoid dealing with both Scala and Spark upgrades at the same time. While the Spark API hasn’t changed that much since the late 2.x versions, moving from Scala 2.12 to 2.13 can be painful, given the new collections introduced many breaking changes in the standard library. If you migrated from 2.11 to 2.12 previously, for instance moving a codebase from Spark 2.4 to 3.0, this was probably fairly smooth. Be warned it won’t necessarily be the case this time.

Groundwork

Spark 3 has been cross-built against Scala 2.12 and 2.13 since the version 3.2, but you should probably bump it to the latest 3.5 patch version before starting. Realistically, if you cannot move your codebase to Spark 3.5, it’s unlikely you’ll be ready for the move to 4.0 later on.

Let the compiler help

Flags

Over the years, many sanity checks have been added to scalac itself. Keeping up with the matrix of flags for different compiler versions is not easy. Use the sbt plugin typelevel/sbt-tpolecat. This results in a pretty sizeable amount of flags, let’s highlight a few, particularly useful ones:

-Xlint:infer-any
-Xfatal-warnings
-Wdead-code
-Wvalue-discard
-Wnonunit-statement

False positives

In a real-life codebase, it’s almost impossible not to run into false positives when you enable scalac’s most pedantic settings.

While you can manually silence a warning using the @nowarn annotation, a few build settings might help.

Generated code and the Scala collection compatibility library might trigger unused warnings, we can exclude them. Also, let’s make sure to avoid warnings before macro expansion:

scalacOptions ++= Seq(
  "-Ywarn-macros:after",
  s"-Wconf:src=${target.value}/.*:silent",
  "-Wconf:origin=scala.collection.compat.*:silent"
)

If you use ScalaTest, its assertions are seen as discarded values by the compiler, we can skip those in the Test scope:

Test / tpolecatExcludeOptions ++= Set(
  ScalacOptions.warnNonUnitStatement,
  ScalacOptions.warnDeadCode
)

Finally, if your own Scaladoc refers to external classes, it’s seen as detached by the compiler:

Compile / doc / tpolecatExcludeOptions ++=
  ScalacOptions.fatalWarningOptions +
  ScalacOptions.lintDocDetached

Compiler plugins

Additionally, two linters exist as compiler plugins.

Wartremover is well maintained and supports Scala 3. Its rules are a bit limited in my experience and trigger false positives with higher-kinded types.

Scapegoat is more comprehensive but in a state of minimal maintenance, and it has typically been slower to keep up with patch versions of the compiler. There’s limited Scala 3 support.

Personal opinion: they are not worth the hassle anymore.

Scalafix

Through a less powerful but saner approach than compiler plugins, Scalafix uses Scalameta and SemanticDB to inspect and rewrite your code.

It was initially developed for rewrite rules. E.g., the Scala 2.13 new collections library, the Scala 3 migration, third party libraries going through breaking changes...

But today, it’s mostly used for its linting rules. It comes with a bunch of them built in:

RuleDescription
ExplicitResultTypesInserts type annotations for inferred public members. Only compatible with Scala 2.x.
NoAutoTuplingInserts explicit tuples for adapted argument lists for compatibility with -Yno-adapted-args.
OrganizeImportsOrganize import statements.
RemoveUnusedRemoves unused imports and terms that reported by the compiler under -Wunused.
DisableSyntaxReports an error for disabled features such as var or XML literals.
LeakingImplicitClassValAdds private to val parameters of implicit value classes.
NoValInForComprehensionRemoves deprecated val inside for-comprehension binders.
ProcedureSyntaxReplaces deprecated procedure syntax with explicit Unit return type.
RedundantSyntaxRemoves redundant syntax such as final modifiers on an object.

Scalafmt

If you use Scalafmt (and you should), keep in mind that its behavior might not always be idempotent. It’s a good idea to run it after Scalafix, not before.

In the future1 it will also help you to migrate the codebase to Scala 3’s syntax.

Specific Spark migration pains

Now that we have covered general settings, let’s get into the main topic.

First, make sure all your dependencies are available for Scala 2.13. In the Spark ecosystem there are still a few offenders, for instance Microsoft SynapseML and AWS Labs Deequ.

If the library is small, you might want to consider upgrading or cross-compiling it yourself, and possibly submit a PR upstream. Otherwise, well, you’re stuck. But with Spark 4.0 coming, convincing upstream maintainers shouldn’t be too hard.

Auto application

To prepare for Scala 3, auto-application results in a warning in recent versions of Scala 2.13.

def foo: T
def bar(): Unit

Both need to be called consistently, i.e., foo() or bar are invalid.

Exception: nullary Java methods can still be called without application:

foo.toString.length

But you’re in luck, this is covered by the recent actionable diagnostics initiative, and can be fixed by the compiler itself!

scalacOptions += "-quickfix:any"

And let the magic happen on recompile.

Deprecated symbol literals

If you are using the symbol literal syntax for column selection in SparkSQL expressions, they will be rewritten to Symbol(...) by -quickfix.

This is a bit ugly and you might want to run a project wide search and replace to use functions.col(...) or the short-hand $"..." syntax instead.

Pattern matching exhaustivity check

Scala 2.13 is pickier. Example: collection sliding.

val check = foo.sliding(2).forall {
  case List(t1, t2) => t1 <= t2
}

This won’t compile without warning. You can add the match error explicitly if you know your code can’t fail:

val check = foo.sliding(2).forall {
  case List(t1, t2) => t1 <= t2
  case other => throw new MatchError(other)
}

Minor compiler regressions

In a few edge cases2 code that was destructuring nested tuples fine with Scala 2.12 fails to compile. For instance, you might need to rewrite:

data.reduceByKey {
  case ((left1, left2), (right1, right2)) => ...
}

Into:

data.reduceByKey { (left, right) =>
  val ((left1, left2), (right1, right2)) = (left, right)
  ...
}

Explicit conversions from and to views

There are no more implicit materialization of views thanks to CanBuildFrom:

data.groupBy(_.id).mapValues(_.distinct)

Has to become explicit:

data.groupBy(_.id).view.mapValues(_.distinct).toMap

This can be rewritten somewhat automatically by the Scalafix rules in scala-collection-compat.

Moreover, collection.breakOut is gone, be careful about the materialization order if you are chaining several transformations on lazy views between sets, maps, and sequences.

scala.Seq is now immutable.Seq

For instance, you cannot pass a mutable.Buffer[T] to a method taking a Seq[T]. Moreover, there’s no more implicit conversion from Array[T] to Seq[T].

In many cases, an explicit call to .toSeq solves your problem. In performance sensitive code, you might want to use ArraySeq.unsafeWrapArray(...) in combination with a mutable.ArrayBuilder.

No more CanBuildFrom

If you have defined custom collections, you’re in for a rewrite. See:

In my case, the standard library LongMap implementation was very useful as inspiration.


1

Spark on Scala 3... who am I kidding?

2

See issue SPARK-30195.