Why is zipped faster than zip in Scala?

Question

None of the other answers mention the primary reason for the difference in speed, which is that the zipped version avoids 10,000 tuple allocations. As a couple of the other answers do note, the zip version involves an intermediate array, while the zipped version doesn’t, but allocating an array for 10,000 elements isn’t what makes the zip version so much worse—it’s the 10,000 short-lived tuples that are being put into that array. These are represented by objects on the JVM, so you’re doing a bunch of object allocations for things that you’re immediately going to throw away.

The rest of this answer just goes into a little more detail about how you can confirm this.

Better benchmarking

You really want to be using a framework like jmh to do any kind of benchmarking responsibly on the JVM, and even then the responsibly part is hard, although setting up jmh itself isn’t too bad. If you have a project/plugins.sbt like this:

addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.3.7")

And a build.sbt like this (I’m using 2.11.8 since you mention that’s what you’re using):

scalaVersion := "2.11.8"

enablePlugins(JmhPlugin)

Then you can write your benchmark like this:

package zipped_bench

import org.openjdk.jmh.annotations._

@State(Scope.Benchmark)
@BenchmarkMode(Array(Mode.Throughput))
class ZippedBench {
  val arr1 = Array.fill(10000)(math.random)
  val arr2 = Array.fill(10000)(math.random)

  def ES(arr: Array[Double], arr1: Array[Double]): Array[Double] =
    arr.zip(arr1).map(x => x._1 + x._2)

  def ES1(arr: Array[Double], arr1: Array[Double]): Array[Double] =
    (arr, arr1).zipped.map((x, y) => x + y)

  @Benchmark def withZip: Array[Double] = ES(arr1, arr2)
  @Benchmark def withZipped: Array[Double] = ES1(arr1, arr2)
}

And run it with sbt "jmh:run -i 10 -wi 10 -f 2 -t 1 zipped_bench.ZippedBench":

Benchmark                Mode  Cnt     Score    Error  Units
ZippedBench.withZip     thrpt   20  4902.519 ± 41.733  ops/s
ZippedBench.withZipped  thrpt   20  8736.251 ± 36.730  ops/s

Which shows that the zipped version gets about 80% more throughput, which is probably more or less the same as your measurements.

Measuring allocations

You can also ask jmh to measure allocations with -prof gc:

Benchmark                                                 Mode  Cnt        Score       Error   Units
ZippedBench.withZip                                      thrpt    5     4894.197 ±   119.519   ops/s
ZippedBench.withZip:·gc.alloc.rate                       thrpt    5     4801.158 ±   117.157  MB/sec
ZippedBench.withZip:·gc.alloc.rate.norm                  thrpt    5  1080120.009 ±     0.001    B/op
ZippedBench.withZip:·gc.churn.PS_Eden_Space              thrpt    5     4808.028 ±    87.804  MB/sec
ZippedBench.withZip:·gc.churn.PS_Eden_Space.norm         thrpt    5  1081677.156 ± 12639.416    B/op
ZippedBench.withZip:·gc.churn.PS_Survivor_Space          thrpt    5        2.129 ±     0.794  MB/sec
ZippedBench.withZip:·gc.churn.PS_Survivor_Space.norm     thrpt    5      479.009 ±   179.575    B/op
ZippedBench.withZip:·gc.count                            thrpt    5      714.000              counts
ZippedBench.withZip:·gc.time                             thrpt    5      476.000                  ms
ZippedBench.withZipped                                   thrpt    5    11248.964 ±    43.728   ops/s
ZippedBench.withZipped:·gc.alloc.rate                    thrpt    5     3270.856 ±    12.729  MB/sec
ZippedBench.withZipped:·gc.alloc.rate.norm               thrpt    5   320152.004 ±     0.001    B/op
ZippedBench.withZipped:·gc.churn.PS_Eden_Space           thrpt    5     3277.158 ±    32.327  MB/sec
ZippedBench.withZipped:·gc.churn.PS_Eden_Space.norm      thrpt    5   320769.044 ±  3216.092    B/op
ZippedBench.withZipped:·gc.churn.PS_Survivor_Space       thrpt    5        0.360 ±     0.166  MB/sec
ZippedBench.withZipped:·gc.churn.PS_Survivor_Space.norm  thrpt    5       35.245 ±    16.365    B/op
ZippedBench.withZipped:·gc.count                         thrpt    5      863.000              counts
ZippedBench.withZipped:·gc.time                          thrpt    5      447.000                  ms

…where gc.alloc.rate.norm is probably the most interesting part, showing that the zip version is allocating over three times as much as zipped.

Imperative implementations

If I knew that this method was going to be called in extremely performance-sensitive contexts, I’d probably implement it like this:

  def ES3(arr: Array[Double], arr1: Array[Double]): Array[Double] = {
    val minSize = math.min(arr.length, arr1.length)
    val newArr = new Array[Double](minSize)
    var i = 0
    while (i < minSize) {
      newArr(i) = arr(i) + arr1(i)
      i += 1
    }
    newArr
  }

Note that unlike the optimized version in one of the other answers, this uses while instead of a for since the for will still desugar into Scala collections operations. We can compare this implementation (withWhile), the other answer’s optimized (but not in-place) implementation (withFor), and the two original implementations:

Benchmark                Mode  Cnt       Score      Error  Units
ZippedBench.withFor     thrpt   20  118426.044 ± 2173.310  ops/s
ZippedBench.withWhile   thrpt   20  119834.409 ±  527.589  ops/s
ZippedBench.withZip     thrpt   20    4886.624 ±   75.567  ops/s
ZippedBench.withZipped  thrpt   20    9961.668 ± 1104.937  ops/s

That’s a really huge difference between the imperative and functional versions, and all of these method signatures are exactly identical and the implementations have the same semantics. It’s not like the imperative implementations are using global state, etc. While the zip and zipped versions are more readable, I personally don’t think there’s any sense in which the imperative versions are against the “spirit of Scala”, and I wouldn’t hesitate to use them myself.

With tabulate

Update: I added a tabulate implementation to the benchmark based on a comment in another answer:

def ES4(arr: Array[Double], arr1: Array[Double]): Array[Double] = {
  val minSize = math.min(arr.length, arr1.length)
  Array.tabulate(minSize)(i => arr(i) + arr1(i))
}

It’s much faster than the zip versions, although still much slower than the imperative ones:

Benchmark                  Mode  Cnt      Score     Error  Units
ZippedBench.withTabulate  thrpt   20  32326.051 ± 535.677  ops/s
ZippedBench.withZip       thrpt   20   4902.027 ±  47.931  ops/s

This is what I’d expect, since there’s nothing inherently expensive about calling a function, and because accessing array elements by index is very cheap.

Better benchmarking

Measuring allocations

Imperative implementations

With tabulate

Leave a Comment Cancel reply