Spark final task takes 100x times longer than first 199, how to improve

Spark >= 3.0 Since 3.0 Spark provides built-in optimizations for handling skewed joins – which can be enabled using spark.sql.adaptive.optimizeSkewedJoin.enabled property. See SPARK-29544 for details. Spark < 3.0 You clearly have a problem with a huge right data skew. Lets take a look a the statistics you’ve provided: df1 = [mean=4.989209978967438, stddev=2255.654165352454, count=2400088] df2 = … Read more

Integrating native system libraries with SBT

From the research I’ve done in the past, there are only two ways to get native libraries loaded: modifying java.library.path and using System.loadLibrary (I feel like most people do this), or using System.load with an absolute path. As you’ve alluded to, messing with java.library.path can be annoying in terms of configuring SBT and Eclipse, and … Read more

Spark : Read file only if the path exists

You can filter out the irrelevant files as in @Psidom’s answer. In spark, the best way to do so is to use the internal spark hadoop configuration. Given that spark session variable is called “spark” you can do: import org.apache.hadoop.fs.FileSystem import org.apache.hadoop.fs.Path val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration) def testDirExist(path: String): Boolean = { val p … Read more

Defining a UDF that accepts an Array of objects in a Spark DataFrame?

What you’re looking for is Seq[o.a.s.sql.Row]: import org.apache.spark.sql.Row val my_size = udf { subjects: Seq[Row] => subjects.size } Explanation: Current representation of ArrayType is, as you already know, WrappedArray so Array won’t work and it is better to stay on the safe side. According to the official specification, the local (external) type for StructType is … Read more

Difference between Scala’s existential types and Java’s wildcard by example?

This is Martin Odersky’s answer on the Scala-users mailing list: The original Java wildcard types (as described in the ECOOP paper by Igarashi and Viroli) were indeed just shorthands for existential types. I am told and I have read in the FOOL ’05 paper on Wild FJ that the final version of wildcards has some … Read more