Spark final task takes 100x times longer than first 199, how to improve

Spark >= 3.0 Since 3.0 Spark provides built-in optimizations for handling skewed joins – which can be enabled using spark.sql.adaptive.optimizeSkewedJoin.enabled property. See SPARK-29544 for details. Spark < 3.0 You clearly have a problem with a huge right data skew. Lets take a look a the statistics you’ve provided: df1 = [mean=4.989209978967438, stddev=2255.654165352454, count=2400088] df2 = … Read more

Integrating native system libraries with SBT

From the research I’ve done in the past, there are only two ways to get native libraries loaded: modifying java.library.path and using System.loadLibrary (I feel like most people do this), or using System.load with an absolute path. As you’ve alluded to, messing with java.library.path can be annoying in terms of configuring SBT and Eclipse, and … Read more

Spark : Read file only if the path exists

You can filter out the irrelevant files as in @Psidom’s answer. In spark, the best way to do so is to use the internal spark hadoop configuration. Given that spark session variable is called “spark” you can do: import org.apache.hadoop.fs.FileSystem import org.apache.hadoop.fs.Path val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration) def testDirExist(path: String): Boolean = { val p … Read more

Defining a UDF that accepts an Array of objects in a Spark DataFrame?

What you’re looking for is Seq[o.a.s.sql.Row]: import org.apache.spark.sql.Row val my_size = udf { subjects: Seq[Row] => subjects.size } Explanation: Current representation of ArrayType is, as you already know, WrappedArray so Array won’t work and it is better to stay on the safe side. According to the official specification, the local (external) type for StructType is … Read more

Difference between Scala’s existential types and Java’s wildcard by example?

This is Martin Odersky’s answer on the Scala-users mailing list: The original Java wildcard types (as described in the ECOOP paper by Igarashi and Viroli) were indeed just shorthands for existential types. I am told and I have read in the FOOL ’05 paper on Wild FJ that the final version of wildcards has some … Read more

Scala: circular references in immutable data types?

Let’s try to work it out step by step. As a rule of thumb when creating an immutable object all constructor parameters should be known at the point of instantiation, but let’s cheat and pass constructor parameters by name, then use lazy fields to delay evaluation, so we can create a bidirectional link between elements: … Read more

Cake pattern with Java8 possible?

With inspiration from other answers I came up with the following (rough) class hierarchy that is similar to the cake pattern in Scala: interface UserRepository { String authenticate(String username, String password); } interface UserRepositoryComponent { UserRepository getUserRepository(); } interface UserServiceComponent extends UserRepositoryComponent { default UserService getUserService() { return new UserService(getUserRepository()); } } class UserService { … Read more

Google guava vs Scala collection framework comparison

Google Guava is a fantastic library, there’s no doubt about it. However, it’s implemented in Java and suffers from all the restrictions that that implies: No immutable collection interface in the standard library No lambda literals (closures), so there’s some heavy boilerplate around the SAM types needed for e.g. predicates lots of duplication in type … Read more

Get companion object of class by given generic type Scala

A gist by Miles Sabin may give you a hint: trait Companion[T] { type C def apply() : C } object Companion { implicit def companion[T](implicit comp : Companion[T]) = comp() } object TestCompanion { trait Foo object Foo { def bar = “wibble” // Per-companion boilerplate for access via implicit resolution implicit def companion … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)