Cleanest, most efficient syntax to perform DataFrame self-join in Spark

Question

There are at least two different ways you can approach this either by aliasing:

df.as("df1").join(df.as("df2"), $"df1.foo" === $"df2.foo")

or using name-based equality joins:

// Note that it will result in ambiguous column names
// so using aliases here could be a good idea as well.
// df.as("df1").join(df.as("df2"), Seq("foo"))

df.join(df, Seq("foo"))

In general column renaming, while the ugliest, is the safest practice across all the versions. There have been a few bugs related to column resolution (we found one on SO not so long ago) and some details may differ between parsers (HiveContext / standard SQLContext) if you use raw expressions.

Personally I prefer using aliases because their resemblance to an idiomatic SQL and ability to use outside the scope of a specific DataFrame objects.

Regarding performance unless you’re interested in close-to-real-time processing there should be no performance difference whatsoever. All of these should generate the same execution plan.

Leave a Comment Cancel reply