apache-spark-1.4 – Tarik Billa

DataFrame join optimization – Broadcast Hash Join

May 5, 2023 by Tarik

Broadcast Hash Joins (similar to map side join or map-side combine in Mapreduce) : In SparkSQL you can see the type of join being performed by calling queryExecution.executedPlan. As with core Spark, if one of the tables is much smaller than the other you may want a broadcast hash join. You can hint to Spark … Read more

How to optimize shuffle spill in Apache Spark application

April 21, 2023 by Tarik

Learning to performance-tune Spark requires quite a bit of investigation and learning. There are a few good resources including this video. Spark 1.4 has some better diagnostics and visualisation in the interface which can help you. In summary, you spill when the size of the RDD partitions at the end of the stage exceed the … Read more