Spark final task takes 100x times longer than first 199, how to improve
Spark >= 3.0 Since 3.0 Spark provides built-in optimizations for handling skewed joins – which can be enabled using spark.sql.adaptive.optimizeSkewedJoin.enabled property. See SPARK-29544 for details. Spark < 3.0 You clearly have a problem with a huge right data skew. Lets take a look a the statistics you’ve provided: df1 = [mean=4.989209978967438, stddev=2255.654165352454, count=2400088] df2 = … Read more