DataFrame / Dataset groupBy behaviour/optimization
Yes, it is “smart enough“. groupBy performed on a DataFrame is not the same operation as groupBy performed on a plain RDD. In a scenario you’ve described there is no need to move raw data at all. Let’s create a small example to illustrate that: val df = sc.parallelize(Seq( (“a”, “foo”, 1), (“a”, “foo”, 3), … Read more