What is shuffle read & shuffle write in Apache Spark

Question

Shuffling means the reallocation of data between multiple Spark stages. “Shuffle Write” is the sum of all written serialized data on all executors before transmitting (normally at the end of a stage) and “Shuffle Read” means the sum of read serialized data on all executors at the beginning of a stage.

Your programm has only one stage, triggered by the “collect” operation. No shuffling is required, because you have only a bunch of consecutive map operations which are pipelined in one Stage.

Try to take a look at these slides:
http://de.slideshare.net/colorant/spark-shuffle-introduction

It could also help to read chapture 5 from the original paper:
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

Leave a Comment Cancel reply