It is not true that RDDs are always unordered. An RDD has a guaranteed order if it is the result of a sortBy
operation, for example. An RDD is not a set; it can contain duplicates. Partitioning is not opaque to the caller, and can be controlled and queried. Many operations do preserve both partitioning and order, like map
. That said I find it a little easy to accidentally violate the assumptions that zip
depends on, since they’re a little subtle, but it certainly has a purpose.