With cache(), you use only the default storage level :
MEMORY_ONLYfor RDDMEMORY_AND_DISKfor Dataset
With persist(), you can specify which storage level you want for both RDD and Dataset.
From the official docs:
- You can mark an
RDDto be persisted using thepersist() orcache() methods on it.- each persisted
RDDcan be stored using a differentstorage level- The
cache() method is a shorthand for using the default storage level, which isStorageLevel.MEMORY_ONLY(store deserialized objects in memory).
Use persist() if you want to assign a storage level other than :
MEMORY_ONLYto the RDD- or
MEMORY_AND_DISKfor Dataset
Interesting link for the official documentation : which storage level to choose