Spark DataFrame – Select n random rows

August 29, 2023 by Tarik

In Python, You can shuffle the rows and then take the top ones:

import org.apache.spark.sql.functions.rand

dataset.orderBy(rand()).limit(n)