How take a random row from a PySpark DataFrame?

You can simply call takeSample on a RDD:

df = sqlContext.createDataFrame(
    [(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v"))
df.rdd.takeSample(False, 1, seed=0)
## [Row(k=3, v='c')]

If you don’t want to collect you can simply take a higher fraction and limit:

df.sample(False, 0.1, seed=0).limit(1)

Don’t pass a seed, and you should get a different DataFrame each time.

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)