What’s the most efficient way to filter a DataFrame

Question

My code (following the description of your first method) runs normally in Spark 1.4.0-SNAPSHOT on these two configurations:

Intellij IDEA's test
Spark Standalone cluster with 8 nodes (1 master, 7 worker)

Please check if any differences exists

val bc = sc.broadcast(Array[String]("login3", "login4"))
val x = Array(("login1", 192), ("login2", 146), ("login3", 72))
val xdf = sqlContext.createDataFrame(x).toDF("name", "cnt")

val func: (String => Boolean) = (arg: String) => bc.value.contains(arg)
val sqlfunc = udf(func)
val filtered = xdf.filter(sqlfunc(col("name")))

xdf.show()
filtered.show()

Output

name cnt
login1 192
login2 146
login3 72

name cnt
login3 72

Leave a Comment Cancel reply