how to filter out a null value from spark dataframe

Let’s say you have this data setup (so that results are reproducible):

// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)

// setting up example data
val e1 = Employee("n1", null, "n1@c1.com", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "n2@c1.com", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "n3@c1.com", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "n4@c2.com", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "n5@c2.com", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "n6@c2.com", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "n7@c3.com", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "n8@c3.com", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF

Data is:

+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|n1@c1.com|[c1,1,d1]|
|  n2|   2|n2@c1.com|[c1,1,d1]|
|  n3|   3|n3@c1.com|[c1,1,d1]|
|  n4|   4|n4@c2.com|[c2,2,d2]|
|  n5|null|n5@c2.com|[c2,2,d2]|
|  n6|   6|n6@c2.com|[c2,2,d2]|
|  n7|   7|n7@c3.com|[c3,3,d3]|
|  n8|   8|n8@c3.com|[c3,3,d3]|
+----+----+---------+---------+

Now to filter employees with null ids, you will do —

df.filter("id is null").show

which will correctly show you following:

+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|n1@c1.com|[c1,1,d1]|
|  n5|null|n5@c2.com|[c2,2,d2]|
+----+----+---------+---------+

Coming to the second part of your question, you can replace the null ids with 0 and other values with 1 with this —

df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show

This results in:

+----+---+---------+---------+
|name| id|    email|  company|
+----+---+---------+---------+
|  n1|  0|n1@c1.com|[c1,1,d1]|
|  n2|  1|n2@c1.com|[c1,1,d1]|
|  n3|  1|n3@c1.com|[c1,1,d1]|
|  n4|  1|n4@c2.com|[c2,2,d2]|
|  n5|  0|n5@c2.com|[c2,2,d2]|
|  n6|  1|n6@c2.com|[c2,2,d2]|
|  n7|  1|n7@c3.com|[c3,3,d3]|
|  n8|  1|n8@c3.com|[c3,3,d3]|
+----+---+---------+---------+

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)