Pyspark: Filter dataframe based on multiple conditions

Question

Your logic condition is wrong. IIUC, what you want is:

import pyspark.sql.functions as f

df.filter((f.col('d')<5))\
    .filter(
        ((f.col('col1') != f.col('col3')) | 
         (f.col('col2') != f.col('col4')) & (f.col('col1') == f.col('col3')))
    )\
    .show()

I broke the filter() step into 2 calls for readability, but you could equivalently do it in one line.

Output:

+----+----+----+----+---+
|col1|col2|col3|col4|  d|
+----+----+----+----+---+
|   A|  xx|   D|  vv|  4|
|   A|   x|   A|  xx|  3|
|   E| xxx|   B|  vv|  3|
|   F|xxxx|   F| vvv|  4|
|   G| xxx|   G|  xx|  4|
+----+----+----+----+---+

Leave a Comment Cancel reply