How to get the lists’ length in one column in dataframe spark?

Question

Pyspark has a built-in function to achieve exactly what you want called size. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size .
To add it as column, you can simply call it during your select statement.

from pyspark.sql.functions import size

countdf = df.select('*',size('products').alias('product_cnt'))

Filtering works exactly as @titiro89 described. Furthermore, you can use the size function in the filter. This will allow you to bypass adding the extra column (if you wish to do so) in the following way.

filterdf = df.filter(size('products')==given_products_length)

Leave a Comment Cancel reply