GroupBy and concat array columns pyspark

Question

You need a flattening UDF; starting from your own df:

spark.version
# u'2.2.0'

from pyspark.sql import functions as F
import pyspark.sql.types as T

def fudf(val):
    return reduce (lambda x, y:x+y, val)

flattenUdf = F.udf(fudf, T.ArrayType(T.IntegerType()))

df2 = df.groupBy("store").agg(F.collect_list("values"))
df2.show(truncate=False)
# +-----+----------------------------------------------+ 
# |store|                         collect_list(values) | 
# +-----+----------------------------------------------+ 
# |1    |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]| 
# |2    |[WrappedArray(2), WrappedArray(3)]            | 
# +-----+----------------------------------------------+

df3 = df2.select("store", flattenUdf("collect_list(values)").alias("values"))
df3.show(truncate=False)
# +-----+------------------+
# |store|           values |
# +-----+------------------+
# |1    |[1, 2, 3, 4, 5, 6]|
# |2    |[2, 3]            |
# +-----+------------------+

UPDATE (after comment):

The above snippet will work only with Python 2. With Python 3, you should modify the UDF as follows:

import functools

def fudf(val):
    return functools.reduce(lambda x, y:x+y, val)

Tested with Spark 2.4.4.

Leave a Comment Cancel reply