pyspark – Tarik Billa

pyspark: isin vs join

April 11, 2024 by Tarik

Considering import pyspark.sql.functions as psf There are two types of broadcasting: sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin psf.broadcast inside a join to copy your pyspark dataframe to every node when the dataframe is small: df1.join(psf.broadcast(df2)). It is usually used for cartesian products (CROSS JOIN in pig). … Read more

How can I tear down a SparkSession and create a new one within one application?

April 10, 2024 by Tarik

Long story short, Spark (including PySpark) is not designed to handle multiple contexts in a single application. If you’re interested in JVM side of the story I would recommend reading SPARK-2243 (resolved as won’t fix). There is a number of design decisions made in PySpark which reflects that including, but not limited to a singleton … Read more

Pyspark dataframe convert multiple columns to float

April 8, 2024 by Tarik

float() is not a Spark function, you need the function cast(): from pyspark.sql.functions import col df_temp.select(*(col(c).cast(“float”).alias(c) for c in df_temp.columns))

Apache spark dealing with case statements

April 8, 2024 by Tarik

These are few ways to write If-Else / When-Then-Else / When-Otherwise expression in pyspark. Sample dataframe df = spark.createDataFrame([(1,1),(2,2),(3,3)],[‘id’,’value’]) df.show() #+—+—–+ #| id|value| #+—+—–+ #| 1| 1| #| 2| 2| #| 3| 3| #+—+—–+ #Desired Output: #+—+—–+———-+ #| id|value|value_desc| #+—+—–+———-+ #| 1| 1| one| #| 2| 2| two| #| 3| 3| other| #+—+—–+———-+ Option#1: withColumn() … Read more

GroupBy and concat array columns pyspark

April 6, 2024 by Tarik

You need a flattening UDF; starting from your own df: spark.version # u’2.2.0′ from pyspark.sql import functions as F import pyspark.sql.types as T def fudf(val): return reduce (lambda x, y:x+y, val) flattenUdf = F.udf(fudf, T.ArrayType(T.IntegerType())) df2 = df.groupBy(“store”).agg(F.collect_list(“values”)) df2.show(truncate=False) # +—–+———————————————-+ # |store| collect_list(values) | # +—–+———————————————-+ # |1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]| … Read more

Sum operation on PySpark DataFrame giving TypeError when type is fine

April 6, 2024 by Tarik

You are not using the correct sum function but the built-in function sum (by default). So the reason why the build-in function won’t work is that’s it takes an iterable as an argument where as here the name of the column passed is a string and the built-in function can’t be applied on a string. … Read more

How to extract an element from an array in PySpark

April 2, 2024 by Tarik

Create sample data: from pyspark.sql import Row x = [Row(col1=”xx”, col2=”yy”, col3=”zz”, col4=[123,234])] rdd = sc.parallelize([Row(col1=”xx”, col2=”yy”, col3=”zz”, col4=[123,234])]) df = spark.createDataFrame(rdd) df.show() #+—-+—-+—-+———-+ #|col1|col2|col3| col4| #+—-+—-+—-+———-+ #| xx| yy| zz|[123, 234]| #+—-+—-+—-+———-+ Use getItem to extract element from the array column as this, in your actual case replace col4 with collect_set(TIMESTAMP): df = df.withColumn(“col5”, … Read more

AWS Glue to Redshift: Is it possible to replace, update or delete data?

January 6, 2024 by Tarik

How to read parquet data from S3 to spark dataframe Python?

January 5, 2024 by Tarik

How to convert list of dictionaries into Pyspark DataFrame

January 3, 2024 by Tarik

In the past, you were able to simply pass a dictionary to spark.createDataFrame(), but this is now deprecated: mylist = [ {“type_activity_id”:1,”type_activity_name”:”xxx”}, {“type_activity_id”:2,”type_activity_name”:”yyy”}, {“type_activity_id”:3,”type_activity_name”:”zzz”} ] df = spark.createDataFrame(mylist) #UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead # warnings.warn(“inferring schema from dict is deprecated,” As this warning message says, you should use pyspark.sql.Row instead. … Read more