apache-spark – Tarik Billa

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

April 11, 2024 by Tarik

I think this problem is similar to Write to multiple outputs by key Spark – one Spark job Please refer the answer there. import org.apache.hadoop.io.NullWritable import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] { override def generateActualKey(key: Any, value: Any): Any = NullWritable.get() override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String … Read more

pyspark: isin vs join

April 11, 2024 by Tarik

Considering import pyspark.sql.functions as psf There are two types of broadcasting: sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin psf.broadcast inside a join to copy your pyspark dataframe to every node when the dataframe is small: df1.join(psf.broadcast(df2)). It is usually used for cartesian products (CROSS JOIN in pig). … Read more

How can I tear down a SparkSession and create a new one within one application?

April 10, 2024 by Tarik

Long story short, Spark (including PySpark) is not designed to handle multiple contexts in a single application. If you’re interested in JVM side of the story I would recommend reading SPARK-2243 (resolved as won’t fix). There is a number of design decisions made in PySpark which reflects that including, but not limited to a singleton … Read more

How to access values in array column?

April 8, 2024 by Tarik

Since Spark 2.4.0, there is a new function element_at($array_column, $index). See Spark docs

Pyspark dataframe convert multiple columns to float

April 8, 2024 by Tarik

float() is not a Spark function, you need the function cast(): from pyspark.sql.functions import col df_temp.select(*(col(c).cast(“float”).alias(c) for c in df_temp.columns))

Apache spark dealing with case statements

April 8, 2024 by Tarik

These are few ways to write If-Else / When-Then-Else / When-Otherwise expression in pyspark. Sample dataframe df = spark.createDataFrame([(1,1),(2,2),(3,3)],[‘id’,’value’]) df.show() #+—+—–+ #| id|value| #+—+—–+ #| 1| 1| #| 2| 2| #| 3| 3| #+—+—–+ #Desired Output: #+—+—–+———-+ #| id|value|value_desc| #+—+—–+———-+ #| 1| 1| one| #| 2| 2| two| #| 3| 3| other| #+—+—–+———-+ Option#1: withColumn() … Read more

Sum operation on PySpark DataFrame giving TypeError when type is fine

April 6, 2024 by Tarik

You are not using the correct sum function but the built-in function sum (by default). So the reason why the build-in function won’t work is that’s it takes an iterable as an argument where as here the name of the column passed is a string and the built-in function can’t be applied on a string. … Read more

How to extract an element from an array in PySpark

April 2, 2024 by Tarik

Create sample data: from pyspark.sql import Row x = [Row(col1=”xx”, col2=”yy”, col3=”zz”, col4=[123,234])] rdd = sc.parallelize([Row(col1=”xx”, col2=”yy”, col3=”zz”, col4=[123,234])]) df = spark.createDataFrame(rdd) df.show() #+—-+—-+—-+———-+ #|col1|col2|col3| col4| #+—-+—-+—-+———-+ #| xx| yy| zz|[123, 234]| #+—-+—-+—-+———-+ Use getItem to extract element from the array column as this, in your actual case replace col4 with collect_set(TIMESTAMP): df = df.withColumn(“col5”, … Read more

Spark / Scala: forward fill with last observation

January 9, 2024 by Tarik

Initial answer (a single time series assumption): First of all try avoid window functions if you cannot provide PARTITION BY clause. It moves data to a single partition so most of the time it is simply not feasible. What you can do is to fill gaps on RDD using mapPartitionsWithIndex. Since you didn’t provide an … Read more

Spark final task takes 100x times longer than first 199, how to improve

January 9, 2024 by Tarik

Spark >= 3.0 Since 3.0 Spark provides built-in optimizations for handling skewed joins – which can be enabled using spark.sql.adaptive.optimizeSkewedJoin.enabled property. See SPARK-29544 for details. Spark < 3.0 You clearly have a problem with a huge right data skew. Lets take a look a the statistics you’ve provided: df1 = [mean=4.989209978967438, stddev=2255.654165352454, count=2400088] df2 = … Read more