apache-spark – Tarik Billa

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

April 11, 2024 by Tarik

I think this problem is similar to Write to multiple outputs by key Spark – one Spark job Please refer the answer there. import org.apache.hadoop.io.NullWritable import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] { override def generateActualKey(key: Any, value: Any): Any = NullWritable.get() override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String … Read more

pyspark: isin vs join

April 11, 2024 by Tarik

Considering import pyspark.sql.functions as psf There are two types of broadcasting: sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin psf.broadcast inside a join to copy your pyspark dataframe to every node when the dataframe is small: df1.join(psf.broadcast(df2)). It is usually used for cartesian products (CROSS JOIN in pig). … Read more

Apache spark dealing with case statements

April 8, 2024 by Tarik

These are few ways to write If-Else / When-Then-Else / When-Otherwise expression in pyspark. Sample dataframe df = spark.createDataFrame([(1,1),(2,2),(3,3)],[‘id’,’value’]) df.show() #+—+—–+ #| id|value| #+—+—–+ #| 1| 1| #| 2| 2| #| 3| 3| #+—+—–+ #Desired Output: #+—+—–+———-+ #| id|value|value_desc| #+—+—–+———-+ #| 1| 1| one| #| 2| 2| two| #| 3| 3| other| #+—+—–+———-+ Option#1: withColumn() … Read more

What’s the most efficient way to filter a DataFrame

January 9, 2024 by Tarik

My code (following the description of your first method) runs normally in Spark 1.4.0-SNAPSHOT on these two configurations: Intellij IDEA’s test Spark Standalone cluster with 8 nodes (1 master, 7 worker) Please check if any differences exists val bc = sc.broadcast(Array[String](“login3”, “login4”)) val x = Array((“login1”, 192), (“login2”, 146), (“login3”, 72)) val xdf = sqlContext.createDataFrame(x).toDF(“name”, … Read more

Unpivot in Spark SQL / PySpark

January 2, 2024 by Tarik

You can use the built in stack function, for example in Scala: scala> val df = Seq((“G”,Some(4),2,None),(“H”,None,4,Some(5))).toDF(“A”,”X”,”Y”, “Z”) df: org.apache.spark.sql.DataFrame = [A: string, X: int … 2 more fields] scala> df.show +—+—-+—+—-+ | A| X| Y| Z| +—+—-+—+—-+ | G| 4| 2|null| | H|null| 4| 5| +—+—-+—+—-+ scala> df.select($”A”, expr(“stack(3, ‘X’, X, ‘Y’, Y, ‘Z’, … Read more

to_date fails to parse date in Spark 3.0

January 2, 2024 by Tarik

spark.sql(“set spark.sql.legacy.timeParserPolicy=LEGACY”) df.withColumn(“date”, to_date(col(“InvoiceDate”), “MM/dd/yyyy”)).show() +————–+———-+ | InvoiceDate| date| +————–+———-+ |12/1/2010 8:26|2010-12-01| +————–+———-+ # in above code spark refers SparkSession

Read all files in a nested folder in Spark

December 31, 2023 by Tarik

If directory structure is regular, lets say something like this: folder ├── a │ ├── a │ │ └── aa.txt │ └── b │ └── ab.txt └── b ├── a │ └── ba.txt └── b └── bb.txt you can use * wildcard for each level of nesting as shown below: >>> sc.wholeTextFiles(“/folder/*/*/*.txt”).map(lambda x: x[0]).collect() [u’file:/folder/a/a/aa.txt’, … Read more

Spark context ‘sc’ not defined

December 30, 2023 by Tarik

you need to do the following after you have pyspark in your path: from pyspark import SparkContext sc =SparkContext()

How to convert Timestamp to Date format in DataFrame?

December 30, 2023 by Tarik

You can cast the column to date: Scala: import org.apache.spark.sql.types.DateType val newDF = df.withColumn(“dateColumn”, df(“timestampColumn”).cast(DateType)) Pyspark: df = df.withColumn(‘dateColumn’, df[‘timestampColumn’].cast(‘date’))

How do I stop a spark streaming job?

December 28, 2023 by Tarik

You can stop your streaming context in cluster mode by running the following command without needing to sending a SIGTERM. This will stop the streaming context without you needing to explicitly stop it using a thread hook. $SPARK_HOME_DIR/bin/spark-submit –master $MASTER_REST_URL –kill $DRIVER_ID -$MASTER_REST_URL is the rest url of the spark driver, ie something like spark://localhost:6066 … Read more