pyspark – Page 2 – Tarik Billa

Unpivot in Spark SQL / PySpark

January 2, 2024 by Tarik

You can use the built in stack function, for example in Scala: scala> val df = Seq((“G”,Some(4),2,None),(“H”,None,4,Some(5))).toDF(“A”,”X”,”Y”, “Z”) df: org.apache.spark.sql.DataFrame = [A: string, X: int … 2 more fields] scala> df.show +—+—-+—+—-+ | A| X| Y| Z| +—+—-+—+—-+ | G| 4| 2|null| | H|null| 4| 5| +—+—-+—+—-+ scala> df.select($”A”, expr(“stack(3, ‘X’, X, ‘Y’, Y, ‘Z’, … Read more

Pyspark – converting json string to DataFrame

January 2, 2024 by Tarik

You can do the following newJson = ‘{“Name”:”something”,”Url”:”https://stackoverflow.com”,”Author”:”jangcy”,”BlogEntries”:100,”Caller”:”jangcy”}’ df = spark.read.json(sc.parallelize([newJson])) df.show(truncate=False) which should give +——+———–+——+———+————————-+ |Author|BlogEntries|Caller|Name |Url | +——+———–+——+———+————————-+ |jangcy|100 |jangcy|something|https://stackoverflow.com| +——+———–+——+———+————————-+

Improve PySpark DataFrame.show output to fit Jupyter notebook

January 2, 2024 by Tarik

After playing around with my table which has a lot of columns I decided the best thing to do to get a feel for the data is to use: df.show(n=5, truncate=False, vertical=True) This displays it vertically without truncation and is the cleanest viewing I can come up with.

to_date fails to parse date in Spark 3.0

January 2, 2024 by Tarik

spark.sql(“set spark.sql.legacy.timeParserPolicy=LEGACY”) df.withColumn(“date”, to_date(col(“InvoiceDate”), “MM/dd/yyyy”)).show() +————–+———-+ | InvoiceDate| date| +————–+———-+ |12/1/2010 8:26|2010-12-01| +————–+———-+ # in above code spark refers SparkSession

Spark context ‘sc’ not defined

December 30, 2023 by Tarik

you need to do the following after you have pyspark in your path: from pyspark import SparkContext sc =SparkContext()

Syntax while setting schema for Pyspark.sql using StructType

December 27, 2023 by Tarik

It means if the column allows null values, true for nullable, and false for not nullable StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can … Read more

How to pivot on multiple columns in Spark SQL?

December 27, 2023 by Tarik

Here’s a non-UDF way involving a single pivot (hence, just a single column scan to identify all the unique dates). dff = mydf.groupBy(‘id’).pivot(‘day’).agg(F.first(‘price’).alias(‘price’),F.first(‘units’).alias(‘unit’)) Here’s the result (apologies for the non-matching ordering and naming): +—+——-+——+——-+——+——-+——+——-+——+ | id|1_price|1_unit|2_price|2_unit|3_price|3_unit|4_price|4_unit| +—+——-+——+——-+——+——-+——+——-+——+ |100| 23| 10| 45| 11| 67| 12| 78| 13| |101| 23| 10| 45| 13| 67| 14| 78| 15| … Read more

Save content of Spark DataFrame as a single CSV file [duplicate]

December 25, 2023 by Tarik

Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename. save_location= “s3a://landing-bucket-test/export/”+year csv_location = save_location+”temp.folder” file_location = save_location+’export.csv’ df.repartition(1).write.csv(path=csv_location, mode=”append”, header=”true”) file = dbutils.fs.ls(csv_location)[-1].path dbutils.fs.cp(file, file_location) dbutils.fs.rm(csv_location, recurse=True) This answer can be improved by not using [-1], but the .csv seems to always be last in the … Read more

How to do left outer join in spark sql?

December 24, 2023 by Tarik

I don’t see any issues in your code. Both “left join” or “left outer join” will work fine. Please check the data again the data you are showing is for matches. You can also perform Spark SQL join by using: // Left outer join explicit df1.join(df2, df1[“col1”] == df2[“col1”], “left_outer”)

Pyspark: Extract date from Datetime value

December 22, 2023 by Tarik

Pyspark has a to_date function to extract the date from a timestamp. In your example you could create a new column with just the date by doing the following: from pyspark.sql.functions import col, to_date df = df.withColumn(‘date_only’, to_date(col(‘date_time’))) If the column you are trying to convert is a string you can set the format parameter … Read more