Unpivot in Spark SQL / PySpark

You can use the built in stack function, for example in Scala: scala> val df = Seq((“G”,Some(4),2,None),(“H”,None,4,Some(5))).toDF(“A”,”X”,”Y”, “Z”) df: org.apache.spark.sql.DataFrame = [A: string, X: int … 2 more fields] scala> df.show +—+—-+—+—-+ | A| X| Y| Z| +—+—-+—+—-+ | G| 4| 2|null| | H|null| 4| 5| +—+—-+—+—-+ scala> df.select($”A”, expr(“stack(3, ‘X’, X, ‘Y’, Y, ‘Z’, … Read more

Pyspark – converting json string to DataFrame

You can do the following newJson = ‘{“Name”:”something”,”Url”:”https://stackoverflow.com”,”Author”:”jangcy”,”BlogEntries”:100,”Caller”:”jangcy”}’ df = spark.read.json(sc.parallelize([newJson])) df.show(truncate=False) which should give +——+———–+——+———+————————-+ |Author|BlogEntries|Caller|Name |Url | +——+———–+——+———+————————-+ |jangcy|100 |jangcy|something|https://stackoverflow.com| +——+———–+——+———+————————-+

How to pivot on multiple columns in Spark SQL?

Here’s a non-UDF way involving a single pivot (hence, just a single column scan to identify all the unique dates). dff = mydf.groupBy(‘id’).pivot(‘day’).agg(F.first(‘price’).alias(‘price’),F.first(‘units’).alias(‘unit’)) Here’s the result (apologies for the non-matching ordering and naming): +—+——-+——+——-+——+——-+——+——-+——+ | id|1_price|1_unit|2_price|2_unit|3_price|3_unit|4_price|4_unit| +—+——-+——+——-+——+——-+——+——-+——+ |100| 23| 10| 45| 11| 67| 12| 78| 13| |101| 23| 10| 45| 13| 67| 14| 78| 15| … Read more

Save content of Spark DataFrame as a single CSV file [duplicate]

Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename. save_location= “s3a://landing-bucket-test/export/”+year csv_location = save_location+”temp.folder” file_location = save_location+’export.csv’ df.repartition(1).write.csv(path=csv_location, mode=”append”, header=”true”) file = dbutils.fs.ls(csv_location)[-1].path dbutils.fs.cp(file, file_location) dbutils.fs.rm(csv_location, recurse=True) This answer can be improved by not using [-1], but the .csv seems to always be last in the … Read more

Pyspark: Extract date from Datetime value

Pyspark has a to_date function to extract the date from a timestamp. In your example you could create a new column with just the date by doing the following: from pyspark.sql.functions import col, to_date df = df.withColumn(‘date_only’, to_date(col(‘date_time’))) If the column you are trying to convert is a string you can set the format parameter … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)