How to save a DataFrame as compressed (gzipped) CSV?

This code works for Spark 2.1, where .codec is not available. df.write .format(“com.databricks.spark.csv”) .option(“codec”, “org.apache.hadoop.io.compress.GzipCodec”) .save(my_directory) For Spark 2.2, you can use the df.write.csv(…,codec=”gzip”) option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

You may try either of these two ways. Option-1: JSON in single line as answered above by @Avishek Bhattacharya. Option-2: Add option to read multi line JSON in the code as follows. You could read the nested attribute also as shown below. val df = spark.read.option(“multiline”,”true”).json(“C:\\data\\nested-data.json”) df.select(“a.b”).show() Here is the output for Option-2. 20/07/29 23:14:35 … Read more

How to pivot on multiple columns in Spark SQL?

Here’s a non-UDF way involving a single pivot (hence, just a single column scan to identify all the unique dates). dff = mydf.groupBy(‘id’).pivot(‘day’).agg(F.first(‘price’).alias(‘price’),F.first(‘units’).alias(‘unit’)) Here’s the result (apologies for the non-matching ordering and naming): +—+——-+——+——-+——+——-+——+——-+——+ | id|1_price|1_unit|2_price|2_unit|3_price|3_unit|4_price|4_unit| +—+——-+——+——-+——+——-+——+——-+——+ |100| 23| 10| 45| 11| 67| 12| 78| 13| |101| 23| 10| 45| 13| 67| 14| 78| 15| … Read more

Spark 2.0 Dataset vs DataFrame

Difference between df.select(“foo”) and df.select($”foo”) is signature. The former one takes at least one String, the later one zero or more Columns. There is no practical difference beyond that. myDataSet.map(foo.someVal) type checks, but as any Dataset operation uses RDD of objects, and compared to DataFrame operations, there is a significant overhead. Let’s take a look … Read more

Why does Spark report “java.net.URISyntaxException: Relative path in absolute URI” when working with DataFrames?

It’s the SPARK-15565 issue in Spark 2.0 on Windows with a simple solution (that appears to be part of Spark’s codebase that may soon be released as 2.0.2 or 2.1.0). The solution in Spark 2.0.0 is to set spark.sql.warehouse.dir to some properly-referenced directory, say file:///c:/Spark/spark-2.0.0-bin-hadoop2.7/spark-warehouse that uses /// (triple slashes). Start spark-shell with –conf argument … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)