apache-spark-sql – Page 2

to_date fails to parse date in Spark 3.0

January 2, 2024 by Tarik

spark.sql(“set spark.sql.legacy.timeParserPolicy=LEGACY”) df.withColumn(“date”, to_date(col(“InvoiceDate”), “MM/dd/yyyy”)).show() +————–+———-+ | InvoiceDate| date| +————–+———-+ |12/1/2010 8:26|2010-12-01| +————–+———-+ # in above code spark refers SparkSession

How to save a DataFrame as compressed (gzipped) CSV?

December 30, 2023 by Tarik

This code works for Spark 2.1, where .codec is not available. df.write .format(“com.databricks.spark.csv”) .option(“codec”, “org.apache.hadoop.io.compress.GzipCodec”) .save(my_directory) For Spark 2.2, you can use the df.write.csv(…,codec=”gzip”) option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

How to convert Timestamp to Date format in DataFrame?

December 30, 2023 by Tarik

You can cast the column to date: Scala: import org.apache.spark.sql.types.DateType val newDF = df.withColumn(“dateColumn”, df(“timestampColumn”).cast(DateType)) Pyspark: df = df.withColumn(‘dateColumn’, df[‘timestampColumn’].cast(‘date’))

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

December 29, 2023 by Tarik

You may try either of these two ways. Option-1: JSON in single line as answered above by @Avishek Bhattacharya. Option-2: Add option to read multi line JSON in the code as follows. You could read the nested attribute also as shown below. val df = spark.read.option(“multiline”,”true”).json(“C:\\data\\nested-data.json”) df.select(“a.b”).show() Here is the output for Option-2. 20/07/29 23:14:35 … Read more

How to pivot on multiple columns in Spark SQL?

December 27, 2023 by Tarik

Here’s a non-UDF way involving a single pivot (hence, just a single column scan to identify all the unique dates). dff = mydf.groupBy(‘id’).pivot(‘day’).agg(F.first(‘price’).alias(‘price’),F.first(‘units’).alias(‘unit’)) Here’s the result (apologies for the non-matching ordering and naming): +—+——-+——+——-+——+——-+——+——-+——+ | id|1_price|1_unit|2_price|2_unit|3_price|3_unit|4_price|4_unit| +—+——-+——+——-+——+——-+——+——-+——+ |100| 23| 10| 45| 11| 67| 12| 78| 13| |101| 23| 10| 45| 13| 67| 14| 78| 15| … Read more

Spark 2.0 Dataset vs DataFrame

December 26, 2023 by Tarik

Difference between df.select(“foo”) and df.select($”foo”) is signature. The former one takes at least one String, the later one zero or more Columns. There is no practical difference beyond that. myDataSet.map(foo.someVal) type checks, but as any Dataset operation uses RDD of objects, and compared to DataFrame operations, there is a significant overhead. Let’s take a look … Read more

Why does Spark report “java.net.URISyntaxException: Relative path in absolute URI” when working with DataFrames?

December 25, 2023 by Tarik

It’s the SPARK-15565 issue in Spark 2.0 on Windows with a simple solution (that appears to be part of Spark’s codebase that may soon be released as 2.0.2 or 2.1.0). The solution in Spark 2.0.0 is to set spark.sql.warehouse.dir to some properly-referenced directory, say file:///c:/Spark/spark-2.0.0-bin-hadoop2.7/spark-warehouse that uses /// (triple slashes). Start spark-shell with –conf argument … Read more

How to do left outer join in spark sql?

December 24, 2023 by Tarik

I don’t see any issues in your code. Both “left join” or “left outer join” will work fine. Please check the data again the data you are showing is for matches. You can also perform Spark SQL join by using: // Left outer join explicit df1.join(df2, df1[“col1”] == df2[“col1”], “left_outer”)

Pyspark: filter dataframe by regex with string formatting?

December 22, 2023 by Tarik

From neeraj’s hint, it seems like the correct way to do this in pyspark is: expr = “Arizona.*hot” dk = dx.filter(dx[“keyword”].rlike(expr)) Note that dx.filter($”keyword” …) did not work since (my version) of pyspark didn’t seem to support the $ nomenclature out of the box.

Spark Option: inferSchema vs header = true

December 21, 2023 by Tarik

The header and schema are separate things. Header: If the csv file have a header (column names in the first row) then set header=true. This will use the first row in the csv file as the dataframe’s column names. Setting header=false (default option) will result in a dataframe with default column names: _c0, _c1, _c2, … Read more