apache-spark – Page 3

to_date fails to parse date in Spark 3.0

January 2, 2024 by Tarik

spark.sql(“set spark.sql.legacy.timeParserPolicy=LEGACY”) df.withColumn(“date”, to_date(col(“InvoiceDate”), “MM/dd/yyyy”)).show() +————–+———-+ | InvoiceDate| date| +————–+———-+ |12/1/2010 8:26|2010-12-01| +————–+———-+ # in above code spark refers SparkSession

java.lang.ClassCastException using lambda expressions in spark job on remote server

January 1, 2024 by Tarik

What you have here, is a follow-up error which masks the original error. When lambda instances are serialized, they use writeReplace to dissolve their JRE specific implementation from the persistent form which is a SerializedLambda instance. When the SerializedLambda instance has been restored, its readResolve method will be invoked to reconstitute the appropriate lambda instance. … Read more

java.lang.ClassCastException using lambda expressions in spark job on remote server

January 1, 2024 by Tarik

Reading TSV into Spark Dataframe with Scala API

January 1, 2024 by Tarik

All of the option parameters are passed in the option() function as below: val segments = sqlContext.read.format(“com.databricks.spark.csv”) .option(“delimiter”, “\t”) .load(“s3n://michaeldiscenza/data/test_segments”)

How to suppress Spark logging in unit tests?

December 31, 2023 by Tarik

Add the following code into the log4j.properties file inside the src/test/resources dir, create the file/dir if not exist # Change this to set Spark log level log4j.logger.org.apache.spark=WARN # Silence akka remoting log4j.logger.Remoting=WARN # Ignore messages below warning level from Jetty, because it’s a bit verbose log4j.logger.org.eclipse.jetty=WARN When I run my unit tests (I’m using JUnit … Read more

Read all files in a nested folder in Spark

December 31, 2023 by Tarik

If directory structure is regular, lets say something like this: folder ├── a │ ├── a │ │ └── aa.txt │ └── b │ └── ab.txt └── b ├── a │ └── ba.txt └── b └── bb.txt you can use * wildcard for each level of nesting as shown below: >>> sc.wholeTextFiles(“/folder/*/*/*.txt”).map(lambda x: x[0]).collect() [u’file:/folder/a/a/aa.txt’, … Read more

How to save a DataFrame as compressed (gzipped) CSV?

December 30, 2023 by Tarik

This code works for Spark 2.1, where .codec is not available. df.write .format(“com.databricks.spark.csv”) .option(“codec”, “org.apache.hadoop.io.compress.GzipCodec”) .save(my_directory) For Spark 2.2, you can use the df.write.csv(…,codec=”gzip”) option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

Spark context ‘sc’ not defined

December 30, 2023 by Tarik

you need to do the following after you have pyspark in your path: from pyspark import SparkContext sc =SparkContext()

How to convert Timestamp to Date format in DataFrame?

December 30, 2023 by Tarik

You can cast the column to date: Scala: import org.apache.spark.sql.types.DateType val newDF = df.withColumn(“dateColumn”, df(“timestampColumn”).cast(DateType)) Pyspark: df = df.withColumn(‘dateColumn’, df[‘timestampColumn’].cast(‘date’))

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

December 29, 2023 by Tarik

You may try either of these two ways. Option-1: JSON in single line as answered above by @Avishek Bhattacharya. Option-2: Add option to read multi line JSON in the code as follows. You could read the nested attribute also as shown below. val df = spark.read.option(“multiline”,”true”).json(“C:\\data\\nested-data.json”) df.select(“a.b”).show() Here is the output for Option-2. 20/07/29 23:14:35 … Read more