java.lang.ClassCastException using lambda expressions in spark job on remote server

What you have here, is a follow-up error which masks the original error. When lambda instances are serialized, they use writeReplace to dissolve their JRE specific implementation from the persistent form which is a SerializedLambda instance. When the SerializedLambda instance has been restored, its readResolve method will be invoked to reconstitute the appropriate lambda instance. … Read more

java.lang.ClassCastException using lambda expressions in spark job on remote server

What you have here, is a follow-up error which masks the original error. When lambda instances are serialized, they use writeReplace to dissolve their JRE specific implementation from the persistent form which is a SerializedLambda instance. When the SerializedLambda instance has been restored, its readResolve method will be invoked to reconstitute the appropriate lambda instance. … Read more

How to suppress Spark logging in unit tests?

Add the following code into the log4j.properties file inside the src/test/resources dir, create the file/dir if not exist # Change this to set Spark log level log4j.logger.org.apache.spark=WARN # Silence akka remoting log4j.logger.Remoting=WARN # Ignore messages below warning level from Jetty, because it’s a bit verbose log4j.logger.org.eclipse.jetty=WARN When I run my unit tests (I’m using JUnit … Read more

Read all files in a nested folder in Spark

If directory structure is regular, lets say something like this: folder ├── a │   ├── a │   │   └── aa.txt │   └── b │   └── ab.txt └── b ├── a │   └── ba.txt └── b └── bb.txt you can use * wildcard for each level of nesting as shown below: >>> sc.wholeTextFiles(“/folder/*/*/*.txt”).map(lambda x: x[0]).collect() [u’file:/folder/a/a/aa.txt’, … Read more

How to save a DataFrame as compressed (gzipped) CSV?

This code works for Spark 2.1, where .codec is not available. df.write .format(“com.databricks.spark.csv”) .option(“codec”, “org.apache.hadoop.io.compress.GzipCodec”) .save(my_directory) For Spark 2.2, you can use the df.write.csv(…,codec=”gzip”) option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

You may try either of these two ways. Option-1: JSON in single line as answered above by @Avishek Bhattacharya. Option-2: Add option to read multi line JSON in the code as follows. You could read the nested attribute also as shown below. val df = spark.read.option(“multiline”,”true”).json(“C:\\data\\nested-data.json”) df.select(“a.b”).show() Here is the output for Option-2. 20/07/29 23:14:35 … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)