Spark : Read file only if the path exists

You can filter out the irrelevant files as in @Psidom’s answer. In spark, the best way to do so is to use the internal spark hadoop configuration. Given that spark session variable is called “spark” you can do: import org.apache.hadoop.fs.FileSystem import org.apache.hadoop.fs.Path val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration) def testDirExist(path: String): Boolean = { val p … Read more

Defining a UDF that accepts an Array of objects in a Spark DataFrame?

What you’re looking for is Seq[o.a.s.sql.Row]: import org.apache.spark.sql.Row val my_size = udf { subjects: Seq[Row] => subjects.size } Explanation: Current representation of ArrayType is, as you already know, WrappedArray so Array won’t work and it is better to stay on the safe side. According to the official specification, the local (external) type for StructType is … Read more

Scala: circular references in immutable data types?

Let’s try to work it out step by step. As a rule of thumb when creating an immutable object all constructor parameters should be known at the point of instantiation, but let’s cheat and pass constructor parameters by name, then use lazy fields to delay evaluation, so we can create a bidirectional link between elements: … Read more

How to use Environment Variables in build.sbt?

You can get env variables with the commands mentioned other responses like: sys.env.get(“USERNAME”) sys.env.get(“PASSWORD”) but they return an option of type Option[String] To turn this into string you need to either do a match or simply use sys.env.get(“USERNAME”).get sys.env.get(“USERNAME”).getOrElse(“some default value”) if you need to set some default value. Warning! calling .get of an option … Read more

Using scala.util.control.Exception

Indeed – I also find it pretty confusing! Here’s a problem where I have some property which may be a parseable date: def parse(s: String) : Date = new SimpleDateFormat(“yyyy-MM-dd”).parse(s) def parseDate = parse(System.getProperty(“foo.bar”)) type PE = ParseException import scala.util.control.Exception._ val d1 = try { parseDate } catch { case e: PE => new Date … Read more

Spark save(write) parquet only one file

Use coalesce before write operation dataFrame.coalesce(1).write.format(“parquet”).mode(“append”).save(“temp.parquet”) EDIT-1 Upon a closer look, the docs do warn about coalesce However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1) Therefore as … Read more

How does `isInstanceOf` work?

isInstanceOf looks if there is a corresponding entry in the inheritance chain. The chain of A with T includes A, B and T, so a.isInstanceOf[B] must be true. edit: Actually the generated byte code calls javas instanceof, so it would be a instanceof B in java. A little more complex call like a.isInstanceOf[A with T] … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)