Spark : Read file only if the path exists

You can filter out the irrelevant files as in @Psidom’s answer. In spark, the best way to do so is to use the internal spark hadoop configuration. Given that spark session variable is called “spark” you can do: import org.apache.hadoop.fs.FileSystem import org.apache.hadoop.fs.Path val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration) def testDirExist(path: String): Boolean = { val p … Read more

Parquet without Hadoop?

Investigating the same question I found that apparently it’s not possible for the moment. I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet. In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It … Read more

Spark save(write) parquet only one file

Use coalesce before write operation dataFrame.coalesce(1).write.format(“parquet”).mode(“append”).save(“temp.parquet”) EDIT-1 Upon a closer look, the docs do warn about coalesce However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1) Therefore as … Read more

Read multiple parquet files in a folder and write to single csv file using python

I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it’s not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on … Read more

How to handle null values when writing to parquet from Spark

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns. The problem is that null alone carries no type information at all scala> spark.sql(“SELECT null as comments”).printSchema root |– comments: null (nullable = true) As per comment by Michael Armbrust all you have to do is cast: scala> spark.sql(“””SELECT CAST(null as DOUBLE) AS … Read more

Reading parquet files from multiple directories in Pyspark

A little late but I found this while I was searching and it may help someone else… You might also try unpacking the argument list to spark.read.parquet() paths=[‘foo’,’bar’] df=spark.read.parquet(*paths) This is convenient if you want to pass a few blobs into the path argument: basePath=”s3://bucket/” paths=[‘s3://bucket/partition_value1=*/partition_value2=2017-04-*’, ‘s3://bucket/partition_value1=*/partition_value2=2017-05-*’ ] df=spark.read.option(“basePath”,basePath).parquet(*paths) This is cool cause you don’t … Read more