parquet – Tarik Billa

Spark : Read file only if the path exists

January 8, 2024 by Tarik

You can filter out the irrelevant files as in @Psidom’s answer. In spark, the best way to do so is to use the internal spark hadoop configuration. Given that spark session variable is called “spark” you can do: import org.apache.hadoop.fs.FileSystem import org.apache.hadoop.fs.Path val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration) def testDirExist(path: String): Boolean = { val p … Read more

Parquet without Hadoop?

January 7, 2024 by Tarik

Investigating the same question I found that apparently it’s not possible for the moment. I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet. In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It … Read more

Spark save(write) parquet only one file

January 5, 2024 by Tarik

Use coalesce before write operation dataFrame.coalesce(1).write.format(“parquet”).mode(“append”).save(“temp.parquet”) EDIT-1 Upon a closer look, the docs do warn about coalesce However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1) Therefore as … Read more

How to identify Pandas’ backend for Parquet

January 1, 2024 by Tarik

Just execute these 2 commands in linux shell/bash pip install pyarrow pip install fastparquet

Read multiple parquet files in a folder and write to single csv file using python

December 30, 2023 by Tarik

I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it’s not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on … Read more

How to handle null values when writing to parquet from Spark

December 24, 2023 by Tarik

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns. The problem is that null alone carries no type information at all scala> spark.sql(“SELECT null as comments”).printSchema root |– comments: null (nullable = true) As per comment by Michael Armbrust all you have to do is cast: scala> spark.sql(“””SELECT CAST(null as DOUBLE) AS … Read more

Write parquet from AWS Kinesis firehose to AWS S3

December 22, 2023 by Tarik

Reading parquet files from multiple directories in Pyspark

December 18, 2023 by Tarik

A little late but I found this while I was searching and it may help someone else… You might also try unpacking the argument list to spark.read.parquet() paths=[‘foo’,’bar’] df=spark.read.parquet(*paths) This is convenient if you want to pass a few blobs into the path argument: basePath=”s3://bucket/” paths=[‘s3://bucket/partition_value1=*/partition_value2=2017-04-*’, ‘s3://bucket/partition_value1=*/partition_value2=2017-05-*’ ] df=spark.read.option(“basePath”,basePath).parquet(*paths) This is cool cause you don’t … Read more

Transfer and write Parquet with python and pandas got timestamp error

December 11, 2023 by Tarik

Pandas already forwards unknown kwargs to the underlying parquet-engine since at least v0.22. As such, using table.to_parquet(allow_truncated_timestamps=True) should work – I verified it for pandas v0.25.0 and pyarrow 0.13.0. For more keywords see the pyarrow docs.

How do you control the size of the output file?

December 2, 2023 by Tarik

It’s impossible for Spark to control the size of Parquet files, because the DataFrame in memory needs to be encoded and compressed before writing to disks. Before this process finishes, there is no way to estimate the actual file size on disk. So my solution is: Write the DataFrame to HDFS, df.write.parquet(path) Get the directory … Read more