What are the pros and cons of the Apache Parquet format compared to other formats?

I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we’re all used to — text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other … Read more

Convert Parquet to CSV

You can do this by using the Python packages pandas and pyarrow (pyarrow is an optional dependency of pandas that you need for this feature). import pandas as pd df = pd.read_parquet(‘filename.parquet’) df.to_csv(‘filename.csv’) When you need to make modifications to the contents in the file, you can standard pandas operations on df.

Is it better to have one large parquet file or lots of smaller parquet files?

Aim for around 1GB per file (spark partition) (1). Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2). Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered. .option(“compression”, “gzip”) is the option to override … Read more

Spark SQL – difference between gzip vs snappy vs lzo compression formats

Compression Ratio : GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. General Usage : GZip is often a good choice for cold data, which is accessed infrequently. Snappy or LZO are a better choice for hot data, which is accessed frequently. Snappy often performs better than LZO. … Read more

Methods for writing Parquet files using Python?

Update (March 2017): There are currently 2 libraries capable of writing Parquet files: fastparquet pyarrow Both of them are still under heavy development it seems and they come with a number of disclaimers (no support for nested data e.g.), so you will have to check whether they support everything you need. OLD ANSWER: As of … Read more

Difference between Apache parquet and arrow

Parquet is a columnar file format for data serialization. Reading a Parquet file requires decompressing and decoding its contents into some kind of in-memory data structure. It is designed to be space/IO-efficient at the expense of CPU utilization for decoding. It does not provide any data structures for in-memory computing. Parquet is a streaming format … Read more

How to partition and write DataFrame in Spark without deleting partitions with no new data?

This is an old topic, but I was having the same problem and found another solution, just set your partition overwrite mode to dynamic by using: spark.conf.set(‘spark.sql.sources.partitionOverwriteMode’, ‘dynamic’) So, my spark session is configured like this: spark = SparkSession.builder.appName(‘AppName’).getOrCreate() spark.conf.set(‘spark.sql.sources.partitionOverwriteMode’, ‘dynamic’)

Reading DataFrame from partitioned parquet file

sqlContext.read.parquet can take multiple paths as input. If you want just day=5 and day=6, you can simply add two paths like: val dataframe = sqlContext .read.parquet(“file:///your/path/data=jDD/year=2015/month=10/day=5/”, “file:///your/path/data=jDD/year=2015/month=10/day=6/”) If you have folders under day=X, like say country=XX, country will automatically be added as a column in the dataframe. EDIT: As of Spark 1.6 one needs to … Read more