How to read partitioned parquet files from S3 using pyarrow in python

I managed to get this working with the latest release of fastparquet & s3fs. Below is the code for the same: import s3fs import fastparquet as fp s3 = s3fs.S3FileSystem() fs = s3fs.core.S3FileSystem() #mybucket/data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet s3_path = “mybucket/data_folder/*/*/*.parquet” all_paths_from_s3 = fs.glob(path=s3_path) myopen = s3.open #use s3fs as the filesystem fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen) #convert to pandas dataframe … Read more

Inspect Parquet from command line

You can use parquet-tools with the command cat and the –json option in order to view the files without a local copy and in the JSON format. Here is an example: parquet-tools cat –json hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet This prints out the data in JSON format: {“name”:”gil”,”age”:48,”city”:”london”} {“name”:”jane”,”age”:30,”city”:”new york”} {“name”:”jordan”,”age”:18,”city”:”toronto”} Disclaimer: this was tested in Cloudera CDH 5.12.0

Avro vs. Parquet

Avro is a Row based format. If you want to retrieve the data as a whole you can use Avro Parquet is a Column based format. If your data consists of a lot of columns but you are interested in a subset of columns then you can use Parquet HBase is useful when frequent updating … Read more

How to read a Parquet file into Pandas DataFrame?

pandas 0.21 introduces new functions for Parquet: import pandas as pd pd.read_parquet(‘example_pa.parquet’, engine=”pyarrow”) or import pandas as pd pd.read_parquet(‘example_fp.parquet’, engine=”fastparquet”) The above link explains: These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

What are the differences between feather and parquet?

Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then) Parquet is more expensive to write than Feather as it features more layers of encoding and … Read more

What are the pros and cons of parquet format compared to other formats?

I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we’re all used to — text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other … Read more