parquet – Page 4 – Tarik Billa

How to read partitioned parquet files from S3 using pyarrow in python

March 26, 2023 by Tarik

I managed to get this working with the latest release of fastparquet & s3fs. Below is the code for the same: import s3fs import fastparquet as fp s3 = s3fs.S3FileSystem() fs = s3fs.core.S3FileSystem() #mybucket/data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet s3_path = “mybucket/data_folder/*/*/*.parquet” all_paths_from_s3 = fs.glob(path=s3_path) myopen = s3.open #use s3fs as the filesystem fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen) #convert to pandas dataframe … Read more

How to view Apache Parquet file in Windows? [closed]

March 19, 2023 by Tarik

What is Apache Parquet? Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing the data one row at a time, you typically access it one column at a … Read more

A comparison between fastparquet and pyarrow?

March 1, 2023 by Tarik

I used both fastparquet and pyarrow for converting protobuf data to parquet and to query the same in S3 using Athena. Both worked, however, in my use-case, which is a lambda function, package zip file has to be lightweight, so went ahead with fastparquet. (fastparquet library was only about 1.1mb, while pyarrow library was 176mb, … Read more

Parquet vs ORC vs ORC with Snappy

January 19, 2023 by Tarik

I would say, that both of these formats have their own advantages. Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here). Apache ORC might be better if your file-structure is flattened. And as far as I know parquet does not … Read more

Inspect Parquet from command line

December 29, 2022 by Tarik

You can use parquet-tools with the command cat and the –json option in order to view the files without a local copy and in the JSON format. Here is an example: parquet-tools cat –json hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet This prints out the data in JSON format: {“name”:”gil”,”age”:48,”city”:”london”} {“name”:”jane”,”age”:30,”city”:”new york”} {“name”:”jordan”,”age”:18,”city”:”toronto”} Disclaimer: this was tested in Cloudera CDH 5.12.0

Avro vs. Parquet

December 18, 2022 by Tarik

Avro is a Row based format. If you want to retrieve the data as a whole you can use Avro Parquet is a Column based format. If your data consists of a lot of columns but you are interested in a subset of columns then you can use Parquet HBase is useful when frequent updating … Read more

How to read a Parquet file into Pandas DataFrame?

November 23, 2022 by Tarik

pandas 0.21 introduces new functions for Parquet: import pandas as pd pd.read_parquet(‘example_pa.parquet’, engine=”pyarrow”) or import pandas as pd pd.read_parquet(‘example_fp.parquet’, engine=”fastparquet”) The above link explains: These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

What are the differences between feather and parquet?

October 26, 2022 by Tarik

Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then) Parquet is more expensive to write than Feather as it features more layers of encoding and … Read more

What are the pros and cons of parquet format compared to other formats?

October 21, 2022 by Tarik

I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we’re all used to — text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other … Read more