pyarrow – Tarik Billa

Using pyarrow how do you append to parquet file?

May 16, 2023 by Tarik

I ran into the same issue and I think I was able to solve it using the following: import pandas as pd import pyarrow as pa import pyarrow.parquet as pq chunksize=10000 # this is the number of lines pqwriter = None for i, df in enumerate(pd.read_csv(‘sample.csv’, chunksize=chunksize)): table = pa.Table.from_pandas(df) # for the first chunk … Read more

How to read partitioned parquet files from S3 using pyarrow in python

March 26, 2023 by Tarik

I managed to get this working with the latest release of fastparquet & s3fs. Below is the code for the same: import s3fs import fastparquet as fp s3 = s3fs.S3FileSystem() fs = s3fs.core.S3FileSystem() #mybucket/data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet s3_path = “mybucket/data_folder/*/*/*.parquet” all_paths_from_s3 = fs.glob(path=s3_path) myopen = s3.open #use s3fs as the filesystem fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen) #convert to pandas dataframe … Read more

A comparison between fastparquet and pyarrow?

March 1, 2023 by Tarik

I used both fastparquet and pyarrow for converting protobuf data to parquet and to query the same in S3 using Athena. Both worked, however, in my use-case, which is a lambda function, package zip file has to be lightweight, so went ahead with fastparquet. (fastparquet library was only about 1.1mb, while pyarrow library was 176mb, … Read more

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

February 23, 2023 by Tarik

You should use the s3fs module as proposed by yjk21. However as result of calling ParquetDataset you’ll get a pyarrow.parquet.ParquetDataset object. To get the Pandas DataFrame you’ll rather want to apply .read_pandas().to_pandas() to it: import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() pandas_dataframe = pq.ParquetDataset(‘s3://your-bucket/’, filesystem=s3).read_pandas().to_pandas()

What are the differences between feather and parquet?

October 26, 2022 by Tarik

Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then) Parquet is more expensive to write than Feather as it features more layers of encoding and … Read more