I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it’s not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on the size.
A better alternative would be to read all the parquet files into a single DataFrame, and write it once:
from pathlib import Path
import pandas as pd
data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
pd.read_parquet(parquet_file)
for parquet_file in data_dir.glob('*.parquet')
)
full_df.to_csv('csv_file.csv')
Alternatively, if you really want to just append to the file:
data_dir = Path('dir/to/parquet/files')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
write_mode="w" if i == 0 else 'a' # 'write' mode for 0th file, 'append' otherwise
df.to_csv('csv_file.csv', mode=write_mode, header=write_header)
A final alternative for appending each file that opens the target CSV file in "a+"
mode at the onset, keeping the file handle scanned to the end of the file for each write/append (I believe this works, but haven’t actually tested it):
data_dir = Path('dir/to/parquet/files')
with open('csv_file.csv', "a+") as csv_handle:
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
df.to_csv(csv_handle, header=write_header)