What is the fastest way to output large DataFrame into a CSV file?

Question

Lev. Pandas has rewritten to_csv to make a big improvement in native speed. The process is now i/o bound, accounts for many subtle dtype issues, and quote cases. Here is our performance results vs. 0.10.1 (in the upcoming 0.11) release. These are in ms, lower ratio is better.

Results:
                                            t_head  t_baseline      ratio
name                                                                     
frame_to_csv2 (100k) rows                 190.5260   2244.4260     0.0849
write_csv_standard  (10k rows)             38.1940    234.2570     0.1630
frame_to_csv_mixed  (10k rows, mixed)     369.0670   1123.0412     0.3286
frame_to_csv (3k rows, wide)              112.2720    226.7549     0.4951

So Throughput for a single dtype (e.g. floats), not too wide is about 20M rows / min, here is your example from above.

In [12]: df = pd.DataFrame({'A' : np.array(np.arange(45000000),dtype="float64")}) 
In [13]: df['B'] = df['A'] + 1.0   
In [14]: df['C'] = df['A'] + 2.0
In [15]: df['D'] = df['A'] + 2.0
In [16]: %timeit -n 1 -r 1 df.to_csv('test.csv')
1 loops, best of 1: 119 s per loop

Leave a Comment Cancel reply