Lev. Pandas has rewritten to_csv
to make a big improvement in native speed. The process is now i/o bound, accounts for many subtle dtype issues, and quote cases. Here is our performance results vs. 0.10.1 (in the upcoming 0.11) release. These are in ms
, lower ratio is better.
Results:
t_head t_baseline ratio
name
frame_to_csv2 (100k) rows 190.5260 2244.4260 0.0849
write_csv_standard (10k rows) 38.1940 234.2570 0.1630
frame_to_csv_mixed (10k rows, mixed) 369.0670 1123.0412 0.3286
frame_to_csv (3k rows, wide) 112.2720 226.7549 0.4951
So Throughput for a single dtype (e.g. floats), not too wide is about 20M rows / min, here is your example from above.
In [12]: df = pd.DataFrame({'A' : np.array(np.arange(45000000),dtype="float64")})
In [13]: df['B'] = df['A'] + 1.0
In [14]: df['C'] = df['A'] + 2.0
In [15]: df['D'] = df['A'] + 2.0
In [16]: %timeit -n 1 -r 1 df.to_csv('test.csv')
1 loops, best of 1: 119 s per loop