Convert large csv to hdf5

Question

Use append=True in the call to to_hdf:

import numpy as np
import pandas as pd

filename="/tmp/test.h5"

df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
#    A  B
# 0  0  1
# 1  2  3
# 2  4  5
# 3  6  7
# 4  8  9

# Save to HDF5
df.to_hdf(filename, 'data', mode="w", format="table")
del df    # allow df to be garbage collected

# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)

print(pd.read_hdf(filename, 'data'))

yields

Note that you need to use format="table" in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.

Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.

Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:

import numpy as np
import pandas as pd

filename="/tmp/test.h5"
store = pd.HDFStore(filename)

for i in range(2):
    df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
    store.append('data', df)

store.close()

store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()

yields

Leave a Comment Cancel reply