Convert large csv to hdf5

Use append=True in the call to to_hdf: import numpy as np import pandas as pd filename=”/tmp/test.h5″ df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=[‘A’, ‘B’]) print(df) # A B # 0 0 1 # 1 2 3 # 2 4 5 # 3 6 7 # 4 8 9 # Save to HDF5 df.to_hdf(filename, ‘data’, mode=”w”, format=”table”) del df … Read more

Combining hdf5 files

This is actually one of the use-cases of HDF5. If you just want to be able to access all the datasets from a single file, and don’t care how they’re actually stored on disk, you can use external links. From the HDF5 website: External links allow a group to include objects in another HDF5 file … Read more

Incremental writes to hdf5 with h5py

Per the FAQ, you can expand the dataset using dset.resize. For example, import os import h5py import numpy as np path=”/tmp/out.h5″ os.remove(path) with h5py.File(path, “a”) as f: dset = f.create_dataset(‘voltage284’, (10**5,), maxshape=(None,), dtype=”i8″, chunks=(10**4,)) dset[:] = np.random.random(dset.shape) print(dset.shape) # (100000,) for i in range(3): dset.resize(dset.shape[0]+10**4, axis=0) dset[-10**4:] = np.random.random(10**4) print(dset.shape) # (110000,) # (120000,) # … Read more

Improve pandas (PyTables?) HDF5 table write performance

Here is a similar comparison I just did. Its about 1/3 of the data 10M rows. The final size is abou 1.3GB I define 3 timing functions: Test the Fixed format (called Storer in 0.12). This writes in a PyTables Array format def f(df): store = pd.HDFStore(‘test.h5′,’w’) store[‘df’] = df store.close() Write in the Table … Read more

Read HDF5 file into numpy array

The easiest thing is to use the .value attribute of the HDF5 dataset. >>> hf = h5py.File(‘/path/to/file’, ‘r’) >>> data = hf.get(‘dataset_name’).value # `data` is now an ndarray. You can also slice the dataset, which produces an actual ndarray with the requested data: >>> hf[‘dataset_name’][:10] # produces ndarray as well But keep in mind that … Read more

HDF5 taking more space than CSV?

Copy of my answer from the issue: https://github.com/pydata/pandas/issues/3651 Your sample is really too small. HDF5 has a fair amount of overhead with really small sizes (even 300k entries is on the smaller side). The following is with no compression on either side. Floats are really more efficiently represented in binary (that as a text representation). … Read more