hdf5 – Tarik Billa

install HDF5 and pytables in ubuntu

April 12, 2024 by Tarik

I found that installing the libhdf5-serial-dev with sudo apt-get install libhdf5-serial-dev did the trick.

Convert large csv to hdf5

January 6, 2024 by Tarik

Use append=True in the call to to_hdf: import numpy as np import pandas as pd filename=”/tmp/test.h5″ df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=[‘A’, ‘B’]) print(df) # A B # 0 0 1 # 1 2 3 # 2 4 5 # 3 6 7 # 4 8 9 # Save to HDF5 df.to_hdf(filename, ‘data’, mode=”w”, format=”table”) del df … Read more

Combining hdf5 files

January 1, 2024 by Tarik

This is actually one of the use-cases of HDF5. If you just want to be able to access all the datasets from a single file, and don’t care how they’re actually stored on disk, you can use external links. From the HDF5 website: External links allow a group to include objects in another HDF5 file … Read more

HDF5 in Java: What are the difference between the availabe APIs?

December 27, 2023 by Tarik

HDF Java follows a layered approach: JHI5 – the low level JNI wrappers: very flexible, but also quite tedious to use. Java HDF object package – a high-level interface based on JHI5. HDFView – a Java-based viewer application based on the Java HDF object package. JHDF5 provides a high-level interface building on the JHI5 layer … Read more

Incremental writes to hdf5 with h5py

December 5, 2023 by Tarik

Per the FAQ, you can expand the dataset using dset.resize. For example, import os import h5py import numpy as np path=”/tmp/out.h5″ os.remove(path) with h5py.File(path, “a”) as f: dset = f.create_dataset(‘voltage284’, (10**5,), maxshape=(None,), dtype=”i8″, chunks=(10**4,)) dset[:] = np.random.random(dset.shape) print(dset.shape) # (100000,) for i in range(3): dset.resize(dset.shape[0]+10**4, axis=0) dset[-10**4:] = np.random.random(10**4) print(dset.shape) # (110000,) # (120000,) # … Read more

How to install h5py (needed for Keras) on MacOS with M1?

September 25, 2023 by Tarik

This works for me: $ brew install hdf5 $ export HDF5_DIR=”$(brew –prefix hdf5)” $ pip install –no-binary=h5py h5py

Improve pandas (PyTables?) HDF5 table write performance

September 24, 2023 by Tarik

Here is a similar comparison I just did. Its about 1/3 of the data 10M rows. The final size is abou 1.3GB I define 3 timing functions: Test the Fixed format (called Storer in 0.12). This writes in a PyTables Array format def f(df): store = pd.HDFStore(‘test.h5′,’w’) store[‘df’] = df store.close() Write in the Table … Read more

Read HDF5 file into numpy array

September 18, 2023 by Tarik

The easiest thing is to use the .value attribute of the HDF5 dataset. >>> hf = h5py.File(‘/path/to/file’, ‘r’) >>> data = hf.get(‘dataset_name’).value # `data` is now an ndarray. You can also slice the dataset, which produces an actual ndarray with the requested data: >>> hf[‘dataset_name’][:10] # produces ndarray as well But keep in mind that … Read more

Missing optional dependency ‘tables’. In pandas to_hdf

August 22, 2023 by Tarik

For conda users: conda install pytables

HDF5 taking more space than CSV?

August 16, 2023 by Tarik

Copy of my answer from the issue: https://github.com/pydata/pandas/issues/3651 Your sample is really too small. HDF5 has a fair amount of overhead with really small sizes (even 300k entries is on the smaller side). The following is with no compression on either side. Floats are really more efficiently represented in binary (that as a text representation). … Read more