dask – Tarik Billa

Converting numpy solution into dask (numpy indexing doesn’t work in dask)

April 11, 2024 by Tarik

Chunk random_days_panel instead of historical_data and use da.map_blocks: def dask_way(sim_count, sim_days, hist_days): # shared historical data # on a cluster you’d load this on each worker, e.g. from a NPZ file historical_data = np.random.normal(111.51, 10, size=hist_days) random_days_panel = da.random.randint( 1, hist_days, size=(1, 1, sim_count, sim_days) ) future_panel = da.map_blocks( lambda chunk: historical_data[chunk], random_days_panel, dtype=float ) … Read more

Strategy for partitioning dask dataframes efficiently

April 5, 2024 by Tarik

As of Dask 2.0.0 you may call .repartition(partition_size=”100MB”). This method performs an object-considerate (.memory_usage(deep=True)) breakdown of partition size. It will join smaller partitions, or split partitions that have grown too large. Dask’s Documentation also outlines the usage.

Can dask parralelize reading fom a csv file?

December 26, 2023 by Tarik

Yes, dask.dataframe can read in parallel. However you’re running into two problems: Pandas.read_csv only partially releases the GIL By default dask.dataframe parallelizes with threads because most of Pandas can run in parallel in multiple threads (releases the GIL). Pandas.read_csv is an exception, especially if your resulting dataframes use object dtypes for text dask.dataframe.to_hdf(filename) forces sequential … Read more

How to see progress of Dask compute task?

December 26, 2023 by Tarik

If you’re using the single machine scheduler then do this: from dask.diagnostics import ProgressBar ProgressBar().register() http://dask.pydata.org/en/latest/diagnostics-local.html If you’re using the distributed scheduler then do this: from dask.distributed import progress result = df.id.count.persist() progress(result) Or just use the dashboard http://dask.pydata.org/en/latest/diagnostics-distributed.html

Read a large csv into a sparse pandas dataframe in a memory efficient way

December 16, 2023 by Tarik

I would probably address this by using dask to load your data in a streaming fashion. For example, you can create a dask dataframe as follows: import dask.dataframe as ddf data = ddf.read_csv(‘test.csv’) This data object hasn’t actually done anything at this point; it just contains a “recipe” of sorts to read the dataframe from … Read more

An attempt has been made to start a new process before the current process has finished its bootstrapping phase

December 15, 2023 by Tarik

The Client command starts up new processes, so it will have to be within the if __name__ == ‘__main__’: block as described in this SO question or this GitHub issue This is the same as with the multiprocessing module

Writing Dask partitions into single file

September 22, 2023 by Tarik

Short answer No, Dask.dataframe.to_csv only writes CSV files to different files, one file per partition. However, there are ways around this. Concatenate Afterwards Perhaps just concatenate the files after dask.dataframe writes them? This is likely to be near-optimal in terms of performance. df.to_csv(‘/path/to/myfiles.*.csv’) from glob import glob filenames = glob(‘/path/to/myfiles.*.csv’) with open(‘outfile.csv’, ‘w’) as out: … Read more

Out-of-core processing of sparse CSR arrays

July 21, 2023 by Tarik

So I don’t know anything about joblib or dask, let alone your application specific data format. But it is actually possible to read sparse matrices from disk in chunks while retaining the sparse data structure. While the Wikipedia article for the CSR format does a great job explaining how it works, I’ll give a short … Read more

Convert Pandas dataframe to Dask dataframe

May 17, 2023 by Tarik

I think you can use dask.dataframe.from_pandas: from dask import dataframe as dd sd = dd.from_pandas(df, npartitions=3) print (sd) dd.DataFrame<from_pa…, npartitions=2, divisions=(0, 1, 2)> EDIT: I find solution: import pandas as pd import dask.dataframe as dd from dask.dataframe.utils import make_meta df=pd.DataFrame({‘a’:[1,2,3],’b’:[4,5,6]}) dsk = {(‘x’, 0): df} meta = make_meta({‘a’: ‘i8’, ‘b’: ‘i8’}, index=pd.Index([], ‘i8’)) d = … Read more

python dask DataFrame, support for (trivially parallelizable) row apply?

April 24, 2023 by Tarik

map_partitions You can apply your function to all of the partitions of your dataframe with the map_partitions function. df.map_partitions(func, columns=…) Note that func will be given only part of the dataset at a time, not the entire dataset like with pandas apply (which presumably you wouldn’t want if you want to do parallelism.) map / … Read more