dask – Page 2 – Tarik Billa

How to transform Dask.DataFrame to pd.DataFrame?

April 16, 2023 by Tarik

You can call the .compute() method to transform a dask.dataframe to a pandas dataframe: df = df.compute()

A comparison between fastparquet and pyarrow?

March 1, 2023 by Tarik

I used both fastparquet and pyarrow for converting protobuf data to parquet and to query the same in S3 using Athena. Both worked, however, in my use-case, which is a lambda function, package zip file has to be lightweight, so went ahead with fastparquet. (fastparquet library was only about 1.1mb, while pyarrow library was 176mb, … Read more

At what situation I can use Dask instead of Apache Spark? [closed]

January 7, 2023 by Tarik

you may want to read Dask comparison to Apache Spark Apache Spark is an all-inclusive framework combining distributed computing, SQL queries, machine learning, and more that runs on the JVM and is commonly co-deployed with other Big Data frameworks like Hadoop. It was originally optimized for bulk data ingest and querying common in data engineering … Read more

Make Pandas DataFrame apply() use all cores?

October 26, 2022 by Tarik

You may use the swifter package: pip install swifter (Note that you may want to use this in a virtualenv to avoid version conflicts with installed dependencies.) Swifter works as a plugin for pandas, allowing you to reuse the apply function: import swifter def some_function(data): return data * 10 data[‘out’] = data[‘in’].swifter.apply(some_function) It will automatically … Read more