You may use the swifter
package:
pip install swifter
(Note that you may want to use this in a virtualenv to avoid version conflicts with installed dependencies.)
Swifter works as a plugin for pandas, allowing you to reuse the apply
function:
import swifter
def some_function(data):
return data * 10
data['out'] = data['in'].swifter.apply(some_function)
It will automatically figure out the most efficient way to parallelize the function, no matter if it’s vectorized (as in the above example) or not.
More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.
Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a “simple” Pandas apply
, which will not be parallel. In this case, even forcing it to use dask
will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing
.