Remainder function (%) runtime on numpy arrays is far longer than manual remainder calculation

My best hypothesis is that your NumPy install is using an unoptimized fmod inside the % calculation. Here’s why.


First, I can’t reproduce your results on a normal pip installed version of NumPy 1.15.1. I get only about a 10% performance difference (asdf.py contains your timing code):

$ python3.6 asdf.py
0.0006543657302856445
0.0006025806903839111

I can reproduce a major performance discrepancy with a manual build (python3.6 setup.py build_ext --inplace -j 4) of v1.15.1 from a clone of the NumPy Git repository, though:

$ python3.6 asdf.py
0.00242799973487854
0.0006397026300430298

This suggests that my pip-installed build’s % is better optimized than my manual build or what you have installed.


Looking under the hood, it’s tempting to look at the implementation of floating-point % in NumPy and blame the slowdown on the unnecessary floordiv calculation (npy_divmod@c@ calculates both // and %):

NPY_NO_EXPORT void
@TYPE@_remainder(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
{
    BINARY_LOOP {
        const @type@ in1 = *(@type@ *)ip1;
        const @type@ in2 = *(@type@ *)ip2;
        npy_divmod@c@(in1, in2, (@type@ *)op1);
    }
}

but in my experiments, removing the floordiv provided no benefit. It looks easy enough for a compiler to optimize out, so maybe it was optimized out, or maybe it was just a negligible fraction of the runtime in the first place.

Rather than the floordiv, let’s focus on just one line in npy_divmod@c@, the fmod call:

mod = npy_fmod@c@(a, b);

This is the initial remainder computation, before special-case handling and adjusting the result to match the sign of the right-hand operand. If we compare the performance of % with numpy.fmod on my manual build:

>>> import timeit
>>> import numpy
>>> a = numpy.arange(1, 8000, dtype=float)
>>> timeit.timeit('a % 3', globals=globals(), number=1000)
0.3510419335216284
>>> timeit.timeit('numpy.fmod(a, 3)', globals=globals(), number=1000)
0.33593094255775213
>>> timeit.timeit('a - 3*numpy.floor(a/3)', globals=globals(), number=1000)
0.07980139832943678

We see that fmod appears to be responsible for almost the entire runtime of %.


I haven’t disassembled the generated binary or stepped through it in an instruction-level debugger to see exactly what gets executed, and of course, I don’t have access to your machine or your copy of NumPy. Still, from the above evidence, fmod seems like a pretty likely culprit.

Leave a Comment

tech