Why is np.dot so much faster than np.sum?

Question

numpy.dot delegates to a BLAS vector-vector multiply here, while numpy.sum uses a pairwise summation routine, switching over to an 8x unrolled summation loop at a block size of 128 elements.

I don’t know what BLAS library your NumPy is using, but a good BLAS will generally take advantage of SIMD operations, while numpy.sum doesn’t do that, as far as I can see. Any SIMD usage in the numpy.sum code would have to be compiler autovectorization, which is likely less efficient than what the BLAS does.

When you raise the array sizes to 1 million elements, at that point, you’re probably hitting a cache threshold. The dot code is working with about 16 MB of data, and the sum code is working with about 8 MB. The dot code might be getting data bumped to a slower cache level or to RAM, or perhaps both dot and sum are working with a slower cache level and dot is performing worse because it needs to read more data. If I try raising the array sizes gradually, the timings are more consistent with some sort of threshold effect than with sum having higher per-element performance.

Leave a Comment Cancel reply