numpy.dot
delegates to a BLAS vector-vector multiply here, while numpy.sum
uses a pairwise summation routine, switching over to an 8x unrolled summation loop at a block size of 128 elements.
I don’t know what BLAS library your NumPy is using, but a good BLAS will generally take advantage of SIMD operations, while numpy.sum
doesn’t do that, as far as I can see. Any SIMD usage in the numpy.sum
code would have to be compiler autovectorization, which is likely less efficient than what the BLAS does.
When you raise the array sizes to 1 million elements, at that point, you’re probably hitting a cache threshold. The dot
code is working with about 16 MB of data, and the sum
code is working with about 8 MB. The dot
code might be getting data bumped to a slower cache level or to RAM, or perhaps both dot
and sum
are working with a slower cache level and dot
is performing worse because it needs to read more data. If I try raising the array sizes gradually, the timings are more consistent with some sort of threshold effect than with sum
having higher per-element performance.