Why does mulss take only 3 cycles on Haswell, different from Agner’s instruction tables? (Unrolling FP loops with multiple accumulators)
Related: AVX2: Computing dot product of 512 float arrays has a good manually-vectorized dot-product loop using multiple accumulators with FMA intrinsics. The rest of the answer explains why that’s a good thing, with cpu-architecture / asm details. Dot Product of Vectors with SIMD shows that with the right compiler options, some compilers will auto-vectorize that … Read more