In what situation would the AVX2 gather instructions be faster than individually loading the data?

Newer microarchitectures have shifted the odds towards gather instructions. On an Intel Xeon Gold 6138 CPU @ 2.00 GHz with Skylake microarchitecture, we get for your benchmark: 9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 … Read more

Why is Intel Haswell XEON CPU sporadically miscomputing FFTs and ART?

EDIT: Problem solved. I have to shout out a huge Sorry to the community and a big thank you for your hints. Sorry to user anonymous, who seems to be involved into kernel development. What happened? We spent another 2 days debugging and fiddling around with the program code. No implementation problems were found. BUT: … Read more

AVX2 what is the most efficient way to pack left based on a mask?

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.) We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle. We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we … Read more

tech