Per-element atomicity of vector load/store and gather/scatter?
Per-element atomicity of vector load/store and gather/scatter?
Per-element atomicity of vector load/store and gather/scatter?
As a few commenters have suggested: a “backend” bottleneck is what you’d expect for this code. That suggests you’re keeping things pretty well fed, which is what you want. Looking at the report, there should be an opportunity in this section: // Lets check if we found any factors, residue 1 == n!-1 found_factor_mask11 = … Read more
The frequency impact depends on the width of the operation and the specific instruction used. There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the “nominal” speed you’ll see written on the box: when the chip says “3.5 GHz turbo”, they are referring to the single-core L0 … Read more
Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E – < /dev/null | egrep “SSE|AVX” | sort #define __SSE__ 1 #define __SSE2__ 1 #define … Read more
The hardware prefetcher is tuned differently on server vs workstation CPUs. Servers are expected to handle many threads, so the prefetcher will request smaller chunks from RAM. Here is a paper that goes into detail about the issue you’re experiencing, but from the other side of the coin: Hardware Prefetcher Aggressiveness Controllers: Do We Need … Read more