avx512 – Tarik Billa

Per-element atomicity of vector load/store and gather/scatter?

September 24, 2023 by Tarik

Per-element atomicity of vector load/store and gather/scatter?

Fast AVX512 modulo when same divisor

August 26, 2023 by Tarik

As a few commenters have suggested: a “backend” bottleneck is what you’d expect for this code. That suggests you’re keeping things pretty well fed, which is what you want. Looking at the report, there should be an opportunity in this section: // Lets check if we found any factors, residue 1 == n!-1 found_factor_mask11 = … Read more

SIMD instructions lowering CPU frequency

April 24, 2023 by Tarik

The frequency impact depends on the width of the operation and the specific instruction used. There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the “nominal” speed you’ll see written on the box: when the chip says “3.5 GHz turbo”, they are referring to the single-core L0 … Read more

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

February 5, 2023 by Tarik

Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E – < /dev/null | egrep “SSE|AVX” | sort #define __SSE__ 1 #define __SSE2__ 1 #define … Read more

memory bandwidth for many channels x86 systems

January 28, 2023 by Tarik

The hardware prefetcher is tuned differently on server vs workstation CPUs. Servers are expected to handle many threads, so the prefetcher will request smaller chunks from RAM. Here is a paper that goes into detail about the issue you’re experiencing, but from the other side of the coin: Hardware Prefetcher Aggressiveness Controllers: Do We Need … Read more