How to choose AVX compare predicate variants

Ordered vs Unordered has to do with whether the comparison is true if one of the operands contains a NaN (see What does ordered / unordered comparison mean?). Signaling (S) vs non-signaling (Q for quiet?) will determine whether an exception is raised if an operand contains a NaN. From a performance perspective, these should all … Read more

Using AVX CPU instructions: Poor performance without “/arch:AVX”

2021 update: Modern versions of MSVC don’t need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did. The behavior that you are seeing is the result of expensive state-switching. See page 102 of Agner Fog’s manual: http://www.agner.org/optimize/microarchitecture.pdf Every time you improperly switch back and forth between SSE and AVX instructions, you … Read more

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

Here are theoretical max FLOPs counts (per core) for a number of recent processor microarchitectures and explanation how to achieve them. In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply (FMAs per clock) * (vector elements / instruction) * 2 (FLOPs … Read more

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

You are experiencing a penalty for “mixing” non-VEX SSE and VEX-encoded instructions – even though your entire visible application doesn’t obviously use any AVX instructions! Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn’t, or vice-versa. That is, you never … Read more

Optimizations for pow() with const non-integer exponent?

Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It’s best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon. #include <stdlib.h> #include <math.h> // Returns x^(5/12) for … Read more

How to check if a CPU supports the SSE3 instruction set?

I’ve created a GitHub repro that will detect CPU and OS support for all the major x86 ISA extensions: https://github.com/Mysticial/FeatureDetector Here’s a shorter version: First you need to access the CPUID instruction: #ifdef _WIN32 // Windows #define cpuid(info, x) __cpuidex(info, x, 0) #else // GCC Intrinsics #include <cpuid.h> void cpuid(int info[4], int InfoType){ __cpuid_count(InfoType, 0, … Read more

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E – < /dev/null | egrep “SSE|AVX” | sort #define __SSE__ 1 #define __SSE2__ 1 #define … Read more

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

What is this warning about? Modern CPUs provide a lot of low-level instructions, besides the usual arithmetic and logic, known as extensions, e.g. SSE2, SSE4, AVX, etc. From the Wikipedia: Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008 and … Read more

tech