avx – Page 2 – Tarik Billa

How to choose AVX compare predicate variants

April 2, 2023 by Tarik

Ordered vs Unordered has to do with whether the comparison is true if one of the operands contains a NaN (see What does ordered / unordered comparison mean?). Signaling (S) vs non-signaling (Q for quiet?) will determine whether an exception is raised if an operand contains a NaN. From a performance perspective, these should all … Read more

Using AVX CPU instructions: Poor performance without “/arch:AVX”

April 1, 2023 by Tarik

2021 update: Modern versions of MSVC don’t need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did. The behavior that you are seeing is the result of expensive state-switching. See page 102 of Agner Fog’s manual: http://www.agner.org/optimize/microarchitecture.pdf Every time you improperly switch back and forth between SSE and AVX instructions, you … Read more

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

March 28, 2023 by Tarik

Here are theoretical max FLOPs counts (per core) for a number of recent processor microarchitectures and explanation how to achieve them. In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply (FMAs per clock) * (vector elements / instruction) * 2 (FLOPs … Read more

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

March 26, 2023 by Tarik

You are experiencing a penalty for “mixing” non-VEX SSE and VEX-encoded instructions – even though your entire visible application doesn’t obviously use any AVX instructions! Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn’t, or vice-versa. That is, you never … Read more

Optimizations for pow() with const non-integer exponent?

March 26, 2023 by Tarik

Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It’s best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon. #include <stdlib.h> #include <math.h> // Returns x^(5/12) for … Read more

How to check if a CPU supports the SSE3 instruction set?

February 21, 2023 by Tarik

I’ve created a GitHub repro that will detect CPU and OS support for all the major x86 ISA extensions: https://github.com/Mysticial/FeatureDetector Here’s a shorter version: First you need to access the CPUID instruction: #ifdef _WIN32 // Windows #define cpuid(info, x) __cpuidex(info, x, 0) #else // GCC Intrinsics #include <cpuid.h> void cpuid(int info[4], int InfoType){ __cpuid_count(InfoType, 0, … Read more

C# and SIMD: High and low speedups. What is happening?

February 16, 2023 by Tarik

I am not going to try to answer the question about SIMD speedup, but provide some detailed comments on poor coding in the scalar version that carried over to the vector version, in a way that doesn’t fit in an SO comment. This code in Intersect(Circle) is just absurd: // Step 3: compute the substitutions, … Read more

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

February 5, 2023 by Tarik

Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E – < /dev/null | egrep “SSE|AVX” | sort #define __SSE__ 1 #define __SSE2__ 1 #define … Read more

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

September 8, 2022 by Tarik

What is this warning about? Modern CPUs provide a lot of low-level instructions, besides the usual arithmetic and logic, known as extensions, e.g. SSE2, SSE4, AVX, etc. From the Wikipedia: Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008 and … Read more