Using AVX CPU instructions: Poor performance without “/arch:AVX”

Question

2021 update: Modern versions of MSVC don’t need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did.

The behavior that you are seeing is the result of expensive state-switching.

See page 102 of Agner Fog’s manual:

http://www.agner.org/optimize/microarchitecture.pdf

Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.

When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you’ll get code that has both SSE and AVX instructions – which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you’re seeing.)

Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX tells the compiler to use all AVX.

It sounds like you’re trying to make multiple code paths: one for SSE, and one for AVX.
For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX and one without) Then link them together and make a dispatcher to choose based on the what hardware it’s running on.

If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.

Leave a Comment Cancel reply