2021 update: Modern versions of MSVC don’t need manual use of _mm256_zeroupper()
even when compiling AVX intrinsics without /arch:AVX
. VS2010 did.
The behavior that you are seeing is the result of expensive state-switching.
See page 102 of Agner Fog’s manual:
http://www.agner.org/optimize/microarchitecture.pdf
Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.
When you compile without /arch:AVX
, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you’ll get code that has both SSE and AVX instructions – which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you’re seeing.)
Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX
tells the compiler to use all AVX.
It sounds like you’re trying to make multiple code paths: one for SSE, and one for AVX.
For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX
and one without) Then link them together and make a dispatcher to choose based on the what hardware it’s running on.
If you need to mix SSE and AVX, be sure to use _mm256_zeroupper()
or _mm256_zeroall()
appropriately to avoid the state-switching penalties.