Why does mulss take only 3 cycles on Haswell, different from Agner’s instruction tables? (Unrolling FP loops with multiple accumulators)

Related: AVX2: Computing dot product of 512 float arrays has a good manually-vectorized dot-product loop using multiple accumulators with FMA intrinsics. The rest of the answer explains why that’s a good thing, with cpu-architecture / asm details. Dot Product of Vectors with SIMD shows that with the right compiler options, some compilers will auto-vectorize that … Read more

SSE SSE2 and SSE3 for GNU C++ [closed]

Sorry don’t know of a tutorial. Your best bet (IMHO) is to use SSE via the “intrinsic” functions Intel provides to wrap (generally) single SSE instructions. These are made available via a set of include files named *mmintrin.h e.g xmmintrin.h is the original SSE instruction set. Begin familiar with the contents of Intel’s Optimization Reference … Read more

How is a vector’s data aligned?

C++ standard requires allocation functions (malloc() and operator new()) to allocate memory suitably aligned for any standard type. As these functions don’t receive the alignment requirement as an argument, in practice it means that the alignment for all allocations is the same, and is that of a standard type with the largest alignment requirement, which … Read more

Using AVX CPU instructions: Poor performance without “/arch:AVX”

2021 update: Modern versions of MSVC don’t need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did. The behavior that you are seeing is the result of expensive state-switching. See page 102 of Agner Fog’s manual: http://www.agner.org/optimize/microarchitecture.pdf Every time you improperly switch back and forth between SSE and AVX instructions, you … Read more

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

You are experiencing a penalty for “mixing” non-VEX SSE and VEX-encoded instructions – even though your entire visible application doesn’t obviously use any AVX instructions! Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn’t, or vice-versa. That is, you never … Read more

Fast method to copy memory with translation – ARGB to BGR

I wrote 4 different versions which work by swapping bytes. I compiled them using gcc 4.2.1 with -O3 -mssse3, ran them 10 times over 32MB of random data and found the averages. Editor’s note: the original inline asm used unsafe constraints, e.g. modifying input-only operands, and not telling the compiler about the side effect on … Read more

How to check if a CPU supports the SSE3 instruction set?

I’ve created a GitHub repro that will detect CPU and OS support for all the major x86 ISA extensions: https://github.com/Mysticial/FeatureDetector Here’s a shorter version: First you need to access the CPUID instruction: #ifdef _WIN32 // Windows #define cpuid(info, x) __cpuidex(info, x, 0) #else // GCC Intrinsics #include <cpuid.h> void cpuid(int info[4], int InfoType){ __cpuid_count(InfoType, 0, … Read more

Fastest way to do horizontal SSE vector sum (or other reduction)

In general for any kind of vector horizontal reduction, extract / shuffle high half to line up with low, then vertical add (or min/max/or/and/xor/multiply/whatever); repeat until a there’s just a single element (with high garbage in the rest of the vector). If you start with vectors wider than 128-bit, narrow in half until you get … Read more

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E – < /dev/null | egrep “SSE|AVX” | sort #define __SSE__ 1 #define __SSE2__ 1 #define … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)