Crash with icc: can the compiler invent writes where none existed in the abstract machine?

Your program is well-formed and free of undefined behaviour, as far as I can tell. The C++ abstract machine never actually assigns to a const object. A not-taken if() is sufficient to “hide”https://stackoverflow.com/”protect” things that would be UB if they executed. The only thing an if(false) can’t save you from is an ill-formed program, e.g. … Read more

Why ARM NEON not faster than plain C++?

The NEON pipeline on Cortex-A8 is in-order executing, and has limited hit-under-miss (no renaming), so you’re limited by memory latency (as you’re using more than L1/L2 cache size). Your code has immediate dependencies on the values loaded from memory, so it’ll stall constantly waiting for memory. This would explain why the NEON code is slightly … Read more

What are the best instruction sequences to generate vector constants on the fly?

All-zero: pxor xmm0,xmm0 (or xorps xmm0,xmm0, one instruction-byte shorter.) There isn’t much difference on modern CPUs, but on Nehalem (before xor-zero elimination), the xorps uop could only run on port 5. I think that’s why compilers favour pxor-zeroing even for registers that will be used with FP instructions. All-ones: pcmpeqw xmm0,xmm0. This is the usual … Read more

print a __m128i variable

Use this function to print them: #include <stdint.h> #include <string.h> void print128_num(__m128i var) { uint16_t val[8]; memcpy(val, &var, sizeof(val)); printf(“Numerical: %i %i %i %i %i %i %i %i \n”, val[0], val[1], val[2], val[3], val[4], val[5], val[6], val[7]); } You split 128bits into 16-bits(or 32-bits) before printing them. This is a way of 64-bit splitting and … Read more

Implementation of __builtin_clz

Yes, and no. CLZ (count leading zero) and BSR (bit-scan reverse) are related but different. CLZ equals (type bit width less one) – BSR. CTZ (count trailing zero), also know as FFS (find first set) is the same as BSF (bit-scan forward.) Note that all of these are undefined when operating on zero! In answer … Read more

How to check if compiled code uses SSE and AVX instructions?

Under Linux, you could also decompile your binary: objdump -d YOURFILE > YOURFILE.asm Then find all SSE instructions: awk ‘/[ \t](addps|addss|andnps|andps|cmpps|cmpss|comiss|cvtpi2ps|cvtps2pi|cvtsi2ss|cvtss2s|cvttps2pi|cvttss2si|divps|divss|ldmxcsr|maxps|maxss|minps|minss|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movss|movups|mulps|mulss|orps|rcpps|rcpss|rsqrtps|rsqrtss|shufps|sqrtps|sqrtss|stmxcsr|subps|subss|ucomiss|unpckhps|unpcklps|xorps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|psadbw|pshufw)[ \t]/’ YOURFILE.asm Find only packed SSE instructions (suggested by @Peter Cordes in comments): awk ‘/[ \t](addps|andnps|andps|cmpps|cvtpi2ps|cvtps2pi|cvttps2pi|divps|maxps|minps|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movntq|movups|mulps|orps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|pmulhuw|psadbw|pshufw|rcpps|rsqrtps|shufps|sqrtps|subps|unpckhps|unpcklps|xorps)[ \t]/’ YOURFILE.asm Find all SSE2 instructions (except MOVSD and CMPSD, which were first introduced in 80386): awk ‘/[ … Read more

Why vectorizing the loop over 64-bit elements does not have performance improvement over large buffers?

This original answer was valid back in 2013. As of 2017 hardware, things have changed enough that both the question and the answer are out-of-date. See the end of this answer for the 2017 update. Original Answer (2013): Because you’re bottlenecked by memory bandwidth. While vectorization and other micro-optimizations can improve the speed of computation, … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)