INC instruction vs ADD 1: Does it matter?

Update: the Efficiency cores on Alder Lake are Gracemont, and run inc reg as a single uop, but at only 1/clock, vs. 4/clock for add reg, 1 (https://uops.info/). This may be a false dependency on FLAGS like P4 had; the uops.info tests didn’t try adding a dep-breaking instruction. Other than the TL:DR, I haven’t updated … Read more

“enter” vs “push ebp; mov ebp, esp; sub esp, imm” and “leave” vs “mov esp, ebp; pop ebp”

There is a performance difference, especially for enter. On modern processors this decodes to some 10 to 20 µops, while the three instruction sequence is about 4 to 6, depending on the architecture. For details consult Agner Fog’s instruction tables. Additionally the enter instruction usually has a quite high latency, for example 8 clocks on … Read more

Divide by 10 using bit shifts?

Editor’s note: this is not actually what compilers do, and gives the wrong answer for large positive integers ending with 9, starting with div10(1073741829) = 107374183 not 107374182 (Godbolt). It is exact for inputs smaller than 0x40000005, though, which may be sufficient for some uses. Compilers (including MSVC) do use fixed-point multiplicative inverses for constant … Read more

How to force GCC to assume that a floating-point expression is non-negative?

You can write assert(x*x >= 0.f) as a compile-time promise instead of a runtime check as follows in GNU C: #include <cmath> float test1 (float x) { float tmp = x*x; if (!(tmp >= 0.0f)) __builtin_unreachable(); return std::sqrt(tmp); } (related: What optimizations does __builtin_unreachable facilitate? You could also wrap if(!x)__builtin_unreachable() in a macro and call … Read more

Does using xor reg, reg give advantage over mov reg, 0? [duplicate]

an actual answer for you: Intel 64 and IA-32 Architectures Optimization Reference Manual Section 3.5.1.7 is where you want to look. In short there are situations where an xor or a mov may be preferred. The issues center around dependency chains and preservation of condition codes. In processors based on Intel Core microarchitecture, a number … Read more

Fast method to copy memory with translation – ARGB to BGR

I wrote 4 different versions which work by swapping bytes. I compiled them using gcc 4.2.1 with -O3 -mssse3, ran them 10 times over 32MB of random data and found the averages. Editor’s note: the original inline asm used unsafe constraints, e.g. modifying input-only operands, and not telling the compiler about the side effect on … Read more

Is reading the `length` property of an array really that expensive an operation in JavaScript?

Well, I would have said it was expensive, but then I wrote a little test @ jsperf.com and to my surprise using i<array.length actually was faster in Chrome, and in FF(4) it didn’t matter. My suspicion is that length is stored as an integer (Uint32). From the ECMA-specs (262 ed. 5, page 121): Every Array … Read more

Is it possible to tell the branch predictor how likely it is to follow the branch?

Yes, but it will have no effect. Exceptions are older (obsolete) architectures pre Netburst, and even then it doesn’t do anything measurable. There is an “branch hint” opcode Intel introduced with the Netburst architecture, and a default static branch prediction for cold jumps (backward predicted taken, forward predicted non taken) on some older architectures. GCC … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)