micro-optimization – Page 2

Cycles/cost for L1 Cache hit vs. Register on x86?

September 19, 2023 by Tarik

Here’s a great article on the subject: http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1 To answer your question – yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly 😉 PS: The specifics will vary, but this link has some good ballpark figures: Approximate cost to access various caches … Read more

x86_64 best way to reduce 64 bit register to 32 bit retaining zero or non-zero status

August 24, 2023 by Tarik

Fewest uops (front-end bandwidth): 1 uop, latency 3c (Intel) or 1c (Zen). Also smallest code-size, 5 bytes. popcnt %rax, %rax # 5 bytes, 1 uop for one port # if using a different dst, note the output dependency on Intel before ICL On most CPUs that have it at all, it’s 3c latency, 1c throughput … Read more

Why are DateTime.Now DateTime.UtcNow so slow/expensive

August 21, 2023 by Tarik

TickCount just reads a constantly increasing counter. It’s just about the simplest thing you can do. DateTime.UtcNow needs to query the system time – and don’t forget that while TickCount is blissfully ignorant of things like the user changing the clock, or NTP, UtcNow has to take this into account. Now you’ve expressed a performance … Read more

x > -1 vs x >= 0, is there a performance difference

August 11, 2023 by Tarik

It is very much dependent on the underlying architecture, but any difference will be minuscule. If anything, I’d expect (x >= 0) to be slightly faster, as comparison with 0 comes for free on some instruction sets (such as ARM). Of course, any sensible compiler will choose the best implementation regardless of which variant is … Read more

Why does breaking the “output dependency” of LZCNT matter?

August 3, 2023 by Tarik

This is simply a limitation in the micro-architecture of your Intel Haswell CPU and several previous1 CPUs. It has been fixed for tzcnt and lzcnt as of Skylake-S (client), but the issue remained for popcnt until it was fixed in Cannon Lake. On those micro-architectures the destination operand for tzcnt, lzcnt and popcnt is treated … Read more

Why are loops always compiled into “do…while” style (tail jump)?

July 25, 2023 by Tarik

Related: asm loop basics: While, Do While, For loops in Assembly Language (emu8086) Terminology: Wikipedia says “loop inversion” is the name for turning a while(x) into if(x) do{}while(x), putting the condition at the bottom of the loop where it belongs. Fewer instructions / uops inside the loop = better. Structuring the code outside the loop … Read more

Why do none of the major compilers optimize this conditional store that checks if the value is already set?

July 13, 2023 by Tarik

The object might be const It wouldn’t be safe for static const int val = 1; living in read-only memory. The unconditional-store version will segfault trying to write to read-only memory. The version that checks first is safe to call on that object in the C++ abstract machine (via const_cast), so the optimizer has to … Read more

Passing null pointer to placement new

June 9, 2023 by Tarik

While I can’t see much of a question in there except “Has anyone ever needed placement new to correctly handle the null pointer case?” (I haven’t), I think the case is interesting enough to spill some thoughts on the issue. I consider the standard broken or incomplete wrt the placement new function and requirements to … Read more

Does calculating Sqrt(x) as x * InvSqrt(x) make any sense in the Doom 3 BFG code?

June 9, 2023 by Tarik

I can see two reasons for doing it this way: firstly, the “fast invSqrt” method (really Newton Raphson) is now the method used in a lot of hardware, so this approach leaves open the possibility of taking advantage of such hardware (and doing potentially four or more such operations at once). This article discusses it … Read more

Go: multiple len() calls vs performance?

June 6, 2023 by Tarik

There are two cases: Local slice: length will be cached and there is no overhead Global slice or passed (by reference): length cannot be cached and there is overhead No overhead for local slices For locally defined slices the length is cached, so there is no runtime overhead. You can see this in the assembly … Read more