micro-optimization
Java: if-return-if-return vs if-return-elseif-return
The generated byte code is identical for those two cases, so it’s purely a matter of style. I produced two methods e1 and e2 and both produced this byte code (read using javap -v): public boolean e1(java.lang.Object); Code: Stack=2, Locals=2, Args_size=2 0: aload_0 1: aload_1 2: if_acmpne 7 5: iconst_1 6: ireturn 7: aload_1 8: … Read more
Modern x86 cost model
The best reference is the Intel Optimization Manual, which provides fairly detailed information on architectural hazards and instruction latencies for all recent Intel cores, as well as a good number of optimization examples. Another excellent reference is Agner Fog’s optimization resources, which have the virtue of also covering AMD cores. Note that specific cost models … Read more
How expensive is it to convert between int and double?
Here’s what I could dig up myself, for x86-64 doing FP math with SSE2 (not legacy x87 where changing the rounding mode for C++’s truncation semantics was expensive): When I take a look at the generated assembly from clang and gcc, it looks like the cast int to double, it boils down to one instruction: … Read more
‘Correct’ unsigned integer comparison
Well, you’ve correctly typified the situation: C/C++ have no way of doing a full signed int/unsigned int comparison with a single compare. I would be surprised if promotion to int64 was faster than doing two comparisons. In my experience, compilers are quite good at realizing that a subexpression like that is pure (has no side … Read more
Is there a faster algorithm for max(ctz(x), ctz(y))?
I don’t think there’s anything better than the naive approach for the maximum. One attempt is using the identity x + y = min(x, y) + max(x, y) and thus max(ctz(x), ctz(y)) = ctz(x) + ctz(y) – min(ctz(x), ctz(y)) This way, we can reduce the max function to the min function we already optimized, albeit … Read more
Why does n++ execute faster than n=n+1?
That would be true if you are working on a “stone-age” compiler… In case of “stone-age”: ++n is faster than n++ is faster than n=n+1 Machine usually have increment x as well as add const to x In case of n++, you will have 2 memory access only (read n, inc n, write n ) … Read more
Is performance reduced when executing loops whose uop count is not a multiple of processor width?
I did some investigation with Linux perf to help answer this on my Skylake i7-6700HQ box, and Haswell results have been kindly provided by another user. The analysis below applies to Skylake, but it is followed by a comparison versus Haswell. Other architectures may vary0, and to help sort it all out I welcome additional … Read more