micro-optimization – Page 3

fastest way to negate a number

June 6, 2023 by Tarik

Use something that is readable, such as a *= -1; or a = -a; Leave the rest to the optimizer.

Comparing two values in the form (a + sqrt(b)) as fast as possible?

June 3, 2023 by Tarik

Here’s a version without sqrt, though I’m not sure whether it is faster than a version which has only one sqrt (it may depend on the distribution of values). Here’s the math (how to remove both sqrts): ad = a2-a1 bd = b2-b1 a1+sqrt(b1) < a2+sqrt(b2) // subtract a1 sqrt(b1) < ad+sqrt(b2) // square it … Read more

Why does Intel’s compiler prefer NEG+ADD over SUB?

May 29, 2023 by Tarik

Strangely I have a simple answer: Because ICC isn’t optimal. When you write own compiler you get started with some very basic set of operation codes: NOP, MOV, ADD… up to 10 opcodes. You don’t use SUB for a while because it might easily be replaced by: ADD NEGgative operand. NEG isn’t basic as well, … Read more

Avoiding the overhead of C# virtual calls

May 28, 2023 by Tarik

You can cause the JIT to devirtualize your interface calls by using a struct with a constrained generic. public SomeObject<TMathFunction> where TMathFunction: struct, IMathFunction { private readonly TMathFunction mathFunction_; public double SomeWork(double input, double step) { var f = mathFunction_.Calculate(input); var dv = mathFunction_.Derivate(input); return f – (dv * step); } } // … var … Read more

Can x86’s MOV really be “free”? Why can’t I reproduce this at all?

May 22, 2023 by Tarik

Register-copy is never free for the front-end, only eliminated from actually executing in the back-end (with zero latency) by the issue/rename stage on the following CPUs: AMD Bulldozer family for XMM vector registers, not integer. AMD Zen family for integer and XMM vector registers. (And YMM in Zen2 and later) (See Agner Fog’s microarch guide … Read more

How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

May 8, 2023 by Tarik

Other answers welcome to address Sandybridge and IvyBridge in more detail. I don’t have access to that hardware. I haven’t found any partial-reg behaviour differences between HSW and SKL. On Haswell and Skylake, everything I’ve tested so far supports this model: AL is never renamed separately from RAX (or r15b from r15). So if you … Read more

Do java finals help the compiler create more efficient bytecode? [duplicate]

May 7, 2023 by Tarik

The bytecodes are not significantly more or less efficient if you use final because Java bytecode compilers typically do little in the way optimization. The efficiency bonus (if any) will be in the native code produced by the JIT compiler1. In theory, using the final provides a hint to the JIT compiler that should help … Read more

What does `rep ret` mean?

April 23, 2023 by Tarik

There’s a whole blog named after this instruction. And the first post describes the reason behind it: http://repzret.org/p/repzret/ Basically, there was an issue in the AMD’s branch predictor when a single-byte ret immediately followed a conditional jump as in the code you quoted (and a few other situations), and the workaround was to add the … Read more

‘ … != null’ or ‘null != ….’ best performance?

April 22, 2023 by Tarik

Comparing the generated bytecodes is mostly meaningless, since most of the optimization happens in run time with the JIT compiler. I’m going to guess that in this case, either expression is equally fast. If there’s any difference, it’s negligible. This is not something that you need to worry about. Look for big picture optimizations.

Why does mulss take only 3 cycles on Haswell, different from Agner’s instruction tables? (Unrolling FP loops with multiple accumulators)

April 21, 2023 by Tarik

Related: AVX2: Computing dot product of 512 float arrays has a good manually-vectorized dot-product loop using multiple accumulators with FMA intrinsics. The rest of the answer explains why that’s a good thing, with cpu-architecture / asm details. Dot Product of Vectors with SIMD shows that with the right compiler options, some compilers will auto-vectorize that … Read more