fastest way to negate a number
Use something that is readable, such as a *= -1; or a = -a; Leave the rest to the optimizer.
Use something that is readable, such as a *= -1; or a = -a; Leave the rest to the optimizer.
Here’s a version without sqrt, though I’m not sure whether it is faster than a version which has only one sqrt (it may depend on the distribution of values). Here’s the math (how to remove both sqrts): ad = a2-a1 bd = b2-b1 a1+sqrt(b1) < a2+sqrt(b2) // subtract a1 sqrt(b1) < ad+sqrt(b2) // square it … Read more
Strangely I have a simple answer: Because ICC isn’t optimal. When you write own compiler you get started with some very basic set of operation codes: NOP, MOV, ADD… up to 10 opcodes. You don’t use SUB for a while because it might easily be replaced by: ADD NEGgative operand. NEG isn’t basic as well, … Read more
You can cause the JIT to devirtualize your interface calls by using a struct with a constrained generic. public SomeObject<TMathFunction> where TMathFunction: struct, IMathFunction { private readonly TMathFunction mathFunction_; public double SomeWork(double input, double step) { var f = mathFunction_.Calculate(input); var dv = mathFunction_.Derivate(input); return f – (dv * step); } } // … var … Read more
Register-copy is never free for the front-end, only eliminated from actually executing in the back-end (with zero latency) by the issue/rename stage on the following CPUs: AMD Bulldozer family for XMM vector registers, not integer. AMD Zen family for integer and XMM vector registers. (And YMM in Zen2 and later) (See Agner Fog’s microarch guide … Read more
Other answers welcome to address Sandybridge and IvyBridge in more detail. I don’t have access to that hardware. I haven’t found any partial-reg behaviour differences between HSW and SKL. On Haswell and Skylake, everything I’ve tested so far supports this model: AL is never renamed separately from RAX (or r15b from r15). So if you … Read more
The bytecodes are not significantly more or less efficient if you use final because Java bytecode compilers typically do little in the way optimization. The efficiency bonus (if any) will be in the native code produced by the JIT compiler1. In theory, using the final provides a hint to the JIT compiler that should help … Read more
There’s a whole blog named after this instruction. And the first post describes the reason behind it: http://repzret.org/p/repzret/ Basically, there was an issue in the AMD’s branch predictor when a single-byte ret immediately followed a conditional jump as in the code you quoted (and a few other situations), and the workaround was to add the … Read more
Comparing the generated bytecodes is mostly meaningless, since most of the optimization happens in run time with the JIT compiler. I’m going to guess that in this case, either expression is equally fast. If there’s any difference, it’s negligible. This is not something that you need to worry about. Look for big picture optimizations.
Related: AVX2: Computing dot product of 512 float arrays has a good manually-vectorized dot-product loop using multiple accumulators with FMA intrinsics. The rest of the answer explains why that’s a good thing, with cpu-architecture / asm details. Dot Product of Vectors with SIMD shows that with the right compiler options, some compilers will auto-vectorize that … Read more