Comparing two values in the form (a + sqrt(b)) as fast as possible?

Here’s a version without sqrt, though I’m not sure whether it is faster than a version which has only one sqrt (it may depend on the distribution of values). Here’s the math (how to remove both sqrts): ad = a2-a1 bd = b2-b1 a1+sqrt(b1) < a2+sqrt(b2) // subtract a1 sqrt(b1) < ad+sqrt(b2) // square it … Read more

Avoiding the overhead of C# virtual calls

You can cause the JIT to devirtualize your interface calls by using a struct with a constrained generic. public SomeObject<TMathFunction> where TMathFunction: struct, IMathFunction { private readonly TMathFunction mathFunction_; public double SomeWork(double input, double step) { var f = mathFunction_.Calculate(input); var dv = mathFunction_.Derivate(input); return f – (dv * step); } } // … var … Read more

Can x86’s MOV really be “free”? Why can’t I reproduce this at all?

Register-copy is never free for the front-end, only eliminated from actually executing in the back-end (with zero latency) by the issue/rename stage on the following CPUs: AMD Bulldozer family for XMM vector registers, not integer. AMD Zen family for integer and XMM vector registers. (And YMM in Zen2 and later) (See Agner Fog’s microarch guide … Read more

How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

Other answers welcome to address Sandybridge and IvyBridge in more detail. I don’t have access to that hardware. I haven’t found any partial-reg behaviour differences between HSW and SKL. On Haswell and Skylake, everything I’ve tested so far supports this model: AL is never renamed separately from RAX (or r15b from r15). So if you … Read more

Do java finals help the compiler create more efficient bytecode? [duplicate]

The bytecodes are not significantly more or less efficient if you use final because Java bytecode compilers typically do little in the way optimization. The efficiency bonus (if any) will be in the native code produced by the JIT compiler1. In theory, using the final provides a hint to the JIT compiler that should help … Read more

What does `rep ret` mean?

There’s a whole blog named after this instruction. And the first post describes the reason behind it: http://repzret.org/p/repzret/ Basically, there was an issue in the AMD’s branch predictor when a single-byte ret immediately followed a conditional jump as in the code you quoted (and a few other situations), and the workaround was to add the … Read more

‘ … != null’ or ‘null != ….’ best performance?

Comparing the generated bytecodes is mostly meaningless, since most of the optimization happens in run time with the JIT compiler. I’m going to guess that in this case, either expression is equally fast. If there’s any difference, it’s negligible. This is not something that you need to worry about. Look for big picture optimizations.

Why does mulss take only 3 cycles on Haswell, different from Agner’s instruction tables? (Unrolling FP loops with multiple accumulators)

Related: AVX2: Computing dot product of 512 float arrays has a good manually-vectorized dot-product loop using multiple accumulators with FMA intrinsics. The rest of the answer explains why that’s a good thing, with cpu-architecture / asm details. Dot Product of Vectors with SIMD shows that with the right compiler options, some compilers will auto-vectorize that … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)