How are x86 uops scheduled, exactly?

Your questions are tough for a couple of reasons: The answer depends a lot on the microarchitecture of the processor which can vary significantly from generation to generation. These are fine-grained details which Intel doesn’t generally release to the public. Nevertheless, I’ll try to answer… When multiple uops are ready in the reservation station, in … Read more

SIMD instructions lowering CPU frequency

The frequency impact depends on the width of the operation and the specific instruction used. There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the “nominal” speed you’ll see written on the box: when the chip says “3.5 GHz turbo”, they are referring to the single-core L0 … Read more

Why is Numpy with Ryzen Threadripper so much slower than Xeon?

As of 2021, Intel unfortunately removed the MKL_DEBUG_CPU_TYPE to prevent people on AMD use the workaround presented in the accepted answer. This means that the workaround no longer works, and AMD users have to either switch to OpenBLAS or keep using MKL. To use the workaround, follow this method: Create a conda environment with conda‘s … Read more

Micro fusion and addressing modes

In the decoders and uop-cache, addressing mode doesn’t affect micro-fusion (except that an instruction with an immediate operand can’t micro-fuse a RIP-relative addressing mode). But some combinations of uop and addressing mode can’t stay micro-fused in the ROB (in the out-of-order core), so Intel SnB-family CPUs “un-laminate” when necessary, at some point before the issue/rename … Read more

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

Here are theoretical max FLOPs counts (per core) for a number of recent processor microarchitectures and explanation how to achieve them. In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply (FMAs per clock) * (vector elements / instruction) * 2 (FLOPs … Read more

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

You are experiencing a penalty for “mixing” non-VEX SSE and VEX-encoded instructions – even though your entire visible application doesn’t obviously use any AVX instructions! Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn’t, or vice-versa. That is, you never … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)