intel – Page 3 – Tarik Billa

How are x86 uops scheduled, exactly?

May 8, 2023 by Tarik

Your questions are tough for a couple of reasons: The answer depends a lot on the microarchitecture of the processor which can vary significantly from generation to generation. These are fine-grained details which Intel doesn’t generally release to the public. Nevertheless, I’ll try to answer… When multiple uops are ready in the reservation station, in … Read more

What was the original reason for the design of AT&T assembly syntax?

April 29, 2023 by Tarik

UNIX was for a long time developed on the PDP-11, a 16 bit computer from DEC, which had a fairly simple instruction set. Nearly every instruction has two operands, each of which can have one of the following eight addressing modes, here shown in the MACRO 16 assembly language: 0n Rn register 1n (Rn) deferred … Read more

Where is the L1 memory cache of Intel x86 processors documented?

April 27, 2023 by Tarik

It is near impossible to find specs on Intel caches. When I was teaching a class on caches last year, I asked friends inside Intel (in the compiler group) and they couldn’t find specs. But wait!!! Jed, bless his soul, tells us that on Linux systems, you can squeeze lots of information out of the … Read more

SIMD instructions lowering CPU frequency

April 24, 2023 by Tarik

The frequency impact depends on the width of the operation and the specific instruction used. There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the “nominal” speed you’ll see written on the box: when the chip says “3.5 GHz turbo”, they are referring to the single-core L0 … Read more

Why is Numpy with Ryzen Threadripper so much slower than Xeon?

April 14, 2023 by Tarik

As of 2021, Intel unfortunately removed the MKL_DEBUG_CPU_TYPE to prevent people on AMD use the workaround presented in the accepted answer. This means that the workaround no longer works, and AMD users have to either switch to OpenBLAS or keep using MKL. To use the workaround, follow this method: Create a conda environment with conda‘s … Read more

How are denormalized floats handled in C#?

April 9, 2023 by Tarik

There is no such option. The FPU control word in a C# app is initialized by the CLR at startup. Changing it is not an option provided by the framework. Even if you try to change it by pinvoking _control87_2() then it is not going to last long; any exception will cause the control word … Read more

How are cache memories shared in multicore Intel CPUs?

April 9, 2023 by Tarik

In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)? Yes. It varies by the exact chip model, but the most common design is for each CPU core to have its own private L1 data and instruction … Read more

Micro fusion and addressing modes

April 2, 2023 by Tarik

In the decoders and uop-cache, addressing mode doesn’t affect micro-fusion (except that an instruction with an immediate operand can’t micro-fuse a RIP-relative addressing mode). But some combinations of uop and addressing mode can’t stay micro-fused in the ROB (in the out-of-order core), so Intel SnB-family CPUs “un-laminate” when necessary, at some point before the issue/rename … Read more

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

March 28, 2023 by Tarik

Here are theoretical max FLOPs counts (per core) for a number of recent processor microarchitectures and explanation how to achieve them. In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply (FMAs per clock) * (vector elements / instruction) * 2 (FLOPs … Read more

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

March 26, 2023 by Tarik

You are experiencing a penalty for “mixing” non-VEX SSE and VEX-encoded instructions – even though your entire visible application doesn’t obviously use any AVX instructions! Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn’t, or vice-versa. That is, you never … Read more