Why does my Intel Skylake / Kaby Lake CPU incur a mysterious factor 3 slowdown in a simple hash table implementation?
Summary The TLDR is that loads which miss all levels of the TLB (and so require a page walk) and which are separated by address unknown stores can’t execute in parallel, i.e., the loads are serialized and the memory level parallelism (MLP) factor is capped at 1. Effectively, the stores fence the loads, much as … Read more