Why vectorizing the loop over 64-bit elements does not have performance improvement over large buffers?

This original answer was valid back in 2013. As of 2017 hardware, things have changed enough that both the question and the answer are out-of-date. See the end of this answer for the 2017 update. Original Answer (2013): Because you’re bottlenecked by memory bandwidth. While vectorization and other micro-optimizations can improve the speed of computation, … Read more

memory bandwidth for many channels x86 systems

The hardware prefetcher is tuned differently on server vs workstation CPUs. Servers are expected to handle many threads, so the prefetcher will request smaller chunks from RAM. Here is a paper that goes into detail about the issue you’re experiencing, but from the other side of the coin: Hardware Prefetcher Aggressiveness Controllers: Do We Need … Read more

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones. When you look … Read more

tech