memory-bandwidth – Tarik Billa

Why vectorizing the loop over 64-bit elements does not have performance improvement over large buffers?

August 6, 2023 by Tarik

This original answer was valid back in 2013. As of 2017 hardware, things have changed enough that both the question and the answer are out-of-date. See the end of this answer for the 2017 update. Original Answer (2013): Because you’re bottlenecked by memory bandwidth. While vectorization and other micro-optimizations can improve the speed of computation, … Read more

How to increase performance of memcpy

April 29, 2023 by Tarik

I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this … Read more

memory bandwidth for many channels x86 systems

January 28, 2023 by Tarik

The hardware prefetcher is tuned differently on server vs workstation CPUs. Servers are expected to handle many threads, so the prefetcher will request smaller chunks from RAM. Here is a paper that goes into detail about the issue you’re experiencing, but from the other side of the coin: Hardware Prefetcher Aggressiveness Controllers: Do We Need … Read more

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

December 10, 2022 by Tarik

A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones. When you look … Read more