How to check whether memory bandwidth has became a bottleneck?

Question

I have had this problem myself on a NUMA 96×8 cores machine.

90% of the time the problem is with memory/cache synchronisation. If you call synchronisation routines frequently (atomics, mutexes) then the appropriate cache line has to be invalidated on all sockets leading to a complete lockdown of the entire memory bus for multiple cycles.

You can profile this by running a profiler like Intel VTune or Perfsuite and have them record how long your atomics take. If you are using them properly then they should take something between 10-40 cycles. Worst case scenario I had was 300 cycles when scaling my multithreaded application to 8 sockets (8×8 cores on Intel Xeon).

Another easy profiling step you can do is compile without any atomics/mutexes (if your code permits it) and run it on multiple sockets then – it should run fast (incorrect, but fast).

The reason why your code runs fast on 8 cores is because Intel processors are using cache locking when executing atomics as long as you keep all on the same physical chip (socket). If a lock has to go to the memory bus – this is when things get ugly.

The only thing I can suggest is: scale down on how often you call atomics/synchronisation routines.

As for my application: I had to implement a virtually lock-free data structure in order to scale my code beyond one socket. Every thread accumulates actions that require a lock and checks regularly it it’s his turn to flush them. Then pass a token around and take turns flushing the synchronisation actions. Obviously only works if you have sufficient work to do while waiting.

Leave a Comment Cancel reply