[I would make this a comment, but do not have enough reputation to do so.]
I have a similar system and see similar results, but can add a few data points:
- If you reverse the direction of your naive
memcpy
(i.e. convert to*p_dest-- = *p_src--
), then you may get much worse performance than for the forward direction (~637 ms for me). There was a change inmemcpy()
in glibc 2.12 that exposed several bugs for callingmemcpy
on overlapping buffers (http://lwn.net/Articles/414467/) and I believe the issue was caused by switching to a version ofmemcpy
that operates backwards. So, backward versus forward copies may explain thememcpy()
/memmove()
disparity. - It seems to be better to not use non-temporal stores. Many optimized
memcpy()
implementations switch to non-temporal stores (which are not cached) for large buffers (i.e. larger than the last level cache). I tested Agner Fog’s version of memcpy (http://www.agner.org/optimize/#asmlib) and found that it was approximately the same speed as the version inglibc
. However,asmlib
has a function (SetMemcpyCacheLimit
) that allows setting the threshold above which non-temporal stores are used. Setting that limit to 8GiB (or just larger than the 1 GiB buffer) to avoid the non-temporal stores doubled performance in my case (time down to 176ms). Of course, that only matched the forward-direction naive performance, so it is not stellar. - The BIOS on those systems allows four different hardware prefetchers to be enabled/disabled (MLC Streamer Prefetcher, MLC Spatial Prefetcher, DCU Streamer Prefetcher, and DCU IP Prefetcher). I tried disabling each, but doing so at best maintained performance parity and reduced performance for a few of the settings.
- Disabling the running average power limit (RAPL) DRAM mode has no impact.
- I have access to other Supermicro systems running Fedora 19 (glibc 2.17). With a Supermicro X9DRG-HF board, Fedora 19, and Xeon E5-2670 CPUs, I see similar performance as above. On a Supermicro X10SLM-F single socket board running a Xeon E3-1275 v3 (Haswell) and Fedora 19, I see 9.6 GB/s for
memcpy
(104ms). The RAM on the Haswell system is DDR3-1600 (same as the other systems).
UPDATES
- I set the CPU power management to Max Performance and disabled hyperthreading in the BIOS. Based on
/proc/cpuinfo
, the cores were then clocked at 3 GHz. However, this oddly decreased memory performance by around 10%. - memtest86+ 4.10 reports bandwidth to main memory of 9091 MB/s. I could not find if this corresponds to read, write, or copy.
- The STREAM benchmark reports 13422 MB/s for copy, but they count bytes as both read and written, so that corresponds to ~6.5 GB/s if we want to compare to the above results.