The first thing I want to point out is that you might want to double-check which cores are on each node. I don’t recall cores and nodes being interleaved like that.
Also, you should have 16 threads due to HT. (unless you disabled it)
The socket 1366 Xeon machines are only slightly NUMA. So it will be hard to see the difference. The NUMA effect is much more noticeable on the 4P Opterons.
On systems like yours, the node-to-node bandwidth is actually faster than the CPU-to-memory bandwidth. Since your access pattern is completely sequential, you are getting the full bandwidth regardless of whether or not the data is local. A better thing to measure is the latency. Try random accessing a block of 1 GB instead of streaming it sequentially.
Depending on how aggressively your compiler optimizes, your loop might be optimized out since it doesn’t do anything:
c = ((char*)x)[j]; ((char*)x)[j] = c;
Something like this will guarantee that it won’t be eliminated by the compiler:
((char*)x)[j] += 1;