SSE instructions: which CPUs can do atomic 16B memory operations?

Question

In the Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.1.1 that:

The Intel486 processor (and newer processors since) guarantees that
the following basic memory operations will always be carried out
atomically:

Reading or writing a byte.

Reading or writing a word aligned on a 16-bit boundary.

Reading or writing a doubleword aligned on a 32-bit boundary. The Pentium processor (and newer processors since) guarantees that the
following additional memory operations will always be carried out
atomically:

Reading or writing a quadword aligned on a 64-bit boundary.

16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors (and newer processors since) guarantee that
the following additional memory operation will always be carried out
atomically:

Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

Processors that enumerate support for Intel® AVX (by setting the
feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte
memory operations performed by the following instructions will always
be carried out atomically:

MOVAPD, MOVAPS, and MOVDQA.

VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.

VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).

(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

Each of the writes x = (v4si){0,0,0,0} and x = (v4si){-1,-1,-1,-1} are probably compiled into a single 16-byte MOVAPS. The address of x is 16-byte aligned. On an Intel processor that supports AVX, these writes are atomic. Otherwise, they are not atomic.

On AMD processors, AMD64 Architecture Programmer’s Manual, Section 3.9.1.3 states that

Single load or store operations (from instructions that do just a single load or store) are naturally
atomic on any AMD64 processor as long as they do not cross an aligned 8-byte boundary. Accesses up
to eight bytes in size which do cross such a boundary may be performed atomically using certain
instructions with a lock prefix, such as XCHG, CMPXCHG or CMPXCHG8B, as long as all such
accesses are done using the same technique. (Note that misaligned locked accesses may be subject to
heavy performance penalties.) CMPXCHG16B can be used to perform 16-byte atomic accesses in 64-
bit mode (with certain alignment restrictions).

AMD processors thus do not guarantee that AVX instructions provide 16-byte atomicity.

On Intel processors that don’t support AVX and on AMD processor, the CMPXCHG16B instruction with the LOCK prefix can be used. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the “CX16” feature bit).

EDIT: Test program results

(Test program modified to increase #iterations by a factor of 10)

On a Xeon X3450 (x86-64):

0000   999998139       1572
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        1861  999998428

On a Xeon 5150 (32-bit):

0000   999243100     283087
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111      756900  999716913

On an Opteron 2435 (x86-64):

0000   999995893       1901
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        4107  999998099

Note that the Intel Xeon X3450 and Xeon 5150 don’t support AVX. The Opteron 2435 is an AMD processor.

Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It’s not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.

EDIT 2: How to make the test program fail

Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the “numactl” tool specifying that each thread runs on a separate socket, I got:

0000   999998634       5990
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          1  Not a single memory access!
1101           0          0
1110           0          0
1111        1366  999994009

So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.

EDIT 3: ASM for the thread functions, on request of “GJ.”

Here’s the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:


.globl thread2
        .type   thread2, @function
thread2:
.LFB537:
        .cfi_startproc
        movdqa  .LC3(%rip), %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L11:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n2(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L11
        xorl    %eax, %eax
        ret
        .cfi_endproc
.LFE537:
        .size   thread2, .-thread2
        .p2align 5,,31
.globl thread1
        .type   thread1, @function
thread1:
.LFB536:
        .cfi_startproc
        pxor    %xmm1, %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L15:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n1(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L15
        xorl    %eax, %eax
        ret
        .cfi_endproc

and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:


.LC3:
        .long   -1
        .long   -1
        .long   -1
        .long   -1
        .ident  "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
        .section        .note.GNU-stack,"",@progbits

Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn’t matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.

Leave a Comment Cancel reply