Efficiency on a 64-bit platform: pointer vs 32-bit array indexing

Question

The problem with low-level advice like this (even coming from Andrei Alexandrescu) is that it ignores the fact that compilers optimize.

Modern compilers optimize so aggressively (and, in general, successfully) that it really becomes a mug’s game to try to second-guess them. On the whole, writing clear, readable code will help you, your colleagues and your compilers analyze the code. And I honestly believe that is the best general advice that can be given.

One of the well-known optimizations which modern compilers use is the conversion between index- and pointer-based loops. In the particular case of your benchmark, with most optimization settings, gcc will compile both the pointer-based and the 32-bit-index-based loop to the same assembler output.

In the following, I replaced the chrono stuff with ++sentry where sentry is a volatile in order to reduce the code size. The assembly output corresponds to:

for (int* q{p}; q != p + n; ++q) ++(*q);
++sentry;
for (int i{0}; i < n; i++) p[i] += 1;

Compile with -O2, this produced the following: (%rdi and %ebp were still initialized from the loop which populated p)

        movq    %rdi, %rdx
        cmpq    %rcx, %rdi
        je      .L10
.L16:
        addl    $1, (%rdx)
        addq    $4, %rdx
        cmpq    %rcx, %rdx
        jne     .L16
.L10:
        movl    sentry(%rip), %eax
        movq    %rdi, %rdx
        addl    $1, %eax
        movl    %eax, sentry(%rip)
        testl   %ebp, %ebp
        jle     .L8
.L14:
        addl    $1, (%rdx)
        addq    $4, %rdx
        cmpq    %rdx, %rsi
        jne     .L14
.L8:

You can see that there is no difference at all between the loops at .L16 and .L14.

Different optimization settings produce different results, of course. With -O3 the loops are vectorized using SIMD instructions and Duff’s device, but again are almost identical. clang does this optimization at -O2

None of that denies the point being made, which is that the compiler may need to work harder to prove that a pointer which is being written through cannot modify arbitrary memory.

But in this case, as in many cases, the loop index is a local variable and the loop is simple enough that the compiler can fully analyze it, thus allowing strength reduction, unrolling, and vectorization; whether the control variable is a pointer or an index is then irrelevant.

A more interesting example (possibly) is a loop over two arrays where the base elements are different sizes. Given the following two functions:

void d2f_ptr(float* out, const double* in, int n) {
  for (auto lim = out + n; out < lim;) *out++ = *in++;
}
void d2f_idx(float out[], const double in[], int n) {
  for (int i = 0; i < n; ++i) out[i] = in[i];
}

gcc (v5.3.0, -O2) does produce different loops and the index-based loop is one instruction shorter:

d2f_ptr(float*, double const*, int):    d2f_idx(float*, double const*, int):
        movslq  %edx, %rdx                      xorl    %eax, %eax
        leaq    (%rdi,%rdx,4), %rax             testl   %edx, %edx
        cmpq    %rax, %rdi                      jle     .L16
        jnb     .L11
.L15:                                   .L20:
        addq    $4, %rdi                        pxor    %xmm0, %xmm0
        addq    $8, %rsi                        cvtsd2ss (%rsi,%rax,8), %xmm0
        pxor    %xmm0, %xmm0                    movss   %xmm0, (%rdi,%rax,4)
        cvtsd2ss -8(%rsi), %xmm0                addq    $1, %rax
        movss   %xmm0, -4(%rdi)
        cmpq    %rdi, %rax                      cmpl    %eax, %edx
        ja      .L15                            jg      .L20
.L11:                                   .L16:
        ret                                     ret

But change the double and float to objects whose sizes no longer permit the use of the Intel chip’s indexed addressing mode, and the compiler once again converts the index-based code to a pointer-based variant.

Here the code is essentially the same as before, but the double has been padded to 48 bytes:

struct Big { double val; char padding[40]; };
struct Small {
  float val;
  Small& operator=(const Big& other) {
    val = other.val;
    return *this;
  }
};

d2f_ptr(Small*, Big const*, int):       d2f_idx(Small*, Big const*, int):
        movslq  %edx, %rdx                      testl   %edx, %edx
        leaq    (%rdi,%rdx,4), %rax             jle     .L26
        cmpq    %rax, %rdi                      leal    -1(%rdx), %eax
        jnb     .L21                            leaq    4(%rdi,%rax,4), %rax
.L25:                                   .L29:
        addq    $48, %rsi                       pxor    %xmm0, %xmm0
        addq    $4, %rdi                        addq    $4, %rdi
        pxor    %xmm0, %xmm0                    cvtsd2ss (%rsi), %xmm0
        cvtsd2ss -48(%rsi), %xmm0               addq    $48, %rsi
        movss   %xmm0, -4(%rdi)                 movss   %xmm0, -4(%rdi)
        cmpq    %rdi, %rax                      cmpq    %rax, %rdi
        ja      .L25                            jne     .L29
.L21:                                   .L26:
        ret                                     ret

It’s possibly worth adding that for compilers, it is not necessarily more difficult to analyze which object a particular pointer write will modify. [Edited: There was a quote from Alexandrescu here, but it wasn’t as relevant as I thought, so I removed it leaving this section to be mostly a strawman.]

In fact, if a pointer is only directly assigned to once, and all other modifications are through increment and decrement operations (including += and -=), then the compiler is totally within its rights to assume that the pointer always points within the same object. If some additive modification of the pointer were to overshoot into some other object, that would be Undefined Behaviour and the compiler can discard that possibility. It’s easy enough to track assign and inc/dec operations in a flow graph, so in cases where the pointer could have been replaced with an index expression, it is quite possible for a compiler to figure that out and thus know that other objects are not being randomly mutated by writes through the pointer.

Leave a Comment Cancel reply