Why do C++ optimizers have problems with these temporary variables or rather why `v[]` should be avoided in tight loops?

Question

It seems like the compiler lacks knowledge about the relationship between std::vector<>::size() and internal vector buffer size. Consider std::vector being our custom bugged_vector vector-like object with slight bug – its ::size() can sometimes be one more than internal buffer size n, but only then v[n-2] >= v[n-1].

Then two snippets have different semantics again: first one has undefined behavior, as we access element v[v.size() - 1]. The second one, however, doesn’t have: due to short-circuit nature of &&, we don’t ever read v[v.size() - 1] on the last iteration.

So, if compiler can’t prove that our v is not a bugged_vector, it must short-circuit, which introduce additional jump in a machine code.

By looking at assembly output from clang, we can see that it actually happens.

From the Godbolt Compiler Explorer, with clang 3.7.0 -O2, the loop in f0 is:

### f0: just the loop
.LBB1_2:                                # =>This Inner Loop Header: Depth=1
    mov     edi, ecx
    cmp     edx, edi
    setl    r10b
    mov     ecx, dword ptr [r8 + 4*rsi + 4]
    lea     rsi, [rsi + 1]
    cmp     edi, ecx
    setl    dl
    and     dl, r10b
    movzx   edx, dl
    add     eax, edx
    cmp     rsi, r9
    mov     edx, edi
    jb      .LBB1_2

And for f1:

### f1: just the loop
.LBB2_2:                                # =>This Inner Loop Header: Depth=1
    mov     esi, r10d
    mov     r10d, dword ptr [r9 + 4*rdi]
    lea     rcx, [rdi + 1]
    cmp     esi, r10d
    jge     .LBB2_4                     # <== This is Extra Jump
    cmp     r10d, dword ptr [r9 + 4*rdi + 4]
    setl    dl
    movzx   edx, dl
    add     eax, edx
.LBB2_4:                                # %._crit_edge.3
    cmp     rcx, r8
    mov     rdi, rcx
    jb      .LBB2_2

I’ve pointed out the extra jump in f1. And as we (hopefuly) know, conditional jumps in a tight loops are bad for performance. (See the performance guides in the x86 tag wiki for details.)

~~GCC and Visual Studio are aware that std::vector is well-behaved, and produce almost identical assembly for both snippets.~~
Edit. It turns out clang does better job optimizing the code. All three compilers can’t prove that it is safe to read v[i + 1] prior to comparison in the second example (or choose not to), but only clang manages to optimize the first example with the additional information that reading v[i + 1] is either valid or UB.

A performance difference of 2% is negligible can be explained by different order or choice of some instructions.

Leave a Comment Cancel reply