Why does this C++ function produce so many branch mispredictions?

Here are my thoughts on this after staring at it for a while. First of all,
the issue is easily reproducible with -O2, so it’s better to use that as a
reference, as it generates simple non-unrolled code that is easy to
analyse. The problem with -O3 is essentially the same, it’s just a bit less obvious.

So, for the first case (half-zeros with half-ones pattern) the compiler
generates this code:

 0000000000400a80 <_Z5test1i>:
   400a80:       55                      push   %rbp
   400a81:       53                      push   %rbx
   400a82:       89 fb                   mov    %edi,%ebx
   400a84:       48 83 ec 08             sub    $0x8,%rsp
   400a88:       3b 3d 0e 07 20 00       cmp    0x20070e(%rip),%edi        #
   60119c <half>
   400a8e:       74 4f                   je     400adf <_Z5test1i+0x5f>
   400a90:       48 8b 15 09 07 20 00    mov    0x200709(%rip),%rdx        #
   6011a0 <A>
   400a97:       48 63 c7                movslq %edi,%rax
   400a9a:       48 8d 2c 85 00 00 00    lea    0x0(,%rax,4),%rbp
   400aa1:       00 
   400aa2:       83 3c 82 01             cmpl   $0x1,(%rdx,%rax,4)
   400aa6:       74 37                   je     400adf <_Z5test1i+0x5f>
   400aa8:       8d 7f 01                lea    0x1(%rdi),%edi
   400aab:       e8 d0 ff ff ff          callq  400a80 <_Z5test1i>
   400ab0:       89 df                   mov    %ebx,%edi
   400ab2:       f7 d7                   not    %edi
   400ab4:       03 3d ee 06 20 00       add    0x2006ee(%rip),%edi        #
   6011a8 <size>
   400aba:       e8 c1 ff ff ff          callq  400a80 <_Z5test1i>
   400abf:       8b 05 e3 06 20 00       mov    0x2006e3(%rip),%eax        #
   6011a8 <size>
   400ac5:       48 8b 15 d4 06 20 00    mov    0x2006d4(%rip),%rdx        #
   6011a0 <A>
   400acc:       29 d8                   sub    %ebx,%eax
   400ace:       48 63 c8                movslq %eax,%rcx
   400ad1:       8b 44 2a 04             mov    0x4(%rdx,%rbp,1),%eax
   400ad5:       03 44 8a fc             add    -0x4(%rdx,%rcx,4),%eax
   400ad9:       01 05 b9 06 20 00       add    %eax,0x2006b9(%rip)        #
   601198 <s>
   400adf:       48 83 c4 08             add    $0x8,%rsp
   400ae3:       5b                      pop    %rbx
   400ae4:       5d                      pop    %rbp
   400ae5:       c3                      retq   
   400ae6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
   400aed:       00 00 00 

Very simple, kind of what you would expect — two conditional branches, two
calls. It gives us this (or similar) statistics on Core 2 Duo T6570, AMD
Phenom II X4 925 and Core i7-4770:

$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500

 Performance counter stats for './a.out 111111':

        45,216,754      branches                                                    
         5,588,484      branch-misses             #   12.36% of all branches        

       0.098535791 seconds time elapsed

If you’re to make this change, moving assignment before recursive calls:

 --- file.cpp.orig  2016-09-22 22:59:20.744678438 +0300
 +++ file.cpp   2016-09-22 22:59:36.492583925 +0300
 @@ -15,10 +15,10 @@
      if(curIndex == half) return;
      if(A[curIndex] == 1) return;

 +    s += A[curIndex+1] + A[size-curIndex-1];
      test1(curIndex+1);
      test1(size - curIndex - 1);

 -    s += A[curIndex+1] + A[size-curIndex-1];

  }

The picture changes:

 $ perf stat -B -e branches,branch-misses ./a.out 111111
 5555500

  Performance counter stats for './a.out 111111':

         39,495,804      branches                                                    
             54,430      branch-misses             #    0.14% of all branches        

        0.039522259 seconds time elapsed

And yes, as was already noted it’s directly related to tail recursion
optimisation, because if you’re to compile the patched code with
-fno-optimize-sibling-calls you will get the same “bad” results. So let’s
look at what do we have in assembly with tail call optimization:

 0000000000400a80 <_Z5test1i>:
   400a80:       3b 3d 16 07 20 00       cmp    0x200716(%rip),%edi        #
   60119c <half>
   400a86:       53                      push   %rbx
   400a87:       89 fb                   mov    %edi,%ebx
   400a89:       74 5f                   je     400aea <_Z5test1i+0x6a>
   400a8b:       48 8b 05 0e 07 20 00    mov    0x20070e(%rip),%rax        #
   6011a0 <A>
   400a92:       48 63 d7                movslq %edi,%rdx
   400a95:       83 3c 90 01             cmpl   $0x1,(%rax,%rdx,4)
   400a99:       74 4f                   je     400aea <_Z5test1i+0x6a>
   400a9b:       8b 0d 07 07 20 00       mov    0x200707(%rip),%ecx        #
   6011a8 <size>
   400aa1:       eb 15                   jmp    400ab8 <_Z5test1i+0x38>
   400aa3:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
   400aa8:       48 8b 05 f1 06 20 00    mov    0x2006f1(%rip),%rax        #
   6011a0 <A>
   400aaf:       48 63 d3                movslq %ebx,%rdx
   400ab2:       83 3c 90 01             cmpl   $0x1,(%rax,%rdx,4)
   400ab6:       74 32                   je     400aea <_Z5test1i+0x6a>
   400ab8:       29 d9                   sub    %ebx,%ecx
   400aba:       8d 7b 01                lea    0x1(%rbx),%edi
   400abd:       8b 54 90 04             mov    0x4(%rax,%rdx,4),%edx
   400ac1:       48 63 c9                movslq %ecx,%rcx
   400ac4:       03 54 88 fc             add    -0x4(%rax,%rcx,4),%edx
   400ac8:       01 15 ca 06 20 00       add    %edx,0x2006ca(%rip)        #
   601198 <s>
   400ace:       e8 ad ff ff ff          callq  400a80 <_Z5test1i>
   400ad3:       8b 0d cf 06 20 00       mov    0x2006cf(%rip),%ecx        #
   6011a8 <size>
   400ad9:       89 c8                   mov    %ecx,%eax
   400adb:       29 d8                   sub    %ebx,%eax
   400add:       89 c3                   mov    %eax,%ebx
   400adf:       83 eb 01                sub    $0x1,%ebx
   400ae2:       39 1d b4 06 20 00       cmp    %ebx,0x2006b4(%rip)        #
   60119c <half>
   400ae8:       75 be                   jne    400aa8 <_Z5test1i+0x28>
   400aea:       5b                      pop    %rbx
   400aeb:       c3                      retq   
   400aec:       0f 1f 40 00             nopl   0x0(%rax)

It has four conditional branches with one call. So let’s analyse the data
we’ve got so far.

First of all, what is a branching instruction from the processor perspective? It’s any of call, ret, j* (including direct jmp) and loop. call and jmp are a bit unintuitive, but they are crucial to count things correctly.

Overall, we expect this function to be called 11111100 times, one for each
element, that’s roughly 11M. In non-tail-call-optimized version we see about
45M branches, initialization in main() is just 111K, all the other things are minor, so the main contribution to this number comes from our function. Our function is call-ed, it evaluates the first je, which is true in all cases except one, then it evaluates the second je, which is true half of the times and then it either calls itself recursively (but we’ve already counted that the function is invoked 11M times) or returns (as it does after recursive calls. So that’s 4 branching instructions per 11M calls, exactly the number we see. Out of these around 5.5M branches are missed, that suggests that these misses all come from one mispredicted instruction, either something that’s evaluated 11M times and missed around 50% of the time or something that’s evaluated half of the time and missed always.

What do we have in tail-call-optimized version? We have the function called
around 5.5M times, but now each invocation incurs one call, two branches initially (first one is true in all cases except one and the second one is always false because of our data), then a jmp, then a call (but we’ve already counted that we have 5.5M calls), then a branch at 400ae8 and a branch at 400ab6 (always true because of our data), then return. So, on average that’s four conditional branches, one unconditional jump, a call and one indirect branch (return from function), 5.5M times 7 gives us an overall count of around 39M branches, exactly as we see in the perf output.

What we know is that the processor has no problem at all predicting things in a flow with one function call (even though this version has more conditional branches) and it has problems with two function calls. So it suggests that the problem is in returns from the function.

Unfortunately, we know very little about the details of how exactly branch
predictors of our modern processors work. The best analysis that I could find
is this and it suggests that the processors have a return stack buffer of around 16 entries. If we’re to return to our data again with this finding at hand things start to clarify a bit.

When you have half-zeroes with half-ones pattern, you’re recursing very
deeply into test1(curIndex+1), but then you start returning back and
calling test1(size-curIndex-1). That recursion is never deeper than one
call, so the returns are predicted perfectly for it. But remember that we’re
now 55555 invocations deep and the processor only remembers last 16, so it’s
not surprising that it can’t guess our returns starting from 55539-level deep,
it’s more surprising that it can do so with tail-call-optimized version.

Actually, the behaviour of tail-call-optimized version suggests that missing
any other information about returns, the processor just assumes that the right
one is the last one seen. It’s also proven by the behaviour of
non-tail-call-optimized version, because it goes 55555 calls deep into the
test1(curIndex+1) and then upon return it always gets one level deep into
test1(size-curIndex-1), so when we’re up from 55555-deep to 55539-deep (or
whatever your processor return buffer is) it calls into
test1(size-curIndex-1), returns from that and it has absolutely no
information about the next return, so it assumes that we’re to return to the
last seen address (which is the address to return to from
test1(size-curIndex-1)) and it’s obviously wrong. 55539 times wrong. With
100 cycles of the function, that’s exactly the 5.5M branch prediction misses
we see.

Now let’s get to your alternating pattern and the code for that. This code is
actually very different, if you’re to analyse how it goes into the
depth. Here you have your test2(curIndex+1) always return immediately and
your test2(curIndex+2) to always go deeper. So the returns from
test2(curIndex+1) are always predicted perfectly (they just don’t go deep
enough) and when we’re to finish our recursion into test2(curIndex+2), it
always returns to the same point, all 55555 times, so the processor has no
problems with that.

This can further be proven by this little change to your original half-zeroes with half-ones code:

--- file.cpp.orig       2016-09-23 11:00:26.917977032 +0300
+++ file.cpp    2016-09-23 11:00:31.946027451 +0300
@@ -15,8 +15,8 @@
   if(curIndex == half) return;
   if(A[curIndex] == 1) return;

-  test1(curIndex+1);
   test1(size - curIndex - 1);
+  test1(curIndex+1);

   s += A[curIndex+1] + A[size-curIndex-1];

So now the code generated is still not tail-call optimized (assembly-wise it’s very similar to the original), but you get something like this in the perf output:

$ perf stat -B -e branches,branch-misses ./a.out 111111 
5555500

 Performance counter stats for './a.out 111111':

        45 308 579      branches                                                    
            75 927      branch-misses             #    0,17% of all branches        

       0,026271402 seconds time elapsed

As expected, now our first call always returns immediately and the second call goes 55555-deep and then only returns to the same point.

Now with that solved let me show something up my sleeve. On one system, and
that is Core i5-5200U the non-tail-call-optimized original half-zeroes with half-ones version shows this results:

 $ perf stat -B -e branches,branch-misses ./a.out 111111
 5555500

  Performance counter stats for './a.out 111111':

         45 331 670      branches                                                    
             16 349      branch-misses             #    0,04% of all branches        

        0,043351547 seconds time elapsed

So, apparently, Broadwell can handle this pattern easily, which returns us to
the question of how much do we know about branch prediction logic of our
modern processors.

Leave a Comment

tech