The most likely cause of the speed improvement is that:
- inserting a MOV shifts the subsequent instructions to different memory addresses
- one of those moved instructions was an important conditional branch
- that branch was being incorrectly predicted due to aliasing in the branch prediction table
- moving the branch eliminated the alias and allowed the branch to be predicted correctly
Your Core2 doesn’t keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.
This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).
If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643
Compilers typically don’t have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.