Such a large difference is the product of two conditions.
The first condition is related to the original code. The in-place merge is so efficient there would be difficulty devising anything significantly faster, even if coding manually at the assembly language level. The application of generics is straightforward, so the compiler ** produced the same assembly with or without it. Because the algorithm implementation is efficient, only a few machine instructions added into the loop is capable of producing the significant proportional change indicated in the question.
** Compilation specifics throughout this answer were using g++ 6.2.1 20160916, the default Fedora 24 dnf package, along with LINUX kernel 4.8.8-200.fc24.x86_64. Runtime was Intel i7-2600 8M cache. Also to Atmel SAM3X8E ARM Cortex-M3 with arm-none-eabi-g++ 4.8.3-2014q1.
The second condition is related to the compilation of the second trick described in paragraph 3 sentence 2 of the question. The first trick, the reduction of types in the template, did not produce any significant change in the assembly language. The second trick produced flop-affecting assembly level differences in the compiler output for the two calls.
This precompiler hack can ease testing.
#ifdef ORIG
#define half_inplace_merge half_inplace_merge_orig
#else // ORIG
#define half_inplace_merge half_inplace_merge_slow
#endif // ORIG
...
half_inplace_merge(niInA.begin(), niInA.end(),
niInB.begin(), niInB.end(),
niOut.begin(), compare);
Execution and compare using these commands in a bash shell exploits the precompiler hack.
g++ -DORIG -S -fverbose-asm -o /tmp/qq.orig.s /tmp/qq.cpp
g++ -DSLOW -S -fverbose-asm -o /tmp/qq.slow.s /tmp/qq.cpp
araxis.sh /tmp/qq.orig.s /tmp/qq.slow.s # to run Araxis Merge in Wine
These instructions are a result of initializing the InputIterator store[ ], but that is outside the loop.
leaq -48(%rbp), %rax #, _4
movq -64(%rbp), %rdx # first1, tmp104
movq %rdx, (%rax) # tmp104, *_5
leaq 8(%rax), %rdx #, _9
movq -96(%rbp), %rax # first2, tmp105
movq %rax, (%rdx) # tmp105, *_9
The primary slow down comes in dereferencing the two items contained in store[ ], as needed by the compare and swap, and that is within the loop. These instructions do not exist in the version without the second trick.
movb %al, -17(%rbp) # _27, cmp
movzbl -17(%rbp), %eax # cmp, _29
cltq
...
movzbl -17(%rbp), %edx # cmp, _31
leaq -48(%rbp), %rax #, tmp121
movslq %edx, %rdx # _31, tmp122
salq $3, %rdx #, tmp123
addq %rdx, %rax # tmp123, _32
Although there is duplication of code in the bodies of the conditional for the version without the trick, that only impacts the compactness of the code, adding two calls, five moves, and one compare instruction. The number of CPU cycles required to perform the in-place merge is the same between branches resulting from the compare, and both lack the instructions listed above.
For each of several syntax permutations tried, removing the redundancy in the branches to improve compactness inescapably leads to additional instructions required along the execution path.
The details of the instruction sequences for the various permutations discussed thus far will vary from compiler to compiler, optimization option selection, and even the conditions of calling the functions.
It is theoretically possible for a compiler to employ an AST (abstract symbol tree) refactoring rule (or the equivalent) to detect and reduce both program memory and CPU cycle requirements for either version of the function. Such rules have have antecedents (search patterns) that match the pattern to be optimized within the code.
Optimizing speed for the code with the second trick would require a rule antecedent that matches at the atypical score[ ] abstraction both inside and outside of the loop. Detecting the branch redundancy without the second trick is a more reasonable goal.
Integrating the two statements within each branch, one can see how the two like patterns in the AST might be simple enough for a refactoring rule antecedent to match and perform the desired code size reduction. There would be very little gain in speed for this case, if any.
if (compare(*first2, *first1)) {
std::iter_swap(result, first2 ++);
} else {
std::iter_swap(result, first1 ++);
}