The performance difference comes from the efficiency of 128-bit divisions/modulus with GCC/Clang in this specific case.
Indeed, on my system as well as on godbolt, sizeof(long long) = 8 and sizeof(__int128_t) = 16. Thus operation on the former are performed by native instruction while not the latter (since we focus on 64 bit platforms). Additions, multiplications and subtractions are slower with __int128_t. But, built-in functions for divisions/modulus on 16-byte types (__divti3 and __modti3 on x86 GCC/Clang) are surprisingly faster than the native idiv instruction (which is pretty slow, at least on Intel processors).
If we look deeper in the implementation of GCC/Clang built-in functions (used only for __int128_t here), we can see that __modti3 uses conditionals (when calling __udivmodti4). Intel processors can execute the code faster because:
- taken branches can be well predicted in this case since they are always the same (and also because the loop is executed millions of times);
- the division/modulus is split in faster native instructions that can mostly be executed in parallel by multiple CPU ports (and that benefit from out-of-order execution). A
divinstruction is still used in most possible paths (especially in this case); - The execution time of the
div/idivinstructions covers most of the overall execution time because of their very high latencies. Thediv/idivinstructions cannot be executed in parallel because of the loop dependencies. However, the latency of adivlower than aidivmaking the former faster.
Please note that the performance of the two implementations can highly differ from one architecture to another (because of the number of CPU ports, the branch prediction capability and the latency/througput of the idiv instruction).
Indeed, the latency of a 64-bit idiv instruction takes 41-95 cycles on Skylake while it takes 8-41 cycles on AMD Ryzen processors for example. Respectively the latency of a div is about 6-89 cycles on Skylake and still the same on Ryzen. This means that the benchmark performance results should be significantly different on Ryzen processors (the opposite effect may be seen due to the additional instructions/branch costs in the 128-bits case).