Why does clang produce a much faster code than gcc for this simple function involving exponentiation?

Question

From this godbolt session clang is able to perform all the pow calculations at compile time. It knows at compile time what the values of k and n are and it just constant folds the calculation:

.LCPI0_0:
    .quad   4604480259023595110     # double 0.69999999999999996
.LCPI0_1:
    .quad   4602498675187552091     # double 0.48999999999999994
.LCPI0_2:
    .quad   4599850558606658239     # double 0.34299999999999992
.LCPI0_3:
    .quad   4597818534454788671     # double 0.24009999999999995
.LCPI0_4:
    .quad   4595223380205512696     # double 0.16806999999999994
.LCPI0_5:
    .quad   4593141924544133109     # double 0.11764899999999996
.LCPI0_6:
    .quad   4590598673379842654     # double 0.082354299999999963
.LCPI0_7:
    .quad   4588468774839143248     # double 0.057648009999999972
.LCPI0_8:
    .quad   4585976388698138603     # double 0.040353606999999979
.LCPI0_9:
    .quad   4583799016135705775     # double 0.028247524899999984
.LCPI0_10:
    .quad   4581356477717521223     # double 0.019773267429999988
.LCPI0_11:
    .quad   4579132580613789641     # double 0.01384128720099999
.LCPI0_12:
    .quad   4576738892963968780     # double 0.0096889010406999918
.LCPI0_13:
    .quad   4574469401809764420     # double 0.0067822307284899942
.LCPI0_14:
    .quad   4572123587912939977     # double 0.0047475615099429958

and it unrolls the inner loop:

.LBB0_2:                                # %.preheader
    faddl   .LCPI0_0(%rip)
    faddl   .LCPI0_1(%rip)
    faddl   .LCPI0_2(%rip)
    faddl   .LCPI0_3(%rip)
    faddl   .LCPI0_4(%rip)
    faddl   .LCPI0_5(%rip)
    faddl   .LCPI0_6(%rip)
    faddl   .LCPI0_7(%rip)
    faddl   .LCPI0_8(%rip)
    faddl   .LCPI0_9(%rip)
    faddl   .LCPI0_10(%rip)
    faddl   .LCPI0_11(%rip)
    faddl   .LCPI0_12(%rip)
    faddl   .LCPI0_13(%rip)
    faddl   .LCPI0_14(%rip)

Note, that it is using a builtin function(gcc documents theirs here) to calculate pow at compile time and if we use -fno-builtin it no longer performs this optimization.

If you change k to 1.0 then gcc is able to perform the same optimization:

.L3:
    fadd    %st, %st(1) #,
    addl    $1, %eax    #, t
    cmpl    %eax, %edi  # t, num
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    jne .L3 #,

Although it is a simpler case.

If you change the condition for the inner loop to n < 4 then gcc seems willing to optimize when k = 0.7. As indicated in the comments to the question, if the compiler does not believe unrolling will help then it will likely be conservative in how much unrolling it will do since there is a code size trade off.

As indicated in the comments I am using a modified version of the OP’s code in the godbolt examples but it does not change the underlying conclusion.

Note as indicated in a comment above if we use -fno-math-errno, which stops errno from being set, gcc does apply a similar optimization.

Leave a Comment Cancel reply