From this godbolt session clang is able to perform all the pow
calculations at compile time. It knows at compile time what the values of k
and n
are and it just constant folds the calculation:
.LCPI0_0:
.quad 4604480259023595110 # double 0.69999999999999996
.LCPI0_1:
.quad 4602498675187552091 # double 0.48999999999999994
.LCPI0_2:
.quad 4599850558606658239 # double 0.34299999999999992
.LCPI0_3:
.quad 4597818534454788671 # double 0.24009999999999995
.LCPI0_4:
.quad 4595223380205512696 # double 0.16806999999999994
.LCPI0_5:
.quad 4593141924544133109 # double 0.11764899999999996
.LCPI0_6:
.quad 4590598673379842654 # double 0.082354299999999963
.LCPI0_7:
.quad 4588468774839143248 # double 0.057648009999999972
.LCPI0_8:
.quad 4585976388698138603 # double 0.040353606999999979
.LCPI0_9:
.quad 4583799016135705775 # double 0.028247524899999984
.LCPI0_10:
.quad 4581356477717521223 # double 0.019773267429999988
.LCPI0_11:
.quad 4579132580613789641 # double 0.01384128720099999
.LCPI0_12:
.quad 4576738892963968780 # double 0.0096889010406999918
.LCPI0_13:
.quad 4574469401809764420 # double 0.0067822307284899942
.LCPI0_14:
.quad 4572123587912939977 # double 0.0047475615099429958
and it unrolls the inner loop:
.LBB0_2: # %.preheader
faddl .LCPI0_0(%rip)
faddl .LCPI0_1(%rip)
faddl .LCPI0_2(%rip)
faddl .LCPI0_3(%rip)
faddl .LCPI0_4(%rip)
faddl .LCPI0_5(%rip)
faddl .LCPI0_6(%rip)
faddl .LCPI0_7(%rip)
faddl .LCPI0_8(%rip)
faddl .LCPI0_9(%rip)
faddl .LCPI0_10(%rip)
faddl .LCPI0_11(%rip)
faddl .LCPI0_12(%rip)
faddl .LCPI0_13(%rip)
faddl .LCPI0_14(%rip)
Note, that it is using a builtin function(gcc documents theirs here) to calculate pow
at compile time and if we use -fno-builtin it no longer performs this optimization.
If you change k
to 1.0
then gcc is able to perform the same optimization:
.L3:
fadd %st, %st(1) #,
addl $1, %eax #, t
cmpl %eax, %edi # t, num
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
jne .L3 #,
Although it is a simpler case.
If you change the condition for the inner loop to n < 4
then gcc seems willing to optimize when k = 0.7
. As indicated in the comments to the question, if the compiler does not believe unrolling will help then it will likely be conservative in how much unrolling it will do since there is a code size trade off.
As indicated in the comments I am using a modified version of the OP’s code in the godbolt examples but it does not change the underlying conclusion.
Note as indicated in a comment above if we use -fno-math-errno, which stops errno
from being set, gcc does apply a similar optimization.