TL;DR
By default, the current snapshot GCC 7 behaves inconsistently, while previous versions have default limit due to PARAM_MAX_COMPLETELY_PEEL_TIMES
, which is 16. It can be overridden from command-line.
The rationale of the limit is to prevent too aggressive loop unrolling, that can be a double-edged sword.
GCC version <= 6.3.0
The relevant optimization option for GCC is -fpeel-loops
, which is enabled indirectly along with flag -Ofast
(emphasis is mine):
Peels loops for which there is enough information that they do not
roll much (from profile feedback or static analysis). It also turns on
complete loop peeling (i.e. complete removal of loops with small
constant number of iterations).Enabled with
-O3
and/or-fprofile-use
.
More details can be obtained by adding -fdump-tree-cunroll
:
$ head test.c.151t.cunroll
;; Function f (f, funcdef_no=0, decl_uid=1919, cgraph_uid=0, symbol_order=0)
Not peeling: upper bound is known so can unroll completely
The message is from /gcc/tree-ssa-loop-ivcanon.c
:
if (maxiter >= 0 && maxiter <= npeel)
{
if (dump_file)
fprintf (dump_file, "Not peeling: upper bound is known so can "
"unroll completely\n");
return false;
}
hence try_peel_loop
function returns false
.
More verbose output can be reached with -fdump-tree-cunroll-details
:
Loop 1 iterates 959 times.
Loop 1 iterates at most 959 times.
Not unrolling loop 1 (--param max-completely-peeled-times limit reached).
Not peeling: upper bound is known so can unroll completely
It is possible to tweak the limits by plaing with max-completely-peeled-insns=n
and max-completely-peel-times=n
params:
max-completely-peeled-insns
The maximum number of insns of a completely peeled loop.
max-completely-peel-times
The maximum number of iterations of a loop to be suitable for complete
peeling.
To learn more about insns, you can refer to GCC Internals Manual.
For instance, if you compile with following options:
-march=core-avx2 -Ofast --param max-completely-peeled-insns=1000 --param max-completely-peel-times=1000
then code turns into:
f:
vmovss xmm0, DWORD PTR .LC0[rip]
ret
.LC0:
.long 1148207104
Clang
I am not sure what Clang actually does and how to tweak its limits, but as I observed, you could force it to evaluate the final value by marking the loop with unroll pragma, and it will remove it completely:
#pragma unroll
for (int i = 0; i < 960; i++)
p++;
results into:
.LCPI0_0:
.long 1148207104 # float 961
f: # @f
vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
ret