Why is a simple loop optimized when the limit is 959 but not 960?

Question

TL;DR

By default, the current snapshot GCC 7 behaves inconsistently, while previous versions have default limit due to PARAM_MAX_COMPLETELY_PEEL_TIMES, which is 16. It can be overridden from command-line.

The rationale of the limit is to prevent too aggressive loop unrolling, that can be a double-edged sword.

GCC version <= 6.3.0

The relevant optimization option for GCC is -fpeel-loops, which is enabled indirectly along with flag -Ofast (emphasis is mine):

Peels loops for which there is enough information that they do not
roll much (from profile feedback or static analysis). It also turns on
complete loop peeling (i.e. complete removal of loops with small
constant number of iterations).

Enabled with -O3 and/or -fprofile-use.

More details can be obtained by adding -fdump-tree-cunroll:

$ head test.c.151t.cunroll 

;; Function f (f, funcdef_no=0, decl_uid=1919, cgraph_uid=0, symbol_order=0)

Not peeling: upper bound is known so can unroll completely

The message is from /gcc/tree-ssa-loop-ivcanon.c:

if (maxiter >= 0 && maxiter <= npeel)
    {
      if (dump_file)
        fprintf (dump_file, "Not peeling: upper bound is known so can "
         "unroll completely\n");
      return false;
    }

hence try_peel_loop function returns false.

More verbose output can be reached with -fdump-tree-cunroll-details:

Loop 1 iterates 959 times.
Loop 1 iterates at most 959 times.
Not unrolling loop 1 (--param max-completely-peeled-times limit reached).
Not peeling: upper bound is known so can unroll completely

It is possible to tweak the limits by plaing with max-completely-peeled-insns=n and max-completely-peel-times=n params:

max-completely-peeled-insns
The maximum number of insns of a completely peeled loop.
max-completely-peel-times
The maximum number of iterations of a loop to be suitable for complete
peeling.

To learn more about insns, you can refer to GCC Internals Manual.

For instance, if you compile with following options:

-march=core-avx2 -Ofast --param max-completely-peeled-insns=1000 --param max-completely-peel-times=1000

then code turns into:

f:
        vmovss  xmm0, DWORD PTR .LC0[rip]
        ret
.LC0:
        .long   1148207104

Clang

I am not sure what Clang actually does and how to tweak its limits, but as I observed, you could force it to evaluate the final value by marking the loop with unroll pragma, and it will remove it completely:

#pragma unroll
for (int i = 0; i < 960; i++)
    p++;

results into:

.LCPI0_0:
        .long   1148207104              # float 961
f:                                      # @f
        vmovss  xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
        ret

TL;DR

GCC version <= 6.3.0

Clang

Leave a Comment Cancel reply