`mov`

+ `adc $-1, %eax`

is more efficient than `xor`

-zero + `setc`

+ 3-component `lea`

for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.^{1}

**This looks like a gcc missed optimization**: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the `adc`

pattern recognition from happening.

I don’t know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC’s internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as “clone compiler”.

The fact that clang compiles it with `adc`

proves that it’s legal, i.e. that the asm you want does match the C++ source, and you didn’t miss some special case that’s stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)

That problem can certainly happen if you’re not careful, e.g. trying to write a general-case `adc`

function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can’t just use the `sum < a+b`

idiom after adding the carry to one of the inputs. I’m not sure it’s possible to get gcc or clang to emit `add/adc/adc`

where the middle `adc`

has to take carry-in and produce carry-out.

e.g. `0xff...ff + 1`

wraps around to 0, so `sum = a+b+carry_in`

/ `carry_out = sum < a`

can’t optimize to an `adc`

because it needs to *ignore* carry in the special case where `a = -1`

and `carry_in = 1`

.

So another guess is that maybe gcc considered doing the `+ X`

earlier, and shot itself in the foot because of that special case. That doesn’t make a lot of sense, though.

What’s the point of using it since it’s up to me to provide the carry flag?

You’re using `_addcarry_u32`

correctly.

The point of its existence is to let you express an add with carry *in* as well as carry *out*, which is hard in pure C. GCC and clang don’t optimize it well, often not just keeping the carry result in CF.

If you only want carry-out, you can provide a `0`

as the carry in and it will optimize to `add`

instead of `adc`

, but still give you the carry-out as a C variable.

e.g. to add two 128-bit integers in 32-bit chunks, you can do this

```
// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
unsigned char carry;
carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}
```

(**On Godbolt with GCC/clang/ICC**)

That’s very inefficient vs. `unsigned __int128`

where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of `add`

/`adc`

/`adc`

/`adc`

. GCC makes a mess, using `setcc`

to store CF to an integer for some of the steps, then `add dl, -1`

to put it back into CF for an `adc`

.

GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.

**Footnote 1**: or for uop count: equal on Intel Haswell and earlier where `adc`

is 2 uops, except with a zero immediate where Sandybridge-family’s decoders special case that as 1 uop.

But the 3-component LEA with a `base + index + disp`

makes it a 3-cycle latency instruction on Intel CPUs, so it’s definitely worse.

On Intel Broadwell and later, `adc`

is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.

So equal total uop count but worse latency means that `adc`

would still be a better choice.

https://agner.org/optimize/