TL;DR: You’re hobbled by bad defaults and compatibility with obsolete machines: Bad defaults are gcc setting errno
for floating-point computations (despite this not being required by the C language), and compatibility with x86 machines that don’t have any better SSE instructions than SSE2. If you want decent code generation, add -fno-math-errno -msse4
to the compiler flags.
Modern processor architectures that contain floating-point hardware typically offer square root calculation as a primitive operation (instruction), which sets an error flag in the floating-point environment if the operand of the square root instruction was out of range (negative). On the other hand, old instruction set architectures may not have had floating point instructions, or not have had hardware accelerated square root instructions, so the C language permits an implementation to set errno
on an out of range argument instead, but errno
being a thread-local memory location practically prevents any sane architecture from setting errno
directly from the square root instruction. To get decent performance, gcc inlines the square root calculation by calling the hardware instruction (sqrtsd
), but to set errno
, it seperately checks the sign of the argument, and calls to the library function only in case the argument was negative, so the library function can set errno
. Yes, this is crazy, but that in turn is par for the course. You can avoid this braindamage that nobody ever needs or wants by setting -fno-math-errno
in the compiler flags.
Reasonably recent x86-64 processors have more instructions than were present in the original x86-64 as first developed by AMD (which included only SSE2 vector/floating-point instructions). Among the added instructions are float/integer conversion instructions that allow controlled rounding/truncation, so this doesn’t have to be implemented in software. You can get gcc to use these new instructions by specifying a target that supports these instructions, for example by using the -msse4
compiler flag. Note that this will cause the generated program to fault if it is run on a target that doesn’t support these instructions, so the generated program will be less portable (though it doesn’t reduce portability of the source code obviously).