ieee-754 – Page 3 – Tarik Billa

32-bit to 16-bit Floating Point Conversion

May 14, 2023 by Tarik

Complete conversion from single precision to half precision. This is a direct copy from my SSE version, so it’s branch-less. It makes use of the fact that -true == ~0 to preform branchless selections (GCC converts if statements into an unholy mess of conditional jumps, while Clang just converts them to conditional moves.) Update (2019-11-04): … Read more

Why is Number.MAX_SAFE_INTEGER 9,007,199,254,740,991 and not 9,007,199,254,740,992?

May 4, 2023 by Tarik

I would say its because while Math.pow(2, 53) is the largest directly representable integer, its unsafe in that its also the first value who’s representation is also an approximation of another value: 9007199254740992 == 9007199254740993 // true In contrast to Math.pow(2, 53) – 1: 9007199254740991 == 9007199254740993 // false

Difference between Java’s `Double.MIN_NORMAL` and `Double.MIN_VALUE`?

April 25, 2023 by Tarik

The answer can be found in the IEEE specification of floating point representation: For the single format, the difference between a normal number and a subnormal number is that the leading bit of the significand (the bit to left of the binary point) of a normal number is 1, whereas the leading bit of the … Read more

Why does division by zero in IEEE754 standard results in Infinite value?

April 24, 2023 by Tarik

It’s a nonsense from the mathematical perspective. Yes. No. Sort of. The thing is: Floating-point numbers are approximations. You want to use a wide range of exponents and a limited number of digits and get results which are not completely wrong. 🙂 The idea behind IEEE-754 is that every operation could trigger “traps” which indicate … Read more

How does this float square root approximation work?

April 23, 2023 by Tarik

(*(int*)&f >> 1) right-shifts the bitwise representation of f. This almost divides the exponent by two, which is approximately equivalent to taking the square root.1 Why almost? In IEEE-754, the actual exponent is e – 127.2 To divide this by two, we’d need e/2 – 64, but the above approximation only gives us e/2 – … Read more

Usefulness of signaling NaN?

April 9, 2023 by Tarik

As I understand it, the purpose of signaling NaN is to initialize data structures, but, of course runtime initialization in C runs the risk of having the NaN loaded into a float register as part of initialization, thereby triggering the signal because the the compiler isn’t aware that this float value needs to be copied … Read more

Double precision – decimal places

April 4, 2023 by Tarik

An IEEE double has 53 significant bits (that’s the value of DBL_MANT_DIG in <cfloat>). That’s approximately 15.95 decimal digits (log10(253)); the implementation sets DBL_DIG to 15, not 16, because it has to round down. So you have nearly an extra decimal digit of precision (beyond what’s implied by DBL_DIG==15) because of that. The nextafter() function … Read more

Are the bit patterns of NaNs really hardware-dependent?

March 30, 2023 by Tarik

This is what §2.3.2 of the JVM 7 spec has to say about it: The elements of the double value set are exactly the values that can be represented using the double floating-point format defined in the IEEE 754 standard, except that there is only one NaN value (IEEE 754 specifies 253-2 distinct NaN values). … Read more

Why does IEEE 754 reserve so many NaN values?

March 21, 2023 by Tarik

The IEEE-754 standard defines a NaN as a number with all ones in the exponent, and a non-zero significand. The highest-order bit in the significand specifies whether the NaN is a signaling or quiet one. The remaining bits of the significand form what is referred to as the payload of the NaN. Whenever one of … Read more

Why does MSVS not optimize away +0? [duplicate]

March 11, 2023 by Tarik

The compiler cannot eliminate the addition of a floating-point positive zero because it is not an identity operation. By IEEE 754 rules, the result of adding +0. to −0. is not −0.; it is +0. The compiler may eliminate the subtraction of +0. or the addition of −0. because those are identity operations. For example, … Read more