Floating point is a giant mess. There are known best practices for most issues that come up in everyday use (e.g., using Kahan summation or adding stochastic noise to reduce aggregate numerical error), but there are still corner cases. Different libraries and implementations give different results because there’s no generally accepted standard.

That’s even true for basic things like transitivity: if a == b and b == c, then a == c. It’s so fundamental that we even assumed it as axiomatic in a number of math classes I’ve taken.

What happens when we check this on our computer using the our handy Haskell REPL?

$ ghci
Prelude> -- Hi! I'm a comment!
Prelude> -- 2^53 == 2.0^53
Prelude>  9007199254740992 == 9007199254740992.0
True
Prelude> -- 2.0^53 == 2^53 + 1
Prelude>  9007199254740992.0 == 9007199254740993
True
Prelude> -- 2^53 == 2^53 + 1
Prelude>  9007199254740992 == 9007199254740993
False

That’s weird. Looks like == isn’t transitive in Haskell. What happens if we try another language?

$ julia
julia> 9007199254740992 == 9007199254740992.0
true
julia> 9007199254740992.0 == 9007199254740993
false
julia> 9007199254740993 == 9007199254740992
false

Ok, so == is transitive over floats and ints in Julia but not in Haskell. How about Octave/Matlab?

$ octave
octave:1> true
ans = 1
octave:2> false
ans = 0
octave:3> 9007199254740992 == 9007199254740993.0
ans =  1
octave:4> 9007199254740992.0 == 9007199254740993
ans =  1
octave:5> 9007199254740992 == 9007199254740993
ans =  1

That’s pretty weird, different ints seem to be equal. According to the docs, you can explicitly cast to an int type. Let’s try that!

octave:6> uint64(9007199254740992) == uint64(9007199254740993)
ans =  1
octave:7> eq(uint64(9007199254740992),uint64(9007199254740993))
ans =  1
octave:8> 0 == 1
ans = 0

Hmmm.

Let’s try a few more languages to see what we get. In the table below, Transitive represents when the particular operations we tried are transitive, Unique represents when these ints are unique and Allowed represents when ints and floats can be compared with ==.

| Language         | Transitive | Unique | Allowed |
| C gcc 4.6.4      |    No      |   Yes  |   Yes   |
| Clojure 1.1      |    No      |   Yes  |   Yes   |
| C# Mono 2.10.8.1 |    No      |   Yes  |   Yes   |
| Haskell 7.4.1    |    No      |   Yes  |   Yes   |
| Java 1.7         |    No      |   Yes  |   Yes   |
| perl 5.14.2      |    No      |   Yes  |   Yes   |
| Ruby 1.8.7       |    No      |   Yes  |   Yes   |
| Scala 2.9.2      |    No      |   Yes  |   Yes   |
| Scheme (racket)  |    No      |   Yes  |   Yes   |
| Go 1.3           |    Yes     |   Yes  |   Yes   |
| Julia 0.3 rc1    |    Yes     |   Yes  |   Yes   |
| Lisp SBCL 1.0.55 |    Yes     |   Yes  |   Yes   |
| Python 2.7.3     |    Yes     |   Yes  |   Yes   |
| Ruby 2.0         |    Yes     |   Yes  |   Yes   |
| JavaScript       |    Yes     |   No   |   Yes   |
| Lua              |    Yes     |   No   |   Yes   |
| Octave 3.6.2     |    Yes     |   No   |   Yes   |
| Ocaml 3.12.1     |     -      |   -    |   No    |

Hmm. After trying a handful of languages, it looks like there are four groups of languages.

  1. Languages where == isn’t transitive over ints and floats, like C and Haskell.
  2. Languages where == is transitive over ints and floats (at least on this one example), like Go and Julia.
  3. Languages where == is transitive, but only because different ints are equal, like JavaScript1.
  4. Languages where == isn’t allowed between ints and floats.

What’s going on here? Well, there are two things that combine to cause this weird behavior.

First, (double precision) floating point numbers are represented by an exponent multiplied by a 52-bit fraction. As a more familiar example, let’s think about how that would work with decimal numbers in base 10 for a second. Say we had a floating point decimal with a 2 digit (signed) exponent and a 3 digit fraction. Our number would then be 10^(ee) * 1.(fff). So, if our exponent was 01 and our fraction was 000, we’d get 10^01 * 1.000 = 10.00. If we were to increment the fraction, to 001, we’d get 10^01 * 1.001 = 10.01. If, instead, the exponent were 02, we’d have 10^02 * 1.000 = 100.0 and 10^02 * 1.001 = 100.1. Notice that the smallest possible increment is the value of the exponent times the least significant digit in the fraction, i.e., (value of exponent) * 10^-(number of digits in fraction); in the last example, that works out to be 10^2 * 10^-3 = 10^-1 = 0.1.

Instead of base 10, floating point numbers operate in base 2. The limit on the smallest possible change we can make still holds, except that for a 64-bit floating point value, we have 11 bits (base-2 digits) of exponent, and 52 bits of fraction (plus 1 bit for the sign). This implies that the smallest increment is (value of exponent) * 2^-52. Now, if we set the value of our exponent to be 53 and our fraction to 0, we get 9007199254740992. The next representable floating point value is 2^53 * 2^-52 = 2 more, 9007199254740994.

Second, you can’t compare an int and a float with == without converting them to the same type. It’s natural to want to convert both to floats and compare, since floats are a more general type; 1.0 + 2 will give you 3.0 in many languages. But if we promote the int to a float, distinct ints can get promoted to the same float, resulting in a violation of transitivity, as we saw in Haskell2. If, instead, we convert floats to ints, the problem goes away, as we saw in Julia.

Well, that problem goes away, at least. But that still leaves a couple other problems. First, we end up with statements like 9007199254740993.0 == 9007199254740993 and int(9007199254740993.0) > 9007199254740992 being false.

Second, what’s supposed to happen when you compare to a value that’s larger than an int? For a 64-bit float, our exponent can go up to 2^1023, which is way more than we can represent in a 64-bit int (2^63). Languages like Python and Ruby that convert values larger than a machine int into some kind of BigInt by default avoid the issue at the cost of some performance. Languages like Julia, which give you machine ints by default, have to deal with overflow.

On overflow, Julia throws an exception if you try to convert a float that’s bigger than 2^63. However, the == comparison always returns false if you compare an int to a float that’s bigger than 2^63. It might seem surprising that int(a) == b throws an exception while a == b is false; that’s part of the case for forcing explicit conversions on comparisons, but given that mistyped comparisons are allowed in a language, throwing an exception on conversion and returning the correct result without an exception on comparison is the only way to avoid incorrect results.

And then there are the languages that make all numbers floats, which is fine if you don’t mind that 9007199254740992 == 9007199254740993. Personally, I mind that 9007199254740992 == 9007199254740993.

Since there’s no way to handle this well, how about just not allowing the comparison to happen? Of the languages I tried, only OCaml actually considers that to be an error. I like this approach a lot, but people often complain about how explicit you have to be with numeric types in OCaml. Technically, languages like Go and Haskell will also error out if you compare an int variable to a float variable, but if you just type an int in, as in the examples above, they’ll helpfully convert to a float for the comparison.

With the == example, it’s possible to maintain consistency by not allowing cross-type comparisons, but there are strange corners in floating point math even if in pure floating point land.

What happens when we try to compute 0/0 or Infinity / Infinity? We get a special value reserved for values that are not a number, NaN. NaN is supposed to fail all comparisons, leading to all kinds of fun stuff, like:

$ ghci
Prelude> nan == nan
False
Prelude> nan < nan || nan == nan || nan > nan
False

One of the justifications for having a special NaN value is that the NaN is supposed to propagate, indicating that an error occurred at some point in the calculation.

$ ghci
Prelude> min nan 1
1.0
Prelude> min 1 nan
NaN

But it doesn’t. The NaN gets suppressed, sometimes, depending on the order of the arguments. This makes sense if you think of min as being implemented by an if/else comparison.

function min(a,b)
  if a < b
    return a
  else
    return b
  end
end

Since the comparison always fails, the else clause always wins.

Not all languages do that. Kahan once proposed that NaN always gets suppressed, matching C’s behavior, treating NaN as a missing value. Other folks think it’s better to propagate the NaN to avoid suppressing errors. But, no matter what happens, reasonable people can disagree on whether or not the result makes sense. Some people want min to be a consistent drop in replacement for the naive way of coding it up, and some people want to avoid the possibility of missing an error. Some people even want to suppress the error. I don’t understand that one, but Kahan’s smarter than me and has thought about this more than me, so there’s probably a good reason that I just don’t understand.

We’ve covered Haskell and C. Let’s see what happens in other languages when give min a NaN for one argument and 1.0 for the other. The languages we’ve tried do one of the following: always return the 1st argument, always return the 2nd argument, always return a NaN, always return the non-NaN, or throw an exception.

| Language         | 1st | 2nd | NaN | non-NaN | Exception |
| Scheme (racket)  |     |     |  X  |         |           |
| Clojure 1.1      |     |  X  |     |         |           |
| C# Mono 2.10.8.1 |     |     |  X  |         |           |
| Ruby 1.8.7       |     |     |  X  |         |     X     |
| Java 1.7         |     |     |  X  |         |           |
| Scala 2.9.2      |     |     |  X  |         |           |
| Haskell 7.4.1    |     |  X  |     |         |           |
| C gcc 4.6.4      |     |     |     |    X    |           |
| perl 5.14.2      |  X  |     |     |         |           |
| Julia 0.3 rc1    |     |     |     |    X    |           |
| Go 1.3           |     |     |  X  |         |           |
| Lisp SBCL 1.0.55 |     |     |     |         |     X     |
| Ruby 2.0         |     |     |     |         |     X     |
| Python 2.7.3     |  X  |     |     |         |           |
| JavaScript       |     |     |  X  |         |           |
| Lua              |  X  |     |     |         |           |
| Octave 3.6.2     |     |     |  X  |         |           |
| Ocaml 3.12.1     |     |  X  |     |         |           |

And then there are the folks who just want whatever’s fastest3, i.e., whatever hardware does. On x86, the SSE/AVX min and max instructions act like a normal if/else comparison and return the second operand if there’s a NaN. To get either of the alternatives above (NaN propagation or NaN suppression), you need to turn an instruction like

MINPD

into an entire sequence of instructions like

MINPD
PCMPNEQD
BLENDPD

If that’s in your critical path on x86, getting NaN to propagate is a noticeable performance hit. The equivalent ARM instruction, VPMIN propagates NaN, so you’re forced to give up on cross-platform consistency if you want the best possible performance.

If you want to propagate NaN, this corner case has its own corner case, which is what to do for min(-Infinity, NaN). The argument for propagating NaN is the same as before. The case for propagating -Infinity is that that min(-Infinity, foo) should be -Infinity for all foo. This is analogous to a decades old unsettled debate in the hardware design world, X-optimism vs. X-pessimism. Unlike in hardware design, where X-ism is central to verifying the correctness of hardware, NaN vs. -Infinity is just a corner case of a corner case, so we haven’t had decades of debate, but that doesn’t make the issue any simpler. I don’t want to replay the entire debate, but suffice to say that it’s possible to come up with arbitrarily many examples that make either decision look unreasonable.

To sum it up, different languages pick different, totally reasonable behaviors. And if you want good performance, you’ll get different results on different hardware. And that doesn’t even scratch the surface of floating point weirdness. In addition to the problems mentioned in Goldberg’s classic What Every Programmer Should Know About Floating Point Arithmetic, there are tens or hundreds of weird corner cases that can cause bugs4.


Thanks to Carter kicking off a discussion with Stefan about NaN comparisons, to Stefan for his comments on NaN comparisons as well as floating point transitivity, to Erik for comments on some floating point nuances that I missed, and to Alyssa for pointing out some omissions in my comments about Haskell.

  1. This table just refers to this specific case. == is famously not transitive in js for all sorts of things. 

  2. Note that our in between value, 9007199254740993, will get converted to 9007199254740992.0 and not 9007199254740994.0 under the default rounding mode on most platforms. This is because the normal default rounding mode is round to nearest, breaking ties by making the last digit even (i.e., 0 and not 1). That, in turn, is because the standard grade school algorithm of rounding ties up introduces a rounding bias. 

  3. If you read through the IEEE standards meeting notes, you can see that they were pretty concerned about performance: “Roger’s experiments (on Itanium, x87, & SSE/SSE2) suggest that the cost of special handling of +/-0 & NaNs is high (3x, 2x, & 12x, respectively)” 

  4. Here’s something to think about next time someone proposes we standardize on just high-enough precision floating point for all calculations, like in js. What happens when you calculate log(x+1) with log1p(x) = if (1.0 + x == 1.0) else log(1.0 + x) * x / ((1.0 + x) - 1.0) on a machine with extended precision floating point vs. one with single or double precision?