Coding by Hand
Rust home

Floats and Precision

A computer storing a decimal number is a kitchen scale with a dial that has a fixed number of tick marks. Drop a feather on the pan and the needle wobbles to the nearest tick. Drop a bag of flour and the needle still wobbles to the nearest tick, but now each tick is a whole spoonful apart instead of a grain. The dial cannot show more than its tick marks allow. Every decimal you write in a Rust program lands on the nearest tick on the same kind of dial, and once you understand which tick it landed on, the strange answers Rust gives back stop being strange.

An analog kitchen scale whose dial face has a fixed number of tick marks — every weight rounds to the nearest tick.
An analog kitchen scale whose dial face has a fixed number of tick marks — every weight rounds to the nearest tick.

The dial Rust uses is a 1985 standard called IEEE 754, hammered out over eight years by a committee whose loudest voice was a Berkeley professor named William Kahan. Before 754, every chip vendor invented their own float format. DEC's VAX rounded one way, IBM's mainframe rounded another, an Intel 8087 math chip did a third thing, and code that ran fine on one machine spit out garbage on the next. Boeing engineers refused to use computers from competing vendors for the same calculation because they could not trust the answers would line up. Kahan walked into the IEEE meetings with a pile of bug reports and a design that fixed them. The standard he pushed through is the one in your laptop right now. Rust's f32 and f64 are IEEE 754 to the bit. So is Python's float, Java's double, JavaScript's only number type. Everybody finally agreed on one dial.

Before we look at the bits, the easiest way to see the dial rounding is to add two numbers any 5th grader can add in their head. One-tenth plus two-tenths. Three lines of Rust.

fn show_dial() {
    let a: f64 = 0.1;
    let b: f64 = 0.2;
    let sum = a + b;
    println!("0.1 + 0.2 = {sum}");
    println!("printed wide: {sum:.20}");
    println!("equal to 0.3? {}", sum == 0.3);
}

The dial has three parts. A sign bit that says positive or negative. An exponent that says where on the number line the dial is hovering — near zero, near a million, near a quintillion. And a mantissa that says which tick on the local dial you landed on. The sign is one bit. The exponent in f64 is 11 bits. The mantissa in f64 is 52 bits. Sixty-four bits total, which is why f64 is called a 64-bit float. The next program lays a few familiar numbers out across those fields so you can see the dial in plain sight.

fn show_bits() {
    let samples: [(&str, f64); 4] = [
        ("0.0  ", 0.0),
        ("1.0  ", 1.0),
        ("0.1  ", 0.1),
        ("1e20 ", 1e20),
    ];
    println!("value  | sign | exponent     | mantissa");
    println!("-------+------+--------------+---------------------------------------------------");
    for (label, value) in samples {
        let bits = value.to_bits();
        let sign = (bits >> 63) & 1;
        let exponent = (bits >> 52) & 0x7ff;
        let mantissa = bits & 0x000f_ffff_ffff_ffff;
        println!("{label}  |  {sign}   | {exponent:011b}  | {mantissa:052b}");
    }
}

The numbers we picked are zero, one, one-tenth, and a hundred quintillion. The to_bits method asks Rust to hand back the raw 64 bits as an integer so we can pull the sign, exponent, and mantissa apart with shifts and masks. Run the program and the table tells the whole story.

0.1 + 0.2 = 0.30000000000000004
printed wide: 0.30000000000000004441
equal to 0.3? false

value  | sign | exponent     | mantissa
-------+------+--------------+---------------------------------------------------
0.0    |  0   | 00000000000  | 0000000000000000000000000000000000000000000000000000
1.0    |  0   | 01111111111  | 0000000000000000000000000000000000000000000000000000
0.1    |  0   | 01111111011  | 1001100110011001100110011001100110011001100110011010
1e20   |  0   | 10001000001  | 0101101011110001110101111000101101011000110001000000

nan == nan?      false
nan.is_nan()?    true
1.0 / 0.0 =      inf
0.0 / 0.0 =      NaN
inf - inf =      NaN

true total       = 10100000
naive total      = 10000000
naive lost       = 100000
kahan total      = 10100000
kahan lost       = 0

Zero is all zeros — no sign, no exponent, no mantissa. One has an exponent of 1023 (which is the IEEE bias for the value zero — the dial is hovering at 2 to the 0, which is 1) and a mantissa of all zeros, because 1 is exactly on a tick. The interesting row is 0.1. The mantissa is 1001100110011001… forever — the binary expansion of one-tenth never ends, the same way one-third never ends in decimal as 0.333…. The dial rounds the infinite tail off at 52 bits and stores the closest tick it can. That rounded tick, converted back to decimal, is 0.100000000000000005551…. Off by a hair. Add it to 0.2 and you get the most famous wrong answer in computing — the first three lines of the output, where 0.1 + 0.2 comes back as 0.30000000000000004 and Rust correctly reports that this is not equal to 0.3. Two roundings, stacked, leaked a four into the seventeenth decimal place.

The 64 bits of an IEEE 754 double, split into sign, exponent, and mantissa fields.
The 64 bits of an IEEE 754 double, split into sign, exponent, and mantissa fields.

Kahan's committee knew about that leak and decided to live with it. The standard buys you something in return. Every legal float operation — add, subtract, multiply, divide, square root — is required to return the closest tick to the true answer. No more, no less. That guarantee is what let Boeing trust the same calculation across vendors. It does not mean the answer is right. It means the answer is wrong by a known, bounded amount.

The standard also carved out two trapdoors for when the dial runs out of room. Divide by zero and you do not crash — you get a value called infinity that Rust prints as inf. Subtract infinity from infinity, or divide zero by zero, and you get NaN, which stands for "not a number." NaN is the trickiest value in the language because it is the one float that is not equal to itself. The committee made that choice on purpose. If a NaN compared equal to anything, including another NaN, then a single bad division upstream would silently flow into downstream if statements as a normal value and corrupt every result. By making NaN refuse all comparisons, IEEE 754 forces your code to notice. Watch.

fn show_nan() {
    let nan = 0.0_f64 / 0.0_f64;
    let inf = 1.0_f64 / 0.0_f64;
    let other_nan = inf - inf;
    println!("nan == nan?      {}", nan == other_nan);
    println!("nan.is_nan()?    {}", nan.is_nan());
    println!("1.0 / 0.0 =      {inf}");
    println!("0.0 / 0.0 =      {nan}");
    println!("inf - inf =      {other_nan}");
}

The output shows it clearly — nan == nan is false, but nan.is_nan() is true. The lesson is to never test a float for "is it a NaN?" with ==. Always use the .is_nan() method.

Now the part that actually breaks programs. Go back to the kitchen scale. Suppose you want to weigh a sack of flour and then add a single grain of salt to the pan. The dial reads 10 million grains for the flour. You drop a salt grain on top. The needle does not move. The grain is real, but it is smaller than the gap between two ticks at the 10-million mark, so the scale rounds it away. Do that a million times and a million grains have vanished. This is exactly what happens when you add many small floats to a running total that has already grown large. The IEEE 754 rule about rounding to the nearest tick is honest about each individual addition. It is not honest about the sum.

The program below proves it. We push the number 10 million into a vector, then push the number 0.1 one million times. The true total is 10 million plus 100,000, which is 10,100,000. The naive loop just adds them up. The Kahan loop does a trick. Watch the difference.

fn naive_sum(values: &[f32]) -> f32 {
    let mut total = 0.0_f32;
    for v in values {
        total += v;
    }
    total
}

fn kahan_sum(values: &[f32]) -> f32 {
    let mut total = 0.0_f32;
    let mut leftover = 0.0_f32;
    for v in values {
        let adjusted = v - leftover;
        let next = total + adjusted;
        leftover = (next - total) - adjusted;
        total = next;
    }
    total
}

fn show_sums() {
    let mut numbers: Vec<f32> = Vec::with_capacity(1_000_001);
    numbers.push(1.0e7);
    for _ in 0..1_000_000 {
        numbers.push(0.1);
    }
    let truth: f64 = 1.0e7_f64 + 1_000_000.0 * 0.1_f64;
    let naive = naive_sum(&numbers);
    let kahan = kahan_sum(&numbers);
    println!("true total       = {truth}");
    println!("naive total      = {naive}");
    println!("naive lost       = {}", truth - naive as f64);
    println!("kahan total      = {kahan}");
    println!("kahan lost       = {}", truth - kahan as f64);
}

The Kahan trick is to keep a tiny notepad called leftover next to the scale. Each time you add a number, you measure how much of it the scale rounded away, write that amount on the notepad, and subtract the notepad's value from the next number before you add it. The notepad rescues the grains the dial would have lost. Kahan published this in a 1965 paper while he was a young numerical analyst at the University of Toronto. He had been called in to debug a Fortran weather model that drifted further from reality the longer it ran, and the four extra lines you see in kahan_sum were what stopped the drift. The paper is one page long.

Run the program and the dial truth comes out — look at the last block of the output above. The naive sum lands on exactly 10 million. A million tiny additions added up to zero, because each 0.1 was smaller than the gap between ticks at the 10-million mark and got rounded away on contact. The Kahan sum lands on 10,100,000 — the true answer — because the notepad caught every grain. Same data, same loop, same float type. The only difference is four extra lines of arithmetic that respect the dial's ticks instead of fighting them.

The question worth asking is which one you should use in real code. The answer is naive whenever the inputs are roughly the same size, Kahan whenever you are summing a long list of small numbers into a growing total. Game physics engines, machine-learning training loops, scientific simulations — anywhere a million tiny corrections feed into one big number — all reach for Kahan or one of its descendants. Your bank's interest calculation does not, because banks store money in integer cents to sidestep the dial entirely.

Next lesson — how Rust represents the other primitive you use without thinking about it, the text you keep typing in those println! calls, and why the language gives you two completely different types for it.