Appendix A: Floating-Point Arithmetic

18.1. Appendix A: Floating-Point Arithmetic#

Modern computers perform calculations with incredible speed, but they do so with finite resources. This appendix explores how computers represent integers and real numbers using a fixed number of binary digits (bits). Understanding this foundation is crucial for anyone doing numerical work, as it explains many common sources of error and surprising behavior in scientific computing.

18.1.1. How Computers Represent Integers#

At the lowest level, all data in a computer is stored in binary, a base-2 number system using only the digits 0 and 1. For example, the binary number \(1011001_2\) corresponds to the decimal number:

\[\begin{split} \begin{align*} 1011001_2 &= 1\cdot 2^6 + 0\cdot 2^5 + 1\cdot 2^4 + 1\cdot 2^3 + 0\cdot 2^2 + 0\cdot 2^1 + 1 \cdot 2^0 \\ &= 64 + 0 + 16 + 8 + 0 + 0 + 1 \\ &= 89_{10} \end{align*} \end{split}\]

In Julia, you can specify a number in binary using the 0b prefix:

# The `0b` prefix tells Julia to interpret the number as binary.
Int(0b1011001)

18.1.1.1. Unsigned Integers#

Unsigned integers (like Julia’s UInt8, UInt16, etc.) use all their bits to represent the magnitude of a positive number. We can view the raw binary representation using the bitstring function.

# A UInt8 uses 8 bits. The number 89 is padded with a leading zero.
bitstring(UInt8(89))

"01011001"

For a UInt64, which uses 64 bits, the largest possible number is one where all bits are 1. This corresponds to the value \(2^{64}-1\).

18.1.1.2. Signed Integers and Two’s Complement#

To represent negative numbers, computers most commonly use the two’s complement format. In this system, the most significant (leftmost) bit acts as a sign indicator: if it’s 0, the number is positive; if it’s 1, the number is negative.

The negative of a number \(x\) is defined as the value \(y\) such that \(x+y = 2^n\) for an \(n\)-bit integer. For example, with 8 bits, the positive number 89 is 01011001. Its two’s complement is 10100111, because:

\[ 01011001_2 + 10100111_2 = 100000000_2 = 2^8 \]

Therefore, the bit pattern 10100111 represents \(-89\). A major advantage of this system is that the same hardware can be used for both addition and subtraction.

(A quick way to find the two’s complement is to invert all the bits and add one.)

# The bitstring for a positive Int8. The first bit is 0.
bitstring(Int8(89))

"01011001"

# The bitstring for a negative Int8. The first bit is 1.
bitstring(Int8(-89))

"10100111"

Using this format, an Int64 can represent numbers from \(-2^{63}\) to \(2^{63}-1\).

# Verify the minimum and maximum values for a 64-bit signed integer.
[typemin(Int64) typemax(Int64); -2^63 2^63-1]

2×2 Matrix{Int64}:
 -9223372036854775808  9223372036854775807
 -9223372036854775808  9223372036854775807

18.1.1.3. Fixed-Point Numbers#

One way to represent fractional numbers is to fix the position of the binary point. For example:

\[ 101.1011_2 = 2^2 + 2^0 + 2^{-1} + 2^{-3} + 2^{-4} = 5.6875_{10} \]

While simple, this fixed-point representation has a major drawback for scientific use: it can’t efficiently represent both very large and very small numbers simultaneously. To solve this, computers use a floating-point representation.

18.1.2. Floating-Point Numbers: Precision for Range#

Floating-point numbers trade a fixed amount of precision for a vastly larger range of representable magnitudes, much like scientific notation.

18.1.2.1. Analogy to Scientific Notation#

Recall that a number in scientific notation has three parts:

\[ {\LARGE \underbrace{-}_{\text{sign}} \underbrace{1.602}_{\text{significand}} \times \underbrace{10}_{\text{base}} \!^{\underbrace{-19}_{\text{exponent}}} } \]

Floating point numbers are represented in a similar way, as

\[ \begin{align*} \pm \left( d_0 + d_1\beta^{-1} + \ldots + d_{p-1} \beta^{-(p-1)} \right) \beta^e,\quad 0\le d_i<\beta \end{align*} \]

with base \(\beta\) and precision \(p\). The number is normalized if \(d_0\ne0\) (use a special case to represent \(0\)).

18.1.2.2. Properties of Floating-Point Numbers#

This representation has some non-intuitive consequences:

Uneven Spacing: The gaps between representable numbers are not uniform. They are smallest near zero and grow larger as the magnitude of the numbers increases.
Relative Error: For any real number \(x\), there’s a nearby floating-point number \(x'\) such that the error relative to \(x\) is small: \(|x-x'| \le \epsilon_\mathrm{machine} |x|\). This \(\epsilon_\mathrm{machine}\) is a fundamental constant for a given floating-point type.

For example, the number line below shows all the representable numbers for the case \(\beta=2, p=3, e_\mathrm{min}=-1, e_\mathrm{max}=2\).

floating_point_number_line

18.1.2.3. The IEEE 754 Standard#

Modern computers follow the IEEE 754 standard for floating-point arithmetic. This standard defines the layout of the bits, how to handle special values, and the rules for rounding.

A standard single-precision number (Float32 in Julia) uses 32 bits, allocated as:

1 sign bit (S): 0 for positive, 1 for negative.
8 exponent bits (E): Stores the exponent in a biased format.
23 significand bits (M, also called the mantissa): Stores the fractional part of the number.

\[\begin{split} \begin{array}{c|c|c} \text{S (1 bit)} & \text{E (8 bits)} & \text{M (23 bits)} \\ \hline \mathtt{x} & \mathtt{xxxxxxxx} & \mathtt{xxxxxxxxxxxxxxxxxxxxxxx} \end{array} \end{split}\]

The value of a normalized number is given by:

\[ {\Large (-1)^S \times (1.M)_2 \times 2^{E-127} } \]

Notice the 1. in (1.M). This is a clever optimization: since the first digit of a normalized binary number is always 1, it doesn’t need to be stored. This is called the hidden bit and it gives us an extra bit of precision for free!

The standard also defines bit patterns for special quantities:

\[\begin{split} \begin{array}{c|c|c|c} & E=0 & 0<E<255 & E=255 \\ \hline M=0 & \pm0 & \text{Normalized numbers} & \pm\infty \\ \hline M\ne0 & \text{Denormalized numbers} & \text{Normalized numbers} & \text{NaN} \end{array} \end{split}\]

Double precision (Float64 in Julia) works the same way but uses 64 bits total (1 sign, 11 exponent, 52 significand), offering much greater precision and range.

\[\begin{split} \begin{array}{l|l|l} & \text{Single Precision (Float32)} & \text{Double Precision (Float64)} \\ \hline \text{Significand Precision} & \text{24 bits (23 stored)} & \text{53 bits (52 stored)} \\ \hline \text{Exponent Size} & \text{8 bits} & \text{11 bits} \\ \hline \text{Range of Magnitudes} & \approx 10^{-38} \text{ to } 10^{38} & \approx 10^{-308} \text{ to } 10^{308} \\ \hline \epsilon_\mathrm{machine} & 2^{-24}\approx 6 \times 10^{-8} & 2^{-53} \approx 1.1 \times 10^{-16} \end{array} \end{split}\]

18.1.3. Floating-Point Demo#

18.1.3.1. Comparing Floating-Point Numbers#

Because of tiny representation errors, you should never use == to check if two floating-point numbers are equal.

# Mathematically, this should be exactly 5/3.
x = (1 - 2/3) * 5

# But due to small binary representation errors, it's not.
x == 5/3

false

Instead, check if the numbers are “close enough” by testing if their absolute difference is smaller than a small tolerance.

# The error is tiny, but non-zero.
abs(x - 5/3)

2.220446049250313e-16

Julia provides the function isapprox (and the convenient operator ≈, typed \approx + Tab) which does this correctly by checking both relative and absolute tolerances.

# This is the correct way to compare floats for approximate equality.
x ≈ 5/3

true

18.1.3.2. Overflow and Underflow#

Overflow occurs when a calculation results in a number larger than the maximum representable value, which becomes Inf (infinity). Underflow occurs when a number is too small (too close to zero) to be represented, which becomes 0.0.

# The largest representable Float64 is around 1e308.
1e308

1.0e308

# Multiplying it by 2 causes an overflow.
2 * 1e308

Inf

# The smallest positive normalized Float64 is around 1e-308.
1e-308

1.0e-308

# Dividing this by 2^52 causes an underflow.
smallest = 1e-308
smallest / 2^51 # Still representable as a denormalized number

5.0e-324

smallest / 2^52 # Too small, underflows to zero.

0.0

18.1.3.3. Catastrophic Cancellation#

Subtracting two nearly-equal numbers can cause a massive loss of relative precision. The leading, most significant digits cancel out, leaving only the noisy, least significant digits.

# Two random numbers and their difference.
x = rand()
y = rand()
z = x - y

-0.44964093315542963

# Add a large number to both x and y. They are now nearly equal.
x1 = x + 1e12
y1 = y + 1e12

# Their difference should still be z, but precision is lost during the subtraction.
z1 = x1 - y1

-0.4495849609375

# The new result `z1` differs from the true result `z`.
z1 - z

5.597221792963403e-5

18.1.3.4. Machine Epsilon#

Machine epsilon is the distance between 1.0 and the next largest representable floating-point number. It defines the smallest relative change that can be registered.

# For Float64, epsilon is ~2.2e-16. Anything smaller added to 1.0 is lost.
eps()

2.220446049250313e-16

# Adding a number smaller than epsilon has no effect.
1.0 + 1e-17 == 1.0

true

# We can calculate epsilon by finding the smallest power of 2 that `1.0` can resolve.
e = 1.0
while 1.0 + e > 1.0
    e = e / 2
end
e * 2 # The last successful value

2.220446049250313e-16

# The gap between numbers is relative to their magnitude.
# `eps(x)` gives the gap at `x`.
eps(1.0)

2.220446049250313e-16

eps(2.0^100)

2.81474976710656e14

18.1.3.5. Special Values: ±0, Inf, and NaN#

The IEEE 754 standard includes several special quantities to handle edge cases gracefully.

18.1.3.5.1. Signed Zeros#

There are distinct positive (+0.0) and negative (-0.0) zeros. They compare as equal but can produce different results in some calculations.

bitstring(0.0)

"0000000000000000000000000000000000000000000000000000000000000000"

bitstring(-0.0)

"1000000000000000000000000000000000000000000000000000000000000000"

# The sign of zero can matter!
1.0 / 0.0

Inf

1.0 / -0.0

-Inf

18.1.3.5.2. Infinity (`Inf`)#

Infinity is the result of overflow or division by zero.

10.0^10.0^10.0

Inf

1 / Inf

0.0

Inf + Inf

Inf

18.1.3.5.3. Not-a-Number (`NaN`)#

NaN is the result of undefined operations, such as 0/0 or Inf - Inf. Any operation involving NaN results in NaN.

0.0 / 0.0

NaN

Inf - Inf

NaN

# NaN is "contagious".
NaN + 123

NaN

A unique property of NaN is that it is not equal to anything, including itself. Therefore, you must use the isnan() function to check for it.

# This is a defining feature of NaN!
NaN == NaN

false

# Use the `isnan` function to test for NaN.
isnan(NaN)

true

18.1.3.6. Rounding Behavior#

IEEE 754 specifies a round-to-nearest, ties-to-even rule. If a number is exactly halfway between two representable values, it is rounded to the one whose last bit is zero (the “even” one). This avoids the statistical bias of always rounding .5 up.

e = eps()/2 # This is exactly half the gap after 1.0

# `1.0 + e` is halfway between 1.0 and `1.0 + 2e`. It rounds down to 1.0 (even mantissa).
1.0 + e

1.0

# `1.0 + 3e` is halfway between `1.0 + 2e` and `1.0 + 4e`.
# `1.0 + 2e` has an odd mantissa.
# `1.0 + 4e` has an even mantissa, so it rounds up.
1.0 + 3*e

1.0000000000000004

# We can see the pattern: 1, 3, 5... round up, while 0, 2, 4... round down.
println("Multiple | Result")
println("-----------------")
for mul in 0:10
    result = ((1.0 + mul * e) - 1.0) / e
    println(rpad(mul, 8), " | ", result)
end

Multiple | Result
-----------------
      | 0.0
      | 0.0
      | 2.0
      | 4.0
      | 4.0
      | 4.0
      | 6.0
      | 8.0
      | 8.0
      | 8.0
     | 10.0

18.1.3.7. Viewing Bit-Level Representations#

This helper function lets us inspect the bit patterns of Float32 numbers to see these rules in action.

using Printf

# A helper function to format a 32-bit string into Sign | Exponent | Mantissa.
split32(s) = s[1] * " " * s[2:9] * " " * s[10:32]

# A function to print a number and its Float32 bit pattern.
showbits(x) = @printf("%12.8g = %s\n", x, split32(bitstring(Float32(x))))

println("--- Special Values ---")
showbits.([0, -0, Inf, -Inf, NaN]);

println("\n--- Integers ---")
showbits.(1:5);

println("\n--- Numbers Just Above 1.0 ---")
showbits.(1 .+ (0:5).*2^-23);

--- Special Values ---
           0 = 0 00000000 00000000000000000000000
           0 = 0 00000000 00000000000000000000000
         Inf = 0 11111111 00000000000000000000000
        -Inf = 1 11111111 00000000000000000000000
         NaN = 0 11111111 10000000000000000000000

--- Integers ---
           1 = 0 01111111 00000000000000000000000
           2 = 0 10000000 00000000000000000000000
           3 = 0 10000000 10000000000000000000000
           4 = 0 10000001 00000000000000000000000
           5 = 0 10000001 01000000000000000000000

--- Numbers Just Above 1.0 ---
           1 = 0 01111111 00000000000000000000000
   1.0000001 = 0 01111111 00000000000000000000001
   1.0000002 = 0 01111111 00000000000000000000010
   1.0000004 = 0 01111111 00000000000000000000011
   1.0000005 = 0 01111111 00000000000000000000100
   1.0000006 = 0 01111111 00000000000000000000101