Floating Point Arithmetic

18.1. Floating Point Arithmetic#

18.1.1. Binary numbers#

Integers in a computer are stored as binary numbers, that is, using base-2. For example, the binary number \(1011001_2\) corresponds to the decimal number

\[\begin{split} \begin{align*} 1011001_2 &= 1\times 2^6 + 0\times 2^5 + 1\times 2^4 + 1\times 2^3 + 0\times 2^2 + 0\times 2^1 + 1 \times 2^0 \\ &= 2^6 + 2^4 + 2^3 + 2^0 = 89_{10} \end{align*} \end{split}\]

or, in Julia:

Int(0b1011001)

Unsigned integers can be represented in Julia using the types UInt8, UInt16, etc. The actual binary representation in the computer can be seen as a bitstring:

bitstring(UInt8(89))

"01011001"

So, for example the UInt64 type can represent numbers between \(0\) and \(1111\ldots 1111_2 = 2^{63} + 2^{62} + \cdots + 2^1 + 2^0 = 2^{64} - 1\)

18.1.1.1. Signed numbers#

There are various ways of representing negative numbers, for example by using one of the bits as a “sign bit” (0 meaning positive, 1 meaning negative). However, the so-called two’s complement format is widely used to store signed integers. When the most significant bit (representing the highest power) is equal to 1, the number is negative. For \(n\) bits, the two’s complement is defined by the sum of the number and its two’s complement equals \(2^n\). For example, with 8 bits, the two’s complement of \(01011001_2\) is \(10100111_2\), since

\[ 01011001_2 + 10100111_2 = 100000000_2 = 2^8 = 256_{10} \]

This means that with signed integers, the number \(10100111_2\) represents the negative of \(01011001_2\), that is, \(-89_{10}\).

In Julia, signed integers are represented using the types Int8, Int16, etc:

bitstring(Int8(89))

"01011001"

bitstring(Int8(-89))

"10100111"

Using this format, the type Int64 can for example represent numbers between \(11111\ldots1111_2 = -2^{63}\) and \(01111\ldots1111_2 = 2^{63} - 1\).

[typemin(Int64) typemax(Int64); -2^63 2^63-1]

2×2 Matrix{Int64}:
 -9223372036854775808  9223372036854775807
 -9223372036854775808  9223372036854775807

18.1.1.2. Fixed point numbers#

One way to represent fractional numbers is as fixed-points. For example,

\[ 101.1011_2 = 2^2 + 2^0 + 2^{-1} + 2^{-3} + 2^{-4} = 5.6875_{10} \]

This could be useful for applications that need a fixed number of binary digits. But since most scientific applications need numbers with wide ranges of magnitudes, computers use floating point instead of fixed point to represent fractional numbers.

18.1.2. Floating point numbers#

18.1.2.1. Scientific notation#

Recall how a (decimal) number is written using scientific notation:

\[ {\LARGE \begin{align*} \underbrace{-}_{\large \text{sign}} \underbrace{1.602\rule[-2pt]{0pt}{0pt}}_{\large \text{significand}} \times \underbrace{10\rule[-2pt]{0pt}{0pt}}_{\large \text{base}} \underbrace{\,^{-19}\rule[-2pt]{0pt}{0pt}}_{\large \text{exponent}} \end{align*} } \]

Floating point numbers are represented in a similar way, as

\[ \begin{align*} \pm \left( d_0 + d_1\beta^{-1} + \ldots + d_{p-1} \beta^{-(p-1)} \right) \beta^e,\quad 0\le d_i<\beta \end{align*} \]

with base \(\beta\) and precision \(p\). The number is normalized if \(d_0\ne0\) (use a special case to represent \(0\)).

18.1.2.2. Properties of floating point numbers#

The gaps between adjacent numbers scale with the size of the numbers
For all \(x\), there exists a floating point \(x'\) such that \(|x-x'| \le \epsilon_\mathrm{machine} |x|\)
Example: \(\beta=2, p=3, e_\mathrm{min}=-1, e_\mathrm{max}=2\)

floating_point_number_line

18.1.2.3. Special quantities#

\(\pm\infty\) is returned when an operation overflows
\(x/\pm\infty=0\) for any number \(x\), \(x/0=\pm\infty\) for any nonzero number \(x\)
Operations with infinity are defined as limits, e.g. \( 4-\infty = \lim_{x\rightarrow\infty} 4-x = -\infty \)
NaN (Not a Number) is returned when the an operation has no well-defined finite or infinite result
Examples: \(\infty-\infty\), \(\infty/\infty\), \(0/0\), \(\sqrt{-1}\), \(\mathrm{NaN}\odot x\)

18.1.2.4. IEEE Single Precision#

1 sign bit, 8 exponent bits, 23 significand bits:

\[\begin{split} \begin{array}{c|c|c} \text{x} & \text{xxxxxxxx} & \text{xxxxxxxxxxxxxxxxxxxxxxx} \\ \hline S & E & C \end{array} \end{split}\]

Represented number:

\[ {\LARGE \begin{align*} (-1)^S \times 1.M \times 2^{E-127} \end{align*} } \]

Special cases:

\[\begin{split} \begin{array}{c|c|c|c} & E=0 & 0<E<255 & E=255 \\ \hline M=0 & \pm0 & \text{Powers of 2} & \pm\infty \\ \hline M\ne0 & \text{Denormalized} & \text{Ordinary numbers} & \text{NaN} \end{array} \end{split}\]

Examples:

\[\begin{split} \begin{array}{c|c|c|l} S & E & M & \text{Quantity} \\ \hline 0 & 11111111 & 00000100000000000000000 & \text{NaN} \\ \hline 1 & 11111111 & 00100010001001010101010 & \text{NaN} \\ \hline 0 & 11111111 & 00000000000000000000000 & \infty \\ \hline 0 & 10000001 & 10100000000000000000000 & +1 \cdot 2^{129-127} \cdot 1.101 = 6.5 \\ \hline 0 & 10000000 & 00000000000000000000000 & +1 \cdot 2^{128-127} \cdot 1.0 = 2 \\ \hline 0 & 00000001 & 00000000000000000000000 & +1 \cdot 2^{1-127} \cdot 1.0 = 2^{-126} \\ \hline 0 & 00000000 & 10000000000000000000000 & +1 \cdot 2^{-126} \cdot 0.1 = 2^{-127} \\ \hline 0 & 00000000 & 00000000000000000000001 & +1 \cdot 2^{-126} \cdot 2^{-23} = 2^{-149} \\ \hline 0 & 00000000 & 00000000000000000000000 & 0 \\ \hline 1 & 00000000 & 00000000000000000000000 & -0 \\ \hline 1 & 10000001 & 10100000000000000000000 & -1 \cdot 2^{129-127} \cdot 1.101 = -6.5 \\ \hline 1 & 11111111 & 00000000000000000000000 & -\infty \end{array} \end{split}\]

18.1.2.5. IEEE Floating Point Data Types#

\[\begin{split} \begin{array}{l|l|l} & \text{Single precision} & \text{Double precision} \\ \hline \text{Significand size} (p) & \text{24 bits} & \text{53 bits} \\ \hline \text{Exponent size} & \text{8 bits} & \text{11 bits} \\ \hline \text{Total size} & \text{32 bits} & \text{64 bits} \\ \hline e_\mathrm{max} & +127 & +1023 \\ \hline e_\mathrm{min} & -126 & -1022 \\ \hline \text{Smallest normalized} & 2^{-126}\approx 10^{-38} & 2^{-1022}\approx 10^{-308} \\ \hline \text{Largest normalized} & 2^{127}\approx 10^{38} & 2^{1023}\approx 10^{308} \\ \hline \epsilon_\mathrm{machine} & 2^{-24}\approx 6\cdots 10^{-8} & 2^{-53} \approx 10^{-16} \end{array} \end{split}\]

18.1.3. IEEE Floating point demo#

18.1.3.1. Comparing floating point numbers approximately#

Always expect small relative errors in floating point computations:

x = (1 - 2/3) * 5     # Approximately 5/3
x == 5/3              # Not exactly 5/3

false

To check for equality, instead compute the error and check that it is small:

abs(x - 5/3)

2.220446049250313e-16

Alternatively, use the julia isapprox or ≈ method, which uses a combination of an appropriate relative and absolute tolerance:

x ≈ 5/3

true

18.1.3.2. Overflow/underflow#

1.0

1.0

-1.0

-1.0

1e100

1.0e100

1e-100

1.0e-100

1e308

1.0e308

1e308 * 2

Inf

1e-308

1.0e-308

1e-308 / 2^51

5.0e-324

1e-308 / 2^52

0.0

18.1.3.3. Cancellation#

x = rand()
y = rand()
z = x - y

0.2504304619588281

x1 = x + 1e10
y1 = y + 1e10
z1 = x1 - y1

0.2504310607910156

z1 - z

5.988321875038949e-7

18.1.3.4. Epsilon#

1 + 1e-20

1.0

(1 + 1e-20) - 1

0.0

1 + 1e-16

1.0

1 + 2e-16

1.0000000000000002

(1 + 2e-16) - 1

2.220446049250313e-16

e = 1.0
while 1 + e > 1
    e = e/2
    println(e)
end

5
25
125
0625
03125
015625
0078125
00390625
001953125
0009765625
00048828125
000244140625
0001220703125
103515625e-5
0517578125e-5
52587890625e-5
62939453125e-6
814697265625e-6
9073486328125e-6
5367431640625e-7
76837158203125e-7
384185791015625e-7
1920928955078125e-7
960464477539063e-8
9802322387695312e-8
4901161193847656e-8
450580596923828e-9
725290298461914e-9
862645149230957e-9
313225746154785e-10
656612873077393e-10
3283064365386963e-10
1641532182693481e-10
820766091346741e-11
9103830456733704e-11
4551915228366852e-11
275957614183426e-12
637978807091713e-12
8189894035458565e-12
094947017729282e-13
547473508864641e-13
2737367544323206e-13
1368683772161603e-13
684341886080802e-14
842170943040401e-14
4210854715202004e-14
105427357601002e-15
552713678800501e-15
7763568394002505e-15
881784197001252e-16
440892098500626e-16
220446049250313e-16
1102230246251565e-16

eps()

2.220446049250313e-16

b = 2.0^100

1.2676506002282294e30

(b + eps()*b) - b

2.81474976710656e14

(b + eps()/2*b) - b

0.0

18.1.3.5. Signed zeros#

0.0

0.0

+0.0

0.0

-0.0

-0.0

18.1.3.6. Infinity#

1.0/0.0

Inf

-1.0/0.0

-Inf

0.0/0.0

NaN

Inf

Inf

1/Inf

0.0

-1/Inf

-0.0

-1/-Inf

0.0

2*Inf

Inf

Inf + Inf

Inf

Inf^Inf

Inf

18.1.3.7. NaN (Not-a-Number)#

Inf-Inf

NaN

Inf/Inf

NaN

0.0/0.0

NaN

NaN + 123

NaN

18.1.3.8. Check for NaN#

x = NaN
x == NaN

false

x == x

false

isnan.([1, 2, 3, NaN, Inf])

5-element BitVector:
 0
 0
 0
 1
 0

isinf.([1, 2, 3, NaN, Inf])

5-element BitVector:
 0
 0
 0
 0
 1

18.1.3.9. Round to even#

e = eps()/2
1 + e

1.0

1 + 2*e

1.0000000000000002

((1 + 2*e)-1) / e

2.0

((1 + 3*e)-1) / e

4.0

((1 + 4*e)-1) / e

4.0

for mul = 0:16
    println([mul, ((1 + mul*e) - 1) / e])
end

[0.0, 0.0]
[1.0, 0.0]
[2.0, 2.0]
[3.0, 4.0]
[4.0, 4.0]
[5.0, 4.0]
[6.0, 6.0]
[7.0, 8.0]
[8.0, 8.0]
[9.0, 8.0]
[10.0, 10.0]
[11.0, 12.0]
[12.0, 12.0]
[13.0, 12.0]
[14.0, 14.0]
[15.0, 16.0]
[16.0, 16.0]

18.1.3.10. View hex/bin representations#

using Printf
split32(s) = s[1] * " " * s[2:9] * " " * s[10:32]
showbits(x) = @printf("%10.8g  =  %s\n", x, split32(bitstring(Float32(x))))

showbits.([0, -0, Inf, -Inf, NaN, -NaN]);
showbits.(1:10);
showbits.(1 .+ (0:10).*2^-23);
showbits.(2 .- (10:-1:0).*2^-23);

         0  =  0 00000000 00000000000000000000000
         0  =  0 00000000 00000000000000000000000
       Inf  =  0 11111111 00000000000000000000000
      -Inf  =  1 11111111 00000000000000000000000
       NaN  =  0 11111111 10000000000000000000000
       NaN  =  1 11111111 10000000000000000000000
         1  =  0 01111111 00000000000000000000000
         2  =  0 10000000 00000000000000000000000
         3  =  0 10000000 10000000000000000000000
         4  =  0 10000001 00000000000000000000000
         5  =  0 10000001 01000000000000000000000
         6  =  0 10000001 10000000000000000000000
         7  =  0 10000001 11000000000000000000000
         8  =  0 10000010 00000000000000000000000
         9  =  0 10000010 00100000000000000000000
        10  =  0 10000010 01000000000000000000000
         1  =  0 01111111 00000000000000000000000
 1.0000001  =  0 01111111 00000000000000000000001
 1.0000002  =  0 01111111 00000000000000000000010
 1.0000004  =  0 01111111 00000000000000000000011
 1.0000005  =  0 01111111 00000000000000000000100
 1.0000006  =  0 01111111 00000000000000000000101
 1.0000007  =  0 01111111 00000000000000000000110
 1.0000008  =  0 01111111 00000000000000000000111
  1.000001  =  0 01111111 00000000000000000001000
 1.0000011  =  0 01111111 00000000000000000001001
 1.0000012  =  0 01111111 00000000000000000001010
 1.9999988  =  0 01111111 11111111111111111110110
 1.9999989  =  0 01111111 11111111111111111110111
  1.999999  =  0 01111111 11111111111111111111000
 1.9999992  =  0 01111111 11111111111111111111001
 1.9999993  =  0 01111111 11111111111111111111010
 1.9999994  =  0 01111111 11111111111111111111011
 1.9999995  =  0 01111111 11111111111111111111100
 1.9999996  =  0 01111111 11111111111111111111101
 1.9999998  =  0 01111111 11111111111111111111110
 1.9999999  =  0 01111111 11111111111111111111111
         2  =  0 10000000 00000000000000000000000