# 18.1. Floating Point Arithmetic¶

## 18.1.1. Binary numbers¶

Integers in a computer are stored as binary numbers, that is, using base-2. For example, the binary number $$1011001_2$$ corresponds to the decimal number

\begin{split} \begin{align*} 1011001_2 &= 1\times 2^6 + 0\times 2^5 + 1\times 2^4 + 1\times 2^3 + 0\times 2^2 + 0\times 2^1 + 1 \times 2^0 \\ &= 2^6 + 2^4 + 2^3 + 2^0 = 89_{10} \end{align*} \end{split}

or, in Julia:

Int(0b1011001)

89


Unsigned integers can be represented in Julia using the types UInt8, UInt16, etc. The actual binary representation in the computer can be seen as a bitstring:

bitstring(UInt8(89))

"01011001"


So, for example the UInt64 type can represent numbers between $$0$$ and $$1111\ldots 1111_2 = 2^{63} + 2^{62} + \cdots + 2^1 + 2^0 = 2^{64} - 1$$

### 18.1.1.1. Signed numbers¶

There are various ways of representing negative numbers, for example by using one of the bits as a “sign bit” (0 meaning positive, 1 meaning negative). However, the so-called two’s complement format is widely used to store signed integers. When the most significant bit (representing the highest power) is equal to 1, the number is negative. For $$n$$ bits, the two’s complement is defined by the sum of the number and its two’s complement equals $$2^n$$. For example, with 8 bits, the two’s complement of $$01011001_2$$ is $$10100111_2$$, since

$01011001_2 + 10100111_2 = 100000000_2 = 2^8 = 256_{10}$

This means that with signed integers, the number $$10100111_2$$ represents the negative of $$01011001_2$$, that is, $$-89_{10}$$.

In Julia, signed integers are represented using the types Int8, Int16, etc:

bitstring(Int8(89))

"01011001"

bitstring(Int8(-89))

"10100111"


Using this format, the type Int64 can for example represent numbers between $$11111\ldots1111_2 = -2^{63}$$ and $$01111\ldots1111_2 = 2^{63} - 1$$.

[typemin(Int64) typemax(Int64); -2^63 2^63-1]

2×2 Matrix{Int64}:
-9223372036854775808  9223372036854775807
-9223372036854775808  9223372036854775807


### 18.1.1.2. Fixed point numbers¶

One way to represent fractional numbers is as fixed-points. For example,

$101.1011_2 = 2^2 + 2^0 + 2^{-1} + 2^{-3} + 2^{-4} = 5.6875_{10}$

This could be useful for applications that need a fixed number of binary digits. But since most scientific applications need numbers with wide ranges of magnitudes, computers use floating point instead of fixed point to represent fractional numbers.

## 18.1.2. Floating point numbers¶

### 18.1.2.1. Scientific notation¶

Recall how a (decimal) number is written using scientific notation:

{\LARGE \begin{align*} \underbrace{-}_{\large \text{sign}} \underbrace{1.602\rule[-2pt]{0pt}{0pt}}_{\large \text{significand}} \times \underbrace{10\rule[-2pt]{0pt}{0pt}}_{\large \text{base}} \underbrace{\,^{-19}\rule[-2pt]{0pt}{0pt}}_{\large \text{exponent}} \end{align*} }

Floating point numbers are represented in a similar way, as

\begin{align*} \pm \left( d_0 + d_1\beta^{-1} + \ldots + d_{p-1} \beta^{-(p-1)} \right) \beta^e,\quad 0\le d_i<\beta \end{align*}

with base $$\beta$$ and precision $$p$$. The number is normalized if $$d_0\ne0$$ (use a special case to represent $$0$$).

### 18.1.2.2. Properties of floating point numbers¶

• The gaps between adjacent numbers scale with the size of the numbers

• For all $$x$$, there exists a floating point $$x'$$ such that $$|x-x'| \le \epsilon_\mathrm{machine} |x|$$

• Example: $$\beta=2, p=3, e_\mathrm{min}=-1, e_\mathrm{max}=2$$ ### 18.1.2.3. Special quantities¶

• $$\pm\infty$$ is returned when an operation overflows

• $$x/\pm\infty=0$$ for any number $$x$$, $$x/0=\pm\infty$$ for any nonzero number $$x$$

• Operations with infinity are defined as limits, e.g. $$4-\infty = \lim_{x\rightarrow\infty} 4-x = -\infty$$

• NaN (Not a Number) is returned when the an operation has no well-defined finite or infinite result

• Examples: $$\infty-\infty$$, $$\infty/\infty$$, $$0/0$$, $$\sqrt{-1}$$, $$\mathrm{NaN}\odot x$$

### 18.1.2.4. IEEE Single Precision¶

• 1 sign bit, 8 exponent bits, 23 significand bits:

$\begin{split} \begin{array}{c|c|c} \text{x} & \text{xxxxxxxx} & \text{xxxxxxxxxxxxxxxxxxxxxxx} \\ \hline S & E & C \end{array} \end{split}$
• Represented number:

{\LARGE \begin{align*} (-1)^S \times 1.M \times 2^{E-127} \end{align*} }
• Special cases:

$\begin{split} \begin{array}{c|c|c|c} & E=0 & 0<E<255 & E=255 \\ \hline M=0 & \pm0 & \text{Powers of 2} & \pm\infty \\ \hline M\ne0 & \text{Denormalized} & \text{Ordinary numbers} & \text{NaN} \end{array} \end{split}$
• Examples:

$\begin{split} \begin{array}{c|c|c|l} S & E & M & \text{Quantity} \\ \hline 0 & 11111111 & 00000100000000000000000 & \text{NaN} \\ \hline 1 & 11111111 & 00100010001001010101010 & \text{NaN} \\ \hline 0 & 11111111 & 00000000000000000000000 & \infty \\ \hline 0 & 10000001 & 10100000000000000000000 & +1 \cdot 2^{129-127} \cdot 1.101 = 6.5 \\ \hline 0 & 10000000 & 00000000000000000000000 & +1 \cdot 2^{128-127} \cdot 1.0 = 2 \\ \hline 0 & 00000001 & 00000000000000000000000 & +1 \cdot 2^{1-127} \cdot 1.0 = 2^{-126} \\ \hline 0 & 00000000 & 10000000000000000000000 & +1 \cdot 2^{-126} \cdot 0.1 = 2^{-127} \\ \hline 0 & 00000000 & 00000000000000000000001 & +1 \cdot 2^{-126} \cdot 2^{-23} = 2^{-149} \\ \hline 0 & 00000000 & 00000000000000000000000 & 0 \\ \hline 1 & 00000000 & 00000000000000000000000 & -0 \\ \hline 1 & 10000001 & 10100000000000000000000 & -1 \cdot 2^{129-127} \cdot 1.101 = -6.5 \\ \hline 1 & 11111111 & 00000000000000000000000 & -\infty \end{array} \end{split}$

### 18.1.2.5. IEEE Floating Point Data Types¶

$\begin{split} \begin{array}{l|l|l} & \text{Single precision} & \text{Double precision} \\ \hline \text{Significand size} (p) & \text{24 bits} & \text{53 bits} \\ \hline \text{Exponent size} & \text{8 bits} & \text{11 bits} \\ \hline \text{Total size} & \text{32 bits} & \text{64 bits} \\ \hline e_\mathrm{max} & +127 & +1023 \\ \hline e_\mathrm{min} & -126 & -1022 \\ \hline \text{Smallest normalized} & 2^{-126}\approx 10^{-38} & 2^{-1022}\approx 10^{-308} \\ \hline \text{Largest normalized} & 2^{127}\approx 10^{38} & 2^{1023}\approx 10^{308} \\ \hline \epsilon_\mathrm{machine} & 2^{-24}\approx 6\cdots 10^{-8} & 2^{-53} \approx 10^{-16} \end{array} \end{split}$

## 18.1.3. IEEE Floating point demo¶

### 18.1.3.1. Overflow/underflow¶

1.0

1.0

-1.0

-1.0

1e100

1.0e100

1e-100

1.0e-100

1e308

1.0e308

1e308 * 2

Inf

1e-308

1.0e-308

1e-308 / 2^51

5.0e-324

1e-308 / 2^52

0.0


### 18.1.3.2. Cancellation¶

x = rand()
y = rand()
z = x - y

-0.1864368218809287

x1 = x + 1e10
y1 = y + 1e10
z1 = x1 - y1

-0.18643569946289062

z1 - z

1.1224180380686732e-6


### 18.1.3.3. Epsilon¶

1 + 1e-20

1.0

(1 + 1e-20) - 1

0.0

1 + 1e-16

1.0

1 + 2e-16

1.0000000000000002

(1 + 2e-16) - 1

2.220446049250313e-16

e = 1.0
while 1 + e > 1
e = e/2
println(e)
end

0.5
0.25
0.125
0.0625
0.03125
0.015625
0.0078125
0.00390625
0.001953125
0.0009765625
0.00048828125
0.000244140625
0.0001220703125
6.103515625e-5
3.0517578125e-5
1.52587890625e-5
7.62939453125e-6
3.814697265625e-6
1.9073486328125e-6
9.5367431640625e-7
4.76837158203125e-7
2.384185791015625e-7
1.1920928955078125e-7
5.960464477539063e-8
2.9802322387695312e-8
1.4901161193847656e-8
7.450580596923828e-9
3.725290298461914e-9
1.862645149230957e-9
9.313225746154785e-10
4.656612873077393e-10
2.3283064365386963e-10
1.1641532182693481e-10
5.820766091346741e-11
2.9103830456733704e-11
1.4551915228366852e-11
7.275957614183426e-12
3.637978807091713e-12
1.8189894035458565e-12
9.094947017729282e-13
4.547473508864641e-13
2.2737367544323206e-13
1.1368683772161603e-13
5.684341886080802e-14
2.842170943040401e-14
1.4210854715202004e-14
7.105427357601002e-15
3.552713678800501e-15
1.7763568394002505e-15
8.881784197001252e-16
4.440892098500626e-16
2.220446049250313e-16
1.1102230246251565e-16

eps()

2.220446049250313e-16

b = 2.0^100

1.2676506002282294e30

(b + eps()*b) - b

2.81474976710656e14

(b + eps()/2*b) - b

0.0


### 18.1.3.4. Signed zeros¶

0.0

0.0

+0.0

0.0

-0.0

-0.0


### 18.1.3.5. Infinity¶

1.0/0.0

Inf

-1.0/0.0

-Inf

0.0/0.0

NaN

Inf

Inf

1/Inf

0.0

-1/Inf

-0.0

-1/-Inf

0.0

2*Inf

Inf

Inf + Inf

Inf

Inf^Inf

Inf


### 18.1.3.6. NaN (Not-a-Number)¶

Inf-Inf

NaN

Inf/Inf

NaN

0.0/0.0

NaN

NaN + 123

NaN


### 18.1.3.7. Check for NaN¶

x = NaN
x == NaN

false

x == x

false

isnan.([1, 2, 3, NaN, Inf])

5-element BitVector:
0
0
0
1
0

isinf.([1, 2, 3, NaN, Inf])

5-element BitVector:
0
0
0
0
1


### 18.1.3.8. Round to even¶

e = eps()/2
1 + e

1.0

1 + 2*e

1.0000000000000002

((1 + 2*e)-1) / e

2.0

((1 + 3*e)-1) / e

4.0

((1 + 4*e)-1) / e

4.0

for mul = 0:16
println([mul, ((1 + mul*e) - 1) / e])
end

[0.0, 0.0]
[1.0, 0.0]
[2.0, 2.0]
[3.0, 4.0]
[4.0, 4.0]
[5.0, 4.0]
[6.0, 6.0]
[7.0, 8.0]
[8.0, 8.0]
[9.0, 8.0]
[10.0, 10.0]
[11.0, 12.0]
[12.0, 12.0]
[13.0, 12.0]
[14.0, 14.0]
[15.0, 16.0]
[16.0, 16.0]


### 18.1.3.9. View hex/bin representations¶

using Printf
split32(s) = s * " " * s[2:9] * " " * s[10:32]
showbits(x) = @printf("%10.8g  =  %s\n", x, split32(bitstring(Float32(x))))

showbits.([0, -0, Inf, -Inf, NaN, -NaN]);
showbits.(1:10);
showbits.(1 .+ (0:10).*2^-23);
showbits.(2 .- (10:-1:0).*2^-23);

         0  =  0 00000000 00000000000000000000000
0  =  0 00000000 00000000000000000000000
Inf  =  0 11111111 00000000000000000000000
-Inf  =  1 11111111 00000000000000000000000
NaN  =  0 11111111 10000000000000000000000
NaN  =  1 11111111 10000000000000000000000
1  =  0 01111111 00000000000000000000000
2  =  0 10000000 00000000000000000000000
3  =  0 10000000 10000000000000000000000
4  =  0 10000001 00000000000000000000000
5  =  0 10000001 01000000000000000000000
6  =  0 10000001 10000000000000000000000
7  =  0 10000001 11000000000000000000000
8  =  0 10000010 00000000000000000000000
9  =  0 10000010 00100000000000000000000
10  =  0 10000010 01000000000000000000000
1  =  0 01111111 00000000000000000000000
1.0000001  =  0 01111111 00000000000000000000001
1.0000002  =  0 01111111 00000000000000000000010
1.0000004  =  0 01111111 00000000000000000000011
1.0000005  =  0 01111111 00000000000000000000100
1.0000006  =  0 01111111 00000000000000000000101
1.0000007  =  0 01111111 00000000000000000000110
1.0000008  =  0 01111111 00000000000000000000111
1.000001  =  0 01111111 00000000000000000001000
1.0000011  =  0 01111111 00000000000000000001001
1.0000012  =  0 01111111 00000000000000000001010
1.9999988  =  0 01111111 11111111111111111110110
1.9999989  =  0 01111111 11111111111111111110111
1.999999  =  0 01111111 11111111111111111111000
1.9999992  =  0 01111111 11111111111111111111001
1.9999993  =  0 01111111 11111111111111111111010
1.9999994  =  0 01111111 11111111111111111111011
1.9999995  =  0 01111111 11111111111111111111100
1.9999996  =  0 01111111 11111111111111111111101
1.9999998  =  0 01111111 11111111111111111111110
1.9999999  =  0 01111111 11111111111111111111111
2  =  0 10000000 00000000000000000000000