Two’s complement representation respects modular arithmetic nicely.
Addition of any two signed integers are just bitwise addition, possibly modulo \(2^M\)
Arithmetics (addition, substraction, multiplication) of integers are exact except for the possiblity of overflow and underflow.
Range of representable integers by \(M\)-bit signed integer is \([-2^{M-1},2^{M-1}-1]\).
Julia functions typemin(T) and typemax(T) give the lowest and highest representable number of a type T respectively
typemin(Int64), typemax(Int64)
(-9223372036854775808, 9223372036854775807)
for T in [Int8, Int16, Int32, Int64, Int128]println(T, '\t', typemin(T), '\t', typemax(T))end
In the scientific notation, a real number is represented as \[\pm d_0.d_1d_2 \cdots d_p \times b^e.\] In computer, the base is \(b=2\) and the digits \(d_i\) are 0 or 1.
Normalized vs denormalized numbers. For example, decimal number 18 is \[ +1.0010 \times 2^4 \quad (\text{normalized})\] or, equivalently, \[ +0.1001 \times 2^5 \quad (\text{denormalized}).\]
In the floating-number system, computer stores
sign bit
the fraction (or mantissa, or significand) of the normalized representation
There are some special rules in IEEE 754 for signed zeros.
Exponent \(e_{\min}-1\) (exponent bits all 0) with a nonzero mantissa are for numbers less than \(b^{e_{\min}}\).
Numbers are denormalized in the range \((0,b^{e_{\min}})\) – graceful underflow.
@shownextfloat(0.0) # next representable number@showbitstring(nextfloat(0.0)); # denormalized
Rounding is necessary whenever a number has more than \(p\) significand bits. Most computer systems use the default IEEE 754 round to nearest mode (also called ties to even mode). Julia offers several rounding modes, the default being RoundNearest. For example, the number 0.1 in decimal system cannot be represented accurately as a floating point number: \[ 0.1 = 1.\underbrace{1001}_\text{repeat}\underbrace{1001}... \times 2^{-4} \]
# half precision Float16, ...110(011...) rounds down to 110@showbitstring(Float16(0.1))# single precision Float32, ...100(110...) rounds up to 101@showbitstring(0.1f0) # double precision Float64, ...001(100..) rounds up to 010@showbitstring(0.1);
For a number with mantissa ending with …001(100…, all 0 digits after), it’s a tie and will be rounded to …010 to make the mantissa even.
4.6 Summary
Double precision: range \(\pm 10^{\pm 308}\) with precision up to 16 decimal digits.
Single precision: range \(\pm 10^{\pm 38}\) with precision up to 7 decimal digits.
Half precision: range \(\pm 10^{\pm 4}\) with precision up to 3 decimal digits.
The floating-point numbers do not occur uniformly over the real number line. Each magnitude has same number of representible numbers, except those around 0 (graceful underflow).
Machine epsilons are the spacings of numbers around 1: \[\epsilon_{\min}=b^{-p}, \quad \epsilon_{\max} = b^{1-p}.\]
@showeps(Float32) # machine epsilon for a floating point type@showeps(Float64) # same as eps()# eps(x) is the spacing after x@showeps(100.0)@showeps(0.0) # grace underflow# nextfloat(x) and prevfloat(x) give the neighbors of x@show x =1.25f0@showprevfloat(x), x, nextfloat(x)@showbitstring(prevfloat(x)), bitstring(x), bitstring(nextfloat(x));
Julia provides Float16 (half precision), Float32 (single precision), Float64 (double precision), and BigFloat (arbitrary precision).
4.7 Overflow and underflow of floating-point number
For double precision, the range is \(\pm 10^{\pm 308}\). In most situations, underflow (magnitude of result is less than \(10^{-308}\)) is preferred over overflow (magnitude of result is larger than \(10^{308}\)). Overflow produces \(\pm \inf\). Underflow yields zeros or denormalized numbers.
E.g., the logit link function is \[p(x) = \frac{\exp (x^T \beta)}{1 + \exp (x^T \beta)} = \frac{1}{1+\exp(- x^T \beta)}.\] The former expression can easily lead to Inf / Inf = NaN, while the latter expression leads to graceful underflow.
floatmin and floatmax functions gives largest and smallest finite number represented by a type.
for T in [Float16, Float32, Float64]println( T, '\t', floatmin(T), '\t', floatmax(T), '\t', typemin(T), '\t', typemax(T), '\t', eps(T) )end
Scenario 1: Addition or subtraction of two numbers of widely different magnitudes: \(a+b\) or \(a-b\) where \(a \gg b\) or \(a \ll b\). We loose the precision in the number of smaller magnitude. Consider \[\begin{eqnarray*}
a &=& x.xxx ... \times 2^{30} \\
b &=& y.yyy... \times 2^{-30}
\end{eqnarray*}\] What happens when computer calculates \(a+b\)? We get \(a+b=a\)!
@show a =2.0^30@show b =2.0^-30@show a + b == a
a = 2.0 ^ 30 = 1.073741824e9
b = 2.0 ^ -30 = 9.313225746154785e-10
a + b == a = true
true
Scenario 2: Subtraction of two nearly equal numbers eliminates significant digits. \(a-b\) where \(a \approx b\). Consider \[\begin{eqnarray*}
a &=& x.xxxxxxxxxx1ssss \\
b &=& x.xxxxxxxxxx0tttt
\end{eqnarray*}\] The result is \(1.vvvvu...u\) where \(u\) are unassigned digits.
a =1.2345678f0# rounding@showbitstring(a) # roundingb =1.2345677f0@showbitstring(b)# correct result should be 1e-7# we see big error due to catastrophic cancellation@show a - b
bitstring(a) = "00111111100111100000011001010001"
bitstring(b) = "00111111100111100000011001010000"
a - b = 1.1920929f-7
1.1920929f-7
Implications for numerical computation
Rule 1: add small numbers together before adding larger ones
Rule 2: add numbers of like magnitude together (paring). When all numbers are of same sign and similar magnitude, add in pairs so each stage the summands are of similar magnitude
Rule 3: avoid substraction of two numbers that are nearly equal
5.1 Algebraic laws
Floating-point numbers may violate many algebraic laws we are familiar with, such as the associative and distributive laws. See Homework 1 problems.