cs2100 9 floating point
TRANSCRIPT
CS2100 Computer Organisation
Floating Point Numbers
2011 Sem 1 Floating Point Numbers 2
FLOATING POINT ARITHMETIC
➤ Representing fractions➤ Fixed point numbers➤ Floating point numbers➤ IEEE Standard for Binary Floating-Point
Arithmetic (ANSI/IEEE Std 754-1985)➤ Floating point addition/subtraction➤ Floating point multiplication➤ Rounding and errors
2011 Sem 1 Floating Point Numbers 3
FIXED POINT NUMBERS (RECAP)
➤ In fixed point representation, the binary point is assumed to be at a fixed location.– For example, if the binary point is at the end of an 8-bit
representation as shown below, it can represent integers from -128 to +127.
binary point
2011 Sem 1 Floating Point Numbers 4
FIXED POINT NUMBERS (RECAP)
➤ In general, the binary point may be assumed to be at any pre-fixed location.– Example: Two fractional bits are assumed as shown below.
– If 2s complement is used, we can represent values like:011010.112s = 26.7510
111110.112s = -000001.012 = -1.2510
assumed binary pointinteger part fraction part
2011 Sem 1 Floating Point Numbers 5
FLOATING POINT NUMBERS (RECAP)
➤ Fixed point numbers have limited range.➤ Floating point numbers allow us to represent very large or
very small numbers.➤ Examples:
0.23 × 1023 (very large positive number)0.5 × 10-37 (very small positive number)-0.2397 × 10-18 (very small negative number)
2011 Sem 1 Floating Point Numbers 6
FLOATING POINT NUMBERS (RECAP)
➤ 3 parts: sign, mantissa and exponent➤ The base (radix) is assumed to be 2.➤ Sign bit: 0 for positive, 1 for negative.
sign mantissa exponent
➤ Mantissa is usually in normalised form (the integer part is zero and the fraction part must not begin with zero)
0.01101 × 24 normalised 101011.0110 × 2-4 normalised
➤ Trade-off:– More bits in mantissa better precision
– More bits in exponent larger range of values
0.1101 × 23
0.101011011 × 22
2011 Sem 1 Floating Point Numbers 7
FLOATING POINT NUMBERS (RECAP)
➤ Exponent is usually expressed in complement or excess format.
➤ Example: Express -6.510 in base-2 normalised form-6.510 = -110.12 = -0.11012 × 23
➤ Assuming that the floating-point representation contains 1-bit, 5-bit normalised mantissa, and 4-bit exponent. The above example will be stored as if the exponent is in 1s or 2s complement.
1 11010 0011
2011 Sem 1 Floating Point Numbers 8
FLOATING POINT NUMBERS (RECAP)
➤ Example: Express 0.187510 in base-2 normalised form
0.187510 = 0.00112 = 0.11 × 2-2
➤ Assume this floating-point representation:1-bit sign, 5-bit normalised mantissa, and 4-bit exponent.
➤ The above example will be represented as
0 11000 1101 If exponent is in 1’s complement.
0 11000 1110 If exponent is in 2’s complement.
0 11000 0110 If exponent is in excess-8.
2011 Sem 1 Floating Point Numbers 9
IEEE STANDARD 754 (1/3)
➤ Standardized in the mid-1980s➤ Most widely-used standard for floating-point
computation➤ Two types of formats
– Normalized numbers– Denormalized numbers
➤ Special values– Negative zero– Infinities– Not-a-Number (NaN)
2011 Sem 1 Floating Point Numbers 10
IEEE STANDARD 754 (2/3)
2011 Sem 1 Floating Point Numbers 11
IEEE STANDARD 754 (3/3)
➤ A normalized number represented is:
v = (– 1)sign 1.fraction 2exponent – bias
➤ Sign is a single bit➤ Single precision – exponent bias = +127➤ Double precision – exponent bias = +1023➤ Exponent must NOT be 0. It must be in [1, 2e-2]. e is the
number of exponent bits. All zero or all one exponents are reserved for special values and are not used for normalized numbers
➤ A normalized fractional part is in the interval [1, 2)
2011 Sem 1 Floating Point Numbers 12
A SINGLE PRECISION EXAMPLE
➤ The value represented here is:v = (–1)0 x 1.010001000000000000000002 x 2(10000011
–127)
= 1.010001000000000000000002 x 2(131–127)
= 1.010001000000000000000002 x 24
= 10100.012
= 20.2510
0100 0001 1010 0010 0000 0000 0000 00000100 0001 1010 0010 0000 0000 0000 0000
sign
0100 0001 1010 0010 0000 0000 0000 0000
exponent fraction
0100 0001 1010 0010 0000 0000 0000 0000
2
2011 Sem 1 Floating Point Numbers 13
HOW TO PRINT
➤ A simple C program to print out the representation in single precision:
#include <stdio.h>
union { float f; int i;} u;
int main(int argc, char *argv[]){ u.f = 20.25;
printf("0x%x %f\n", u.i, u.f);}
2011 Sem 1 Floating Point Numbers 14
DENORMALIZED NUMBERS
➤ Goal: To represent really small (positive and negative) numbers
➤ The number represented is:
v = (– 1)sign 0.fraction 2–bias+1
➤ Exponent must be 0, mantissa must be non-zero– Note that the exponent bits are not interpreted as (0 –
bias).
2011 Sem 1 Floating Point Numbers 15
SPECIAL VALUES
➤ When both exponent, and fractional part is zero, it is a zero– Two zeroes: positive and negative
➤ If the exponents are all ones, and the fractional part is zero, it represent infinity– Two infinities: positive and negative
➤ If the exponents are all ones, and the fractional part is non-zero, it represent NaN (Not-a-Number)– Two types: signalling or non-signalling, depending on
user choice
2011 Sem 1 Floating Point Numbers 16
Comparison Rules
➤ Negative and positive zero compare equal
➤ Every NaN compares unequal to every value, including itself
➤ All values except NaN are strictly smaller than +∞ and strictly greater than −∞
2011 Sem 1 Floating Point Numbers 17
EXAMPLES IN SINGLE PRECISION
2011 Sem 1 Floating Point Numbers 18
RANGE OF NUMBERS
(Not drawn to absolute scale)
2011 Sem 1 Floating Point Numbers 19
VERY IMPORTANT CAVEAT
➤ The most important thing you must know about floating point arithmetic is that associativity is never preserved (even when the operation is theoretically associative), i.e.
(A op B) op C ≠ A op (B op C)
even if “op” is “+”.
2011 Sem 1 Floating Point Numbers 20
ROUNDING
➤ Rounding off destroys associativity!➤ Rounding is selecting a representable number
as the result.
Example:
1.000···0 1.000···1
23 bits 23 bits
Result ofcomputation
X Y
Report X or Y asresult?
Cannot be represented!
2011 Sem 1 Floating Point Numbers 21
THE FOUR ROUNDING MODES
➤ Round to nearest (default)
➤ Round towards zero
➤ Round towards positive infinity
➤ Round towards negative infinity
2011 Sem 1 Floating Point Numbers 22
THE THREE EXTRA BITS
➤ Standard specifies that all arithmetic must be performed with 3 extra bits at the end of the last fractional bit– The guard bit– The round bit– The sticky bit (this is one is any bits to the right of it is
one)
2011 Sem 1 Floating Point Numbers 23
ROUND TO NEAREST
Result of computation
X YNearer! So report Y!
Least significant bit –the last bit of the fraction
2011 Sem 1 Floating Point Numbers 24
ROUND TOWARDS ZERO
Result of computation
X YX is nearer 0! So report X!
0 +∞
2011 Sem 1 Floating Point Numbers 25
ROUND TOWARDS ∞
Result of computation
X Y
Y is nearer +∞! So report Y!
0 +∞
2011 Sem 1 Floating Point Numbers 26
ROUND TOWARDS ∞
Result of computation
X Y
X is nearer -! So report X!
0 +∞
2011 Sem 1 Floating Point Numbers 27
ERRORS
➤ Rounding yields a representable floating point number x’ that is an approximation of the real number x
➤ Absolute error = |x’ – x|➤ Relative error = |x’ – x| |x| (assuming x≠0)➤ Errors will accumulate with more and more
operations – watch your errors!
2011 Sem 1 Floating Point Numbers 28
Floating Point Addition
➤ Given two decimal numbers in scientific notation:– X = a × 10p
– Y = b × 10q
➤ To perform X + Y, we need to align the decimal point by shifts such that the two exponents are the same
➤ If we assume p > q, then we can adjust Y such that:– Y’ = (b with decimal point shifted p – q decimal places to the left)
× 10p
– This is called a denormalization shift.
➤ Now we can perform the addition➤ Normalize the result
2011 Sem 1 Floating Point Numbers 29
FLOATING POINT ADDITION
• Do the above in base 2
• Difference in the exponents - denormalization shift amount
• Shift smaller number by this amount to align the binary points
• Perform addition
• Have to re-normalize the result by shifting over leading zeros – leading zero detection (the hard part)
2011 Sem 1 Floating Point Numbers 30
Floating Point Multiplication
➤ Given two decimal numbers in scientific notation:– X = a × 10p
– Y = b × 10q
➤ X × Y = (a × b) × 10(p+q)
➤ Normalize the result➤ More straightforward than addition
2011 Sem 1 Floating Point Numbers 31
FLOATING POINT MULTIPLICATION
• Add the exponents
• Multiply the mantissa
• Since input is in [1, 2), result will be in [1, 4) – at most 1 bit shift to renormalize
2011 Sem 1 Floating Point Numbers 32
END