cs2100 9 floating point

CS2100 Computer Organisation

Floating Point Numbers

2011 Sem 1 Floating Point Numbers 2

FLOATING POINT ARITHMETIC

➤ Representing fractions➤ Fixed point numbers➤ Floating point numbers➤ IEEE Standard for Binary Floating-Point

Arithmetic (ANSI/IEEE Std 754-1985)➤ Floating point addition/subtraction➤ Floating point multiplication➤ Rounding and errors


FIXED POINT NUMBERS (RECAP)

➤ In fixed point representation, the binary point is assumed to be at a fixed location.– For example, if the binary point is at the end of an 8-bit

representation as shown below, it can represent integers from -128 to +127.

binary point


FIXED POINT NUMBERS (RECAP)

➤ In general, the binary point may be assumed to be at any pre-fixed location.– Example: Two fractional bits are assumed as shown below.

– If 2s complement is used, we can represent values like:011010.112s = 26.7510

111110.112s = -000001.012 = -1.2510

assumed binary pointinteger part fraction part


FLOATING POINT NUMBERS (RECAP)

➤ Fixed point numbers have limited range.➤ Floating point numbers allow us to represent very large or

very small numbers.➤ Examples:

0.23 × 1023 (very large positive number)0.5 × 10-37 (very small positive number)-0.2397 × 10-18 (very small negative number)



➤ 3 parts: sign, mantissa and exponent➤ The base (radix) is assumed to be 2.➤ Sign bit: 0 for positive, 1 for negative.

sign mantissa exponent

➤ Mantissa is usually in normalised form (the integer part is zero and the fraction part must not begin with zero)

0.01101 × 24 normalised 101011.0110 × 2-4 normalised

➤ Trade-off:– More bits in mantissa better precision

– More bits in exponent larger range of values

0.1101 × 23

0.101011011 × 22



➤ Exponent is usually expressed in complement or excess format.

➤ Example: Express -6.510 in base-2 normalised form-6.510 = -110.12 = -0.11012 × 23

➤ Assuming that the floating-point representation contains 1-bit, 5-bit normalised mantissa, and 4-bit exponent. The above example will be stored as if the exponent is in 1s or 2s complement.

1 11010 0011



➤ Example: Express 0.187510 in base-2 normalised form

0.187510 = 0.00112 = 0.11 × 2-2

➤ Assume this floating-point representation:1-bit sign, 5-bit normalised mantissa, and 4-bit exponent.

➤ The above example will be represented as

0 11000 1101 If exponent is in 1’s complement.

0 11000 1110 If exponent is in 2’s complement.

0 11000 0110 If exponent is in excess-8.


IEEE STANDARD 754 (1/3)

➤ Standardized in the mid-1980s➤ Most widely-used standard for floating-point

computation➤ Two types of formats

– Normalized numbers– Denormalized numbers

➤ Special values– Negative zero– Infinities– Not-a-Number (NaN)



➤ A normalized number represented is:

v = (– 1)sign 1.fraction 2exponent – bias

➤ Sign is a single bit➤ Single precision – exponent bias = +127➤ Double precision – exponent bias = +1023➤ Exponent must NOT be 0. It must be in [1, 2e-2]. e is the

number of exponent bits. All zero or all one exponents are reserved for special values and are not used for normalized numbers

➤ A normalized fractional part is in the interval [1, 2)


A SINGLE PRECISION EXAMPLE

➤ The value represented here is:v = (–1)0 x 1.010001000000000000000002 x 2(10000011

–127)

= 1.010001000000000000000002 x 2(131–127)

= 1.010001000000000000000002 x 24

= 10100.012

= 20.2510

0100 0001 1010 0010 0000 0000 0000 00000100 0001 1010 0010 0000 0000 0000 0000

sign

0100 0001 1010 0010 0000 0000 0000 0000

exponent fraction

0100 0001 1010 0010 0000 0000 0000 0000

2


HOW TO PRINT

➤ A simple C program to print out the representation in single precision:

#include <stdio.h>

union { float f; int i;} u;

int main(int argc, char *argv[]){ u.f = 20.25;

printf("0x%x %f\n", u.i, u.f);}


DENORMALIZED NUMBERS

➤ Goal: To represent really small (positive and negative) numbers

➤ The number represented is:

v = (– 1)sign 0.fraction 2–bias+1

➤ Exponent must be 0, mantissa must be non-zero– Note that the exponent bits are not interpreted as (0 –

bias).


SPECIAL VALUES

➤ When both exponent, and fractional part is zero, it is a zero– Two zeroes: positive and negative

➤ If the exponents are all ones, and the fractional part is zero, it represent infinity– Two infinities: positive and negative

➤ If the exponents are all ones, and the fractional part is non-zero, it represent NaN (Not-a-Number)– Two types: signalling or non-signalling, depending on

user choice


Comparison Rules

➤ Negative and positive zero compare equal

➤ Every NaN compares unequal to every value, including itself

➤ All values except NaN are strictly smaller than +∞ and strictly greater than −∞


EXAMPLES IN SINGLE PRECISION


RANGE OF NUMBERS

(Not drawn to absolute scale)


VERY IMPORTANT CAVEAT

➤ The most important thing you must know about floating point arithmetic is that associativity is never preserved (even when the operation is theoretically associative), i.e.

(A op B) op C ≠ A op (B op C)

even if “op” is “+”.


ROUNDING

➤ Rounding off destroys associativity!➤ Rounding is selecting a representable number

as the result.

Example:

1.000···0 1.000···1

23 bits 23 bits

Result ofcomputation

X Y

Report X or Y asresult?

Cannot be represented!


THE FOUR ROUNDING MODES

➤ Round to nearest (default)

➤ Round towards zero

➤ Round towards positive infinity

➤ Round towards negative infinity


THE THREE EXTRA BITS

➤ Standard specifies that all arithmetic must be performed with 3 extra bits at the end of the last fractional bit– The guard bit– The round bit– The sticky bit (this is one is any bits to the right of it is

one)


ROUND TO NEAREST

Result of computation

X YNearer! So report Y!

Least significant bit –the last bit of the fraction


ROUND TOWARDS ZERO


X YX is nearer 0! So report X!

0 +∞


ROUND TOWARDS ∞


X Y

Y is nearer +∞! So report Y!

0 +∞


ROUND TOWARDS ∞


X Y

X is nearer -! So report X!

0 +∞


ERRORS

➤ Rounding yields a representable floating point number x’ that is an approximation of the real number x

➤ Absolute error = |x’ – x|➤ Relative error = |x’ – x| |x| (assuming x≠0)➤ Errors will accumulate with more and more

operations – watch your errors!


Floating Point Addition

➤ Given two decimal numbers in scientific notation:– X = a × 10p

– Y = b × 10q

➤ To perform X + Y, we need to align the decimal point by shifts such that the two exponents are the same

➤ If we assume p > q, then we can adjust Y such that:– Y’ = (b with decimal point shifted p – q decimal places to the left)

× 10p

– This is called a denormalization shift.

➤ Now we can perform the addition➤ Normalize the result


FLOATING POINT ADDITION

• Do the above in base 2

• Difference in the exponents - denormalization shift amount

• Shift smaller number by this amount to align the binary points

• Perform addition

• Have to re-normalize the result by shifting over leading zeros – leading zero detection (the hard part)


Floating Point Multiplication

➤ Given two decimal numbers in scientific notation:– X = a × 10p

– Y = b × 10q

➤ X × Y = (a × b) × 10(p+q)

➤ Normalize the result➤ More straightforward than addition


FLOATING POINT MULTIPLICATION

• Add the exponents

• Multiply the mantissa

• Since input is in [1, 2), result will be in [1, 4) – at most 1 bit shift to renormalize


END

cs2100 9 floating point

Documents