cs2100 9 floating point

Post on 25-Dec-2015

230 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS2100 Computer Organisation

Floating Point Numbers

2011 Sem 1 Floating Point Numbers 2

FLOATING POINT ARITHMETIC

➤ Representing fractions➤ Fixed point numbers➤ Floating point numbers➤ IEEE Standard for Binary Floating-Point

Arithmetic (ANSI/IEEE Std 754-1985)➤ Floating point addition/subtraction➤ Floating point multiplication➤ Rounding and errors

2011 Sem 1 Floating Point Numbers 3

FIXED POINT NUMBERS (RECAP)

➤ In fixed point representation, the binary point is assumed to be at a fixed location.– For example, if the binary point is at the end of an 8-bit

representation as shown below, it can represent integers from -128 to +127.

binary point

2011 Sem 1 Floating Point Numbers 4

FIXED POINT NUMBERS (RECAP)

➤ In general, the binary point may be assumed to be at any pre-fixed location.– Example: Two fractional bits are assumed as shown below.

– If 2s complement is used, we can represent values like:011010.112s = 26.7510

111110.112s = -000001.012 = -1.2510

assumed binary pointinteger part fraction part

2011 Sem 1 Floating Point Numbers 5

FLOATING POINT NUMBERS (RECAP)

➤ Fixed point numbers have limited range.➤ Floating point numbers allow us to represent very large or

very small numbers.➤ Examples:

0.23 × 1023 (very large positive number)0.5 × 10-37 (very small positive number)-0.2397 × 10-18 (very small negative number)

2011 Sem 1 Floating Point Numbers 6

FLOATING POINT NUMBERS (RECAP)

➤ 3 parts: sign, mantissa and exponent➤ The base (radix) is assumed to be 2.➤ Sign bit: 0 for positive, 1 for negative.

sign mantissa exponent

➤ Mantissa is usually in normalised form (the integer part is zero and the fraction part must not begin with zero)

0.01101 × 24 normalised 101011.0110 × 2-4 normalised

➤ Trade-off:– More bits in mantissa better precision

– More bits in exponent larger range of values

0.1101 × 23

0.101011011 × 22

2011 Sem 1 Floating Point Numbers 7

FLOATING POINT NUMBERS (RECAP)

➤ Exponent is usually expressed in complement or excess format.

➤ Example: Express -6.510 in base-2 normalised form-6.510 = -110.12 = -0.11012 × 23

➤ Assuming that the floating-point representation contains 1-bit, 5-bit normalised mantissa, and 4-bit exponent. The above example will be stored as if the exponent is in 1s or 2s complement.

1 11010 0011

2011 Sem 1 Floating Point Numbers 8

FLOATING POINT NUMBERS (RECAP)

➤ Example: Express 0.187510 in base-2 normalised form

0.187510 = 0.00112 = 0.11 × 2-2

➤ Assume this floating-point representation:1-bit sign, 5-bit normalised mantissa, and 4-bit exponent.

➤ The above example will be represented as

0 11000 1101 If exponent is in 1’s complement.

0 11000 1110 If exponent is in 2’s complement.

0 11000 0110 If exponent is in excess-8.

2011 Sem 1 Floating Point Numbers 9

IEEE STANDARD 754 (1/3)

➤ Standardized in the mid-1980s➤ Most widely-used standard for floating-point

computation➤ Two types of formats

– Normalized numbers– Denormalized numbers

➤ Special values– Negative zero– Infinities– Not-a-Number (NaN)

2011 Sem 1 Floating Point Numbers 10

IEEE STANDARD 754 (2/3)

2011 Sem 1 Floating Point Numbers 11

IEEE STANDARD 754 (3/3)

➤ A normalized number represented is:

v = (– 1)sign 1.fraction 2exponent – bias

➤ Sign is a single bit➤ Single precision – exponent bias = +127➤ Double precision – exponent bias = +1023➤ Exponent must NOT be 0. It must be in [1, 2e-2]. e is the

number of exponent bits. All zero or all one exponents are reserved for special values and are not used for normalized numbers

➤ A normalized fractional part is in the interval [1, 2)

2011 Sem 1 Floating Point Numbers 12

A SINGLE PRECISION EXAMPLE

➤ The value represented here is:v = (–1)0 x 1.010001000000000000000002 x 2(10000011

–127)

= 1.010001000000000000000002 x 2(131–127)

= 1.010001000000000000000002 x 24

= 10100.012

= 20.2510

0100 0001 1010 0010 0000 0000 0000 00000100 0001 1010 0010 0000 0000 0000 0000

sign

0100 0001 1010 0010 0000 0000 0000 0000

exponent fraction

0100 0001 1010 0010 0000 0000 0000 0000

2

2011 Sem 1 Floating Point Numbers 13

HOW TO PRINT

➤ A simple C program to print out the representation in single precision:

#include <stdio.h>

union { float f; int i;} u;

int main(int argc, char *argv[]){ u.f = 20.25;

printf("0x%x %f\n", u.i, u.f);}

2011 Sem 1 Floating Point Numbers 14

DENORMALIZED NUMBERS

➤ Goal: To represent really small (positive and negative) numbers

➤ The number represented is:

v = (– 1)sign 0.fraction 2–bias+1

➤ Exponent must be 0, mantissa must be non-zero– Note that the exponent bits are not interpreted as (0 –

bias).

2011 Sem 1 Floating Point Numbers 15

SPECIAL VALUES

➤ When both exponent, and fractional part is zero, it is a zero– Two zeroes: positive and negative

➤ If the exponents are all ones, and the fractional part is zero, it represent infinity– Two infinities: positive and negative

➤ If the exponents are all ones, and the fractional part is non-zero, it represent NaN (Not-a-Number)– Two types: signalling or non-signalling, depending on

user choice

2011 Sem 1 Floating Point Numbers 16

Comparison Rules

➤ Negative and positive zero compare equal

➤ Every NaN compares unequal to every value, including itself

➤ All values except NaN are strictly smaller than +∞ and strictly greater than −∞

2011 Sem 1 Floating Point Numbers 17

EXAMPLES IN SINGLE PRECISION

2011 Sem 1 Floating Point Numbers 18

RANGE OF NUMBERS

(Not drawn to absolute scale)

2011 Sem 1 Floating Point Numbers 19

VERY IMPORTANT CAVEAT

➤ The most important thing you must know about floating point arithmetic is that associativity is never preserved (even when the operation is theoretically associative), i.e.

(A op B) op C ≠ A op (B op C)

even if “op” is “+”.

2011 Sem 1 Floating Point Numbers 20

ROUNDING

➤ Rounding off destroys associativity!➤ Rounding is selecting a representable number

as the result.

Example:

1.000···0 1.000···1

23 bits 23 bits

Result ofcomputation

X Y

Report X or Y asresult?

Cannot be represented!

2011 Sem 1 Floating Point Numbers 21

THE FOUR ROUNDING MODES

➤ Round to nearest (default)

➤ Round towards zero

➤ Round towards positive infinity

➤ Round towards negative infinity

2011 Sem 1 Floating Point Numbers 22

THE THREE EXTRA BITS

➤ Standard specifies that all arithmetic must be performed with 3 extra bits at the end of the last fractional bit– The guard bit– The round bit– The sticky bit (this is one is any bits to the right of it is

one)

2011 Sem 1 Floating Point Numbers 23

ROUND TO NEAREST

Result of computation

X YNearer! So report Y!

Least significant bit –the last bit of the fraction

2011 Sem 1 Floating Point Numbers 24

ROUND TOWARDS ZERO

Result of computation

X YX is nearer 0! So report X!

0 +∞

2011 Sem 1 Floating Point Numbers 25

ROUND TOWARDS ∞

Result of computation

X Y

Y is nearer +∞! So report Y!

0 +∞

2011 Sem 1 Floating Point Numbers 26

ROUND TOWARDS ∞

Result of computation

X Y

X is nearer -! So report X!

0 +∞

2011 Sem 1 Floating Point Numbers 27

ERRORS

➤ Rounding yields a representable floating point number x’ that is an approximation of the real number x

➤ Absolute error = |x’ – x|➤ Relative error = |x’ – x| |x| (assuming x≠0)➤ Errors will accumulate with more and more

operations – watch your errors!

2011 Sem 1 Floating Point Numbers 28

Floating Point Addition

➤ Given two decimal numbers in scientific notation:– X = a × 10p

– Y = b × 10q

➤ To perform X + Y, we need to align the decimal point by shifts such that the two exponents are the same

➤ If we assume p > q, then we can adjust Y such that:– Y’ = (b with decimal point shifted p – q decimal places to the left)

× 10p

– This is called a denormalization shift.

➤ Now we can perform the addition➤ Normalize the result

2011 Sem 1 Floating Point Numbers 29

FLOATING POINT ADDITION

• Do the above in base 2

• Difference in the exponents - denormalization shift amount

• Shift smaller number by this amount to align the binary points

• Perform addition

• Have to re-normalize the result by shifting over leading zeros – leading zero detection (the hard part)

2011 Sem 1 Floating Point Numbers 30

Floating Point Multiplication

➤ Given two decimal numbers in scientific notation:– X = a × 10p

– Y = b × 10q

➤ X × Y = (a × b) × 10(p+q)

➤ Normalize the result➤ More straightforward than addition

2011 Sem 1 Floating Point Numbers 31

FLOATING POINT MULTIPLICATION

• Add the exponents

• Multiply the mantissa

• Since input is in [1, 2), result will be in [1, 4) – at most 1 bit shift to renormalize

2011 Sem 1 Floating Point Numbers 32

END

top related