cs2100 9 floating point

32
CS2100 Computer Organisation Floating Point Numbers

Upload: amanda

Post on 25-Dec-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cs2100 9 Floating Point

CS2100 Computer Organisation

Floating Point Numbers

Page 2: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 2

FLOATING POINT ARITHMETIC

➤ Representing fractions➤ Fixed point numbers➤ Floating point numbers➤ IEEE Standard for Binary Floating-Point

Arithmetic (ANSI/IEEE Std 754-1985)➤ Floating point addition/subtraction➤ Floating point multiplication➤ Rounding and errors

Page 3: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 3

FIXED POINT NUMBERS (RECAP)

➤ In fixed point representation, the binary point is assumed to be at a fixed location.– For example, if the binary point is at the end of an 8-bit

representation as shown below, it can represent integers from -128 to +127.

binary point

Page 4: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 4

FIXED POINT NUMBERS (RECAP)

➤ In general, the binary point may be assumed to be at any pre-fixed location.– Example: Two fractional bits are assumed as shown below.

– If 2s complement is used, we can represent values like:011010.112s = 26.7510

111110.112s = -000001.012 = -1.2510

assumed binary pointinteger part fraction part

Page 5: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 5

FLOATING POINT NUMBERS (RECAP)

➤ Fixed point numbers have limited range.➤ Floating point numbers allow us to represent very large or

very small numbers.➤ Examples:

0.23 × 1023 (very large positive number)0.5 × 10-37 (very small positive number)-0.2397 × 10-18 (very small negative number)

Page 6: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 6

FLOATING POINT NUMBERS (RECAP)

➤ 3 parts: sign, mantissa and exponent➤ The base (radix) is assumed to be 2.➤ Sign bit: 0 for positive, 1 for negative.

sign mantissa exponent

➤ Mantissa is usually in normalised form (the integer part is zero and the fraction part must not begin with zero)

0.01101 × 24 normalised 101011.0110 × 2-4 normalised

➤ Trade-off:– More bits in mantissa better precision

– More bits in exponent larger range of values

0.1101 × 23

0.101011011 × 22

Page 7: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 7

FLOATING POINT NUMBERS (RECAP)

➤ Exponent is usually expressed in complement or excess format.

➤ Example: Express -6.510 in base-2 normalised form-6.510 = -110.12 = -0.11012 × 23

➤ Assuming that the floating-point representation contains 1-bit, 5-bit normalised mantissa, and 4-bit exponent. The above example will be stored as if the exponent is in 1s or 2s complement.

1 11010 0011

Page 8: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 8

FLOATING POINT NUMBERS (RECAP)

➤ Example: Express 0.187510 in base-2 normalised form

0.187510 = 0.00112 = 0.11 × 2-2

➤ Assume this floating-point representation:1-bit sign, 5-bit normalised mantissa, and 4-bit exponent.

➤ The above example will be represented as

0 11000 1101 If exponent is in 1’s complement.

0 11000 1110 If exponent is in 2’s complement.

0 11000 0110 If exponent is in excess-8.

Page 9: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 9

IEEE STANDARD 754 (1/3)

➤ Standardized in the mid-1980s➤ Most widely-used standard for floating-point

computation➤ Two types of formats

– Normalized numbers– Denormalized numbers

➤ Special values– Negative zero– Infinities– Not-a-Number (NaN)

Page 10: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 10

IEEE STANDARD 754 (2/3)

Page 11: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 11

IEEE STANDARD 754 (3/3)

➤ A normalized number represented is:

v = (– 1)sign 1.fraction 2exponent – bias

➤ Sign is a single bit➤ Single precision – exponent bias = +127➤ Double precision – exponent bias = +1023➤ Exponent must NOT be 0. It must be in [1, 2e-2]. e is the

number of exponent bits. All zero or all one exponents are reserved for special values and are not used for normalized numbers

➤ A normalized fractional part is in the interval [1, 2)

Page 12: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 12

A SINGLE PRECISION EXAMPLE

➤ The value represented here is:v = (–1)0 x 1.010001000000000000000002 x 2(10000011

–127)

= 1.010001000000000000000002 x 2(131–127)

= 1.010001000000000000000002 x 24

= 10100.012

= 20.2510

0100 0001 1010 0010 0000 0000 0000 00000100 0001 1010 0010 0000 0000 0000 0000

sign

0100 0001 1010 0010 0000 0000 0000 0000

exponent fraction

0100 0001 1010 0010 0000 0000 0000 0000

2

Page 13: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 13

HOW TO PRINT

➤ A simple C program to print out the representation in single precision:

#include <stdio.h>

union { float f; int i;} u;

int main(int argc, char *argv[]){ u.f = 20.25;

printf("0x%x %f\n", u.i, u.f);}

Page 14: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 14

DENORMALIZED NUMBERS

➤ Goal: To represent really small (positive and negative) numbers

➤ The number represented is:

v = (– 1)sign 0.fraction 2–bias+1

➤ Exponent must be 0, mantissa must be non-zero– Note that the exponent bits are not interpreted as (0 –

bias).

Page 15: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 15

SPECIAL VALUES

➤ When both exponent, and fractional part is zero, it is a zero– Two zeroes: positive and negative

➤ If the exponents are all ones, and the fractional part is zero, it represent infinity– Two infinities: positive and negative

➤ If the exponents are all ones, and the fractional part is non-zero, it represent NaN (Not-a-Number)– Two types: signalling or non-signalling, depending on

user choice

Page 16: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 16

Comparison Rules

➤ Negative and positive zero compare equal

➤ Every NaN compares unequal to every value, including itself

➤ All values except NaN are strictly smaller than +∞ and strictly greater than −∞

Page 17: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 17

EXAMPLES IN SINGLE PRECISION

Page 18: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 18

RANGE OF NUMBERS

(Not drawn to absolute scale)

Page 19: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 19

VERY IMPORTANT CAVEAT

➤ The most important thing you must know about floating point arithmetic is that associativity is never preserved (even when the operation is theoretically associative), i.e.

(A op B) op C ≠ A op (B op C)

even if “op” is “+”.

Page 20: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 20

ROUNDING

➤ Rounding off destroys associativity!➤ Rounding is selecting a representable number

as the result.

Example:

1.000···0 1.000···1

23 bits 23 bits

Result ofcomputation

X Y

Report X or Y asresult?

Cannot be represented!

Page 21: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 21

THE FOUR ROUNDING MODES

➤ Round to nearest (default)

➤ Round towards zero

➤ Round towards positive infinity

➤ Round towards negative infinity

Page 22: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 22

THE THREE EXTRA BITS

➤ Standard specifies that all arithmetic must be performed with 3 extra bits at the end of the last fractional bit– The guard bit– The round bit– The sticky bit (this is one is any bits to the right of it is

one)

Page 23: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 23

ROUND TO NEAREST

Result of computation

X YNearer! So report Y!

Least significant bit –the last bit of the fraction

Page 24: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 24

ROUND TOWARDS ZERO

Result of computation

X YX is nearer 0! So report X!

0 +∞

Page 25: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 25

ROUND TOWARDS ∞

Result of computation

X Y

Y is nearer +∞! So report Y!

0 +∞

Page 26: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 26

ROUND TOWARDS ∞

Result of computation

X Y

X is nearer -! So report X!

0 +∞

Page 27: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 27

ERRORS

➤ Rounding yields a representable floating point number x’ that is an approximation of the real number x

➤ Absolute error = |x’ – x|➤ Relative error = |x’ – x| |x| (assuming x≠0)➤ Errors will accumulate with more and more

operations – watch your errors!

Page 28: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 28

Floating Point Addition

➤ Given two decimal numbers in scientific notation:– X = a × 10p

– Y = b × 10q

➤ To perform X + Y, we need to align the decimal point by shifts such that the two exponents are the same

➤ If we assume p > q, then we can adjust Y such that:– Y’ = (b with decimal point shifted p – q decimal places to the left)

× 10p

– This is called a denormalization shift.

➤ Now we can perform the addition➤ Normalize the result

Page 29: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 29

FLOATING POINT ADDITION

• Do the above in base 2

• Difference in the exponents - denormalization shift amount

• Shift smaller number by this amount to align the binary points

• Perform addition

• Have to re-normalize the result by shifting over leading zeros – leading zero detection (the hard part)

Page 30: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 30

Floating Point Multiplication

➤ Given two decimal numbers in scientific notation:– X = a × 10p

– Y = b × 10q

➤ X × Y = (a × b) × 10(p+q)

➤ Normalize the result➤ More straightforward than addition

Page 31: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 31

FLOATING POINT MULTIPLICATION

• Add the exponents

• Multiply the mantissa

• Since input is in [1, 2), result will be in [1, 4) – at most 1 bit shift to renormalize

Page 32: Cs2100 9 Floating Point

2011 Sem 1 Floating Point Numbers 32

END