12.1 rounding modes

2

12.1 Rounding Modes

3

Rounding: the process to obtain the best possible floating-point representation for a given real value.

ANSI/IEEE standard: round to floating number whose significand has an LSB of 0 (of two adjacent floating-point number, the significand of one must end in 0, and the other one in 1). This is called round-to-near-even.

For example, 3.5 and 4.5 are both rounded to 4, the closet even number, based on round-to-near-even.

4

• Other rounding methods– Round inward (toward 0):choose the nearest value

in the same direction as 0.– Round upward (toward +∞): choose the larger of

the two possible values.– Round downward (toward -∞): choose the smaller

of the two possible vavlues.

•

5

Example 12.1 Rounding to the nearest integer

a. Consider the rounded even integer corresponding to a real signed-magnitude number x a rtnei(x). Plot this round-to-nearest-even-integer for x in the range [-4,4].

b. Repeat part a for the function rtni(x), that is, round-to-nearest-integer function, where the midway values are always rounded up

7

Example 12.2 Directed rounding

a. Consider the inward-directed round corresponding to a real signed-magnitude number x as a function ritni(x). Plot this round-inward-to-nearest-integer function for x in the range [-4,4].

b. Repeat part a for the round-upward-to-nearest-integer rutni(x).

8

Figure 12.3 Two directed round-to-nearest-integer functions for x in [– 4, 4].

9

Figure 12.3 (Continued)

10

12.2 Special Values and Execeptions

• Five special values in ANSI/IEEE floating-point standard– ±0 Biased exponent=0, significand=0 (no

hidden 1)– ± ∞ Biased exponent=255 (short), or 2047

(long), significand=0– NaN Biased exponent=255 (short), or 2047

(long), significand≠0

11

Consider the addition of ±2e1s1 and ±2e2s2, where e1 > e2

(±2e1s1) +(±2e2s2)=±2e1(s1±s2/2e1-e2)

12.3 Floating-Point Addition

13Figure 12.6 Simplified schematic of a floating-point adder

14

12.4 Other Floating-point Operations

Multiplication of ±2e1s1 and ±2e2s2

(±2e1s1)×(±2e2s2)=±2e1+e2(s1×s2/2e1-e2)

Division of ±2e1s1 and ±2e2s2

(±2e1s1)/(±2e2s2)=±2e1-e2(s1/s2)

15Figure 12.6 Simplified schematic of a floating-point multiply/divide unit.

16

Figure 12.7 The common floating-point instruction format for MiniMIPS and components for arithmetic instructions. The extension (ex) field distinguishes single (* = s) from double (* = d) operands.

12.5 Floating-Point Instructions

10 floating-point arithmetic instructions (5 different operations: add, sub, multiply, divide, negate)

add.s $f0,$f8,$f10 # set $f0 to ($f8)+($f10)

add.d $f0,$f8,$f10 # set $f0 $f1 to ($f8$f9)+($f10$f11)

Single operands can be in any of the floating registers. Double operands must be in specified to be in even numbered registers

17

Figure 12.8 Floating-point instructions for format conversion in MiniMIPS.

6 format conversion instructions: integer to single/double, single to double, double to single, and single/double to integercvt.s.w $f0,$f8 # set $f0 to single (integer $f8)cvt.d.w $f0,$f8 # set $f0 to double (integer $f8)cvt.d.s $f0,$f8 # set $f0 to double ($f8)cvt.s.d $f0,$f8 # set $f0 to single ( $f8, $f9,)cvt.w.s $f0,$f8 # set $f0 to integer ($f8)cvt.w.d $f0,$f8 # set $f0 to integer ($f8, $f9)

18

Figure 12.9 Instructions for floating-point data movement in MiniMIPS.

6 data transfer instructions: load/store word to/from coprocessor1, move single/double from one FP register to another, move (copy) between FP registers and CPU general registers.

lwcl $f8, 40($3) # load mem[40+($s3)] into $f8swc1 $f8, A($3) # store mem[A+($s3)] into $f8mv.s $f0,$f8 # load $f0 with ($f8)mv.d $f0,$f8 # load $f0,$f1 with ( $f8, $f9,)mfc1 $t0,$f12 # load $t0 with ($f12)mtc1 $f8,$t4 # load $f8 with ($t4)

19

Figure 12.10 Floating-point branch and comparison instructions in MiniMIPS.

2 branch and 6 comparison instructions. The FP unit has a flag that is set to T or F based on 6 comparisons (equal, less than, or less or equal for single/double data type)

bc1t L # branch on FP flag truebc1f L # branch on FP flag falsec.eq.* $f0, $f8 # if ($f0)=($f8), set flag to truec.lt.* $f0, $f8 # if ($f0)<($f8), set flag to truec.lw.* $f0, $f8 # if ($f0)≤($f8), set flag to true

20

Table 12.1 The 30 MiniMIPS floating-point instructions:because the op field contains 17 for all but two of the instructions (49 for lwc1 and 50 for swc1), it is not shown.

21

12.6 Result Precision and Errors• FP arithmetic can be quite dangerous and must be used with

proper care, because results of FP computations are inexact.

• Why? – Many real numbers do not have exact binary representation within a

finite word format. This is referred as representation error.

– Even for values that are exactly representable, FP arithmetic produces inexact results. For example, product of 2 short FP numbers will have a 48 bits significant that must be rounded to 23 bits (plus hidden 1) This is called computation error.

22

Example 12. 4

Associate law of addition does not hold in general in FP arithmetic. For example

a= -25×(1.10101011)

b=25 × (1.10101110)

c=-2-2 × (1.01100101)

(a+b)+c = a+(b+c) ?

23

Figure 12.11 Algebraically equivalent computations may yield different results with floating-point arithmetic.

24

• Using guard digits to avoid excessive error.For example, in a 10-digit calculator, 1/3 is represented as 0.333 333 333 3, multiplying 3 results in 0.999 999 999 9, but not 1.

However, in a calculator with 2 guard bits, 1/3 is represented as 0.333 333 333 333, but still displayed as 0.333 333 333 3, multiplying 3 results in 1.

25

Figure 12.12 Function evaluation by table lookup and linear interpolation.

12.1 rounding modes

Documents