floating point arithmetic. hardware vs. software can build the alu (arithmetic logic unit) to...

Floating Point Arithmetic

Hardware vs. Software• Can build the ALU (Arithmetic Logic Unit)

to perform Floating Point Arithmetic– Faster– More expensive

• Less of an issue as technology improves

• Can simulate the operations of Floating Point with multiple integer operations– Done by the compiler– Slower– Cheaper Hardware

IEEE Floating Point Layout

• Single Precision – 32 bits– Left bit is a sign bit– Next 8 are exponent– Next 23 are mantissa

• Double Presicion – 64 bits– Left bit is a sign bit– Next 11 are exponent– Next 52 are mantissa

Floating Point Addition

• Performed in several steps– Line up the decimal points

• Now the exponents are the same

– Add the mantissas– Exponent of the result is the same as the

exponents of the operands– Normalize if necessary

• Place in proper scientific notation

Equalizing Exponents• In math, we can shift the value with the larger

exponent left while decreasing the exponent until the exponents are equal– But the hardware has no place to shift the value left into. – There is the implied decimal point

• We must shift the value with the smaller exponent right and increase the exponent– The right values are lost

• Insignificant – low order bits – won’t affect the answer much• Some hardware has extra bits just for computation, not for answer

Adding

• Now the bits for the mantissa can be added.

• Just like adding integers (but with fewer than 32 bits)

• The exponent of the answer is the same exponent as the operands.

Normalizing• In scientific notation, the mantissa of the

operands is between 1 and 2.• After getting the exponents equal the mantissa

is between 0 and 2.• So, the result is between 1 and 4

– Unless one of the operands is negative, then the result can be between 0 and 4 (in absolute value)

• We may need to shift the result left to get a 1 bit into the leftmost bit of the answer

• We may need to shift the result right to get the result in the proper range

Correct Results• What happens when we add two values of

very different magnitude?– We must shift one of the values many places– The rightmost bits “fall off” the end– The answer will not be “exact”, but very close.

• When would this happen?– What if we are summing many, many values.– Sum=Sum + A[I]– Sum can get so big compared to A[I] that Sum

does not change.

Multiplication

• Actually a little easier– Do unsigned multiplication with the mantissas– Add the exponents– Normalize the result– Set the sign bit of the result

Multiplication Details

• We have already done unsigned multiplication.

• To add the exponents we need to look at the notation.– The exponents use excess 127 notation

– fpe1=reale1+127

– Result = fpe1+fpe2 = reale1+reale2+127+127

– Need to subtract 127 from the result to get appropriate value

Sign

• The sign of the result depends on the sign of the operands

• If both operands have the same sign, the result is positive, otherwise the result is negative.

S1 S2 R

0 0 0

0 1 1

1 0 1

1 1 0

•This is the XOR function

•Of course, must normalize the result

•May have many more shifts

True Division

• Do unsigned division on the mantissas– Discussed with integers.

• Subtract the exponents– Now need to add 127 to get the correct

representation of the value

• Normalize the result– Same as previous methods

• Set the sign– Same as with multiplication

Division by Reciprocal• Calculate a/b as a* (1/b)

• This is useful only if we can compute (1/b) without using division.

• Use a Newton-Raphson technique (discussed in CSCI 381)– Repeat

• r = r * (2 – r*b)

– Until r does not change– r starts with a first guess at the reciprocal and

gets closer with each iteration

Errors

• Floating point numbers are not exact

• Do NOT compare floating point numbers for equality. 0.1 * 10 ≠ 1.– Instead of using “if (a == b)” when a and b are

floating point, use if (abs(a-b) < .0001) or some other reasonable measure of “close enough”

Rounding in Base 2

• Round to the nearest.– Ties are such that the least significant bit is 0

• Round towards 0– Truncation

• Round towards positive infinity– Round up (careful with negative values)

• Round towards negative infinity– Round down

Overflow and Underflow• Overflow for integers is when the result is

too big to be held with the number of bits allocated.

• The same is true for Floating Point. However, this is determined more by the size of the exponent field than the size of the mantissa field.

• Underflow is when a value becomes so small that it becomes 0. – Again, this is related to the exponent field but

with negative exponents

floating point arithmetic. hardware vs. software can build the alu (arithmetic logic unit) to...

Documents