chapter 7 floating-point arithmetic

Chapter 7

Floating-Point Arithmetic

Representation of Floating-Point Numbers

A simple representation of a floating-point (or real) number (N) uses a fraction (F), base (B), and exponent (E), where N = F x BE.

The base can be any integer larger than 1 and can be implied or explicit.

The fraction and the exponent can be represented in many formats. Example: they can be represented by 2’s

complement formats, sign-magnitude form, oranother number representation.

There are a variety of floating-point formats.

Representation of Floating-Point Numbers: 2’s Complement 1

The base for the exponent is 2. Hence, the value of the number is N = F x 2E.

In a typical floating-point number system, F is 16 to 64 bits long and E is 8 to 15 bits long.

The sign bit is 0 for positive numbers and 1 for negative numbers.

Example: represent decimal 2.5 in 8-bit 2’s complement floating-point format: 2.5 = 0010.1000 = 1.010 x 21 (normalized representation) = 0.101 x 22 (4-bit 2’s complement fraction) Thus, F = 0.101 E = 0010 N = 5/8 x 22

If the number is -2.5, the same exponent can be used, but the fraction must have a negative sign. The 2’s complement representation for the fraction is 1.011.

Thus, F = 1.011 E = 0010 N = -5/8 x 22

Normalizing: In order to utilize all the bits in F and have the

maximum number of significant figures, F should benormalized so that its magnitude is as large aspossible.

If F is not normalized, normalize F by shifting it leftuntil the sign bit and the next bit are different.

Shifting F left is equivalent to multiplying by 2, sofor every shift, decrement E by 1 to keep N thesame.

After normalization, the magnitude of F will be aslarge as possible, since any further shifting wouldchange the sign bit.

Examples:

Unnormalized: F = 0.0101 E = 0011 N = 5/16 x 23 = 5/2

Normalized: F = 0.101 E = 0010 N = 5/8 x 22 = 5/2

Unnormalized: F = 1.11011 E = 1100 N = -5/32 x 2-4 = -5 x 2-9

Shift F left: F = 1.1011 E = 1011 N = -5/16 x 2-5 = -5 x 2-9

Normalized: F = 1.011 E = 1010 N = -5/8 x 2-6 = -5 x 2-9

Zero cannot be normalized, so F = 0.000 whenN = 0.

Any exponent could then be used; however, it isbest to have a uniform representation of 0.

In this format, associate the negative exponentwith the largest magnitude with the fraction 0.

In a 4-bit 2’s complement integer numbersystem, the most negative number is 1000, whichrepresents -8. Thus when F and E are 4 bits, 0 isrepresented by: F = 0.000 E = 1000 N = 0.000 x 2-8

Some floating-point systems use a biasedexponent whereby E = 0 is associated with F = 0.

Representation of Floating-Point Numbers: IEEE 754 Standard 1

IEEE 754 is a floating-point standardestablished by the IEEE in 1985.

It contains two representations for floating-point numbers: Single precision: uses 32 bits. Double precision: uses 64 bits.

Designers of IEEE 754 desired a format thatwas easy to sort and hence adopted a sign-magnitude system for the fractional partand a biased notation for the exponent.

Representation of Floating-Point Numbers: IEEE 754 2

The IEEE 754 floating-point formats needthree subfields: sign, fraction, and exponent.

The fractional part of the number isrepresented using a sign-magnituderepresentation in the IEEE floating-pointformats.

The sign is 0 for positive numbers and 1 fornegative numbers.

Form is: N = (-1)S X (1 + F) X 2E

S is the sign bit, F is the fractional part, and Eis the exponent.

Base of the exponent is 2 and implied (notstored).

Magnitude (significand) of the fraction is 1 + F. Often the terms significand and fraction are

used interchangeably.

7‐10

IEEE Single Precision Floating-Point Format: 32 bits:

IEEE Double Precision Floating-Point Format: 64 bits:

7‐11

Sign Exponent Fraction1 bit 8 bits 23 bits

Sign Exponent Fraction1 bit 11 bits 52 bits

The exponent in the IEEE floating-pointformats uses a biased notation: Contains the actual exponent plus 127 for

single precision or plus 1023 for doubleprecision.

Converts all single-precision exponents from-126 to +127 into normalized floating-pointnumbers from 1 to 254, and all double-precisionexponents from -1022 to +1023 into normalizedfloating-point numbers from 1 to 2046.

7‐12

Overflow: positive exponent is too large to berepresented in the exponent field.

Underflow: negative exponent is too large to berepresented in the exponent field.

7‐13

Representation of Floating-Point Numbers: IEEE 754-Example 1

13.45 in IEEE single precision floating-pointformat: Converting to binary representation (.45 is a

recurring binary fraction): 13.45 = 1101.01 1100 1100 1100 1100 … … …

Normalized: 13.45 = 1.10101 1100 1100 … x 23

As the number is positive, the sign bit is 0. Exponent in biased notation:

127 + 3 = 130 or 10000010 in binary.

7‐14

Fraction is 1.10101 1100 1100 … Omitting the leading 1, the 23 bits for the fractional part are: 10101 1100 1100 1100 1100 11

Thus, the 32 bits are: 0 10000010 10101 1100 1100 1100 1100 11

Summarized as:

In hex format, the 32 bits are: 4157 3333

7‐15

Sign Exponent Fraction

0 1 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

The number -13.45 can be represented bychanging only the sign bit (i.e., the first bitmust be 1 instead of 0).

Hence, the hex number C157 3333 represents-13.45 in IEEE 754 single precision format.

7‐16

13.45 in IEEE double precision floating-pointformat: Converting to binary representation:

13.45 = 1101.01 1100 1100 1100 … … … Normalized:

13.45 = 1.10101 1100 1100 … x 23

As the number is positive, the sign bit is 0. Exponent in biased notation:

1023 + 3 = 1026 or 10000000010 in binary.

7‐17

Fraction is 1.10101 1100 1100 …. Omitting the leading 1, the 52 bits of the fractional part are: 10101 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 110

Thus the 64 bits are: 0 10000000010 10101 1100 1100 1100 1100 1100 1100 1100 1100 1100

1100 1100 110

Summarized as:

In hex format, the 64 bits are: 402A E666 6666 6666

7‐18

Sign Exponent Fraction

0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0

The number -13.45 can be represented by changing only the sign bit (i.e., the first bit must be 1 instead of 0).

Hence, the hex number C02A E666 6666 6666 represents -13.45 in IEEE 754 double precision format.

7‐19

Special Cases in IEEE 754 Standard

The smallest and highest exponents are used to denote these special cases.

7‐20

Single Precision Double Precision ObjectRepresented

Exponent Fraction Exponent Fraction

0 0 0 0 0

0 Nonzero 0 Nonzero ± denormalizednumber

255 0 2047 0 ± infinity

255 Nonzero 2047 Nonzero NaN (not a number)

Guard Round and Sticky Bit

When the number of bits available is smaller thanthe number of bits required to represent anumber, rounding is employed.

It is desirable to round to the nearest value. Guard round: the two extra bits that the IEEE

standard requires in intermediate representationsin order to facilitate better rounding.

Sticky bit: the third intermediate bit sometimesused in rounding. It is set whenever there arenon-zero bits to the right of the round bit.

7‐21

Round, Truncate, and Unbiased The IEEE standard has 4 rounding modes when

the number falls halfway: Round up: round toward positive infinity; round up

to the next higher number. Round down: round toward negative infinity;

round down to the nearest smaller number. Truncate: round toward zero. Ignore bits beyond

the allowable number of bits. Same as truncation in sign magnitude.

Unbiased: round to nearest. If the number falls halfway, round up half the time and round down half the time. In order to achieve rounding up half the time, add 1 if the lowest bit retained is 1, and truncate if it is 0.

7‐22

Floating-Point Multiplication 1

Given two floating-point numbers, F1 x 2E1 and F2 x 2E2, the product is:(F1 x 2E1) x (F2 x 2E2) = (F1 x F2) x 2(E1 + E2) = F x 2E

The fraction part of the product is the product of the fractions, and the exponent part of the product is the sum of the exponents.

a floating-point multiplier consists of two major components: 1. A fraction multiplier 2. An exponent adder

7‐23

Floating-Point Multiplication 2

Procedure for performing floating-point multiplication:1. Add the two exponents.2. Multiply the two fractions (significands). 3. If the product is 0, adjust the representation to the

proper representation for 0. 4. a. If the product fraction is too big, normalize by shifting

it right and incrementing the exponent. b. If the product fraction is too small, normalize by shifting left and decrementing the exponent.

5. If an exponent underflow or overflow occurs, generate an exception or error indicator.

6. Round to the appropriate number of bits. If rounding resulted in loss of normalization, go to step 4 again.

7‐24

Flowchart for Floating-Point Multiplication

7‐25

Hardware Required to Implement the Multiplier

Exponent adder: a 5-bit full adder is used. Fraction multiplier: implements a shift-and-add

multiplier algorithm. Control unit: provides the signals to perform

the appropriate operations of right shifting, left shifting, exponent incrementing/decrementing, and so forth.

7‐26

SM Chart for Floating-PointMultiplication

7‐27

The VHDL Behavioral Description for Floating-Point Multiplication 1

The VHDL behavioral description uses three processes: The main process generates control signals

based on the SM chart. The second process generates the control

signals for the fraction multiplier. The third process tests the control signals and

updates the appropriate registers on the rising edge of the clock.

7‐28

The VHDL Behavioral Description for Floating-Point Multiplication 2

Testing the VHDL code for the floating-point multiplier must be done carefully to account for all the special cases in combination with positive and negative fractions, as well as positive and negative exponents.

When the VHDL code was synthesized for the Xilinx Spartan-3/Virtex-4 architectures using the Xilinx ISE tools, the result was 38 slices, 29 flip-flops, 72 4-input LUTs, 27 I/O blocks, and one global clock circuitry.

7‐29

Floating-Point Addition

Given two floating-point numbers, F1 x 2E1

and F2 x 2E2, the sum is:(F1 x 2E1) + (F2 x 2E2) = F x 2E

7‐30

Procedure for Performing Floating-Point Addition

Procedure for performing floating-point addition: 1. Compare exponents. If the exponents are not equal,

shift the fraction with the smaller exponent right and add 1 to its exponent; repeat until the exponents are equal.

2. Add the fractions (significands). 3. If the result is 0, set the exponent to the appropriate

representation for 0 and exit. 4. If fraction overflow occurs, shift right and add 1 to the

exponent to correct the overflow. 5. If the fraction is unnormalized, shift left and subtract 1

from the exponent until the fraction is normalized. 6. Check for exponent overflow. Set overflow indicator, if

necessary. 7. Round to the appropriate number of bits. Is it still

normalized? If not, go back to step 4.

7‐31

Floating-Point Addition- Example 1

add (F1 x 2E1) = 0.111 x 25 and(F2 x 2E2) = 0.101 x 23

Apply the aforementioned steps:1. Compare exponents. Since E2 does not equal E1,

unnormalize the smaller number F2 by shifting right 2 times and adding 2 to the exponent: 0.101 x 23 = 0.0101 x 24 = 0.00101 x 25

2. Add the fractions: (0.111 x 25) + (0.00101 x 25) = 01.00001 x 25

3. If the result is 0, set the exponent to the appropriate representation for 0 and exit. Result is not 0.

7‐32

Floating-Point Addition- Example 2

4. If fraction overflow occurs, shift right and add 1 to the exponent to correct the overflow:

This addition caused an overflow into the sign bit position. The final result is: F x 2E = 0.100001 x 26

5. If the fraction is unnormalized (or negative), shift left and subtract 1 from the exponent until the fraction is normalized. Example:

(1.100 x 2-2) + (0.100 X 2-1) =(1.110 x 2-1) + (0.100 x 2-1) (after shifting F1)= 0.010 x 2-1 (result of adding fractions unnormalized)= 0.100 x 2-2 (normalized by shifting left and

subtracting 1 from exponent)

7‐33

Hardware Units are Required to Implement a Floating-Point Adder Adder (subtractor) to compare Shift register to shift the smaller number to the

right ALU (adder) to add fractions Bidirectional shifter, incrementer/decrementer. Overflow detector Rounding hardware

Many of these components can be combined.

7‐34

Overview of a Floating-Point Addition

7‐35

Other Floating-Point Operations: Subtraction

The procedure is the same as addition, except you must subtract the fractions instead of adding them.

Other steps remain the same.

7‐36

Other Floating-Point Operations: Division 1

The quotient of 2 floating-point numbers is:(F1 x 2E1) ÷ (F2 x 2E2) = (F1 / F2) x 2(E1 - E2) = F x 2E

The basic procedure is to divide the fractions and subtract the exponents. In addition to considering the special cases already described, also test for divide by 0 before dividing.

7‐37

Other Floating-Point Operations: Division 2

If F1 and F2 are normalized, then the largest positive quotient (F) will be:0.1111 … / 0.1000 … = 01.111 … This is less than 102, so the fraction overflow is

easily corrected. For example:

(0.110101 x 22) ÷ (0.101 x 2-3) = 01.010 x 25

= 0.101 x 26

Alternatively, if F1 ≥ F2, we can shift F1 right before dividing and avoid fraction overflow in the first place. In the IEEE format, when divide by 0 is involved, the result can be set to NaN.

7‐38

chapter 7 floating-point arithmetic

Documents

operations and arithmetic floating point representation

floating-point arithmetic tms32020

floating point arithmetic - drexel...

tms320vc33 digital signal processor (rev. e) ·...

floating point arithmetic -...

floating point arithmetic final

beyond floating point – next generation computer...

floating point arithmetic sun

chapter 9 floating point arithmetic

lecture 3 floating point...

floating point arithmetic feb 17, 2000 topics ieee floating...

30441900 floating point arithmetic final

set 16 floating point arithmetic. topics binary...

floating point arithmetic

draft standard for floating-point arithmetic...

set 16 floating point arithmetic

18.330 lecture notes: machine arithmetic: fixed-point and...

floating point arithmetic - the college of engineering at...

a proposed standard for binary floating.-point arithmetic

10 mips floating point arithmetic