on the representation of zero in floating-point arithmetic

BIT 4 (1964) 156-161

ON T H E R E P R E S E N T A T I O N OF Z E R O I N

F L O A T I N G - P O I N T A R I T H M E T I C

CHRISTIA2q GRAM

Abstract. Two representations of zero in floating-point arithmetic are considered in rela-

tion to a summation with correction for rounding errors. The common representation with exponent zero is shown in this case to be better than the "academic" representation where the exponent depends on the "history" of the zero.

1. Introduction.

In floating-point arithmetic the representation of zero is not uniquely determined and it is necessary explicitly to define how the number zero behaves arithmetically, and to examine the implications hereof.

The present study arose from an at tempt to improve an ALGOL procedure for integration (see Gram (1964)). The tests were carried out on the GIER computer which has built-in floating-point arithmetic. A switch makes it possible to choose between the two modes of representation of zero described in section 2, and the deviation between results obtained from the same programs by operating in these two modes led to the conclusions of section 4.

Throughout the paper we shall consider a binary floating-point arithmetic having mantissae with 29 significant bits (plus one bit for the sign) and with exponents in the range -512 < p < 512 (corresponding to the situation on GIER).

2. "Civil" and "Academic" Representation of Zero.

We shall consider two ways of representing the floating-point number zero and the impact of this on the arithmetic operations.

2.1 "Civil" Representation. Probably the most used representation of zero in floating-point arith-

metic is to put both the mantissa and the exponent equal to zero, i.e., we always have (2.1) the number 0 = 0 × 2 0

and the following arithmetic rules:

ON THE REPRESENTATION OF ZERO IN FLOATING-POINT ARITHMETIC 157

(2.2) a_+ 0 = a

(2.3) a - a = 0

(2.4) a × 0 = 0

(2.5) O/a = o

( = 0 x 20 independent of a)

Thus, e.g., (21°°- 21°°) + 17 = 17, bu t (2100 + 17) -- 21°° = 0.

In any ar i thmet ic wi th a finite numbe r of digits a number represents an interval of the reM axis, where the length of this in terval depends on the nu m ber of digits and, in f loat ing-point ar i thmet ic , also on the absolute va lue of the number . Loosely speaking, one m a y say t h a t the "civi l" representa t ion of zero makes i t correspond to an in terval shorter t h a n the in terva l of any o ther number .

2.2 " A c a d e m i c " Representa t ion .

In the other representa t ion considered the exponen t of zero depends on the " h i s t o r y " of the number , i.e., i t depends on the calculat ion t h a t has produced the zero; this influences the ar i thmet ic operat ions according to the following rules:

(2.2a) The value of a +_ 0 x 2p depends on the exponent, a 2 of a: I f a s > p, t hen the result is a, bu t if a s < p, t hen the mantissa of a is shif ted to the r ight before the addi t ion/subt rac t ion; hence some digits of a m a y be lost.. (After the operat ion, the result is normal ized as usual so t ha t it gets the exponen t of a.)

(2.3a) a 1X 2a2--a1 x 2 a~ -~ 0 x 2 a~

(2.4a) (a z x 2 a~) x (0 x 2 p) = 0 x 2 a2+p*

( 2 . 5 a ) (0 X 2P)/ (a l X 2 a2) = 0 x 2 p-a2*

Thus e.g. (25°- 25°) + 2 I5 = 0 x 250 and Mso (25° + 2 I5) - 250 = 0 x 25°, be-

cause 215 is shif ted more t han 30 places to the r ight before the addit ion. A peculiar result of the rules above is t h a t one m a y get overflow when mul t ip ly ing zero with a large exponent b y a number with a large exponent , and correspondingly with respect to the division.

On the real axis, this means t ha t the length of the in terval corresponding to zero varies with the circumstances under which the zero has arisen.

This representa t ion of zero, which m a y seem more logical and consistent t han the o ther one, raises the following problems: W h a t should be done wi th a zero appear ing as inpu t or as a cons tant of a program, and how

* If the calculated exponent is less than - 5 1 2 , then i t is pu t equal to - 5 1 2 .

] 5 8 CHRISTIAN GRAI~

should funct ion values such as ln(1), sin(0), and sign(0) calculated b y subroutines be represented ?

As to the first problem, one might prohibi t the occurrence of zero so t h a t i t can only appear as difference between two ident ical numbers , or one might demand an exponent with each zero telling the "s ize" of t h a t par t icu lar zero. B u t bo th of these solutions are r a the r cumbersome and restr ict ive, so i t seems be t t e r to define a n y input zero and a n y cons tant zero to have the smallest available exponent , p = - 512. This means t h a t such an initialized zero represents the shortest possible in terval on the real axis, and hence it will act as the usual number zero in all ar i thmetics (cf. the rules 2.2a-2.Sa). F o r instance, in the A LG O L s t a t emen t

if x = 0 t h e n . . .

the re la t ion x = 0 will be t rue only if x has the mant issa 0. On the o ther hand the relat ion x = z e r o where ze ro = 0 × 2 l° is t rue for a n y x with ]xl < 2 -2° because the subtract ion z e r o - x gives 0 × 21°.

As to the second problem, the answer is also here t h a t s t andard rou- t ines should deliver a zero wi th the smallest possible exponent , because such a zero can do no ha rm when used in connect ion wi th o ther numbers. An y o ther choice might easily spoil o ther variables according to the rule (2.2a).

I f a zero is wanted of the same order of magni tude as a certain number A, then i t mus t be in t roduced in a s t a t emen t like

z e r o A : = A - A .

3. Im pac t on a S u m m a t i o n Algor i thm.

We shall consider the following algor i thm for summat ion of the elements of an a r ray A with. correct ion of near ly all the rounding errors:

s u m l : = e r r o r : = 0 ;

for j : = 1 step 1 until N do begin

s u m : = s u m l ÷ A j ;

E l : e r r o r : = e r r o r ÷ ( i f a b s ( A j ) < a b s ( s u m l )

then A j - ( s u m - a u r a l )

else s u m l - ( s u m - A j ) ) ;

s u m l := s ~ n end;

s u m : = s u m + e r r o r ;

In the s t a t emen t labelled E1 we f ind the ro tmdf l~ error commi t t ed in

ON THE REPRESEI~TATION OF ZERO IN FLOATIi~G-POINT ARITHMETIC 159

the previous addition and accumulate it under the name error. The conditional expression is necessary because it is always the operand with the smallest absolute value which looses some digits during the addition. The above correction is not complete because it fails in the case where a shift in sum is necessary after the addition (in order to make it a correct normalized floating-point number); also this case can be taken care of (see Moller (1964) for further details), but the algorithm above is sufficient to illustrate an effect of the zero representation:

(a) Consider a summation of positive, decreasing values A~- where, say, the first p values are so large that the sum of these exceeds 229x Ap+ 1. Then, of course, the remaining values of A~ cannot change sum when added one by one, but working with the "civil" representation, they are added to error successively (and error will most often be small enough to be able to assimilate the values of A), and hence the sum of the last N - p A j-terms may, through the addition of error, influence correctly on the total sum.

Working with the "academic" representation, these last Ayterms are completely lost in the statement E1 because the difference ( sum-suml ) equals zero with an exponent high enough to make A 1 - ( s u m - s u m l ) equal to zero with the same exponent. Hence the resulting sum consists alone of the sum of the first p values of Aj and a correction for rounding errors from the first p - 1 additions.

This is illustrated by the summation of function values when calculat- ing the integral

1.01

f x-Sdx 0.01

by Romberg's method (an extrapolation from trapezoidal integrations with 2 , 4 , 8 , . . . subintervals, see Gram (1964)). With 26 subintervals the result is independent of the zero representation, but with 27 subintervals the results show a relative difference of approximately 10-8; this difference increases with a growing number of subintervals, the "civil" result being the best in each case. A numerical analysis yields the following result: The last function values of the summation are approximately 0.95 (namely

1.01-5), and with 26 subintervals the sum of the first function values is*

S1 = + + . . . ~ 9 × 1 0 7 + 3 × 1 0 6 + . . . =~ lO s .

* Because of t h e success ive h a l v i n g of in te rva ls , on ly t he n e w f unc t i on va lue s are a d d e d here, cor responding to m e s h p o i n t No. 1, 3, 5, . . .

160 CHRISTIAN GRAM

Hence the last values are greater than 2 -~s × S1% 4 × 10 -9 × S1 T 0.4 and the correction statement E1 will "catch" nearly all rounding errors in both cases. But with 2 7 subintervals the sum is

S1 = + + . . . =~ 5 . 6 × 1 0 s + 3 × 1 0 7 + . . . =~ 6 x l O s

and thus the last function values are lost totally in the "academic" case because they are less than 2 -39 × S1 ~ 2 × 10 -9 × S1 T 1.2.

(b) I t may happen that one of the additions is performed without any error so that the difference A t - (sum-suml) (or the symmetric expression.) is exactly equal to zero. This makes no damage in the "civil" case, but in the "academic" case this zero gets a large exponent if IAjl is large; hence we may loose some or all digits in the already accumulated error, because it is shifted to the right before addition to the zero coming from the/f-expression.

4. Conclusion.

In the case considered above the use of the "academic" zero was worse than use of the "civil" zero. Probably, it is possible to change the algo- ri thm so that it works correctly in the "academic" case; furthermore, the material presented here is certainly too small to admit any general conclusions to be drawn. However, it may raise some doubt as to the usefulness of the "academic" zero, and it may be of interest to t ry other types of numerical calculations in the two modes of operation. We have run a program for calculation of matrix eigenvalues in both modes of operation, and also here the "civil" zero representation worked a little better than the "academic" one, but the results have not been analyzed and explained as yet.

The only decisive conclusion which can be drawn from the present s tudy is that it is necessary to keep the zero representation in mind when programming (and the examples shown here were of course programmed with the common type of zero in mind).

5. Acknowledgements.

I t is a pleasure for the author to thank several staff members of Regnecentra]en for valuable help. Most of all, many discussions with Erik Jorgensen, Peter Naur, Ote M~ller, and Bjarner Svejgaard have helped to clarify my ideas and thoughts.

ON THE REPRESENTATION OF ZERO IN FLOATING-POINT ARITIIMETIC 161

R E F E R E N C E S

Grmm, Christian, Definite Integral by Rombergs Method. ALGOL Programming section of BIT 4, p. 54 (1964).

Msller, Ole, On a Quasi Double-Precision Method in Floating.Point Arithmetic. To appear in BIT 4 (1964).

REGNECENTRALEN

COPENHAGEN, DENMARK

on the representation of zero in floating-point arithmetic

Documents