multiprecision division: expanded version eric rice ... · -word reciprocal to calculate an -word...

Multiprecision division: Expanded Version

Eric Rice Richard Hughey

Abstract

This paper presents a study of multiprecision division on processors containing word-by-word mul-tipliers. It compares several algorithms by first optimizing each for the software environment, and thencomparing their performances on simple machine models. While the study was originally motivated byfloating-point division in the small-word environment, the results are extended to multiprecision floating-point and integer division in general to the extent possible without extensive architecture-specific analy-sis.

Two algorithms are found to be best for multiprecision division. For many floating-point divisionproblems, and especially for any division by a small divisor, a hybrid of the Newton-Raphson andByte Division algorithms is optimal, where significant reciprocal refinement is performed before be-ginning very high radix Byte Division iterations. Low-precision arithmetic and a method of inexpen-sively boosting accuracy during Newton-Raphson reciprocal refinement improve algorithm efficiency.For other division problems, Restoring Division is best, and is easy to implement. The asymptotic costsfor floating-point division of each of these algorithms is the same as that of multiprecision multiplication.

1

Contents

1 Introduction 4

2 Problem Overview 5

3 Low-Precision Arithmetic 6

4 Overview of Division Algorithms 74.1 Restoring Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 Byte Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3 Accurate Quotient Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.4 Prescaling and Truncating .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.5 Newton-Raphson Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.6 Goldschmidt’s Algorithm .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.7 Low-Radix Algorithms . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Newton-Raphson Reciprocal Refinement 135.1 Efficient NR Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2 NR Accuracy Boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Floating-point division in small-word processors 16

7 Multiprecision Division for Arbitrary Parameters 187.1 Floating-Point Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187.2 Integer Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

8 Complications and Observations 20

9 Conclusions 22

10 Acknowledgments 22

11 Affiliation of Authors 25

A Algorithm Implementations 25A.1 Modified Newton-Raphson Algorithm .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27A.2 Byte Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28A.3 Accurate Quotient Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31A.4 Prescaling and Rounding Algorithm . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A.5 Goldschmidt’s Algorithm . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

B Derivation of General Formulas 46

C Assorted issues 46C.1 Newton-Raphson division never optimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46C.2 Higher multiplication latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46C.3 Instruction Counts for Two Independently-Programmable Processors . .. . . . . . . . . . . 49C.4 Overflow prevention in hybrid algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2

D Single-precision algorithm for Kestrel 50

3

1 Introduction

Multiprecision division is usually performed using a simple Restoring Division algorithm [1]. Althoughmany studies examine other division strategies for hardware implementation, a thorough examination ofalgorithms for software division has not been available. Several hardware studies have examined the poten-tial for reduced operand sizes in certain algorithms [2, 3, 4]—something critical to software efficiency—butmost hardware analysis is of little use to software division, and efficient software implementations requiresignificant redesign.

This study attempts to redesign the most promising hardware strategies for division and to evaluate theresulting algorithms in a variety of multiprecision settings. The original motivation of the study was toimplement single- and double-precision floating point division on UCSC’s 8-bit Kestrel parallel processor[5, 6]. This work extends the results to both other small-word environments (by direct results), and tomultiprecision division in general—including integer division—by obtaining closed-form expressions ofalgorithm costs.

There are three parts to this study. In the first, we optimize algorithms for software division, includingByte Division [3], the Accurate Quotient algorithm [4], Prescaling and Truncating [7, 8], Newton-Raphsondivision [9], Goldschmidt’s algorithm [10], and Restoring Division [1], and briefly look at lower-radixalgorithms. Minimizing operand size and comparing the costs of different implementations when significantmodifications are required are the most important parts of this process.

The second part of the study is a semi-exhaustive case study for implementing single- and double-precision division in small-word processors. The primitive operations assumed are addition and integermultiplication, and several multiplication configurations are considered.

The third part of the study considers arbitrary floating-point and integer division. Closed-form expres-sions estimating algorithm costs are used to compare performances over a variety of operand and target sizes.While the additional primitive of division is assumed available, it is not found useful. In both the secondand third parts of the study, algorithms are evaluated based on the costs of performing required arithmeticoperations.

The results indicate that two algorithms are most competitive for multiprecision division. A hybrid ofthe Newton-Raphson and Byte Division algorithms is significantly more efficient than other algorithms fordivision by divisors containing a small number of words, both in floating-point and integer division. Thisalgorithm, which refines a reciprocal estimate to multiple-word accuracy before beginning very high radixByte Division iterations, is also optimal for other floating-point division when target accuracy is less than�70 words, though with a smaller margin of superiority. For other division problems, Restoring Division ismost efficient, and is easy to implement.

The paper is organized as follows. After an overview of the problem specifications in Section 2, Sec-tion 3 looks at the savings available during calculations using low-precision arithmetic. Section 4 reviewsseveral of the most common division algorithms, followed by a closer look at Newton-Raphson reciprocalrefinement in Section IV. Section 6 presents the machine model used in the studies and evaluates the algo-rithms for performing single- and double-precision division on small-word processors, followed by a moregeneral comparison of algorithms for both floating-point and integer division in Section 7. Finally, Section 8discusses issues that complicate algorithm implementations and running times, including the added costs ofhigher multiplication latency, loading and storing when insufficient registers are available, and the degree towhich algorithms contain instruction-level parallelism.

4

2 Problem Overview

In this paper we address both integer and floating-point division. The floating-point studies consider theproblem:

Q = a=b 1=2 � a < 1 1 � b < 2leading to

� 1=4 < Q < 1:While we do not address the process of normalizing (the cost to all algorithms will be the same), we assumethat one quotient bit beyond target accuracy will be calculated so that a normalized quotient will still havethe required precision.

We assume that one more quotient bit will be calculated to allow efficient rounding. As an example ofthis strategy, for single-precision division (where 24 fractional bits are needed after normalizing), we requirethat:

a=b� 2�26 < Q < a=b+ 2�26:An adjusted quotient estimateQ0 = Q+ 2�26 then satisfies:

a=b < Q0 < a=b+ 2�25:SinceQ0 is an overestimate and will still be within2�24 of accurate after normalizing, truncating the nor-malizedQ0 to 24 fractional bits will result in a exact quotient when such is representable in 24 bits. (Forexample,3=4 will be represented:11000 : : : 0 instead of:10111 : : : 1.)

While this example produces a result within 1 least significant bit (lsb), an additional quotient bit canbe calculated to provide a result within3=4 lsb, (and another to provide a result within5=8 lsb, and so on),where an exact result is again produced when such can be represented in 24 bits.

This rounding scheme is much less expensive than perfect rounding (to�1=2 lsb). For double-precisiondivision on an 8-bit processor, perfect rounding requires�45% more instructions than the 1 lsb methodabove, and asymptotically doubles the cost of the optimal algorithms when the operands contain the samenumber of words as target accuracy. Because of this, we do not address perfect rounding further, and assumeinstead that the method above will be used.

We define the following:A = size ofa (in words).

B = size ofb (in words).

W = word size (in bits).

T = target accuracy (in bits).

M = precision of critical calculations (in words).

R = accuracy of a refined reciprocal estimate (in words).The multiprecision variablesA, B, andM determine to a large extent the overall cost of division. Furtherexplanations ofM andR are given at the ends of Sections 3 and 4.2, respectively.

5

Finally, since all algorithms begin with a reciprocal estimate, we assume an initial 1-word reciprocal(under)estimater0 is available, defined by

r0 � 1=b

r0 + 2�W > 1=b

3 Low-Precision Arithmetic

Because multiprecision multiplications are usually the most costly part of a division algorithm, it is espe-cially important to perform these as efficiently as possible. Most important in this regard is to minimizethe number of words in the operands. The degree to which this can be done depends on the mathematicalproperties of an algorithm.

Another important opportunity for savings occurs when only the most significant words of a multipreci-sion multiplication represent useful information. (A common occurrence of this will be when using ann-word reciprocal to calculate an�n-word partial quotient.) In such cases some of the partial products canbe omitted without significantly affecting the accuracy of the result. The situation is pictured below:

n2 n1 n0� m1 m0

m0n0m0n1

m0n2m1n0

m1n1+ m1n2

result

Here, where we assume that only two words of result are needed, the partial product m0n0 is of littlesignificance and can be omitted. Although the partial products m0n1 and m1n0 can each impact the resultby nearly 1 lsb of the result, it is usually more efficient to omit these as well. Finally, also potentiallyrepresenting 1 lsb of result each are the low-word results of m0n2 and m1n1. On most machines, one ormore instructions would be required to accumulate such values (possibly adding one or more carries to aresult), making it usually best to omit them.

A low-precision multiplication such as above will be represented as follows (the operands in the exampleare arbitrary and in base 256):

128 7 9� 32 3

1 (128)!!!aaa!!!aaa+ 16 0 (224)!!!aaa

16 1

where the!!!aaa ’s represent uncalculated partial products and the(128) and(224) represent words that arecalculated but truncated. The radix point is to the left of the most significant word unless otherwisespecified.

In Newton-Raphson reciprocal refinement as implemented in Section IV, a partial remainder overesti-mate can lead to a subsequent quotient overestimate, which in turn will lead to a negative partial remainder.

6

To avoid this, after a low-precision multiplication during partial remainder calculation, a borrow of one ormore (depending on the maximum error of the multiplication) can be forced to ensure that the result of asubsequent subtraction is less than the exact result. Aternatively, one can conditionally deal with negativepartial remainder calculations when they occur, or can avoid partial remainders by making sure that theinitial reciprocal estimate is not too good(!).

Although not a low-precision technique as such, unnecessary work can also be avoided when the mostsignificant word(s) of a subtraction must be zero. In this case, common in partial remainder calculation in thesecond and subsequent iterations of some algorithms, we can save both in the multiplication that producesthe word(s) to be subtracted, and in the subtraction itself.

While low-precision methods decrease the accuracy of calculations, they are almost always the mostefficient strategy. For example, inW=8 single-precision division, partial remainder calculations aimed atdT=W e=d26=8e=4 words can produce the necessary accuracy, where the 6 least significant bits absorberror introduced by low-precision methods.

For W=8 double-precision division, however, aiming partial remainder calculations atdT=W e=d55=8e=7 words will not guarantee sufficient accuracy. By maintaining 8 fractional words, how-ever, error introduced by low-precision arithmetic will not have significant impact on the leading 55 bits.For largeT ’s maintaining two extra words might be necessary, especially whenW is small.

We can define the multiprecision variableM to be the minimum number of words that partial remaindercalculations can be aimed at using low-precision arithmetic while still meeting target accuracy, usuallydT=W e or dT=W e+1.

The only place where full-precision arithmetic is appropriate is in integer division, where maintainingpartial remainders to full accuracy efficiently provides the needed final remainder (and an exactba=bc).

4 Overview of Division Algorithms

The most appropriate methods for software division are high radix algorithms which use multiplication asa primary operation. Of the six most appropriate software algorithms, the first four—Restoring Division,Byte Division, the Accurate Quotient algorithm, and Prescaling and Truncating—minimize the size of mul-tiplications, but at the cost of achieving only linear convergence. The Newton-Raphson and Goldschmidtalgorithms require more costly multiplications but achieve quadratic convergence.

This section presents the optimized division algorithms used in our studies.

4.1 Restoring Division

Restoring division produces non-overlapping partial productsqi in each iteration by obtaining an estimate ofqi and then adjusting it (andPi+1) if necessary after calculatingPi+1=Pi�bqi. The cost of this conditionaladjustment is minimized by ensuring that each initialqi estimate is too large if anything (so that an overesti-mate ofqi is easy to detect via a negativePi+1) and by ensuring that the initialqi estimate is almost alwayscorrect (so thatPi+1 does not often need adjusting). While this second consideration could be avoided bya non-restoring scheme—where an overestimate ofqi is corrected via a negativeqi+1—we did not find amethod of calculatingqi efficiently enough to overcome the overhead of managing such a system.

Another consideration is how to obtain the initialqi estimate. The following algorithm uses an efficientmethod:

� Setup:

Set initial partial remainderP0=a.

Calculate 2-word reciprocal estimate:r1=r0(2�br0)

7

� Iterate on:

Calculate quotient estimate:qi=r1Pi+Æ (or if overflow occursqi=2W�1). The value ofÆ ischosen to ensure thatqi is an overestimate.

Calculate new partial remainder:Pi+1=Pi�bqi.If Pi+1 < 0, decrementqi and setPi+1=Pi+1+b (with b correctly shifted).

While this approach requires precalculating a 2-word reciprocal, the resultingqi calculation is efficient,especially in processors with pipelined multipliers or superscalar capabilities. If two addition units and apipelined multiplier are available, eachqi can be obtained in(5+L) cycles, whereL is the multiplicationlatency.

Another way to estimateqi (which requires normalizingb to b< 1) is to use a 2-by-1 word integer divide(if such is available) and conditional increments [1]. While this avoids the overhead of reciprocal refinement,it includes the dependent sequence:

divide! multiply ! subtract! conditional;leading to relatively expensive iterations. (For the special caseB=1 the above is not true—a single divideobtains the correctqi estimate, one that will not need adjusting even after calculatingPi+1=Pi�bqi. Unfor-tunately, this algorithm compares least favorably whenB is small, so that this refinement is not particularlyuseful.)

When low-precision arithmetic is used in Restoring Division partial remainder calculations, the lastiterative step checking the sign ofPi+1 may not accurately reflect the sign of a full-precision(a�bPi qi),so thatqi might be decremented (or not decremented) inappropriately. This will only occur when

Pi qi

already meets target accuracy, however, and the rounding method described in Section 2 will still ensureexact representation of quotient when possible.

Since the 2-wordr1 will provide quotient estimates that will be wrong only�2�W of the time, a sim-plifying assumption made when estimating the cost of this algorithm is that adjustments toqi andPi+1 willnever be needed in the 3rd iterative step.

4.2 Byte Division

Byte Division [3] was designed to achieve high radix while limiting the size of required multiplications.While (as the name implies) originally designed for 8-bit words, here we consider any radix that uses thesame scheme.

Two modifications from the standard approach are required for efficient software implementation. Whileshifting is used in hardware to remove leading 0’s from each partial remainder, in software there is a betterapproach (discussed shortly). Also, instead of summing theqi’s iteratively, it is better to wait until after alliterations are finished to do so (to prevent redundant conditional carries). The algorithm for software is:

� Setup:

Obtain reciprocal estimater0.


� Iterate on:

Choose nextpi (a truncatedPi).

Calculate new partial quotientqi�pir0.Calculate new partial remainderPi+1=Pi�bqi.

8

� At the end:

Obtain final quotientQ=Xi

qi.

(The first two iterative steps computeqi�Pir0 using low-precision arithmetic.)A significant issue when implementing Byte Division is how to managePi and its truncated counterpart

pi. For example, supposeW=8 and we are calculating255 120 111 / 1. 226 with an initial r0= 135 .Then after choosing a 1-wordp0:

q0 � p0r0 = 255 � 135

= 134 121 :Hereq0 can be truncated to one byte without much loss of accuracy (r0 is accurate to only one byte), makingthe multiplicationbq0 less costly. ThenP1=P0�bq0:

255 120 111 P0� 252 76 bq0

3 44 111 P1Because of the small leading byte of this new partial remainder, we can either use a two-bytep1= 3 44

(leading to a two-byteqi, or shiftP1 to provide a useful leading byte:3 44 111 << 6 = 203 27 192

where we can usep1= 203 , keeping track of how far we have shifted to be able to later correctly align thepartial quotients.

It turns out that the two strategies are nearly identical in efficiency whenB=M in floating-point division.While using 2-wordpi’s leads to more expensive calculations ofqi=pir0 andPi+1=Pi�bqi, the additionalcosts are almost identical to the costs (if shifting is used) of realigning partial quotients and of shiftingitself. For smallB’s, however, using 2-wordpi’s is more efficient, where there will be less added costto each partial remainder calculation (�B partial products instead of needing to shift the entire non-zeropartial remainder). In addition, less precision is lost during the calculationqi=pir0 using 2-wordpi’s. Inthe example above wherer0 is accurate to�7 bits, we will get�7 quotient bits per iteration instead of 6bits if shifting is used. Because of these considerations, we limit our analysis to a non-shifting strategy.(The cost comparison between the two strategies above assumes that multiprecision shifting costs as muchas multiprecision multiplication by a 1-word multiplier. When shifting is less expensive, it will be a betterstrategy whenB is large enough.)

A consequence of this approach is that the small leading word ofPi can increase in magnitude in eachiteration. Forr0 = 1=b� �:

1� br0 � 1� b(1=b� �)

= b�:When� � 2�W andb � 2, (1�br0) can thus approach2�2�W . Since also

Pi+1 = Pi � bqi > Pi � b(Pir0) = Pi(1� br0);the leading non-zero word of the partial remainder can nearly double between iterations due only to thisanalysis, and can in fact more than double due to the effects of low-precision arithmetic. In the case ofW=8 this leads to possible overflow in the next-to-last iteration of double precision division, requiring anextra modified iteration.

While the preceding assumes a radix determined by the accuracy ofr0, an important enhancement isto use the Newton-Raphson equations to refiner0 to multiple-word accuracy before beginning iterations,where anR-word reciprocal will be able to retireR words per iteration. Shifting can again be avoided byincorporating an extra word in thepi’s andqi’s. (An analysis similar to the one above supports this strat-egy.) Even with the added cost of reciprocal refinement, this strategy improves performance considerably

9

for M>2, resulting in the most efficient algorithm for mostM < 70 multiprecision division problems (es-pecially whenB is small). We refer to this new hybrid method as the NR-Byte algorithm. For simplicity,we will sometimes consider the original Byte Division algorithm to be a special case of NR-Byte withR=1.

4.3 Accurate Quotient Algorithm

The high-radix Accurate Quotient algorithm proposed by Wong and Flynn [4] differs from Byte Division inhow the partial remainders are updated and in how these are used to obtain a quotient. If we chooseqi=pir0(without truncatingqi as in Byte Division),Pi+1 can be rewritten as:

Pi+1 = Pi � (bqi)

= Pi � (br0)pi:The Accurate Quotient algorithm uses the fact thatb and r0 are constant throughout the iterations andcalculatesbr0 at the beginning. (This calculation can be aimed atM words in floating-point division, butmust be calculated to full precision in integer division.)

This takes the calculation and truncation ofqi�Pir0 out of the critical loop, slightly increasing theconvergence in each iteration. As a result, overflow that is possible in the next-to-last iteration of ByteDivision forW=8 double-precision division does not occur in the Accurate Quotient algorithm.

Althoughqi=pir0 is not truncated as in Byte Division, additional expense can be avoided by summingthepi’s before multiplying byr0 (rather than calculating the individual partial quotientspir0). Incorporatingthis, the efficient software implementation of the algorithm for most machines is:

� Setup:



Calculate to full target accuracyN=br0.

� Iterate on:

Choose nextpi (truncatedPi).

Calculate new partial remainderPi+1=Pi�(piN).

� At the end:

Q=r0Xi

pi.

As with the Byte Division algorithm (and for the same reasons), we assume that an extra word will beused in eachpi to avoid shifting partial remainders.

For machines with word-wide multiply-accumulate-accumulate capability (Section 6), a further refine-ment is to update the partial remainder using:

Pi+1 = Pi � piN

= (Pi � pi) + pi(1�N):Calculating(1�N) before beginning iterations saves the cost of all subsequent subtractions, since(Pi�pi),which is just the low-order words ofPi, can be accumulated at no cost during the multiplicationpi(1�N).

As with Byte Division, the radix (and efficiency) is increased forM > 2 by using the NR equations torefiner0 to multiple-word accuracy before beginning Accurate Quotient iterations. We refer to this as theNR-AQ algorithm.

10

4.4 Prescaling and Truncating

Prescaling both divisor and dividend by the reciprocal estimate allows partial quotients to be obtained di-rectly from the partial remainder [7]. Although in hardware, partial quotients are often obtained by roundingto improve convergence slightly [8], maintaining positive partial remainders by simply truncating is moreefficient in software:

� Setup:


Calculate initial partial remainder to full target accuracyP0=ar0.

Calculate to full target accuracyN=br0.

� Iterate on:

Choose nextqi (truncatedPi).

Calculate new partial remainderPi+1=Pi�(qiN).

� At the end:

Q=Xi

qi.

The only difference between this and the Accurate Quotient algorithm is in whenr0 is introduced towardthe quotient. While prescaling the dividend at the beginning makes the Prescaling and Truncating (P&T)algorithm unsuitable for integer division due to increased difficulty of calculating a remainder, it is slightlymore efficient for floating-point implementation. This is because in the Accurate Quotient algorithm

Pi pi

can be greater than 1, while in the P&T algorithmP

i qi< 1, saving a word (and instruction) during thesubtraction and leading to a smaller multiprecision multiplication outside of the iterations (ar0 instead ofr0P

i pi). The algorithms are otherwise identical in size and costs of operations.An extra word in eachqi is again assumed to avoid shifting, and efficiency is again increased for

multiply-accumulate-accumulate machines by calculating(1�N) at the beginning. Finally, as in the pre-vious two algorithms, calculating a multiple-word reciprocal before beginning high-radix P&T iterations isimportant for largerM ’s—we refer to this as the NR-P&T algorithm.

4.5 Newton-Raphson Division

In its most basic form, the well-known Newton-Raphson method of division refines a reciprocal estimate ofthe divisor and then multiplies it by the dividend to obtain a quotient [9]:

� Setup:


� Iterate on:

ri+1 = ri(2� bri).

� At end:

Q = a(rfinal).

11

Each iteration produces a quadratically converging reciprocal estimate that requires two multiplications andone subtraction, all of which must be performed sequentially.

In the technical report [11] we show that this algorithm is never as efficient as the NR-Byte algorithmwith a reciprocal refined toM=2-word accuracy. Newton-Raphson reciprocal refinement remains, however,an important part of the NR-Byte algorithm (as well as the other hybrid algorithms), and is examined indetail in Section IV.

4.6 Goldschmidt’s Algorithm

Goldschmidt’s algorithm [10], another popular method of implementing division in hardware, obtains aquotient by choosing a series of multipliersm0;m1; : : : to transforma=b to�Q=1:

a

b=n0d0

=n0d0

� m0

m0=n1d1

m1

m1= : : :!� Q

1:

By choosing each multipliermi=2�di, the denominator converges quadratically to 1 and the numeratorsimilarly a=b. (The refinement of the denominator uses NR reciprocal refinement of1=b for b=1.)

Given the assumed availability of a reciprocal estimater0, the performance of Goldschmidt’s algorithmcan be improved by prescalinga andb before beginning iterations. The complete algorithm is:

� Setup:


Setn0 = ar0.

Setd0 = br0.

� Iterate on:

Calculate new multipliermi = 2� di.

Refine the numeratorni+1 = ni(mi).

Refine the denominatordi+1 = di(mi).

� At the end:

Calculatemfinal = 2� di.

Q = ni(mfinal).

The number of words of eachmi significantly affects the efficiency of this algorithm. Because eachiteration doubles the number of accurate words, little convergence is lost by truncatingmi to the size of thisexpected accuracy [2]. Since bothni anddi will be multiplied by this same (truncated) multiplier, the valueof ni+1

di+1= mi

mi� ni

diis preserved with considerable savings.

A variation of Goldschmidt’s algorithm examined in the technical report is to calculate the multipliermi

using the extended NR equation,mi = 1 + (1� di) + (1� di)

2;where each iteration can triple the number of accurate words. No form of Goldschmidt’s algorithm wasfound to be competitive with other algorithms for multiprecision division, however [11].

4.7 Low-Radix Algorithms

Low-radix algorithms—those of radix significantly less than word size—do not adapt well to software be-cause they cannot take advantage of such techniques as carry-save addition, shifting or efficient multiplexingbetween values. A low-radix algorithm is likely to be similar to the original Byte Division algorithm (withshifting):

12

� Setup:

Set initial partial remainderP0 = a.

� Iterate on:

Obtain a partial quotientqi.

Obtain new partial remainderPi+1 = Pi � bqi.

Shift Pi+1 to remove leading 0’s.

� At the end:

Obtain final quotientQ =Xi

qi.

The only way a low-radix algorithm can perform these steps significantly faster than Byte Division—which it needs to do to compensate for its lower radix—is to use table lookup ofbqi based onqi. To evaluatethis strategy, we can compare its cost with other algorithms.

Letting m represent the cost of processing one partial product in a multiprecision multiplication (i.e.,multiplication latency plus accumulations), the cost of precalculating thebqi’s for a radix-2k algorithm whenB=M will be �m2kM . An underestimate of the cost of iterations (based only on the second and thirdsteps) will ben(W=k)(M2=2), wheren is the number of instructions needed in each iteration to updateone word ofPi to Pi+1 (including shifting). Since the best algorithms perform multiprecision division in�m(M2=2) instructions (Section 8), a low-radix algorithm will be practical only when:

m2kM +nWM2

2k< m

M2

2

m2k+1 +nWM

k< mM

(A similar result forB<M=2 hasm2k in the first term.)Even if we relax this constraint to2k+1 < M andnW=k < m, which ensures only half the performance

of other algorithms, the dual constraints:k < lg(M) � 1

k > nW=mare difficult to meet.

While it is possible to construct conditions where this approach is practical, they are not likely to occur.High multiplication latency, very large targets, small word-size, and efficient memory access are all featuresthat make low-radix algorithms more competitive. Also, shifting can be avoided (at the cost of increasedprecalculations and memory demands) by calculatingbqi’s for each position eachqi can occur in a word,reducing the value ofn above.

5 Newton-Raphson Reciprocal Refinement

We now take a detailed look at using the Newton-Raphson method to calculate successive approximations to1=b. After obtaining an initial reciprocal estimater0 by some method (such as table look-up or polynomialapproximation), subsequent estimates are typically obtained using the equation:

ri+1 = ri [1 + (1� bri)]= ri [2� bri] ;

(1)

13

whereri converges quadratically toward an accurate value. The Newton-Raphson equation can be extendedto the more general equation:

ri+1 = rih1 + (1� bri) + (1� bri)

2 + : : :+ (1� bri)ni; (2)

where equation (1) corresponds ton = 1 [12]. Each inner term of equation (2) can be calculated from theprevious one with a single multiprecision multiplication. In terms of convergence, every additional termlinearly extends the accuracy of the estimate. Thus, ifri is accurate to one word,ri[1 + (1 � bri)] will beaccurate to two words,ri[1 + (1� bri) + (1 � bri)

2] to three words, and so on.One can thus obtain any degree of accuracy by calculating just one iteration of equation (2) provided

enough inner terms are calculated. In fact, one obtains (theoretically) exactly the same results in this manneras by iterating repeatedly over equation (1). For example, two iterations of equation (1) lead to:

ri+1 = ri [2� bri]

ri+2 = ri+1 [2� bri+1]

= [ri(2� bri)]� [2� b [ri(2� bri)]]

= rih1 + (1� bri) + (1� bri)

2 + (1� bri)3i;

which is equation (2) withn = 3.However, although mathematically equivalent, the efficiency of NR reciprocal refinement is highly af-

fected by the choice of equations. The following sections take a closer look at this and other issues.

5.1 Efficient NR Implementation

In Newton-Raphson reciprocal refinement aimed atR words, a new reciprocal estimate can be obtainedfrom the previous estimate andb. The precision of all calculations within an iteration can be limited to theexpected accuracy of the result of that iteration (this being a less costly process than maintaining all partialremainders toR-word accuracy and updating them based on corrections to the reciprocal.

The following example illustrates this method, where a�6-byte reciprocal ofb = 1. 186 207 145 76 27 173 is calculated in aW=8 processor. Assuming an initial 1-byteestimate of 147 (accurate reciprocal� 148 0 0 0 0 0 ), in order to obtain 6 bytes we willuse the following two iterations:

r1 = r0[1 + (1� br0) + (1� br0)2]

r2 = r1[1 + (1� br1)]:Since the first iteration will triple the number of accurate bytes to three, we begin with calculations aimed atproducing 3-byte results: 1. 186 207 145 76 27 173 b

� 147 r0

254 69 48 (67)!!!aaa!!!aaa!!!aaa br0

255 255 254 �1

� 254 69 48 br0

1 186 206 (1� br0)

1 186 206 (1� br0)

� 1 186 206 (1� br0)

0 (206)!!!aaa!!!aaa1 65 (36)!!!aaa

+ 0 1 186 (206)

0 2 251 (1� br0)2

1 186 206 (1� br0)

+ 0 2 251 (1� br0)2

1 189 201 (1� br0) + (1� br0)2

14

1. 1 189 201 1 + (1� br0) + (1� br0)2

� 147 r0

147 255 250 (107) r1 = r0[1 + (1� br0) + (1� br0)2]

In the second step, we subtracted from255 255 254 because of omitted partial products in the previ-ous multiplication (a loss of up to�1 lsb), and the truncation of whatwasmultiplied (a loss of another�1lsb potentially though just(67) here), preventing the possibility of an over-estimate of(1�br0).

In the second iteration, we expect to double the number of accurate bytes, so aim our calculations at6-byte results:

1. 186 207 145 76 27 173 b

� 147 255 250 r1

1 176 110 179 228 (56)!!!aaa!!!aaa1 185 20 193 186 206 (229)!!!aaa

+ 254 69 48 110 179 228 (87)

255 255 245 159 34 150 br1

255 255 255 255 255 251 �1

� 255 255 245 159 34 150 br1

0 0 10 96 221 101 (1� br1)

0 0 10 96 221 101 (1� br1)

� 147 255 250 r1

10 33 (192)!!!aaa!!!aaa10 86 124 (35)!!!aaa

+ 5 245 159 32 (255)

0 0 5 255 255 189 r1(1� br1)

147 255 250 r1+ 0 0 5 255 255 189 r1(1� br1)

147 255 255 255 255 189 r2 = r1[1 + (1� br1)]

The most significant bytes of the first two operations above did not need to be calculated: given theaccuracy ofr0 (and thusr1), the two leading bytes of(1�br1) must be0.

To achieve the same 6-byte result in a single iteration of the extended NR equation (2) usingr1 = r0[1 + (1� br0) + (1� br0)

2 + : : :+ (1� br0)5]

would be considerably more costly since there would be more terms to calculate and since each would needto be calculated to full 6-byte target accuracy. In general, it is better to use multiple iterations any time aresult needs much more than triple the number of accurate words of an estimate.

5.2 NR Accuracy Boost

During NR reciprocal refinement, the accuracy of a result can be increased using an extra term from theextended NR equation (2). Consider the case where ann-word reciprocal under-estimateri has at least oneaccurate bit in its least significant word:

1=b � ri < 2W�1 � 2�nW

b (1=b� ri) < b (2W�1 � 2�nW ):

15

Sinceb < 2,(1� bri) < 2W � 2�nW

(1� bri)2 < 22W � 2�2nW :

Thus when calculating

ri+1 = rih1 + (1� bri) + (1� bri)

2i; (3)

the last term calculated to2�2nW accuracy will contain at most two non-zero words, something inexpensiveto calculate as shown below (leading zeros have been omitted from all values):

� � � (1� bri)

� � � � (1� bri)

!!!aaa!

!!aaa!

!!aaa!

!!aaa� � �!!!aaa

!!!aaa!

!!aaa!

!!aaa!

!!aaa � � � !

!!aaa

( ) !!!aaa!!!aaa � � �!!!aaa

+ ( ) !!!aaa � � � !!!aaa

(1� bri)2

Since an iteration of equation (3) can triple the number of accurate words when calculations are carriedout to sufficient accuracy, the only significant error inri+1 (calculated to only2n-word accuracy) will bethe result of low-precision arithmetic in the iteration at hand.

A slightly different version of this technique could be used to boost accuracy in the example in theprevious section:

r0 = r2 + r1(1� br1)2

= r2 + (1� br1) [ r1(1� br1) ]

= r2 + 0 0 10 96 221 101 � 0 0 5 255 255 189

� 147 255 255 255 255 249 :The correction here was calculated using low-precision arithmetic as pictured above, but even a1�1-bytemultiplication leads to an improved reciprocal estimate of

147 255 255 255 255 239 ;a slightly better result than would be obtained if all calculations in the original iterations had been carriedout to full precision.

6 Floating-point division in small-word processors

We now present a semi-exhaustive case study for implementing single- and double-precision division in8-bit and 16-bit processors.

Three multiplier configurations are considered: multiply (mult), multiply-accumulate (mult-acc), andmultiply-accumulate-accumulate (mult-acc-acc). In the latter two, a multiply can accumulate the high-orderword of a previous multiplication from a special-purpose register (MHI), allowing efficient addition withina row of partial products. In the mult-acc-acc model, a second user-specified register can be added to theproduct (a feature incorporated in UCSC’s Kestrel parallel processor [5]), allowing a previous row of partialproducts to be accumulated as well. (On ann-bit processor the maximum value of the result of such aninstruction is(2n� 1)2+(2n� 1)+ (2n� 1) = (22n� 1), just small enough for two-word representation.)

In all models, it is assumed that each multiply overwrites MHI and that an MHI value can be eitherexplicitly stored, used as an operand in an addition, or accumulated in a mult-acc or mult-acc-acc operation.

16

Table I: Comparison of division algorithms forW=8.SINGLE PRECISION (T=26) DOUBLE PRECISION (T=55)

NR-Byte P&T Gold Restoring NR-Byte NR-P&T Gold RestoringR=2 R=1 R=4 R=2 R=1 R=2 R=1

Mult 48 56 56 75 72 178 184 235 227 210 283 214Mult-acc 36 44 42 57 62 124 133 174 154 142 196 177Mult-acc-acc 30 39 34 43 50 92 102 147 106 98 129 149

Table II: Comparison of division algorithms forW=16.SINGLE PRECISION (T=26) DOUBLE PRECISION (T=55)

Byte P&T Gold NR-Byte P&T GoldR=2 R=1

Mult 11 15 19 54 59 63 80Mult-acc 10 12 15 39 46 44 59Mult-acc-acc 10 12 14 31 41 36 45

In mult machines, it is further assumed that a carry from an addition can be latched and used after asubsequent multiply instruction. This increases efficiency when a row of partial products needs to be addedto previously calculated values (such as a previous row of partial products), since then the MHI from eachmultiplication can be added to a previous value, with the carry latched until the next multiplication’s MHIis available, and so on. This process, which can also be used when a multiplication result needs to besubtracted from a previous value, also reduces the number of values needing to be stored at any given time.

Turning to problem specifics, final target accuracies are chosen to be those needed for efficient roundingin single- and double-precision division: 26 and 55 bits. As discussed in Section 4, these targets allowinexpensive rounding to�1 lsb, and exact quotient when expressible, by adding2�26 or 2�55 before nor-malizing.

Tables I and II show instruction counts required by the algorithms onW=8 andW=16 machines forsingle- and double-precision division (T=26 andT=55). Note that the Byte algorithm is equivalent to theNR-Byte algorithm withR=1 (the same is true for the P&T algorithm). The Accurate Quotient algorithmis not included because of its similarity to the P&T algorithm and because it is always slightly less efficientfor floating-point division. Also not included is the Restoring algorithm forW=16, where it is even lesscompetitive than forW=8. Implementation details of all algorithms can be found in the technical report[11].

The NR-Byte algorithm with an appropriately refined reciprocal performs best in all cases. While theNR-P&T algorithm can also be reasonably efficient, Goldschmidt’s algorithm and Restoring Division arenot competitive.

In general, whenW andT are scaled together by powers of 2 (witha andb also of sizeT ), algorithmimplementations will be the same, except that for a largerW the effect of low-precision arithmetic will besmaller (each bit of error introduced representing a smaller portion of the least significant word). Such scal-ing is illustrated by the similarity between the costs ofW=8 single-precision andW=16 double-precisiondivision (Tables I and II), where in each caseM=4. The differences between the instruction counts resultfrom a andb not being scaled by a factor of 2, requiring 3-word representation forW=8 single-precision but4-word representation forW=16 double-precision. (The fact that target accuracies are not exactly doubledhere does not affect algorithm costs.)

17

Table III: Cost of arithmetic operations for algorithms.

Parameter values Algorithm Cost of arithmetic operations

Restoring 1=2 [ 3M2 + 35M � 40]

A = B =M NR-Byte 1=2 [ 3M2 + 10M + 3R2 + 7R+ 2 log2 R� 44 +3M2+6M

R]

NR-P&T 1=2 [ 3M2 + 9MR � 3M � 6R2 + 32R + 2 log2 R� 42 +3M2+4M

R]

Restoring (same as forA=B=M )

A�M;B=M NR-Byte (same as forA=B=M )

NR-P&T(A>R) 1=2 [ 3M2 + 3MR � 3M + 6AR � 3R2 + 31R + 2 log2 R� 44 +3M2+4M

R]

NR-P&T(A�R) 1=2 [ 3M2 + 3MR � 3M + 6AR � 3R2 + 29R + 2 log2 R+ 2A � 44 + 3M2+4M

R]

Restoring 1=2 [6MB + 32M � 3B2 + 3B � 36]

A=M;B�M NR-Byte (B�R) 1=2 [ 3MR + 11M + 6BR � 6B2 + 10B log2(R=B) + 4B � 4 log2 B � 2R + 6 log2 R� 30 +3MB2+7MB

R]

NR-Byte (B>R) 1=2 [ 6MB + 12M � 12BR + 15R2 � 18B + 27R + 2 log2 R� 30 + 6MBR

]

NR-P&T 1=2 [ 6MB + 9MR + 15M � 3B2 � 6R2 � 14B + 6R + 2 log2 R� 50 + 6MB+8M�3B2�2B

R]

Restoring 3AB + 16A � 3B2 � 16B + 9

Integer Division NR-Byte (B�R) 1=2 [ 3AR+ 11A + 3BR � 3B2 + 10B log2(R=B) + R+ 6 log2 R� 4 log2 B � 15 + 3AB2+7AB�3B3�4B2+7B

R]

NR-Byte (B>R) 1=2 [ 6AB + 12A � 6B2 � 18B + 3R2 + 21R + 2 log2 R� 22 + 6AB�6B2+6B

R]

7 Multiprecision Division for Arbitrary Parameters

To address integer and floating-point division with arbitrary parameters, we have developed closed-formexpressions to approximate the number of add, multiply, store-MHI, and conditional instructions (outsideof loop management) required by each algorithm. While the equations are based on mult machines andassume a multiplication latency of one cycle, results for other multiplier configurations would be similar, asdiscussed in Section 8. The machine model is otherwise the same as presented in the previous section.

We assume thatM=dT=W e+1 in this study. This assumption is made to deal with the Restoringalgorithm, which will not be required to calculate a final quotient word in theM th position, which places iton fairer terms with the hybrid algorithms which also do not produce an accurateM th word.

Equations for floating-point and integer division are presented in Table III for different ranges ofA, BandM .

As defined earlier,R represents the number of words in a reciprocal estimate. The equations for thehybrid algorithms were developed under the assumptions thatR is a power of two (used in NR reciprocalrefinement) and thatM is a multiple ofR (used in Byte iterations). Algorithm costs are then obtained bydifferentiating these equations to find optimalR’s—whether or not the conditions above are met—so thatresults will sometimes be slight underestimates.

ForB<M=2, two equations are needed for the NR-byte algorithm depending on the relative sizes ofBandR so that the multiplicationbr can be calculated efficiently, using the smaller value as multiplier. Forthe NR-P&T algorithm, optimalR is always less thanB, but two equations are needed depending on therelative sizes ofA andR whenA<M=2.

Goldschmidt’s algorithm was dropped from consideration in this study after preliminary analysis showedit to be unpromising. Derivation of the equations, including assumptions made relating to overflow preven-tion and other issues, are provided in the technical report [11].

7.1 Floating-Point Division

The graphs in Figure 1 compare algorithm costs for floating-point division, whereA andB (the number ofwords ofa andb) are either equal to or less thanM (the number of target words).

18

0

200

400

600

Instrs

5 10 15M0

2000

4000

6000

8000

Instrs

20 30 40 50 60M0

100

200

300

Instrs

2 4 6 8A

(a)A=B=M (b)A=B=M (c)B=M=10

0

500

1000

1500

Instrs

5 10 15A0

100

200

300

400

500

600

Instrs

2 4 6 8B0

2000

4000

6000

Instrs

10 20B

(d)B=M=25 (e)A=M=15 (f) A=M=70

Figure 1:Floating-point arithmetic costs of the NR-Byte (solid), Restoring (dashed) and NR-P&T (dotted) algorithms.

ForA=B=M , the NR-Byte algorithm is optimal for smallM , as expected from the small-word study(Figure 1 (a) and (b)). The Restoring algorithm becomes increasingly competitive for largeM ’s and be-comes optimal forM>�70. This is because it calculates eachqi at a fixed rate, while the hybrid algorithmsrequire�R=2 partial products to calculate each word ofqi (a largerM leading to a largerR).

As can be derived from Table III, the asymptotic cost of all algorithms is3M2=2, the same as that ofM -by-M word low-precision multiplication aimed at anM -word result. For the hybrid algorithms, thisasymptotic cost is achieved for

pM or logM , etc, but not forR=M=k orR=k for small constantk > 1.

For small values ofA (Figure 1 (c) and (d)), the NR-P&T algorithm gains the most due to the reducedcost ofar, which it must calculate toM -word precision. The NR-Byte algorithm gains only whenA<R,when the first partial quotient is less expensive. (This leads to larger optimalR’s for smallerA’s.) Theseeffects are not significant, however, the only change in optimal performance being a slight decrease in costfor the NR-Byte algorithm whenA is small enough.

For small values ofB (Figure 1 (e) and (f)), NR-Byte is significantly better than the other algorithms, dueto the efficiency with which it calculates partial remainders. WhenB<R, each partial remainder requires�B2=2 partial products (forR words of quotient), instead of�BR partial products for Restoring Divisionand�(BR+R2=2) for NR-P&T. WhenB>R, eachPi+1 requires�(BR�R2=2) partial products in NR-Byte, diminishing its advantage. As a result, Restoring Division becomes optimal for larger values ofB aslong asM is large enough (i.e. whenqi calculations in the NR-Byte algorithm become expensive enough).

Savings from smaller values ofA andB are independent in the NR-P&T algorithm, and the value ofAdoes not affect the Restoring algorithm. Only the NR-Byte algorithm can gain extra savings from havingbothA andB small. WhenB<R, after enough iterations have occurred to zero the leadingA words of thepartial remainder, each subsequentqi calculation will require a(B+1)-by-R word multiplication (instead ofan(R+1)-by-R word multiplication when eitherA orB is large). This is because eachPi+1=Pi�bqi cal-culation will consist of an(R+1)-by-B word multiplication resulting in a(B+R+1)-word result, followedby a subtraction which zeros the leadingR words, leaving a(B+1)-wordPi+1.

19

0

50

100

150

200

Instrs

2 4 6 8 10B0

200

400

600

800

1000

Instrs

10 20 30B

(a)A=11 (b)A=31

Figure 2:Integer division arithmetic costs of the NR-Byte (solid) and Restoring (dashed) algorithms.

7.2 Integer Division

In the case of integer division, we assume that botha andb have been left-shifted so thatb has a leadingword of ‘1’ and (A�B) quotient words must be calculated. (The leading ‘1’ ofb is not counted inB.)Although we do not address this further, the final remainder will likely need to be right-shifted.

For integer division, partial remainder calculations are best carried out to full precision. If insteadthey are maintained to only(A�B+1) fractional words, the minimum necessary to obtain a good quotientestimate (low-precision arithmetic introducing error into the least significant word), the expense of a final(a�bQ) needed to check for quotient overestimate and to obtain an accurate final remainder will lead to aless efficient algorithm.

The least significant quotient word in each iteration of NR-Byte is not accurate (usually being correctedby the overlapping leading word of the following(R+1)-word partial quotient). Thus,(A�B+1) words ofquotient will be needed to produce(A�B) near-accurate words. The accuracy of the(A�B)’th word willstill need to be checked after calculating a final remainder.

As in the floating-point study, the equations for NR-Byte are derived under the assumption thatR is apower of two and that it divides(A�B+1), and can lead to slight underestimates when this is not the case.

The NR-P&T algorithm is not a good choice for integer division because prescalinga makes final re-mainder calculation difficult. While the NR-AQ algorithm does not have this liability (and is virtuallyidentical in cost otherwise), we do not investigate it further either, due to the high cost of partial remain-der calculation, where eachpi must be multiplied by an (R+B)-word quantity. Since the correspondingmultiplication byqi in the NR-Byte algorithm involves a smallerB-word quantity, and since NR-Byte alsocalculates quotients more efficiently (requiring�MR=2 instead of(�MR�R2=2) partial products total),it will always be more efficient.

This leaves only the NR-Byte and Restoring algorithms as competitors for implementing integer divi-sion. The NR-Byte algorithm is again most efficient for small divisors (requiring�60% as many instructionsfor B=1), with the Restoring algorithm optimal for larger values ofB (Figure 2 (a) and (b)).

8 Complications and Observations

Three issues that will affect division performance are multiplication latency, storing and loading costs wheninsufficient registers are available, and the degree to which instruction-level parallelism aids algorithms.Multiplication latency is the simplest to address. In the small-word study above, the optimal NR-Byte

20

algorithms require the fewest multiplications in all cases, so that their superiority will be maintained withhigher multiplication latency.

The same should be true for higher precision division. Since multiprecision multiplication increasinglydominates algorithms for largeM ’s, algorithm costs will largely be determined by the number of partialproducts needing to be processed, so that again, optimal algorithms will contain the fewest partial products.

While the mult machine used in our study requires three cycles (one multiply and two additions) to pro-cess each partial product, other multiplier configurations will lead to similar relative algorithm performance.For example, with a mult-acc machine, only two instructions would be needed, and all algorithms wouldrequire�2=3 as many instructions as given in the table. A similar analysis—based on the cost of processinga partial product—can be used to predict costs for processors with other multiplication latencies, pipelinedmultipliers, and/or superscalar capabilities.

Note that the3M2=2 asymptotic cost of the best algorithms forA=B=M floating-point division repre-sentsM2=2 partial products, which is an asymptotic lower bound using standard arithmetic techniques sincecalculating partial remainders requires[M + (M�1) + (M�2) + : : :+1] �M2=2 partial products. WhenB<M , algorithm cost is less dominated by partial remainder calculations. For example, whenB<R theasymptotic number of partial products in the NR-Byte algorithm is�MR=2 for partial quotient calculationsplus�MB2=(2R) for partial remainder calculations.

The cost of storing and reloading results when insufficient registers are available is less clear. Newton-Raphson reciprocal refinement is favorable because operand sizes are minimal in each iteration, limiting thenumber of words needing to be saved. Of the hybrid algorithms NR-P&T has an edge over NR-Byte whenB=M , when it requires onlyB words to store information aboutB andR during most of the algorithm,instead of(B+R) for the NR-Byte algorithm. While the Restoring algorithm is always at least as good asthe hybrid algorithms in terms of the number of values needing to be saved at any given time, it will still beat a disadvantage with regard to storing and loading costs. While every algorithm must access all (non-zero)words ofPi to calculatePi+1, the hybrid algorithms can processR quotient words before (longer-term)storage of each word ofPi+1 (instead of only one word for Restoring Division). The extent to which theseconsiderations will lead to changes in relative performance is not clear.

In order to measure instruction-level parallelism in the small-word study, we evaluated the algorithmsassuming two instructions could be issued per cycle. The NR-Byte algorithm again outperformed the others,and Goldschmidt’s algorithm was worst in spite of being able to perform parallel multiprecision multiplica-tions. This is because there is significant opportunity for parallel (or pipelined) computation within a singlemultiprecision multiplication. Thus, since multiprecision multiplication dominates all algorithms, they willusually be able to take advantage of whatever parallelism is offered by a given architecture. A possibleexception is Restoring Division partial remainder caluclations, where a carry-propagate add requires at leastone cycle per word for a maximum speed-up of three.

Though unlikely to come up, for extremely large target accuracies, low-precision methods must bealtered. Since each row of partial products can add 1 lsb error to a result using the suggestions in Section 3,when the number of rows of partial products approaches2W , an algorithm would need to incorporate anothercolumn of partial products in calculations to prevent immediate overflow (and for22W , another column still,and so on).

For repeated use of a divisor, obtaining a reciprocal of full target accuracy allows each division to beperformed via a single multiprecision multiplication. Surprisingly, this is not always the best approach.WhenA=M the asymptotic cost of this multiplication is the same as that of the entire hybrid algorithms forA=B=M . For smallerB, where the hybrid algorithms become more efficient (but the multiplication by afull reciprocal does not), the NR-Byte algorithm is the best choice.

Finally, multiplication can be performed in�(nlogn) time using Fast Fourier Transforms (FFTs) [1].While we did not examine this approach in sufficient detail to determine how large the parameters must

21

be to make it practical, we did examine the asymptotic behavior of floating-point division under the as-sumption that ann-by-n word multiplication can be performed in(cnlogn) instructions. WhenA=B=M ,the NR-Byte algorithm with a reciprocal refined toM=2 words is optimal, with an asymptotic cost of(7=2)cM logM instructions. The advantage of this algorithm comes from having most of its cost in severallarge multiplications. (While ann-by-n word multiplication requires the same number of instructions astwo (n=

p2)-by-(n=

p2) word multiplications using standard multiplication, the former will require only

cnlogn instead of�p2cnlogn instructions when FFTs are used.)The NR-Byte algorithm is also asymptotically optimal for smallB’s when using FFT multiplication. By

settingR=B, the asymptotic cost isc(2M+3B)logB, where the first term is equally divided between thecosts of quotient and partial remainder calculations and the second term is the cost of reciprocal refinement.These results assume that minimum-sized operands are used, as would be used by low-precision arithmetic,but that the FFTs perform the multiplications to full precision.

9 Conclusions

In this study we have optimized several algorithms for software division, and then evaluated their perfor-mances using simple machine models in two studies. The first is a semi-exhaustive case-study for imple-menting floating-point division in 8-bit or 16-bit processors, with several multiplier configurations con-sidered. The second evaluates algorithms for arbitrary parameters by obtaining closed-form expressionsestimating algorithm costs, both for integer and floating-point division.

The studies indicate that just two algorithms are competitive for multiprecision division. RestoringDivision is most efficient for floating-point division with large target accuracies (>�70) and for integerdivision when the divisor is not small. Its main advantage, however, is that it is easy to implement.

For other multiprecision division, a hybrid of the Newton-Raphson and Byte Division algorithms isoptimal, where significant reciprocal refinement is often performed before beginning very high-radix ByteDivision iterations. It is made more efficient through the use of low-precision arithmetic and a method bywhich the accuracy of a reciprocal can be boosted at very little cost during Newton-Raphson refinement.

Implementing the NR-Byte algorithm is not simple. While details can be carefully worked out for fixedparameters, efficient implementation for arbitrary parameters will require considerable effort. Integratedstrategies must be developed to determineM (based onT andW ) andR (based onA, B andM ), and toprevent partial remainder overflow for largeM ’s. These strategies must ensure that target accuracies aremet without sacrificing too much efficiency.

Because of these complications, the Restoring algorithm is an attractive choice when the performancedifferences between algorithms is small (Figure 3). For small divisors, however, the factor of�1:67 advan-tage of the NR-Byte algorithm can justify the effort needed to implement it.

10 Acknowledgments

This work was supported in part by NSF grant MIP-9423985 and its REU supplement, a University ofCalifornia MICRO fellowship, and an ARCS foundation scholarship. The authors thank the Kestrel team formany helpful discussions, and the Arith-13 and Transactions reviewers for excellent and detailed comments.Additional information on the Kestrel project, as well as the associated technical report, can be found athttp://www.cse.ucsc.edu/research/kestrel.

22

0.0

0.2

0.4

0.6

0.8

1.0

NR-Byte/Restoring

10 20 30 40 50 60 70M0.0

0.2

0.4

0.6

0.8

1.0

NR-Byte/Restoring

8 10 12 14 16 18 20A

(a) (b)

Figure 3: (a) Ratios of instructions (floating-point division) for NR-Byte vs. Restoring for:A=B=M (top solid);A=1, B=M (dashed); B=1, A=M (dotted); A=B=1 (bottom solid). (b) Ratios of instructions (integerdivision) for NR-Byte vs. Restoring for (bottom to top):B=1, B=2, B=3, andB=4.

References

[1] D. E. Knuth, The Art of Computer Programming, vol. 2. Reading, MA: Addison-Wesley, 2nd ed.,1981.

[2] E. V. Krishnamurthy, “On optimal iterative schemes for high-speed division,”IEEE Trans. Computers,vol. C-19, pp. 227–231, Mar. 1970.

[3] S. Waser and M. J. Flynn,Introduction to Arithmetic for Digital System Designers. New York: Holt,Rinehart and Winston, 1982.

[4] D. Wong and M. Flynn, “Fast division using accurate quotient approximations to reduce the numberof iterations,”IEEE Trans. Computers, vol. 41, pp. 981–995, Aug. 1992.

[5] D. M. Dahle, J. D. Hirschberg, K. Karplus, H. Keller, E. Rice, D. Speck, D. H. Williams, andR. Hughey, “Kestrel: Design of an 8-bit SIMD parallel processor,” inProc. 17th Conf. on AdvancedResearch in VLSI(R. B. Brown and A. T. Ishii, eds.), pp. 145–162, IEEE CS, Sept. 1997.

[6] E. Rice and R. Hughey, “Multiprecision division on an 8-bit processor,” inProc. 13th IEEE Symp.Computer Arithmetic(T. Lang, J.-M. Muller, and N. Takagi, eds.), pp. 74–81, IEEE CS, July 1997.

[7] A. Svoboda, “An algorithm for division,” inProc. 9th Symp. Inform. Processing Machines, pp. 25–34,1963.

[8] M. D. Ercegovac, T. Lang, and P. Montuschi, “Very high radix division with selection by rounding andprescaling,”IEEE Trans. Computers, vol. 43, pp. 909–918, Aug. 1994.

[9] M. J. Flynn, “On division by functional iteration,”IEEE Trans. Computers, vol. C-19, pp. 702–706,Aug. 1970.

[10] R. E. Goldschmidt, “Applications of division by convergence,” Master’s thesis, MIT, Cambridge, MA,1964.

23

[11] E. Rice and R. Hughey, “Multiprecision division: Expanded version,” Tech. Rep. UCSC-CRL-98-10, University California, Santa Cruz, CA, Apr. 1998. Also available fromhttp://www.cse.ucsc.edu/research/kestrel.

[12] D. Ferrari, “A division method using a parallel multiplier,”IEEE Trans. Computers, vol. EC-16,pp. 224–226, Apr. 1967.

24

11 Affiliation of Authors

The authors are with the Department of Computer Engineering, Jack Baskin School of Engineering, Uni-versity of California, Santa Cruz, CA 95064

e-mail: felrice, [email protected]

A Algorithm Implementations

In the following presentation of the algorithms, each line describes an operation that typically requiresseveral instructions. On the left of each operation is a label followed by boxes indicating the accuracy towhich it is to be calculated, as well as which values need calculating. The following notation is used:

: Value needing to be calculated. Radix point is to the left of all symbols unless otherwise specified.

(0) : A leading word whose values is known to be 0 and does not need calculating.

(m) : A word whose value is known to be the maximum possible, namely (2wordsize � 1), and does notneed calculating.

( ) : A word whose value does not need calculating because the same-position result of a subsequentsubtraction can be predicted.

!!!aaa : An uncalculated partial product.

The details of how an operation would be performed can be determined by assuming that all low-precision techniques discussed in the paper are used, and that the order of individual instructions (add,subtract, or multiply) minimize the overall cost of the operation.

For example, an operation from the first algorithm is:

Q = q0 + t3(r1):

From the left side of the equation, we see that 4-word accuracy is needed in Q and that all must be calculated(by definition of above). The number of words ofq0, t3 andr1 can be obtained by finding wheretheyappear on the left side of an equation, which turn out to be 2-words (q0), 3-words (t3) and 2-words (r1). Theoverall structure of the operation will then be:

+ q0t3

� r1

!!!aaa

+

Q

From this, the optimal ordering of instructions must be determined. This optimal ordering is shown belowfor a mult-acc machine on the left, and a mult-acc-acc machine on the right:

+ q0t3

� r1

3 2 1 !!!aaa

+ MHI 6 5 4

10 9 8 7 Q

+ q0t3

� r1

3 2 1 !!!aaa

+ MHI 6 5 4

8 7 6 5 Q

25

Note that after instruction 6 on the left, MHI is left alone while the two rows of partial products are added,so that when MHI is added, the carry from this addition process (as well as the lower word ofq0) can beaccumulated at the same time. (This strategy on mult-acc machines of leaving MHI while the previous rowis accumulated is usually the best strategy.) On the right, since the results of instructions 2 and 3 can beaccumulated during instructions 5 and 6 on a mult-acc-acc machine, no further work is needed for the twolow-order words of resultQ.

For a mult machine, things are a bit trickier, especially when the result is to be subtracted from anothernumber as when calculating a new partial remainder. Then the most efficient process is to perform a sub-traction when processing an MHI. The following shows how this would work:

a b c d Pi

y1 y2 y3

� x

6 5/4 3/2 110 9 8 7 Pi+1

Here, at time 2, we performc�MHI(xy3), latching the carry until time 4 when we performbc�MHI(xy2)�carry, etc. Finally, at time 7 we performd � (1), at time 8(2) � (3) � carry, etc. Although this fuses themultiplication and subtraction in a combined operation, in the following, we will consider all but the lastrow of subtractions as being part of the multiplication cost.

Also provided with the algorithms are instruction and partial-product counts for the operations. Here,x/y(z) to the right of each operation represents:

x: Number of instructions required by mult-acc machine.

y: Number of instructions required by mult-acc-acc machine.

z: Number of partial products required by operation.

The algorithm implemented in Subsection A.5.3 triples the number of accurate words per iteration byobtaining a 3-word reciprocalr1, prescalingn = ar andd = br, obtaining a multiplier usingm = [1+(1�d) + (1 � d)2] and finally calculatingQ = nm. While we developed this algorithm from Goldschmidt’salgorithm, since the last three steps calculate

Q = ar1m

= ar1[1 + (1� br1) + (1� br1)2];

it would be pure Newton-Raphson ifr1m were calculated beforear1. The motivation of this order is that itallows simultaneous calculation of the multiplications in step 2.

While we are careful in the following implementations to prevent quotient overestimation in the NR andRestoring algorithms, we do not do the same for other algorithms, this simplification justified by the factthat they do not compare favorably even with this advantage. [Richard**************]

26

A.1 Modified Newton-Raphson Algorithm

A.1.1 Modified NR/8-Bit/Single

a1 a2 a3 = an = b� 1 �

*In this and many of the following algorithms we usen = b� 1 in orderto minimize the number of partial products and/or reduce the number of requiredinstructions when multiplying byb (which is assumed to be normalized to2 > b � 1).

t0 = r + rn 3=3(2)t0 = 255 254 � t0 2=2r1 = r + rt1 2=2(2)q0 = ar1 6=5(3)t2 ( ) = q0 + nq0 10=7(6)t3 (0) = (a1) a2 a3-1 255 � t2 3=3Q = q0 + t3(r1) 10=8(5)

TOTALS: 36=30(18)

A.1.2 Modified NR/8-bit/Double

a1 a2 a3 a4 a5 a6 a7 = an = b� 1

t0 = r + rn 5=5(4)The following 254 prepares for underestimates being subtracted inboth the first two iterations (t0 andt1).

d1 = 255 255 255 254 � t0 4=4The followingc1 represents the correction tor.

c1 = rt1 2=2(2)t1 ( ) = c1 + c1n 4=4(4)t2 (0) = d1 � t1 3=3t3 (0) = t2t4 (0) = t2 + t3(t3) 4=3(1)r2 = (r + c1) + (r + c1)(t4) 10=8(5)q0 = ar2 20=14(10)t5 ( ) ( ) ( ) = q0 + nq0 37=23(22)t6 (0) (0) (0) = (a1) (a2) (a3) a4 a5 a6 a7-1 251 � t5 5=5Q = q0 + t6(r2) 30=21(14)

TOTALS: 124=92(62)

A.1.3 Modified NR/16-bit/Single

a1 a2 = an = b� 1

q0 = ar 2=2(1)t0 = q0 + nq0 3=3(2)t1 = a1 a2-1� t0 2=2Q = q0 + t1(r) 3=3(2)

TOTALS: 10=10(5)

27

A.1.4 Modified NR/16-bit/Double

a1 a2 a3 a4 = an = b� 1

t0 = r + rn 3=3(2)t1 = 65535 65534 � t0 2=2r1 = r + rt1 2=2(2)q0 = ar1 6=5(3)t2 ( ) = q0 + nq0 + 0 0 0 3 � 13=8(7)

This last term is added to ensure thatt2 is not anunderestimate. (Low precision arithmetic could make it� 3 lsb less than accurate.)

t3 (0) = 1� t2 3=3Q = q0 + t3(r1) 10=8(5)

TOTALS: 39=31(19)

A.2 Byte Division

A.2.1 Byte/8-bit/Single

n = b� 1P0 = a

q0 = rP0 2=2(1)t0 = q0 + nq0 4=4(3)P1 = P0 � t0 4=4q1 = rP1 3=3(2)t1 ( ) = q1 + nq1 10=7(6)P2 (0) = P1 � t1 3=3q2 (0) = rP2 3=3(2)t2 (0) ( ) = q2 + nq2 8=6(5)P3 (0) (0) = P2 � t2 2=2q3 (0) (0) = rP3� 3=3(2)

*Begin calculatingQ here.Q =

P3i=0 qi 2=2

TOTALS: 44=39(21)

28

A.2.2 Byte/8-bit/Double

n = b� 1P0 = a

q0 = rP0 2=2(1)t0 = q0 + nq0 8=8(7)P1 = P0 � t0 8=8q1 = rP1 3=3(2)t1 ( ) = q1 + nq1 22=15(14)P2 (0) = P1 � t1 7=7q2 (0) = rP2 3=3(2)t2 (0) ( ) = q2 + nq2 20=14(13)P3 (0) (0) = P2 � t2 6=6q3 (0) (0) = rP3 3=3(2)t3 (0) (0) ( ) = q3 + nq3 17=12(11)P4 (0) (0) (0) = P3 � t3 5=5q4 (0) (0) (0) = rP4 3=3(2)t4 (0) (0) (0) ( ) = q4 + nq4 14=10(9)P5 (0) (0) (0) (0) = P4 � t4 4=4q5 (0) (0) (0) (0) = rP5 3=3(2)t5 (0) (0) (0) (0) ( ) = q5 + nq5 11=8(7)P6 (0) (0) (0) (0) (0) = P5 � t5 3=3

The following 1-byte partial quotientk0 will be added toq6.k0 (0) (0) (0) (0) (0) = rP6 2=2(1)t6 (0) (0) (0) (0) (0) = q6 + nq6 4=4(3)P7 (0) (0) (0) (0) (0) = P6 � t6 3=3q6 (0) (0) (0) (0) (0) = rP7 + k0 3=3(2)t7 (0) (0) (0) (0) (0) ( ) = q7 + nq7 8=6(5)P8 (0) (0) (0) (0) (0) (0) = P0 � t7 2=2q7 (0) (0) (0) (0) (0) (0) = rP1� 3=3(2)


P7i=0 qi 7=7

TOTALS: 174=147(85)

29

A.2.3 Byte/8-bit/2-word Reciprocal/Double

n = b� 1P0 = a

To get accurate enough 2-byte estimate, we use form ofr1 = r[1 + (1� br) + (1� br)2]:

u0 = r + rn 4=4(3)u1 = 255 255 255 � u0 3=3u2 = u1u3 = u1u4 = u2 + u3 � u3 2=1(1)r1 = r + r(u4) 2=2(2)

q0 = r1P0 6=5(3)t0 ( ) = q0 + nq0 22=15(14)P1 (0) = P0 � t0 7=7q1 (0) = r1P1 9=7(5)t1 (0) ( ) ( ) = q1 + nq1 28=18(17)P2 (0) (0) (0) = P1 � t1 5=5q2 (0) (0) (0) = r1P2 9=7(5)t2 (0) (0) (0) ( ) ( ) = q2 + nq2 18=12(11)P3 (0) (0) (0) (0) (0) = P2 � t2 3=3q3 (0) (0) (0) (0) (0) = r1P3� 9=7(5)


P3i=0 qi 6=6

TOTALS: 133=102(66)

A.2.4 Byte/16-bit/Single

(See Modified NR/16-bit/Single, which is identical in this case.)

30

A.2.5 Byte/16-bit/Double

n = b� 1P0 = a

q0 = rP0 2=2(1)t0 = q0 + nq0 5=5(4)P1 = P0 � t0 4=4q1 = rP1 3=3(2)t1 ( ) = q1 + nq1 11=8(7)P2 (0) = P1 � t1 3=3q2 (0) = rP2 3=3(2)t2 (0) ( ) = q2 + nq2 8=6(5)P3 (0) (0) = P2 � t2 2=2q3 (0) (0) = rP3� 3=3(2)


P3i=0 qi 2=2

TOTALS: 46=41(23)

A.3 Accurate Quotient Algorithm

A.3.1 Accurate Quotient/8-bit/Single/Mult-Acc

n = b� 1P0 = a

N = r + rn 4(3)p0 = P0

t0 = Np0 5(4)P1 = P0 � t0 4p1 = P1

t1 ( ) = Np1 10(7)P2 (0) = P1 � t1 3p2 (0) = P2

t2 (0) ( ) = Np2 7(5)p3 (0) (0) = P2 � t2 2

Y . =P3

i=0 pi 4Q = rY 5(5)

TOTALS: 44(24)

31

A.3.2 Accurate Quotient/8-bit/Single/Mult-Acc-Acc

n = b� 1P0 = a

N = r + rn 4(3)K = 1�N 4p0 = P0

P1 = (P0 � p0) +Kp0 5(4)p1 = P1

P2 (0) = (P1 � p1) +Kp1 8(7)p2 (0) = P2

p3 (0) (0) = (P2 � p2) +Kp2 6(5)

Y . =P3

i=0 pi 4Q = rY 5(5)

TOTALS: 36(24)

32

A.3.3 Accurate Quotient/8-bit/Double/Mult-Acc

n = b� 1P0 = a

N = r + rn 8(7)p0 = P0

t0 = Np0 9(8)P1 = P0 � t0 8p1 = P1

t1 ( ) = Np1 22(15)P2 (0) = P1 � t1 7p2 (0) = P2

t2 (0) ( ) = Np2 19(13)P3 (0) (0) = P2 � t2 6p3 (0) (0) = P3

t3 (0) (0) ( ) = Np3 16(11)P4 (0) (0) (0) = P3 � t3 5p4 (0) (0) (0) = P4

t4 (0) (0) (0) ( ) = Np4 13(9)P5 (0) (0) (0) (0) = P4 � t4 4p5 (0) (0) (0) (0) = P5

t5 (0) (0) (0) (0) ( ) = Np5 10(7)P6 (0) (0) (0) (0) (0) = P5 � t5 3p6 (0) (0) (0) (0) (0) = P6

t6 (0) (0) (0) (0) (0) ( ) = Np6 7(5)p7 (0) (0) (0) (0) (0) (0) = P6 � t6 2

Y . =P7

i=0 pi 8Q = rY 9(8)

TOTALS: 156(83)

33

A.3.4 Accurate Quotient/8-bit/Double/Mult-Acc-Acc

n = b� 1P0 = a

N = r + rn 8(7)K = 1�N 8p0 = P0

P1 = (P0 � p0) +Kp0 9(8)p1 = P1

P2 (0) = (P1 � p1) +Kp1 16(15)p2 (0) = P2

P3 (0) (0) = (P2 � p2) +Kp2 14(13)p3 (0) (0) = P3

P4 (0) (0) (0) = (P3 � p3) +Kp3 12(11)p4 (0) (0) (0) = P4

P5 (0) (0) (0) (0) = (P4 � p4) +Kp4 10(9)p5 (0) (0) (0) (0) = P5

P6 (0) (0) (0) (0) (0) = (P5 � p5) +Kp5 8(7)p6 (0) (0) (0) (0) (0) = P6

p7 (0) (0) (0) (0) (0) (0) = (P6 � p6) +Kp6 6(5)

Y . =P7

i=0 pi 8Q = rY 9(8)

TOTALS: 108(83)

A.3.5 Accurate Quotient/8-bit/2-word/Single/Mult-Acc

n = b� 1P0 = a

Here we use basic NR to get accurate enough 2-byte estimater1 = r(2� br) = r + r[1� (rn+ r)]:

u0 = r + rn 3(2)u1 = 255 255 � u0 2r1 = r + r(u1) 2(2)

N (m) = r1 + r1(n) 13(6)p0 = P0

t0 ( ) = Np0 10(7)p1 (0) = P0 � t0 3

Y . = p0 + p1 3Q = r1(Y ) 13(9)

TOTALS: 49(26)

34

A.3.6 Accurate Quotient/8-bit/2-word/Single/Mult-Acc-Acc

n = b� 1P0 = a

Here we use basic NR to get accurate enough 2-byte estimater1 = r(2� br) = r + r[1� (rn+ r)]:

u0 = r + rn 3(2)u1 = 255 255 � u0 2r1 = r + r(u1) 2(2)

N (m) = r1 + r1(n) 9(6)K (0) = 1�N 4p0 = P0

p1 (0) = (P0 � p0) +Kp0 7(5)

Y . = p0 + p1 3Q = r1(Y ) 10(9)

TOTALS: 40(24)

A.3.7 Accurate Quotient/8-bit/2-word/Double/Mult-Acc

n = b� 1P0 = a

Here we need more accuracy than for single precision andcalculate 3-byte(1� br):

u0 = r + rn 4(3)u1 = 255 255 254 � u0 3r1 = r + r(u1) 3(3)

N (m) = r1 + r1(n) 22(14)p0 = P0

t0 ( ) = Np0 22(13)P1 (0) = P0 � t0 7p1 (0) = P1

t1 (0) ( ) ( ) = Np1 27(17)P2 (0) (0) (0) = P � 1� t1 5p2 (0) (0) (0) = P2

t2 (0) (0) (0) ( ) ( ) = Np2 17(11)p3 (0) (0) (0) (0) (0) = P2 � t2 3

Y . =P3

i=0 pi 7Q = r1(Y ) 25(17)

TOTALS: 145(78)

35

A.3.8 Accurate Quotient/8-bit/2-word/Double/Mult-Acc-Acc

n = b� 1P0 = a

Here we need more accuracy than for single precision andcalculate 3-byte(1� br):

u0 = r + rn 4(3)u1 = 255 255 254 � u0 3r1 = r + r(u1) 3(3)

N (m) = r1 + r1(n) 15(14)K (0) = 1�N 8p0 = P0

P1 (0) = (P0 � p0) +Kp0 15(13)p1 (0) = P1

P2 (0) (0) (0) = (P1 � p1) +Kp1 17(15)p2 (0) (0) (0) = P2

p3 (0) (0) (0) (0) (0) = (P2 � p2) +Kp2 11(9)

Y . =P3

i=0 pi 7Q = r1(Y ) 18(17)

TOTALS: 101(74)

A.3.9 Accurate Quotient/16-bit/Single

n = b� 1P0 = a

N = r + rn 3=3(2)p0 = P0

t0 = Np0 3=3(2)p1 = P0 � t0 2=2

Y . = p0 + p1 2=2Q = rY 3=3(3)

TOTALS: 13=13(7)

36

A.3.10 Accurate Quotient/16-bit/Double/Mult-Acc

n = b� 1P0 = a

N = r + rn 5(4)p0 = P0

t0 = Np0 5(4)P1 = P0 � t0 4p1 = P1

t1 ( ) = Np1 10(7)P2 (0) = P1 � t1 3p2 (0) = P2

t2 (0) ( ) = Np2 7(5)p3 (0) (0) = P2 � t2 2

Y . =P3

i=0 pi 4Q = rY 5(5)

TOTALS: 45(25)

A.3.11 Accurate Quotient/16-bit/Double/Mult-Acc-Acc

n = b� 1P0 = a

N = r + rn 5(4)K = 1�N 4p0 = P0

P1 = (P0 � p0) +Kp0 5(4)p1 = P1

P2 (0) = (P1 � p1) +Kp1 8(7)p2 (0) = P2

p3 (0) (0) = (P2 � p2) +Kp2 6(5)

Y . =P3

i=0 pi 4Q = rY 5(5)

TOTALS: 37(25)

37

A.4 Prescaling and Rounding Algorithm

A.4.1 Prescaling and Rounding/8-bit/Single/Mult-Acc

n = b� 1

P0 = ar 4(3)N = r + rn 4(3)q0 = P0

t0 = Nq0 5(4)P1 = P0 � t0 4q1 = P1

t1 ( ) = Nq1 10(7)P2 (0) = P1 � t13q2 (0) = P2

t2 (0) ( ) = Nq2 7(5)q3 (0) (0) = P2 � t2 2

Q =P3

i=0 qi 3TOTALS: 42(22)

A.4.2 Prescaling and Rounding/8-bit/Single/Mult-Acc-Acc

n = b� 1

P0 = ar 4(3)N = r + rn 4(3)K = 1�N 4q0 = P0

P1 = (P0 � q0) +Kq0 5(4)q1 = P1

P2 (0) = (P1 � q1) +Kq1 8(7)q2 (0) = P2

q3 (0) (0) = (P2 � q2) +Kq2 6(5)

Q =P3


38

A.4.3 Prescaling and Rounding/8-bit/Double/Mult-Acc

n = b� 1

P0 = ar 8(7)N = r + rn 8(7)q0 = P0

t0 = Nq0 9(8)P1 = P0 � t0 8q1 = P1

t1 ( ) = Nq1 22(15)P2 (0) = P1 � t1 7q2 (0) = P2

t2 (0) ( ) = Nq2 19(13)P3 (0) (0) = P2 � t2 6q3 (0) (0) = P3

t3 (0) (0) ( ) = Nq3 16(11)P4 (0) (0) (0) = P3 � t3 5q4 (0) (0) (0) = P4

t4 (0) (0) (0) ( ) = Nq4 13(9)P5 (0) (0) (0) (0) = P4 � t4 4q5 (0) (0) (0) (0) = P5

t5 (0) (0) (0) (0) ( ) = Nq5 10(7)P6 (0) (0) (0) (0) (0) = P5 � t5 3q6 (0) (0) (0) (0) (0) = P6

t6 (0) (0) (0) (0) (0) ( ) = Nq6 7(5)q7 (0) (0) (0) (0) (0) (0) = P6 � t6 2

Q =P7

i=0 qi 7TOTALS: 154(82)

39

A.4.4 Prescaling and Rounding/8-bit/Double/Mult-Acc-Acc

n = b� 1

P0 = ar 8(7)N = r + rn 8(7)K = 1�N 8q0 = P0

P1 = (P0 � q0) +Kq0 9(8)q1 = P1

P2 (0) = (P1 � q1) +Kq1 16(15)q2 (0) = P2

P3 (0) (0) = (P2 � q2) +Kq2 14(13)q3 (0) (0) = P3

P4 (0) (0) (0) = (P3 � q3) +Kq3 12(11)q4 (0) (0) (0) = P4

P5 (0) (0) (0) (0) = (P4 � q4) +Kq4 10(9)q5 (0) (0) (0) (0) = P5

P6 (0) (0) (0) (0) (0) = (P5 � q5) +Kq5 8(7)q6 (0) (0) (0) (0) (0) = P6

p7 (0) (0) (0) (0) (0) (0) = (P6 � q6) +Kq6 6(5)

Q =P7

i=0 qi 7TOTALS: 106(82)

A.4.5 Prescaling and Rounding/8-bit/2-word/Single/Mult-Acc

n = b� 1Here we use basic NR to get accurate enough 2-byte estimate

r1 = r(2� br) = r + r[1� (rn+ r)]:u0 = r + rn 3(2)u1 = 255 255 � u0 2r1 = r + r(u1) 2(2)

P0 = ar 11(6)N (m) = r1 + r1(n) 13(6)q0 = P0

t0 ( ) = Nq0 10(7)q1 (0) = P0 � t0 3

Q = q0 + q1 2TOTALS: 46(23)

40

A.4.6 Prescaling and Rounding/8-bit/2-word/Single/Mult-Acc-Acc

n = b� 1Here we use basic NR to get accurate enough 2-byte estimate

r1 = r(2� br) = r + r[1� (rn+ r)]:u0 = r + rn 3(2)u1 = 255 255 � u0 2r1 = r + r(u1) 2(2)

P0 = ar 8(6)N (m) = r1 + r1(n) 9(6)K (0) = 1�N 4q0 = P0

q1 (0) = (P0 � q0) +Kq0 7(5)

Q = q0 + q1 2TOTALS: 37(21)

A.4.7 Prescaling and Rounding/8-bit/2-word/Double/Mult-Acc

n = b� 1Here we need more accuracy than for single precision andcalculate 3-byte(1� br):

u0 = r + rn 4(3)u1 = 255 255 254 � u0 3r1 = r + r(u1) 3(3)

P0 = ar 23(14)N (m) = r1 + r1(n) 22(14)q0 = P0

t0 ( ) = Nq0 22(13)P1 (0) = P0 � t0 7q1 (0) = P1

t1 (0) ( ) ( ) = Nq1 27(17)P2 (0) (0) (0) = P1 � t1 5q2 (0) (0) (0) = P2

t2 (0) (0) (0) ( ) ( ) = Nq2 17(11)q3 (0) (0) (0) (0) (0) = P2 � t2 3

Q =P3

i=0 qi 6TOTALS: 142(75)

41

A.4.8 Prescaling and Rounding/8-bit/2-word/Double/Mult-Acc-Acc

n = b� 1Here we need more accuracy than for single precision andcalculate 3-byte(1� br):

u0 = r + rn 4(3)u1 = 255 255 254 � u0 3r1 = r + r(u1) 3(3)

P0 = ar 16(14)N (m) = r1 + r1(n) 15(14)K (0) = 1�N 8q0 = P0

P1 (0) = (P0 � q0) +Kq0 15(13)q1 (0) = P1

P2 (0) (0) (0) = (P1 � q1) +Kq1 17(15)q2 (0) (0) (0) = P2

q3 (0) (0) (0) (0) (0) = (P2 � q2) +Kq2 11(9)

Q =P3


A.4.9 Prescaling and Rounding/16-bit/Single

n = b� 1

P0 = ar 3=3(2)N = r + rn 3=3(2)q0 = P0

t0 = Nq0 3=3(2)q1 = P0 � t0 2=2

Q . = q0 + q1 1=1TOTALS: 12=12(6)

42

A.4.10 Prescaling and Rounding/16-bit/Double/Mult-Acc

n = b� 1

P0 = ar 5(4)N = r + rn 5(4)q0 = P0

t0 = Nq0 5(4)P1 = P0 � t0 4q1 = P1

t1 ( ) = Nq1 10(7)P2 (0) = P1 � t1 3q2 (0) = P2

t2 (0) ( ) = Nq2 7(5)q3 (0) (0) = P2 � t2 2

Q =P3


A.4.11 Prescaling and Rounding/16-bit/Double/Mult-Acc-Acc

n = b� 1

P0 = ar 5(4)N = r + rn 5(4)K = 1�N 4q0 = P0

P1 = (P0 � q0) +Kq0 5(4)q1 = P1

P2 (0) = (P1 � q1) +Kq1 8(7)q2 (0) = P2

q3 (0) (0) = (P2 � q2) +Kq2 6(5)

Q =P3


43

A.5 Goldschmidt’s Algorithm

A.5.1 Goldschmidt/8-bit/Single

(In following Mi represents(mi � 1) from paper...)n = b� 1

n0 = ar 4=4(3)d0 = r + nr 4=4(3)t0 = d0M0 = 255 255 � t0 2=2n1 = n0 + n0(M0) 14=10(7)d1 (m) = d0 + d0(M0) 13=8(7)M1 (0) = (255) 255 255 255 � d1 3=3Q = n1 + n1(M1) 16=12(6)

TOTALS: 56=43(26)For two maskable processors, letC(ki) represent the cost of obtainingki.Then total cost would be (d0 being calculated at same time asn0, etc):

C(n0) + C(M0) + 2 + C(n1) + C(M1) + C(Q) = 41=33(16)

A.5.2 Goldschmidt/8-bit/Double

n = b� 1

n0 = ar 8=8(7)d0 = r + rn 8=8(7)t0 = d0M0 = 255 255 � t0 2=2n1 = n0 + n0(M0) 31=18(15)d1 (m) = d0 + d0(M0) 29=16(15)t1 (m) = d1M1 (0) = (255) 255 255 255 � t1 3=3n2 = n1 + n1(M1) 40=24(18)d2 (m) (m) (m) = d1 + d1(M1) 32=18(17)M2 (0) (0) (0) = 1� d2 5=5Q = n2 + n2(M2) 38=27(15)

TOTALS: 196=129(94)For two maskable processors, total cost would be:

C(n0) + C(M0) + 2=2(0) + C(n1) + C(M1) + 3=3(0) + C(n2) + C(M2) +C(Q) = 132=92(55)

44

A.5.3 Goldschmidt’s/8-bit/Extended/Double

n = b� 1

t0 = r + rn 4=4(3)t1 = 255 255 254 � t0 3=3t2 = t1 + t1(t1) 15=11(6)r1 = r + r(t2) 4=4(3)n0 = a(r1) 36=23(20)d0 (m) (m) = r1 + r1(n) 32=20(19)t3 (m) (m) = d0t4 (0) (0) = (255)(255) 255 255 255 255 � t3 6=6M0 (0) (0) = t4 + t4(t4) 26=19(10)Q = d0 + d0(M0) 50=34(21)


176=124(82) �C(d0) + 3=3(0) = 147=107(63)

A.5.4 Goldschmidt’s/16-bit/Single Precision

n = b� 1

n0 = ar 3=3(2)d0 = r + rn0 3=3(2)M0 = 65535 65535 � d0 2=2Q = n0 +N0(M0) 7=6(3)


C(n0) + C(M0) + C(Q) = 12=11(5)

A.5.5 Goldschmidt’s/16-bit/Double Precision

n = b� 1

n0 = ar 5=5(4)d0 = r + nr 5=5(4)t0 = d0M0 = 65535 65535 � t0 2=2n1 = n0 + n0(M0) 15=10(7)d1 (m) = d0 + d0(M0) 13=8(7)M1 (0) = 65535 65535 65535 65535 � d1 3=3Q = n1 + n1(M1) 16=12(6)


C(n0) + C(M0) + 2=2(0) + C(n1) + C(M1) + C(Q) = 43=34(17)

45

Table IV: Costs for NR reciprocal refinement to K-word accuracy.

B >= K Calc(1� bri�1) 1=4(3 � 22i + 8� 2i)Calcri = ri�1(2� bri�1) 1=8(3 � 22i + 18� 2i + 8)

CalcB-word recip (using results from above) 1=2(3B2 + 17B + 2 log2B � 26)B < K Each subsequent(1� bri�1) calc 1=2(3B2 + 7B + 2)

Subsequentri = ri�1(2� bri�1) calcs 1=2(3B � 2i � 3B2 + 3B + 3� 2i + 4)

B Derivation of General Formulas

The first step in deriving general formulas for instruction counts (excluding costs of loading, storing andaccuracy boosts) was to obtain closed-form expressions for each operation needed within the algorithms.For NR reciprocal refinement, the required steps are given in Table IV. To get final cost equations forcalculating a K-word reciprocal requires summing over the requiredlog2(K) iterations.

Expressions for individual operations needed in other algorithms for floating-point and integer divisionare provided in Tables V and VI. The cost of adding a small constant to the final partial quotient in integerdivision in the NR-Byte algorithm (to ensure an overestimate if anything), is included in the equation forsumming theqi’s. (As is done with Restoring Division, the cost of a final quotient adjustment when anoverestimate actually occurs is not included in the equation due to the unlikelihood of such an occurrenceand the inexpensiveness of such an adjustment.)

C Assorted issues

C.1 Newton-Raphson division never optimal

To see that Newton-Raphson division is never optimal, consider the steps of the NR-Byte algorithm whenR is chosen to beR =M=2:

� Obtain anM=2-word reciprocal using NR reciprocal refinement.

� Calculateq0 = ar.

� CalculateP1 = a� bq0.

� CalculateQ = q0 + rP1.

Since the costs of the last two steps are identical to the costs of a final iteration of NR reciprocal refinement,the cost of this algorithm is equivalent to the cost of reciprocal refinement toM -word accuracy plus thecost ofq0 = ar. This makes the algorithm more efficient than NR division, which also requires the costof reciprocal refinement toM -word accuracy but which then performs a more costly multiplication by thedividend (anM -by-M word multiplication instead of anM=2-by-M word multiplication).

C.2 Higher multiplication latency

The effect of higher multiplication latency on floating-point division in small-word processors is exploredin Table VII, which shows the number of partial products required by each algorithm. Since the NR-Bytealgorithm requires the fewest in all cases, it will remain the optimal algorithm under higher multiplicationlatency.

46

Table V: Equation derivations for floating-point division.

A = B =M

getR-word reciprocal 1=2(3R2 + 17R + 2 log2 R � 26)q0 = rP0 (P0 = a) 1=2(3R2 + 3R � 2)qi = rPi (for i > 0) 1=2(3R2 + 9R � 2total qi cost fori > 0 (M=R � 1)(1=2)(3R2 + 9R � 2)

NR-Byte P1 = P0 � bq0 R(3M � 3R+ 4)Pi = Pi�1 � bqi�1 (for i > 1) 3(R + 1)(M � iR+ 1) + R+ 1totalPi cost fori > 1 sum above fori = 2! (M=R � 1)sum theqi’s M �RgetR-word reciprocal 1=2(3R2 + 17R + 2 log2 R � 26)P0 = ar 1=2(6MR � 3R2 + 3R � 2)N = br 3MR � 3R2 + 4R � 2

NR-P&T P1 = P0 � p0N R(3M � 3R+ 4)Pi = Pi�1 � pi�1N (for i > 1) 3(R + 1)(M � iR+ 1) + RtotalPi cost fori > 1 sum above fori = 2! (M=R � 1)sum thepi’s M �Rget 2-word reciprocal 10

Restoring get eachqi 13sum of allPi calc’s 1=2(3M2 + 9M � 34)

A <=M;B =M

getR-word reciprocal 1=2(3R2 + 17R � 4 log2 R � 26)P0 = ar 3AR +R � 2N = br 3MR � 3R2 + 4R � 2

NR-P&T (A > R) P1 = P0 � p0N R(3M � 3R+ 4)Pi for i > 1 3(R + 1)(M � iR+ 1) + RtotalPi cost fori > 1 sum above fori = 2! (M=R � 1)sum theqi’s M �RgetR-word reciprocal 1=2(3R2 + 17R � 4 log2 R � 26)P0 = ar 3AR +A� 2N = br 3MR � 3R2 + 4R � 2

NR-P&T (A <= R) P1 = P0 � p0N R(3M � 3R+ 4)Pi for i > 1 3(R + 1)(M � iR+ 1) + RtotalPi cost fori > 1 sum above fori = 2! (M=R � 1)sum theqi’s M �Rget 2-word reciprocal 10

Restoring get eachqi 13sum of allPi calc’s 1=2(3M2 + 9M � 34)

A =M;B <=M

getR-word reciprocal 1=2(6BR � 3B2 + 10B log2(R=B) + 6R+ 11B + 6 log2 R� 4 log2 B � 26)q0 = rP0 (P0 = a) 1=2(3R2 + 3R � 2)qi = rPi (for i > 0) 1=2(3R2 + 9R � 2

NR-Byte (B <= R) total qi cost fori > 0 (M=R � 1)(1=2)(3R2 + 9R � 2)Pi = P0 � bq0 1=2(3B2 + 7B + 2)totalPi cost (M=R � 1)(1=2)(3B2 + 7B + 2)sum theqi’s M �RgetR-word reciprocal 1=2(3R2 + 17R + 2 log2 R � 26)q0 = rP0 (P0 = a) 1=2(3R2 + 3R � 2)qi = rPi (for i > 0) 1=2(3R2 + 9R � 2total qi cost fori > 0 (M=R � 1)(1=2)(3R2 + 9R � 2)

NR-Byte (B > R) P1 = P0 � bq0 1=2(6BR � 3R2 + 7R� 2)

totalPi=Pi�1�bqi�1 (for i>1) 1=2(6MB�3MR+M�3B2�6BR+3R2

�9R�2B�8+ 6MB+2M+4B�3B2

R)

sum theqi’s M �RgetR-word reciprocal 1=2(3R2 + 17R � 4 log2 R � 26)P0 = ar 1=2(6MR � 3R2 + 3R � 2)N = br 1=2(6BR � 3R2 + 7R� 2

NR-P&T P1 = P0 � p0N R(3M � 3R+ 4)

totalPi=Pi�1�pi�1N (for i>1) 1=2(6MB+3MR+13M�6R2�12BR�3B2

�26R�14B�16+ 6MB+8M�3B2�2BR

)sum theqi’s M �Rget 2-word reciprocal 10

Restoring get eachqi 13sum of allPi calc’s 1=2(6MB + 6M � 3B2 + 3B � 30)

47

Table VI: Equation derivations for Integer Division

getR-word reciprocal 1=2(6BR � 3B2 + 10B log2(R=B) + 6R + 11B + 6 log2 R� 4 log2 B � 26)q0 = rP0 (P0 = a) 1=2(3R2 + 3R � 2)qi = rPi (for i > 0) 1=2(3R2 + 9R � 2total qi cost fori > 0 (1=R)(A � B + 1�R)(1=2)(3R2 + 9R � 2)

NR-Byte (B <= R) mostPi’s 1=2(3B2 + 7B + 2)sum of mostPi’s ((A �B + 1)=R � 1)1=2((3B2 + 7B + 2)lastPi 1=2(3B2 +B + 2)sum theqi’s A� BgetR-word reciprocal 1=2(3R2 + 17R + 2 log2 R � 26)q0 = rP0 (P0 = a) 1=2(3R2 + 3R � 2)qi = rPi (for i > 0) 1=2(3R2 + 9R � 2)total qi cost fori > 0 1=2(1=R)(A � B + 1�R)(3R2 + 9R � 2)

NR-Byte (B > R) P1 = P0 � bq0 1=2(6BR � 3R2 + 7R � 2)Pi = Pi�1 � bqi�1 (for i > 1) 1=2(6BR + 6B � 3R2 +R+ 2)totalPi cost fori > 1 1=2(1=R)(A � B + 1�R)(6BR + 6B � 3R2 + R+ 2)sum theqi’s A� Bget 2-word reciprocal 10

Restoring get eachqi 13sum of allPi calc’s 3AB � 3B2 + 16A� 16B + 9

Table VII: Number of partial products required by algorithms.NR-Byte NR-P&T Gold NR/Gold

R=1 R=2 R=4 R=1 R=2 (hybrid)Single Prec onW=8 processor 21 18 – 22 23 26(16) –Double Prec onW=8 processor 85 66 62 82 71 94(55) 82(63)Single Prec onW=16 processor 5 – – 5 – 7(5) –Double Prec onW=16 processor 23 19 – 24 – 28(17) –

48

Table VIII: Comparison of algorithms on two independent 8-bit processors.SINGLE PRECISION DOUBLE PRECISION

NR-Byte NR-P&T Gold NR-Byte NR-P&T Gold NR/GoldR=1 R=2 R=1 R=2 R=1 R=2 R=4 R=1 R=2 (hybrid)

Mult-acc 27 23 25 29 31 91 79 75 85 83 106 95Mult-acc-acc 25 20 23 24 22 84 59 51 78 64 72 74

Table IX: Comparison of algorithms on two independent 16-bit processors.SINGLE PRECISION DOUBLE PRECISIONByte P&T Gold NR-Byte P&T Gold

R=1 R=2Mult-acc 8 9 9 28 24 26 32Mult-acc-acc 8 9 8 26 20 24 25

C.3 Instruction Counts for Two Independently-Programmable Processors

Tables VIII and IX provide the number of instructions required per processor when two independentlyprogrammable processors are assigned to a division problem. The NR-Byte algorithm remains best in allcases.

C.4 Overflow prevention in hybrid algorithms

It was pointed out in Section 4 that the leading non-zero word of the partial remainder in Byte Division canmore than double in each iteration. The Prescaling and Truncating algorithm is similar in this regard, whilein NR division the magnitude can become squared. When this error becomes� 2W , subsequent partialremainders and partial quotients will require extra word(s) in calculations and the least significant words ofpartial quotients will no longer contain significant information. One way to deal with this is to simply dropthe least significant word. Another is to use a modified iteration to reduce the size of the leading non-zeroword.

While these strategies are about equally efficient, our analysis assumes the latter approach since it main-tains regularity in the algorithms, leading to simpler formulas. For the NR algorithm, for example, this leadsto reciprocals of1; 2; 4; :::2k ::: words (rather than, for example, 1, 2, 4, 7, 14, 28, 55, etc.).

In fact, however, although we assume such boosts are made when necessary, the cost of periodic overflowprevention is not added to any of the algorithms in our study primarily to keep the equations as simple aspossible. There are reasons this can be done without losing much accuracy. In the case of NR reciprocalrefinement, the cost of each boost is extremely small (see Section 5.2). For example, on a 32-bit processor,a boost to prevent overflow will not be necessary untilM >� 32, when the 10–20 instructions needed for aboost will have little impact on the overall cost of>� 1800 instructions.

For the hybrid algorithms an accuracy boost is more expensive since a new partial remainder must becalculated. Fortunately, the size ofM at which the first boost is needed is much greater as well, each iterationretiring multiple words. Since the size of the leading non-zero word of the partial remainder can roughlydouble in each iteration, if theR-word reciprocal is quite accurate (which can be assured by making a finalboost during reciprocal refinement), the size of the initial leading non-zero word will be small and overflowcannot occur on a 32-bit processor untilM >� 32R. For the NR-byte algorithm (the only hybrid algorithmwhich is sometimes optimal), this does not occur untilM > 16000, when the cost of a boost—proportionalto the number of non-zero words of partial remainder—would be insignificant.

49

128

144

160

176

192

208

224

240

256

1 1.2 1.4 1.6 1.8 2

App

roxi

mat

e in

vers

e (in

uni

ts o

f 1/2

56)

b

A

B

Inverse255-n

361-b/2255-b/4

Figure 4: Near-tangents to1=b used to obtain reciprocal estimate.

D Single-precision algorithm for Kestrel

We now present the complete algorithm for single-precision division on Kestrel with rounding to�1 lsb andwithout normalizing. An initial reciprocal under-estimate is obtained using the following approximation:

r0 = 1=256 �max (255 � n; 361 � b=2; 255 � b=4); (4)

wheren is the leading fractional byte ofb, and only the leading fractional bytes ofb=2 andb=4 are used tocalculate the other terms.

Equation 4 finds the best estimate provided by 3 near-tangents to the curve1=b (Figure 4). Althoughclearly the linesy = (255 � b=4) and y = (255 � n) are not optimal in terms of minimizing the error ofthe estimate (there are pockets at A and B that could be reduced), this choice of lines has the advantages ofbeing easily calculated and of always producing an estimate less than1=b. Permitting estimates greater than1=b requires extra steps to deal with the possibility of a negative(1 � bri), and does not provide enoughincreased accuracy to allow omitting any of the refinement steps in our final algorithms (or to allow anycheaper algorithms that we could find). In Kestrel, equation 4 requires just 3 instructions to evaluate, thanksto its bit-shifter (which obtainedb=4 at no cost fromb=2) and especially to its comparator, which allows asubtraction and maximization in one instruction.

(An alternative approach also not requiring table lookup is polynomial approximation. For an 8-bitprocessor, a good start is provided by the equation

r0 = (249 � b(203 � b81n=256c) � n=256c)=256;

wheren is the leading fractional byte ofb = 1. n ... [6]. This equation provides an underestimateof 1=b accurate to�5 bits and requires 4 instructions in Kestrel if the values of 203 and 249 are alreadyloaded in registers.)

The complete algorithm for single-precision division is presented in Figure 5.

50

Setup:a = a0 a1 a2b=2 = b0 b1 b2Bit shifter (BS) contains b0

Calculateb:3 (1) c0 c1 c2 2 � b0 b1 b2

Calculate initial reciprocal estimater0:1 r0 255 � c01 r0 max( r0 ; 105 � b0 );BS BS=21 r0 max( r0 ; 255 � BS )

Calculatebr0:3 (0) k0 k1 ( ) r0 � 1 c0 c1

Calculated = 1� br0:2 d0 d1 255 254 � k0 k1

Calculated+ d2, result in f2 f1 :1 (0) f0 2 � d01 mhi ( ) f0 � d11 mhi f1 d0 � d0 + mhi + d11 f2 mhi + d0

Calculater1 = r0 + r0(d+ d2) , result in r1 r2 :2 mhi r2 ( ) r0 � f2 f1 + r0 0 01 r1 mhi

Calculateq0 = ar1, result in q2 q3 :1 mhi q1 r2 � a01 q0 mhi2 mhi q3 ( ) r1 � a0 a1 + q0 q 11 q2 mhi

Calculatebq0, result in ( ) h3 h4 h5 :4 (mhi) h0 h1 h2 ( ) q3 � 1 c0 c1 c23 (mhi) h3 h4 h5 q2 � c0 c1 c2 + h0 h1 h2

Calculatea� bq0 (with forced borrow), result in 0 m0 m1 m2 :3 m0 m1 m2 a1 a2 0 � h3 h4 h5 � 0 0 1

Calculateq1 = q0 + r1(a� bq0), adding 0 0 0 64 to round, result in q8 q9 q10 q11 :1 q4 642 mhi q6 q7 r2 � m0 m1 + q4 01 q5 mhi3 mhi q10 q11 ( ) r1 � m0 m1 m2 + q5 q6 q72 q8 q9 0 mhi + q2 q342

Figure 5: Kestrel instruction counts and symbolic code for single-precision division with rounding (butwithout normalizing).

51

multiprecision division: expanded version eric rice ... · -word reciprocal to calculate an -word...

Documents