computer arithmetic designscholar.fju.edu.tw/課程大綱/upload/054753/content/981...3 computer...

1Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan

Computer Arithmetic Design

Instructor: Kuan Jen Lin E-Mail: [email protected]: http://vlsi.ee.fju.edu.tw/teacher/kjlin/kjlin.htmDept. of EE, FJU, TaiwanRoom: SF 727B


SW & HW

SW = Algorithm + Data Structure + Programming techniques

HW = Algorithm + Architecture + Design Method

Computing

Communication

Pipeline

Systolic array

Low power

Interface

…

Full custom

Cell based

FPGA

System level


Course ObjectivesLearn computer algorithms to do arithmetic operationsLearn hardware designs for computer arithmetic.After completing the course

Students are able to implement computer arithmetic hardware designs using HDL.Students are able to read research papers about computer arithmetic.


Textbook•Textbook

Behrooz Parhami,

“Computer Arithmetic

Algorithms and Hardware Designs,”

Oxford University Press

•Reference books:

Ercegovac and Lang, “Digital Arithmetic,” MKP.

Stine, “Digital Computer Aruthmetic datapath Design Using Verilog HDL,” CAP


Syllabus

Number representationTwo-operand AdditionMulti-operand AdditionMultiplicationDivisionSquare RootPapers reading and presentation


Grading

Mid Exam (30%)Papers reading and presentation (30%)Homework (some problems need HDL programming) (30%)Attendance and Others (10%)


Number Representation

Instructor: Kuan Jen Lin E-Mail: [email protected]. of EE, FJU, TaiwanRoom: SF 727B

Most slides are revision of PowerPoint files gotten from textbook website.


Numbers and Arithmetic

Chapter GoalsDefine scope and provide motivationSet the framework for the rest of the bookReview positional fixed-point numbers

Chapter HighlightsWhat goes on inside your calculator?Ways of encoding numbers in k bitsRadices and digit sets: conventional, exoticConversion from one system to another


What is Computer Arithmetic?

Pentium Division Bug (1994-95): Pentium’s radix-4 SRT algorithm occasionally gave incorrect quotient First noted in 1994 by T. Nicely who computed sums of reciprocals of twin primes:

1/5 + 1/7 + 1/11 + 1/13 + . . . + 1/p + 1/(p + 2) + . . .

Worst-case example of division error in Pentium:

4 195 835

3 145 727

1.333 820 44... 1.333 739 06...

c = = Correct quotient circa 1994 Pentium double FLP value;

accurate to only 14 bits (worse than single!)


Hardware (our focus in this book) Software––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––––––––––––––––––––––––Design of efficient digital circuits for Numerical methods for solvingprimitive and other arithmetic operations systems of linear equations,such as +, –, ×, ÷, √, log, sin, cos partial differential equations, etc.Issues: Algorithms Issues: Algorithms

Error analysis Error analysisSpeed/cost trade-offs Computational complexityHardware implementation ProgrammingTesting, verification Testing, verification

General-purpose Special-purpose–––––––––––––––––––––– –––––––––––––––––––––––Flexible data paths Tailored toFast primitive applications like:

operations like Digital filtering+, –, ×, ÷, √ Image processing

Benchmarking Radar tracking

The Scope of Computer Arithmetic.


Using a calculator with √, x2, and xy functions, compute:u = √√ … √ 2 = 1.000 677 131 “1024th root of 2”v = 21/1024 = 1.000 677 131 Save u and v; If you can’t save, recompute values when neededx = (((u2)2)...)2 = 1.999 999 963x' = u1024 = 1.999 999 973 y = (((v2)2)...)2 = 1.999 999 983y' = v1024 = 1.999 999 994 Perhaps v and u are not really the same valuew = v – u = 1 × 10–11 Nonzero due to hidden digits (u – 1) × 1000 = 0.677 130 680 [Hidden ... (0) 68](v – 1) × 1000 = 0.677 130 690 [Hidden ... (0) 69]

A Motivating Example


Finite Precision Can Lead to DisasterExample: Failure of Patriot Missile (1991 Feb. 25)Source http://www.math.psu.edu/dna/455.f96/disasters.html American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept incoming Iraqi Scud missile

The Scud struck an American Army barracks, killing 28 Cause, per GAO/IMTEC-92-26 report: “software problem” (inaccurate calculation of the time since boot)Problem specifics: Time in tenths of second as measured by the system’s internal clock was multiplied by 1/10 to get the time in seconds Internal registers were 24 bits wide1/10 = 0.0001 1001 1001 1001 1001 100 (chopped to 24 b)Error ≈ 0.1100 1100 × 2–23 ≈ 9.5 × 10–8

Error in 100-hr operation period ≈ 9.5 × 10 –8 × 100 × 60 × 60 × 10 = 0.34 s

Distance traveled by Scud = (0.34 s) × (1676 m/s) ≈ 570 m


Numbers and Their Encodings

Some 4-bit number representation formats

Unsigned integer ± Signed integer

Signed fraction 2's-compl fraction

Floating point Logarithmic

Fixed point, 3+1

±

e s log x

Radix point

Base-2logarithm

Exponent in{−2, −1, 0, 1}

Significand in{0, 1, 2, 3}


Encoding Numbers in 4 Bits0 2 4 6 8 10 12 14 16 −2 −4 −6 −8 −10 −12 −14 −16

Unsigned integers

Signed-magnitude

3 + 1 fixed-point, xxx.x

Signed fraction, ±.xxx

2’s-compl. fraction, x.xxx

2 + 2 floating-point, s × 2 e in [−2, 1], s in [0, 3]

2 + 2 logarithmic (log = xx.xx)

±

±

Number format

log x

s e e


Fixed-Radix Positional Number Systems( xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l )r = xi r i

One can generalize to: Arbitrary radix (not necessarily integer, positive, constant) Arbitrary digit set, usually {–α, –α+1, . . . , β–1, β} = [–α, β]

Example 1.1. Balanced ternary number system: Radix r = 3, digit set = [–1, 1]

Example 1.2. Negative-radix number systems: Radix –r, r ≥ 2, digit set = [0, r – 1]The special case with radix –2 and digit set [0, 1] is known as the negabinary number system

Can it represent all integer number?

∑−

−=

1k

li


More Examples of Number Systems

Example 1.3. Digit set [–4, 5] for r = 10: (3 –1 5)ten represents 295 = 300 – 10 + 5

Example 1.4. Digit set [–7, 7] for r = 10: (3 –1 5)ten = (3 0 –5)ten = (1 –7 0 –5)ten

Example 1.7. Quater-imaginary number system:radix r = 2j, digit set [0, 3]


Number Radix Conversion

Radix conversion, using arithmetic in the old radix rConvenient when converting from r = 10

u = w . v= ( xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l )r Old= ( XK–1XK–2 . . . X1X0 . X–1X–2 . . . X–L )R New

Radix conversion, using arithmetic in the new radix RConvenient when converting to R = 10

Whole part Fractional part

Example: (31)eight = (25)ten 31 Oct. = 25 Dec. Halloween = Xmas


Radix Conversion: Old-Radix ArithmeticConverting whole part w: (105)ten = (?)five

Repeatedly divide by five Quotient Remainder105 021 14 40

Therefore, (105)ten = (410)fiveConverting fractional part v: (105.486)ten = (410.?)five

Repeatedly multiply by five Whole Part Fraction.486

2 .4302 .1500 .7503 .7503 .750

Therefore, (105.486)ten ≅ (410.22033)five


Radix Conversion: New-Radix ArithmeticConverting whole part w: (22033)five = (?)ten

((((2 × 5) + 2) × 5 + 0) × 5 + 3) × 5 + 3 |-----| : : : :

10 : : : : |-----------| : : :

12 : : : |---------------------| : :

60 : : |-------------------------------| :

303 : |-----------------------------------------|

1518

Converting fractional part v: (410.22033)five = (105.?)ten(0.22033)five × 55 = (22033)five = (1518)ten

1518 / 55 = 1518 / 3125 = 0.48576Therefore, (410.22033)five = (105.48576)ten

Horner’srule or formula


Horner’s Rule for Fractions

Converting fractional part v: (0.22033)five = (?)ten

(((((3 / 5) + 3) / 5 + 0) / 5 + 2) / 5 + 2) / 5|-----| : : : :

0.6 : : : : |-----------| : : :

3.6 : : : |---------------------| : :

0.72 : : |-------------------------------| :

2.144 : |-----------------------------------------|

2.4288 |-----------------------------------------------|

0.48576

Horner’srule or formula


Classes of Number Representations

Signed numberRedundant number systemResidue number systemReal number


2 Representing Signed Numbers

Chapter GoalsLearn different encodings of the sign infoDiscuss implications for arithmetic design

Chapter HighlightsUsing sign bit, biasing, complementationProperties of 2’s-complement numbersSigned vs unsigned arithmeticSigned numbers, positions, or digits


0000 0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

0 +1

+3

+4

+5

+6 +7

-7

-3

-5

-4

-0 -1

+2-

+ _

Bit pattern (representation)

Signed values (signed magnitude)

+2 -6

Increment Decrement

-

Four-bit signed-magnitude number representation system for integers


Four-bit biased integer number representation system with a bias of 8

0000 0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

-8 -7

-5

-4

-3

-2 -1

+7

+3

+5

+4

0 +1 +2

+ _

Bit pattern (representation)

Signed values (biased by 8)

-6 +6

Increment Increment


Arithmetic with Biased Numbers

Addition/subtraction of biased numbersx + y + bias = (x + bias) + (y + bias) – biasx – y + bias = (x + bias) – (y + bias) + bias

A power-of-2 (or 2a – 1) bias simplifies addition/subtraction

Comparison of biased numbers:Compare like ordinary unsigned numbersfind true difference by ordinary subtraction

We seldom perform arbitrary arithmetic on biased numbersMain application: Exponent field of floating-point numbers


Example and Two Special CasesExample -- complement system for fixed-point numbers:

Complementation constant M = 12.000Fixed-point number range [–6.000, +5.999]Represent –3.258 as 12.000 – 3.258 = 8.742

Auxiliary operations for complement representationscomplementation or change of sign (computing M – x) computations of residues mod M

Thus, M must be selected to simplify these operations

Two choices allow just this for fixed-point radix-r arithmetic with k whole digits and l fractional digits

Radix complement M = rk

Digit complement M = rk – ulp (aka diminished radix compl)

ulp (unit in least position) stands for r−l

Allows us to forget about l, even for nonintegers


Two’s- Complement Numbers

0000 0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

+0 +1

+3

+4

+5

+6 +7

-1

-5

-3

-4

-8 -7

-6

+ _

Unsigned representations

Signed values (2’s complement)

+2 -2 Two’s complement = radix complement system for r = 2

M = 2k

2k – x = [(2k – ulp) – x] + ulp= xcompl + ulp

Range of representable numbers in with k whole bits:

from –2k–1 to 2k–1 – ulp

ulp (unit in least position) stands for r−l

Allows us to forget about l, even for nonintegers


One’s-Complement Number Representation

One’s complement = digit complement (diminished radix complement) system for r = 2

M = 2k – ulp

(2k – ulp) – x = xcompl

Range of representable numbers in with k whole bits:

from –2k–1 + ulp to 2k–1 – ulp

0000 0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

+0 +1

+3

+4

+5

+6 +7

-0

-4

-2

-3

-7 -6

-5

+ _

Unsigned representations

Signed values (1’s complement)

+2 -1


Range/Precision extension for 2’s- and 1’s Complement

Range/precision extension for 2’s-complement numbers. . . xk–1 xk–1 xk–1 xk–1 xk–2 . . . x1 x0 . x–1 x–2 . . . x–l 0 0 0 . . .

Sign extension Sign LSD Extension bit

Range/precision extension for 1’s-complement numbers. . . xk–1 xk–1 xk–1 xk–1 xk–2 . . . x1 x0 . x–1 x–2 . . . x–l xk–1 xk–1 xk–1 . . .

Sign extension Sign LSD Extension bit


Mod 2k vs Mod 2k-1

Mod-2k operation needed in 2’s-complement arithmetic is trivial:Simply drop the carry-out (subtract 2k if result is 2k or greater)

Mod-(2k – ulp) operation needed in 1’s-complement arithmetic is done via end-around carry

(x + y) – (2k – ulp) Connect cout to cin

Since the dropped carry is worth 2k unites and the inserted carry is worth ulp, the combined effect is to reduce the magnitude by 2k-ulp.


Why 2’s-Complement Is the Universal Choice

Adder/subtractor architecture for 2’s-complement numbers.

Mux

Adder

0 1

x y

y or y _

s = x ± y

add/sub ___

c in

Controlled complementation

0 for addition, 1 for subtraction

c out

Can replace this mux with k XOR gates


Interpreting a 2’s-complement number as having a negatively weighted most-significant digit.

x = (1 0 1 0 0 1 1 0)two’s-compl

–27 26 25 24 23 22 21 20

–128 + 32 + 4 + 2 = –90

Check:x = (1 0 1 0 0 1 1 0)two’s-compl

–x = (0 1 0 1 1 0 1 0)two

27 26 25 24 23 22 21 20

64 + 16 + 8 + 2 = 90


Redundant Number Systems

Chapter GoalsExplore the advantages and drawbacks of using more than r digit values in radix r

Chapter HighlightsRedundancy eliminates long carry chainsRedundancy takes many forms: trade-offsConversions between redundant

and nonredundant representationsRedundancy used for end values too?


Coping with the Carry Problem

Ways of dealing with the carry propagation problem:1. Limit propagation to within a small number of bits (Chapters 3-4)

2. Detect end of propagation; don’t wait for worst case (Chapter 5)

3. Speed up propagation via lookahead etc. (Chapters 6-7)

4. Ideal: Eliminate carry propagation altogether! (Chapter 3)


Use Redundant Number System (1/2)

5 7 8 2 4 9

6 2 9 3 8 9 Operand digits in [0, 9]––––––––––––––––––––––––––––––––––

11 9 17 5 12 18 Position sums in [0, 18]

But how can we extend this beyond a single addition?Subsequent additions will cause problems.

+

•The digit values 10 through 18 are redundant.

•Carry occurs if the sum >= 10, while not >18.


Use Redundant Number System (2/2)

18 18 18 18 18

+ 0 0 0 0 1

Is there still carry propagation problem?

The sum of digits for each position is in [0, 36], each can be decomposed into an interim sum in [0, 16] and a transfer digit in [0, 2], i.e. carry.

8 8 8 8 9

1 1 1 1

1 9 9 9 9 9


Example: Addition of Redundant Numbers

Position sum decomposition [0, 36] = 10 × [0, 2] + [0, 16]

Absorption of transfer digit [0, 16] + [0, 2] = [0, 18]

6 12 9 10 8 18 Operand digits in [0, 18]

17 21 26 20 20 36

7 11 16 0 10 16

Position sums in [0, 36]

Interim sums in [0, 16]

1 1 1 2 1 2

1 8 12 18 1 12 16

11 9 17 10 12 18

Transfer digits in [0, 2]

Sum digits in [0, 18]

+


Carry-Free Addition Schemes

Interim sumat position i

Transfer digitinto position i

Operand digits at position i

s i+1 s i–1s i

xi–1 ,y i–1,x ixi+1 ,y i+1 y i xi–1 ,y i–1,x ixi+1 ,y i+1 y i

(b) Two-stage carry-free.

s i+1 s i–1s i

t i

(c) Single-stage with lookahead.

s i+1 s i–1s i

xi–1 ,y i–1,x ixi+1 ,y i+1 y i

(a) Ideal single-stage carry-free.

(Impossible for positional system with fixed digit set)


Redundancy IndexSo, redundancy helps us achieve carry-free addition

But how much redundancy is actually needed? Is [0, 11] enough for r = 10?

18 12 16 21 12 16 Position sums in [0, 22]

8 2 6 1 2 6

1 1 1 2 1 1

Interim sums in [0, 9]

Transfer digits in [0, 2]

1 9 3 8 2 3 6

11 10 7 11 3 8

Sum digits in [0, 11]

+ 7 2 9 10 9 8 Operand digits in [0, 11]

Redundancy index ρ = α + β + 1 – r For example, 0 + 11 + 1 – 10 = 2


Digit Sets and Digit-Set ConversionsExample 3.1: Convert from digit set [0, 18] to [0, 9] in radix 10

11 9 17 10 12 18 18 = 10 (carry 1) + 811 9 17 10 13 8 13 = 10 (carry 1) + 311 9 17 11 3 8 11 = 10 (carry 1) + 111 9 18 1 3 8 18 = 10 (carry 1) + 811 10 8 1 3 8 10 = 10 (carry 1) + 012 0 8 1 3 8 12 = 10 (carry 1) + 2

1 2 0 8 1 3 8 Answer; all digits in [0, 9]

Note: Conversion from redundant to nonredundant representation always involves carry propagation

Thus, the process is sequential and slow


Generalized Signed-Digit NumbersRadix-r Positional

ρ = 0 ρ ≥ 1

Non-redundant

α = 0 α ≥ 1

Conventional Non-redundant signed-digit

Generalized signed-digit (GSD)

ρ = 1 ρ ≥ 2

Minimal GSD

Non-minimal GSD

α = β(even r)

α ≠ β

Symmetric minimal GSD

r = 2

BSD or BSB

Asymmetric minimal GSD

α = 0 α = 1(r ?2)

Stored- carry (SC)

Non-binary SB

Symmetric non- minimal GSD

α = β α ≠ β

Asymmetric non- minimal GSD

α < r

Ordinary signed-digit

Minimally redundant OSD

Maximally redundant OSD BSCB

SCB

r = 2

α = 1β = rα = 0

Unsigned-digit redundant (UDR)

r = 2

BSC

α = r ?1α = ⎣ ⎦r/2 + 1

≠

Radix rDigit set [–α, β]Requirement

α + β + 1 ≥ rRedundancy index

ρ = α + β + 1 – r


Binary Signed Digit (BSD)

xi 1 –1 0 –1 0 BSD representation of +6⟨s, v⟩ 01 11 00 11 00 Sign and value encoding2’s-compl 01 10 00 10 00 2-bit 2’s-complement ⟨n, p⟩ 01 10 00 10 00 Negative & positive flags ⟨n, z, p⟩ 001 100 010 100 010 1-out-of-3 encoding


Carry-Free Addition AlgorithmsCarry-free addition of GSD numbers

Compute the position sums pi = xi + yi

Divide pi into a transfer ti+1 and interim sum wi = pi – rti+1

Add incoming transfers to get the sum digits si = wi + ti

xi? ,yi?,xixi+1,yi+1 yi

s i+1 s i?s i

tiwi

If the transfer digits ti are in [–λ, μ], we must have:

–α + λ ≤ pi – rti+1 ≤ β – μ

interim sum

Smallest interim sum Largest interim sumif a transfer of –λ if a transfer of μis to be absorbable is to be absorbable

These constraints lead to:

λ ≥ α / (r – 1)

μ ≥ β / (r – 1)


Is Carry-Free Addition Always Applicable?No: It requires one of the following two conditions [Parh 90]

a. r > 2, ρ ≥ 3

b. r > 2, ρ = 2, α ≠ 1, β ≠ 1 e.g., not [−1, 10] in radix 10

In other words, it is inapplicable for

r = 2 Perhaps most useful case

ρ = 1 e.g., carry-save

ρ = 2 with α = 1 or β = 1 e.g., carry/borrow-save

BSD is not two-stage carry-free -1 -10 -1-1 -2-1

-1


Use Carry-Estimate

A position sum –1 is kept intact when the incoming transfer is in [0, 1], whereas it is rewritten as 1 with a carry of –1 for incoming transfer in [–1, 0]. This guarantees that ti ≠ wi and thus –1 ≤ si ≤ 1.

1 –1 0 –1 0 x in [–1, 1]

+ 0 –1 –1 0 1

1 –2 –1 –1 1

1 0 1 –1 –1

–1 –1 0 1

0 –1 1 0 –1

i

i+1

y in [–1, 1] i

p in [–2, 2] i

w in [–1, 1] i

s in [–1, 1] i

t in [–1, 1]

low low low high high high

0

0

e in {low: [–1, 0], high: [0, 1]} i


Residue Number Systems

Chapter GoalsStudy a way of encoding large numbers as a collection of smaller numbersto simplify and speed up some operations

Chapter HighlightsModuli, range, arithmetic operationsMany sets of moduli possible: tradeoffsConversions between RNS and binary The Chinese remainder theoremWhy are RNS applications limited?


RNS Representations and Arithmetic

Chinese puzzle, 1500 years ago:

What number has the remainders of 2, 3, and 2 when divided by 7, 5, and 3, respectively?

Residues uniquely identify the number, hence they constitute a representation

Pairwise relatively prime moduli: mk–1 > . . . > m1 > m0

The residue xi of x wrt the ith modulus mi (similar to a digit):xi = x mod mi = ⟨x⟩mi

RNS representation contains a list of k residues or digits:x = (2 | 3 | 2)RNS(7|5|3)

Default RNS for this chapter: RNS(8 | 7 | 5 | 3)


RNS Dynamic RangeProduct M of the k pairwise relatively prime moduli is the dynamic range

M = mk–1 × . . . × m1 × m0

For RNS(8 | 7 | 5 | 3), M = 8 ×7 ×5 ×3 = 840

Negative numbers: Complement relative to M⟨–x⟩mi = ⟨M – x⟩mi21 = (5 | 0 | 1 | 0)RNS

–21 = (8 – 5 | 0 | 5 – 1 | 0)RNS = (3 | 0 | 4 | 0)RNS

Here are some example numbers in our default RNS(8 | 7 | 5 | 3):(0 | 0 | 0 | 0)RNS Represents 0 or 840 or . . .(1 | 1 | 1 | 1)RNS Represents 1 or 841 or . . .(2 | 2 | 2 | 2)RNS Represents 2 or 842 or . . .. .(0 | 1 | 4 | 1)RNS Represents 64 or 904 or . . .(2 | 0 | 0 | 2)RNS Represents –70 or 770 or . . .(7 | 6 | 4 | 2)RNS Represents –1 or 839 or . . .

We can take the range of RNS(8|7|5|3) to be [−420, 419] or any other set of 840 consecutive integers


We will see later how the weights can be determined for a given RNS

RNS as Weighted Representation

For RNS(8 | 7 | 5 | 3), the weights of the 4 positions are:

105 120 336 280

Example: (1 | 2 | 4 | 0)RNS represents the number

⟨105×1 + 120×2 + 336×4 + 280×0⟩840 = ⟨1689⟩840 = 9

For RNS(7 | 5 | 3), the weights of the 3 positions are:

15 21 70

Example -- Chinese puzzle: (2 | 3 | 2)RNS(7|5|3) represents the number

⟨15 × 2 + 21 × 3 + 70 × 2⟩105 = ⟨233⟩105 = 23


RNS Encoding and Arithmetic Operations

Binary-coded format for RNS(8 | 7 | 5 | 3).

Arithmetic in RNS(8 | 7 | 5 | 3)(5 | 5 | 0 | 2)RNS Represents x = +5(7 | 6 | 4 | 2)RNS Represents y = –1(4 | 4 | 4 | 1)RNS x + y : ⟨5 + 7⟩8 = 4, ⟨5 + 6⟩7 = 4, etc.(6 | 6 | 1 | 0)RNS x – y : ⟨5 – 7⟩8 = 6, ⟨5 – 6⟩7 = 6, etc.

(alternatively, find –y and add to x)(3 | 2 | 0 | 1)RNS x × y : ⟨5 × 7⟩8 = 3, ⟨5 × 6⟩7 = 2, etc.

mod 8 mod 7 mod 5 mod 3


Mod-8 Unit

Mod-7 Unit

Mod-5 Unit

Mod-3 Unit

3 3 3 2

Operand 1 Operand 2

Result


Choosing the RNS Moduli

Target range for our RNS: Decimal values [0, 100 000]

Strategy 1: To minimize the largest modulus, and thus ensure high-speed arithmetic, pick prime numbers in sequence

Pick m0 = 2, m1 = 3, m2 = 5, etc. After adding m5 = 13:RNS(13 | 11 | 7 | 5 | 3 | 2) M = 30 030 Inadequate

RNS(17 | 13 | 11 | 7 | 5 | 3 | 2) M = 510 510 Too large

RNS(17 | 13 | 11 | 7 | 3 | 2) M = 102 102 Just right!5 + 4 + 4 + 3 + 2 + 1 = 19 bits

Fine tuning: Combine pairs of moduli 2 & 13 (26) and 3 & 7 (21)RNS(26 | 21 | 17 | 11) M = 102 102


An Improved Strategy

Target range for our RNS: Decimal values [0, 100 000]

Strategy 2: Improve strategy 1 by including powers of smaller primes before proceeding to the next larger prime

RNS(22 | 3) M = 12RNS(32 | 23 | 7 | 5) M = 2520RNS(11 | 32 | 23 | 7 | 5) M = 27 720RNS(13 | 11 | 32 | 23 | 7 | 5) M = 360 360

(remove one 3, combine 3 & 5)RNS(15 | 13 | 11 | 23 | 7) M = 120 120

4 + 4 + 4 + 3 + 3 = 18 bits

Fine tuning: Maximize the size of the even modulus within the 4-bit limitRNS(24 | 13 | 11 | 32 | 7 | 5) M = 720 720 Too largeWe can now remove 5 or 7; not an improvement in this example


Low-Cost RNS ModuliTarget range for our RNS: Decimal values [0, 100 000]

Strategy 3: To simplify the modular reduction (mod mi) operations, choose only moduli of the forms 2a or 2a – 1, aka “low-cost moduli”

RNS(2ak–1 | 2ak–2 – 1 | . . . | 2a1 – 1 | 2a0 – 1)

We can have only one even modulus2ai – 1 and 2aj – 1 are relatively prime iff ai and aj are relatively prime

RNS(23 | 23–1 | 22–1) basis: 3, 2 M = 168RNS(24 | 24–1 | 23–1) basis: 4, 3 M = 1680RNS(25 | 25–1 | 23–1 | 22–1) basis: 5, 3, 2 M = 20 832RNS(25 | 25–1 | 24–1 | 23–1) basis: 5, 4, 3 M = 104 160

ComparisonRNS(15 | 13 | 11 | 23 | 7) 18 bits M = 120 120RNS(25 | 25–1 | 24–1 | 23–1) 17 bits M = 104 160

It’s easy to mod 2k and 2k -1


Encoding and Decoding of Numbers

Conversion from binary/decimal to RNS

–––––––––––––––––––––––––––––i 2i ⟨2i⟩7 ⟨2i⟩5 ⟨2i⟩3

–––––––––––––––––––––––––––––0 1 1 1 11 2 2 2 22 4 4 4 13 8 1 3 24 16 2 1 15 32 4 2 26 64 1 4 17 128 2 3 28 256 4 1 19 512 1 2 2

–––––––––––––––––––––––––––––

Table 4.1 Residues of the first 10 powers of 2

Example 4.1: Represent the number y = (1010 0100)two = (164)tenin RNS(8 | 7 | 5 | 3)

The mod-8 residue is easy to find

x3 = ⟨y⟩8 = (100)two = 4

We have y = 27+25+22; thus

x2 = ⟨y⟩7 = ⟨2 + 4 + 4⟩7 = 3

x1 = ⟨y⟩5 = ⟨3 + 2 + 4⟩5 = 4

x0 = ⟨y⟩3 = ⟨2 + 2 + 1⟩3 = 2


Conversion from RNS to Binary/DecimalTheorem 4.1 (The Chinese remainder theorem)

x = (xk–1 | . . . | x2 | x1 | x0)RNS = ⟨ ∑i Mi ⟨αi xi⟩mi ⟩Mwhere Mi = M/mi and αi = ⟨Mi

–1⟩mi (multiplicative inverse of Mi wrt mi)

Implementing CRT-based RNS-to-binary conversionx = ⟨ ∑i Mi ⟨αi xi⟩mi ⟩M = ⟨ ∑i fi(xi) ⟩M

We can use a table to store the fi values –- ∑i mi entries

Table 4.2 Values needed in applying the Chinese remainder theorem to RNS(8 | 7 | 5 | 3)

––––––––––––––––––––––––––––––i mi xi ⟨Mi ⟨αi xi⟩mi⟩M––––––––––––––––––––––––––––––3 8 0 0

1 1052 2103 315. .. .. .


Intuitive Justification for CRTPuzzle: What number has the remainders of 2, 3, and 2

when divided by the numbers 7, 5, and 3, respectively?

x = (2 | 3 | 2)RNS(7|5|3) = (?)ten

(1 | 0 | 0)RNS(7|5|3) = multiple of 15 that is 1 mod 7 = 15(0 | 1 | 0)RNS(7|5|3) = multiple of 21 that is 1 mod 5 = 21(0 | 0 | 1)RNS(7|5|3) = multiple of 35 that is 1 mod 3 = 70

(2 | 3 | 2)RNS(7|5|3) = (2 | 0 | 0) + (0 | 3 | 0) + (0 | 0 | 2)= 2 × (1 | 0 | 0) + 3 × (0 | 1 | 0) + 2 × (0 | 0 | 1)

= 2 × 15 + 3 × 21 + 2 × 70 = 30 + 63 + 140= 233 = 23 mod 105

Therefore, x = (23)ten


Difficult RNS Arithmetic Operations

Sign test Magnitude comparisonDivision

•Could convert back and forth to/from binary. •Another approach: convert to a mixed radix system, as numbers in a mixed radix system are comparable.


Difficult RNS Arithmetic Operations

Example: Of the following RNS(8 | 7 | 5 | 3) numbers:Which, if any, are negative?Which is the largest?Which is the smallest?

Assume a range of [–420, 419]a = (0 | 1 | 3 | 2)RNS

b = (0 | 1 | 4 | 1)RNS

c = (0 | 6 | 2 | 1)RNS

d = (2 | 0 | 0 | 2)RNS

e = (5 | 0 | 1 | 0)RNS

f = (7 | 6 | 4 | 2)RNS

Answers:d < c < f < a < e < b

–70 < –8 < –1 < 8 < 21 < 64


General RNS DivisionGeneral RNS division, as opposed to division by one of the moduli (aka scaling), is difficult; hence, use of RNS is unlikely to be effective when an application requires many divisions

Scheme proposed in 1994 PhD thesis of Ching-Yu Hung (UCSB):Use an algorithm that has built-in tolerance to imprecision, and apply the approximate CRT decoding to choose quotient digits

Example –– SRT algorithm (s is the partial remainder)

s < 0 quotient digit = –1s ≅ 0 quotient digit = 0s > 0 quotient digit = 1

The BSD quotient can be converted to RNS on the fly


Limits of Fast Arithmetic in RNS

Known results from number theory

Implications to speed of arithmetic in RNS

Theorem 4.5: It is possible to represent all k-bit binary numbers in RNS with O(k / log k) moduli such that the largest modulus has O(log k) bits

That is, with fast log-time adders, addition needs O(log log k) time

Theorem 4.2: The ith prime pi is asymptotically i ln i

Theorem 4.3: The number of primes in [1, n] is asymptotically n / ln n

Theorem 4.4: The product of all primes in [1, n] is asymptotically en


Hardware Implementation for RNS Representations


Mod-8 Unit

Mod-7 Unit

Mod-5 Unit

Mod-3 Unit

3 3 3 2

Operand 1 Operand 2

Result


Addition/Subtraction


Most slides originate from the textbook author’s PowerPoint presentation files.


II Addition / Subtraction

Chapter 8 Multioperand Addition

Chapter 7 Variations in Fast Adder

Chapter 6 Carry-Lookahead Adders

Chapter 5 Basic Addition and Counting

Topics in This Part

Review addition schemes and various speedup methods• Addition is a key op (in itself, and as a building block)• Subtraction = negation + addition• Carry propagation speedup: lookahead, skip, select, …• Two-operand versus multioperand addition


Basic Addition and Counting

Chapter GoalsStudy the design of ripple-carry adders, discuss why their latency is unacceptable,and set the foundation for faster adders

Chapter HighlightsFull adders are versatile building blocksLongest carry chain on average: log2k bitsFast asynchronous adders are simpleCounting is relatively easy to speed up


HA and FA Adders

Half-adder (HA): Truth table and block diagram

Full-adder (FA): Truth table and block diagram

x y c c s ---------------------- 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1

Inputs Outputs

c out c in

out in x

y

s

FA

x y c s ---------------- 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0

Inputs Outputs

HA

x y

c

s


Half-Adder Implementations

c

s

(b) NOR-gate half-adder.

xy

xy

(c) NAND-gate half-adder with complemented carry.

x

y

c

s

s

c xy

xy

(a) AND/XOR half-adder._

__c


Some Full-Adder Details

Logic equations for a full-adder:s = x ⊕ y ⊕ cin (odd parity function)

= xycin ∨ x ′y ′cin ∨ x ′y cin′ ∨ x y ′cin′

cout = x y ∨ x cin ∨ y cin (majority function)


Full-Adder Implementations

HA

HA

xy

cin

cout

(a) Built of half-adders.s

(b) Built as an AND-OR circuit.

(c) Suitable for CMOS realization.

cout

s

cin

xy

0 1 2 3

0 1 2 3

xy

cin

cout

s

0

1

Mux


Bit Serial Adder and Ripple Adder

x y

c

x

s

y

c

x

s

y

c out c in

0 0

0

c 0

31

31

31

31

FA

s

c c

1 1

1

1 2 FA FA

32 . . .

s 32

x

s

y

c c

i i

i

i i+1 FA Carry

FF Shift

Shift

x

y

s

(a) Bit-serial adder.

(b) Ripple-carry adder.

Clock


Critical Path Through a Ripple-Carry Adder

Critical path in a k-bit ripple-carry adder.

x

s

y

c

x

s

y

c

x

s

y

c

x

s

y

c

c out c in

0 0

0

c 0

1 1

1

1

k-2 k–2

k–2

2 k

k–1

k–1

k–1

k–1

FA FA FA FA . . . c k–2

s k

Tripple-add = TFA(x,y→cout) + (k – 2)×TFA(cin→cout) + TFA(cin→s)


Conditions and Exceptions

overflow2’s-compl = xk–1 yk–1 sk–1′ ∨ xk–1′ yk–1′ sk–1

overflow2’s-compl = ck ⊕ ck–1 = ck ck–1′ ∨ ck′ ck–1

FAFA

xy 11 x0y0

c0c1

s0s1

FAc2

sk–1

cout cin...

ck–1ck–2

sk–2

ck

xk–2yk–2xk–1yk–1

FA

Overflow

Negative

Zero

Overflows occurs when two numbers of like sign are added and a result of the opposite sign is produced.


Binary Adders as Versatile Building Blocks (1/2)

Fig. 5.6 Four-bit binary adder used to realize the logic function f = w + xyz and its complement.

c

3

c

4

c

2

c

1

c

0

0

1 w

1 z

0 y

x Bit 3 Bit 2 Bit 1 Bit 0

w ∨ xyz

(w ∨ xyz)′

w ∨ xyz xyz xy 0

Set one input to 0: cout = AND of other inputs

Set one input to 1: cout = OR of other inputs

Set one input to 0 and another to 1: s = NOT of third input

cout cin

x y

s

FA


Binary Adders as Versatile Building Blocks (2/2)

x y c c s----------------------0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1

Inputs Outputs

c out c in

outin x y

s

FA


Example of Carry Propagation

Bit positions15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0----------- ----------- ----------- -----------1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0

cout 0 1 0 1 1 0 0 1 1 1 0 0 0 0 1 1 cin\__________/\__________________/ \________/\____/

4 6 3 2Carry chains and their lengths


Using Probability to Analyze Carry PropagationGiven binary numbers with random bits, for each position i we have

Probability of carry generation = ¼ (both 1s)Probability of carry annihilation = ¼ (both 0s)Probability of carry propagation = ½ (different)

Probability that carry generated at position i propagates through position j – 1 and stops at position j (j > i)

2–(j–1–i) × 1/2 = 2–(j–i)

Expected length of the carry chain that starts at position i

)1()1()1(

)1(1

1

)1(1

1

)(

222)(2)1(2

2)(22)(2)(

−−−−−−−−−

−−−−−

=

−−−−−

+=

−−

−=−++−−=

−+=−+− ∑∑ikikik

ikik

l

likk

ij

ij

ikik

iklikij

Because the carry definitely stops at position k, the term for k is not multiplied by ½.


Carry Completion Detection

. . .

. . .

. . .

. . .

x y = x +y

alldoneFrom other bit positions

i+1

c = c

b = c

b = 1: No carry c = 1: Carry

b

i+1c 0

i i i i

ib

ic

x + yi i

x y i i

x y i i

0

in

in

}

di+1 ii

c = c k out

b k

bi ci0 0 Carry not yet known0 1 Carry known to be 11 0 Carry known to be 0

Dual rail coding


Self-Timed Adder


Self-Timed Adder with Parallel carry Completion Sensing


Addition of a Constant: Counters

Count register

Mux

Incrementer (Decrementer)

+1 (−1)

Data in

Load

Count / Initialize _____

x + 1

x

0 1

Data out

Reset Clear Enable Clock

Counter overflow

(x − 1)

c out


Implementing a Simple Up Counter

Four-bit asynchronous up counter built only of negative-edge-triggered T flip-flops.

T

Q

Q T

Q

Q T

Q

Q T

Q

QIncrement

0

0

1

1

2

2

3

3

Count Output

Ripple-carry incrementer for use in an up counter.

1

0

k−2

k−1

. . . c

k−1

c

k

c

k−2

c

1

x

x

x

x

c

2

1 0 k−2 k−1 s s s s 2 s


Manchester Carry Chains and AddersSum digit in radix r si = (xi + yi + ci) mod rSpecial case of radix 2 si = xi ⊕ yi ⊕ ci

Computing the carries ci is thus our central problem For this, the actual operand digits are not important What matters is whether in a given position a carry is

generated, propagated, or annihilated (absorbed)

For binary addition:gi = xi yi pi = xi ⊕ yi ai = xi′yi ′ = (xi ∨ yi) ′

It is also helpful to define a transfer signal:ti = gi ∨ pi = ai′ = xi ∨ yi

Using these signals, the carry recurrence is written asci+1 = gi ∨ ci pi = gi ∨ ci gi ∨ ci pi = gi ∨ ci ti


Manchester Carry Network

p

g

a

Logic 1

Logic 0

c

c

i+1

i

i

i

i

0

1

0

1

0 1

(a) Conceptual representation

c'i+1 ic'

Clock

ip

VDD

VSS

ig

(b) Possible CMOS realization.

The worst-case delay of a Manchester carry chain has three components:

1. Latency of forming the switch control signals2. Set-up time for switches3. Signal propagation delay through k switches

gi = xi yi pi = xi⊕ yi

ci+1 = gi∨ ci pi


Carry Network is the Essence of a Fast Adder

The main part of an adder is the carry network. The rest is just a set of gates to produce the g and p signals and the sum bits.

Carry network

. . . . . .

x i y i

g p

s

i i

i

c i c i+1

c k−1

c k c k−2 c 1

c 0

g p 1 1 g p 0 0

g p k−2 k−2 g p i+1 i+1 g p k−1 k−1

c 0 . . . . . .

0 0 0 1 1 0 1 1

annihilated or killed propagated generated (impossible)

Carry is: g i p i gi = xi yi

pi = xi ⊕ yi

Ripple; Skip;Lookahead;Parallel-prefix


Carry Propagation Network of a Ripple-Carry Adder

. . . c

k−1

c

k c k−2

c 1

g

p

1

1

g

p

0

0

g

p

k−2

k−2

g

p

k−1

k−1

c

0 c 2

The carry recurrence: ci+1 = gi ∨ pi ci

Latency of k-bit adder is roughly 2k gate delays:

1 gate delay for production of p and g signals, plus 2(k – 1) gate delays for carry propagation, plus1 XOR gate delay for generation of the sum bits


Carry-Lookahead Adders

Chapter GoalsUnderstand the carry-lookahead method and its many variationsused in the design of fast adders

Chapter HighlightsSingle- and multilevel carry lookaheadVarious designs for log-time addersRelating the carry determination problem

to parallel prefix computationImplementing fast adders in VLSI


Unrolling the Carry RecurrenceRecall the generate, propagate, annihilate (absorb), and transfer signals:

Signal Radix r Binarygi is 1 iff xi + yi ≥ r xi yipi is 1 iff xi + yi = r – 1 xi ⊕ yiai is 1 iff xi + yi < r – 1 xi′yi ′ = (xi ∨ yi) ′ti is 1 iff xi + yi ≥ r – 1 xi ∨ yi

si (xi + yi + ci) mod r xi ⊕ yi ⊕ ci

The carry recurrence can be unrolled to obtain each carry signal directly from inputs, rather than through propagation

ci = gi–1 ∨ ci–1 pi–1= gi–1 ∨ (gi–2 ∨ ci–2 pi–2)pi–1= gi–1 ∨ gi–2pi–1 ∨ ci–2 pi–2pi–1= gi–1 ∨ gi–2pi–1 ∨ gi–3 pi–2pi–1 ∨ ci–3 pi–3 pi–2pi–1= gi–1 ∨ gi–2pi–1 ∨ gi–3 pi–2pi–1 ∨ gi–4 pi–3 pi–2pi–1 ∨ ci–4 pi–4 pi–3 pi–2pi–1=….

Where pj can be replaced with tj.


Four-Bit Carry-Lookahead Adder (1/2)Complexity reduced by deriving the carry-out indirectlyc4=g3+c3p3

g0

g1

g2

g3

c0

c4

c1

c2

c3

p3

p2

p1

p0

Full carry lookahead is quite practical for a 4-bit adder

c1 = g0 ∨ c0 p0c2 = g1 ∨ g0p1 ∨ c0 p0p1c3 = g2 ∨ g1p2 ∨ g0 p1p2 ∨ c0 p0 p1p2c4 = g3 ∨ g2p3 ∨ g1 p2p3 ∨ g0 p1 p2p3

∨ c0 p0 p1 p2p3


Four-Bit Carry-Lookahead Adder (2/2)

Source: Ercegovac and Lang, “Digital Arithmetic,” MKP


Carry Lookahead Beyond 4 Bits

32-input AND

Consider a 32-bit adder

c1 = g0 ∨ c0 p0c2 = g1 ∨ g0p1 ∨ c0 p0p1c3 = g2 ∨ g1p2 ∨ g0 p1p2 ∨ c0 p0 p1p2

.

.

.

c31 = g30 ∨ g29p30 ∨ g28 p29p30 ∨ g27 p28 p29p30 ∨ . . . ∨ c0 p0 p1p2p3 ... p29p30

32-input OR. . . High fan-ins necessitate

tree-structured circuits

For wide words, full carry lookahead is impractical.


Two Schemes to Manage the ComplexityHigh-radix addition (i.e., radix 2h)

Increases the latency for generating g and p signals and sum digits,but simplifies the carry network (optimal radix?)

Multilevel lookahead

Example: 16-bit addition

Radix-16 (four digits)

Two-level carry lookahead (four 4-bit blocks)

Either way, the carries c4, c8, and c12 are determined first

c16 c15 c14 c13 c12 c11 c10 c9 c8 c7 c6 c5 c4 c3 c2 c1 c0cout ? ? ? cin


One-Level carry Lookahead Adder

Source: Ercegovac and Lang, “Digital Arithmetic”, pp.72.


Block Generate and Propagate signals

Block generate and propagate signals

g [i,i+3] = gi+3 ∨ gi+2pi+3 ∨ gi+1 pi+2pi+3 ∨ gi pi+1 pi+2pi+3

p [i,i+3] = pi pi+1 pi+2pi+3

ic4-bit lookahead carry generator

g p g p g p g p

[i,i+3]p

i+1c i+2c i+3c

g

iii+1i+1i+2 i+2 i+3 i+3

[i,i+3]

Note: unrelated to ci

Ck = g[0,k-1]+c0p[0,k-1]

Ci+4 = g[i,i+3]+cip[i,i+3]


4-bit Lookahead Carry Generator

gi

gi+1

g i+2

gi+3

ci

ci+1

ci+2

ci+3

pi+3

pi+2

pi+1

pi

g

p [i,i+3]

Block Signal GenerationIntermediate Carries

[i,i+3]


A Two-Level Carry-Lookahead Adder (64 bits)

cccc

4-bit lookahead carry generator

4-bit lookahead carry generator

g p

ccc

g p

12 8 4 0

48 32 16

[0,63]

16-bit Carry-Lookahead Adder

[0,63]

[48,63][48,63] g

p[32,47][32,47] g

p[0,15][0,15]g

p[16,31][16,31]

g p [12,15]

[12,15] g p [8,11]

[8,11] g p [4,7]

[4,7] g p [0,3]

[0,3]

16 bit CLA

C4, C8 and C12 are the Ci+1, Ci+2 an Ci+3 respectively in last slide.

Ck = g[0,k-1]+c0p[0,k-1]


Latency of a 16-bit 2-Level l Carry-Lookahead Adder (1/2)

(Level 1) g and p for individual bit positions 1 gate level

(Level 1) g and p signals for 4-bit blocks 2 gate levelsi.e. g[0,3], p[0,3]……g[12, 15], p[12, 15]

(Level 2) Block carry-in signals c4, c8, and c12 2 gate levelsg[0,15], p[0,15]

(Level 1) Internal carries within 4-bit blocks 2 gate levelsc1, c2, c3, c5,…..(Level 2) C15 if required

(Level 1) Sum bits (XOR) 2 gate levels???


Latency of a 16-bit 2-Level l Carry-Lookahead Adder (2/2)

Total latency for the 16-bit adder is 9 gate levelsEach additional lookahead level adds 4 gate levels of latency (yellow block in last slide)

Latency for k-bit CLA adder:4 log4k + 1 gate levels


Combining of g and p signals

Combining of g and p signals of two (contiguous or overlapping) blocks B' and B" of arbitrary widths into the g and p signals for block B.

g" p"

i 0i 1

j 0j 1

g p

g' p'

Block B'Block B"

Block B(g, p)

(g", p") (g', p')

¢g = g" + g'p" p = p'p"

g p

g″ p″ g′ p′


Formulating the Prefix Computation ProblemThe problem of carry determination can be formulated as:Given (g0, p0) (g1, p1) . . . (gk–2, pk–2) (gk–1, pk–1) Find (g [0,0] , p [0,0]) (g [0,1] , p [0,1]) . . . (g [0,k–2] , p [0,k–2]) (g [0,k–1] , p [0,k–1])

c1 c2 . . . ck–1 ck

Carry-in can be viewed as an extra (−1) position: (g–1, p–1) = (cin, 0)

The desired pairs are found by evaluating all prefixes of(g0, p0) ¢ (g1, p1) ¢ . . . ¢ (gk–2, pk–2) ¢ (gk–1, pk–1)

The carry operator ¢ is associative, but not commutative[(g1, p1) ¢ (g2, p2)] ¢ (g3, p3) = (g1, p1) ¢ [(g2, p2) ¢ (g3, p3)]

Prefix sums analogy:Given x0 x1 x2 . . . xk–1Find x0 x0+x1 x0+x1+x2 . . . x0+x1+...+xk–1


g0, p0g1, p1g2, p2g3, p3

g[0,0], p[0,0]= (c1, --)

g[0,1], p[0,1]= (c2, --)

g[0,2], p[0,2]= (c3, --)

g[0,3], p[0,3]= (c4, --)

Prefix-Based Carry Network

g p

g″ p″ g′ p′

++

++

26 5−1

712 56g0, p0g1, p1g2, p2g3, p3

g[0,0], p[0,0]= (c1, --)

g[0,1], p[0,1]= (c2, --)

g[0,2], p[0,2]= (c3, --)

g[0,3], p[0,3]= (c4, --)

¢¢

¢¢

Four-input prefix sums network

Scan order

Four-bitCarry lookahead network


Parallel Prefix Sums Network Built of Two k/2-Input Networks and k/2 Adders(Ladner-Fischer)

Delay recurrence D(k) = D(k/2) + 1 = log2kCost recurrence C(k) = 2C(k/2) + k/2 = (k/2) log2kIncurs large fanout

. . .

Prefix Sums k/2 Prefix Sums k/2

. . .

xk–1 xk/2 xk/2–1 x0

s k–1 s k/2

s k/2–1 s 0+ +. . .

. . .

. . . . . .

. . .

. . .. . .

Recursive dividing


a is t in the textbook

Source: Ercegovac and Lang, “Digital Arithmetic”, pp.81


Eliminate Large Fanout

Increase the number of levelsIncrease the number of cells


The Brent-Kung Recursive Construction

Delay recurrence D(k) = D(k/2) + 2 = 2 log2k – 1 (–2 really)Cost recurrence C(k) = C(k/2) + k – 1 = 2k – 2 – log2k

Parallel prefix sums network built of one k/2-input network and k – 1 adders.

Prefix Sums k/2

xk–1 xk–2 x3 x2 x1 x0

s k–1 s k–2 s 3 s 2 s 1 s 0

++

+

+

+

. . .

. . .

. . .

. . .


Brent-Kung Carry Network (8-Bit Adder)

¢ ¢ ¢ ¢

¢ ¢

¢ ¢

¢ ¢ ¢

[7, 7 ] [6, 6 ] [5, 5 ] [4, 4 ] [3, 3 ] [2, 2 ] [1, 1 ] [0, 0 ]

[0, 7 ] [0, 6 ] [0, 5 ] [0, 4 ] [0, 3 ] [0, 2 ] [0, 1 ] [0, 0 ]

g p [0,1] [0,1]

g p [1,1] [1,1] g p [0,0] [0,0]

[2, 3 ] [4, 5 ]

[6, 7 ]

[4, 7 ] [0, 3 ]

[0, 1 ]


Source: Ercegovacand Lang, “Digital Arithmetic”, pp.83


Brent-Kung Carry Network (16-Bit Adder)x0x1x2x3x4x5x6x7

x8x9x10x11x12x13x14x15

s0s1s2s3s4s5s6s7s8s9s10s11

s12s13s14s15

1 2 3 4 5 6

Level

Reason for latency being 2 log2k – 2


Kogge-Stone Carry Network (16-Bit Adder)x0x1x2x3x4x5x6x7

x8x9x10x11x12x13x14x15


s12s13s14s15

log2k levels (minimum possible)

Cost formulaC(k) = (k – 1)

+ (k – 2) + (k – 4) + . . . + (k – k/2)

= k log2k – k + 1


Source: Ercegovacand Lang, “Digital Arithmetic”, pp.84


Speed-Cost Tradeoffs in Carry Networks

2k – 2 – log2k2 log2k – 2 Brent-Kung

k log2k – k + 1log2kKogge-Stone

(k/2) log2klog2kLadner-Fischer

CostDelayMethod

. . .

Prefix Sums k/2 Prefix Sums k/2

. . .

xk? xk/2 xk/2? x0

sk? sk/2

sk/2? s0+ +. . .

. . .

. . . . . .

. . .

. . .. . .Improving the Ladner/Fischer design

These outputs can be produced one time unit later without increasing the overall latency

This strategy saves enough to make the overall cost linear (best possible)


Hybrid B-K/K-S Carry Network (16-Bit Adder)x0x1x2x3x4x5x6x7

x8x9x10x11x12x13x14x15

s0s 1s2s 3s4s5s 6s7s8s9s 10s11s12s 13s14s 15

x0

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

s0s1s2s3s4s5s6s7s8s 9s10s11s12s13s14s15

1 2 3 4 5 6

Level

x0x1x2x3x4x5x6x7x8x9x10x11

x12x13x14x15


s12s13s14s15

Brent- Kung

Brent- Kung

Kogge- Stone

Brent-Kung: 6 levels

26 cells

Kogge-Stone: 4 levels

49 cells

Hybrid: 5 levels

32 cells


Four-Bit Manchester Carry Chains (Transistor Level)

PH2g2

PH2g3

PH2g1

PH2g0

p3

p2

p1

p0

g[0,3]

PH2p[0,3]

(a)

PH2

PH2

g2

g3

g1

g0

p3

p2

p1

p0

g[0,3]

p[0,3]

g[0,2]

p[0,2]

g[0,1]

p[0,1]

PH2PH2

(b)

PH2 PH2

PH2 PH2

PH2 PH2

PH2PH2


Variations in Fast Adders

Chapter GoalsStudy alternatives to the carry-lookahead method for designing fast adders

Chapter HighlightsMany methods besides CLA are available

(both competing and complementary)Best design is technology-dependent

(often hybrid rather than pure)Knowledge of timing allows optimizations


Simple Carry-Skip Adders

cc ccc

cc ccc

ppppSkipSkipSkip

4-Bit Block

Skip logic (2 gates)

16 12

8

4

0

0

4

8

1216

[12,15] [8,11] [4,7][0,3]

(a) Ripple-carry adder.

(b) Simple carry-skip adder.

3 2 1 0

Ripple-carry stages

4-Bit Block

4-Bit Block

4-Bit Block

4-Bit Block

4-Bit Block

3 2 1 0


Carry-Skip Adder Using MUX



Another View of Carry-Skip Addition

Street/freeway analogy for carry-skip adder.

c

g

p

4j+1

4j+1

g

p

4j

4j

g

p

4j+2

4j+2

g

p

4j+3

4j+3

c

4j

4j+4

c

4j+3

c

4j+2

c

4j+1

One-way street

Freeway


Carry-Skip Adder with Fixed Block SizeBlock width b; k/b blocks to form a k-bit adder (assume b divides k)

Example: k = 32, b opt = 4, T opt = 12.5 stages(contrast with 32 stages for a ripple-carry adder)

Tfixed-skip-add = (b – 1) + 0.5 + (k/b – 2) + (b – 1) in block 0 OR gate skips in last block

≅ 2b + k/b – 3.5 stages

dT/db = 2 – k/b2 = 0 ⇒ b opt = √k/2

T opt = 2√2k – 3.5

. . .

1stage =

2 gate levels


Worst Case Delay

Source: Ercegovac and Lang, “Digital Arithmetic”, pp.67-68.


1111

+0001 C0=0Worst case in block 0

0111

+0000 C12=1Worst case in last block


Carry-Skip Adder with Variable-Width Blocks (1/2)

b b b b. . .

RippleSkip

Carry path (1)

01t–1 t–2 Block widths

Carry path (3)

Carry path (2)

Carry path (2) goes through one fewer skip than (1), so block t-2 can be one bit wider than block t-1 without increasing the total delay.

Carry path (3) goes through one fewer skip than (1), so block 1 can be one bit wider than block 0 without increasing the total delay.


Carry-Skip Adder with Variable-Width Blocks (2/2)

The total number of bits in the t blocks is k:

2[b + (b + 1) + . . . + (b + t/2 – 1)] = t(b + t/4 – 1/2) = k

b = k/t – t/4 + 1/2

Tvar-skip-add = 2(b – 1) + 0.5 + t – 2 = 2k/t + t/2 – 2.5

dT/db = –2k/t 2 + 1/2 = 0 ⇒ t opt = 2√k

T opt = 2√k – 2.5 (a factor of √2 smaller than for fixed-block)

Let b=1


Multilevel Carry-Skip Adders

S 1

c out c in

S 1 S 1 S 1 S 1

S 2

S 1

c out c in

S 1 S 1 S 1 S 1

c out c in

S 2

S

1

S

1

S

1


Single-Level Carry-Skip Adder (Example 7.1)Assumptions: Each of the following takes one unit of time: generation of gi and pi, generation of level-i skip signal from level-(i–1) skip signals, ripple, skip, and formation of sum bit once the incoming carry is known

Build the widest possible one-level carry-skip adder with total delay of 8

c cbbbbbbb 0

2345678

2

inout

S1 S1 S1 S1 S1

0123456

Stage b0 takes 2 time units: one for generating gp and the other for generating carry.

Stage b1 cannot be more than 3 bits, because its output is available at time 3, so it can take one time unit for generating gp and two for propagation across 2 bits.

At the right end, block width is limited by the output timing requirement.


Generalization of Example 7.1 for total time T (even or odd)1 2 3 . . . T/2 T/2 . . . 4 3 11 2 3 . . . (T + 1)/2 . . . 4 3 1

Thus, for any T, the total width is ⎣(T + 1)2/4⎦ – 2

Stage b4 cannot be more than 3 bits, because its input become available at time 5 and the total adder delay is to be 8 units..

Max adder width = 18 (1 + 2 + 3 + 4 + 4 + 3 + 1)

At the left end, block width is limited by input timing.


Two-Level Carry-Skip Adder (1/2)

Given the delay pair {β, α} for a level-2 block in Fig. 7.7a, the number of level-1 blocks that can be accommodated is γ = min(β–1, α)

Example 7.2

Single-level carry-skip adder with Tassimilate = α

Single-level carry-skip adder with Tproduce = β

Width of the ith level-1 block in the level-2 block characterized by {β, α} is bi = min(β – γ + i + 1, α – i); the total block width is then ∑i=0 to γ–1 bi

c cbb

234β

inout

S1 S1 S1 S1 S1

12

– 1β – 2βb –3βb –2β

S1

b0

S1

1

c cbb

0123

αinout

S1 S1 S1 S1 S1

12

– 1α – 2αS1

b0

S1

b –1α b –2α


Two-Level Carry-Skip Adder (2/2)

Max adder width = 30(4 + 8 + 8 + 6 + 3 + 1)

c c

80

7 6 5 34 3

b b b b b b{8, 1} {7, 2} {6, 3} {5, 4} {4, 5} {3, 8}

inoutABCDEF

S2 S2 S2 S2 S2

Tproduce Tassimilate

(a)

3457 6

2 t=0t=8cout cin2

3

Block E Block D Block C Block B Block AF


Carry-Skip Adder Optimization Scheme

Inputs

Level-h skip

Block of b full-adder uni ts

I(b)

A(b)

G(b)

E (b) h S (b) h


Carry-Select Adders

Cselect-add(k) = 3Cadd(k/2) + k/2 + 1

Tselect-add(k) = Tadd(k/2) + 1

k/2-bit adder k/2-bit adder

k - 1 k/2 k - 1 0

0 1

k/2+1 k/2+1 k/2

1 0 Mux

k/2 c out

c k/2

c in

High k /2 bits Low k /2 bits

k /2-bit adder Carry-select adder for k-bit numbers built from three k/2-bit adders.


Two-level Carry-Select Adder Built of k/4-bit adders

k /4-bit adder k/4-bit adder

k /2 - 1 k /4 k /4 - 1 0

0 1

k/4+1 k/4+1 k/4

1 0 Mux

k/4

k/4-bit adder

k - 1 3k/4 0 1

k/4+1 k/4+1 k/4

1 0 Mux

k /4-bit adder

3k/4 - 1 k /2 0 1

1 0 Mux

k/2+1

k/4

c k/2

c k/4

c out

c in

, High k /2 bits Middle k /4 bits Low k /4 bits

k/2-bit conditional-sum


Conditional Adder



Carry Select Adder



Conditional Sum Adder



16-Bit Conditional Sum Adder

The same as Fig. 7.20 in textbookSource: Ercegovac and Lang, “Digital Arithmetic”, pp.89


Conditional-Sum AdderMultilevel carry-select idea carried out to the extreme (to 1-bit blocks.

C(k) ≅ 2C(k/2) + k + 2 ≅ k (log2k + 2) + k C(1)

T(k) = T(k/2) + 1 = log2k + T(1)

where C(1) and T(1) are the cost and delay of the circuit of the following circuit for deriving the sum and carry bits with a carry-in of 0 and 1

sc

xy

sc

ii

ii+1 i+1 i

For c = 0iFor c = 1i

k + 2 is an upper bound on number of single-bit 2-to-1 multiplexers needed for combining two k/2-bit adders into a k-bit adder


A Hybrid Carry-Lookahead/Carry-Select Adder

Lookahead Carry Generator

Carry-Select

c

g, p

in

MuxMuxMux

cout

01

01

01

Block

The most popular hybrid addition scheme:


Summary



A Hybrid Ripple-Carry/Carry-Lookahead Design

Any Two Addition Schemes Can Be CombinedOther possibilities: hybrid carry-select/ripple-carry

hybrid ripple-carry/carry-select. . .

cccc

4-Bit Lookahead Carry Generator

c12 8 4 016

16-bit Carry-Lookahead Adder

g p [12,15]

[12,15] g p [8,11]

[8,11] g p [4,7]

[4,7] g p [0,3]

[0,3]

c32c48

(with carry-out)


Optimizations in Fast Adders

What looks best at the block diagram or gate level may not be best when a circuit-level design is generated (effects of wire length, signal loading, . . . )

Modern practice: Optimization at the transistor level

Variable-block carry-lookahead adder

Optimizations for average or peak power consumption

Timing-based optimizations (next slide)


Multioperand Addition

Chapter GoalsLearn methods for speeding up the addition of several numbers (needed for multiplication or inner-product)

Chapter HighlightsRunning total kept in redundant formCurrent total + Next number → New total Deferred carry assimilationWallace/Dadda trees and parallel counters


Some Applications of Multioperand Addition

• • • • a • • • • x ---------- • • • • x a • • • • x a • • • • x a • • • • x a ----------------• • • • • • • • p

×

0123

0123

2 2 2 2

• • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p -----------------• • • • • • • • • s

(0)(1)(2)(3)(4)(5)(6)

Multioperand addition problems for multiplication or inner-product computation in dot notation.


Serial Implementation with One Adder

Tserial-multi-add = O(n log(k + log n))

= O(n log k + n log log n)

Therefore, addition time grows superlinearly with n when k is fixed and logarithmically with k for a given n

Adderx

k bits

k + log n bits∑ xj=0

i–1

(i)

2 (j)

Partial sum register


Pipelined Adder



Parallel Implementation as Tree of Adders

Adding 7 numbers in a binary tree of adders.

Adder Adder Adder

AdderAdder

Adder

k

k+1

k+2

k+3

k+2

k+1k+1

k kk kk k

Ttree-fast-multi-add = O(log k + log(k + 1) + . . . + log(k + ⎡log2n⎤ – 1))

= O(log n log k + log n log log n)

Ttree-ripple-multi-add = O(k + log n) [Justified on the next slide]

⎡log2n⎤adder levelsn – 1

adders


Elaboration on Tree of Ripple-Carry Adders

Ttree-ripple-multi-add = O(k + log n)

Adder Adder Adder

AdderAdder

Adder

k

k+1

k+2

k+3

k+2

k+1k+1

k kk kk k

Fig. 8.5 Ripple-carry adders at levels i and i + 1 in the tree of adders used for multi-operand addition.

. . .

. . . Level i

Level i+1

HAFA

HAFA

t

t+1

tt+1t+1

t+1

t+1

t+2

t+2 t+2

t+2

t+3t+2t+3

The absolute best latency that we can hope for is O(log k + log n)

There are kn data bits to process and using any set of computation elements with constant fan-in, this requires O(log(kn)) time

We will see shortly that carry-save adders achieve this optimum time


Carry-Save Adders

FA FAFA FA FAFA

FA FAFA FA FAFA

Cut

Carry-propagate adder

Carry-save adder (CSA) or (3; 2)-counter or 3-to-2 reduction circuit

c

in

c

out

dot notation.

Half-adder

Full-adder

Specifying full- and half-adder blocks, with their inputs and outputs, in dot notation.

Ripple carry adder

Carry save adder


Example of CSA

Also considered as reduction by column [3:2].

[p:q] counter: p bits of the same weight and produce q bits of adjacent weights.

3

2

Reduction by row (3:2) counter


Use Dot Notation


Carry-save adder (CSA) or (3; 2)-counter or 3-to-2 reduction circuit

c

in

c

out


Multioperand Addition Using Carry-Save Adders

Tree of carry-save adders reducing seven numbers to two.

CSACSA

CSA

CSA

CSA

Tcarry-save-multi-add = O(tree height + TCPA)

= O(log n + log k)

Ccarry-save-multi-add = (n – 2)CCSA + CCPA


Serial carry-save addition using a single CSA.

CSA

Input

Sum registerCarry register

Output

CPA


Reduction by a CSA Tree

12 FAs

6 FAs

6 FAs

4 FAs + 1 HA

7-bit adder

Total cost = 7-bit adder + 28 FAs + 1 HA

Addition of seven 6-bit numbers in dot notation.

8 7 6 5 4 3 2 1 0 Bit position

7 7 7 7 7 7 6×2 = 12 FAs2 5 5 5 5 5 3 6 FAs3 4 4 4 4 4 1 6 FAs

1 2 3 3 3 3 2 1 4 FAs + 1 HA 2 2 2 2 2 1 2 1 7-bit adder

--Carry-propagate adder--

1 1 1 1 1 1 1 1 1

Representing a seven-operand addition in tabular form.

A full-adder compacts 3 dots into 2(compression ratio of 1.5)

A half-adder rearranges 2 dots(no compression, but still useful)


Width of Adders in a CSA TreeAdding seven k-bit numbers and the CSA/CPA widths required.

Due to the gradual retirement (dropping out) of some of the result bits, CSA widths do not vary much as we go down the tree levels

k-bit CPA

k-bit CSA k-bit CSA

k-bit CSA

k-bit CSA

0k+2

The index pair [i, j] means that bit positions from i up to j are involved.

k-bit CSA

[0, k–1] [0, k–1]

[0, k–1] [0, k–1]

[0, k–1] [0, k–1]

[0, k–1] [0, k–1]

[0, k–1]

[1, k] [1, k]

[1, k]

[1, k]

[0, k–1]

[2, k+1] [2, k+1]

[2, k+1]

[2, k+1] [1, k–1]

1

[1, k+1]

Bit K+1 does not involve addition


Wallace and Dadda Trees

h(n) = 1 + h(⎡2n/3⎤)

n(h) = ⎣3n(h – 1)/2⎦

2×1.5h–1< n(h) ≤ 2×1.5h

. . . inputsn

2 outputs

levelshh levels

Table 8.1 The maximum number n(h) of inputs for an h-level CSA tree

––––––––––––––––––––––––––––––––––––h n(h) h n(h) h n(h)––––––––––––––––––––––––––––––––––––0 2 7 28 14 4741 3 8 42 15 7112 4 9 63 16 10663 6 10 94 17 15994 9 11 141 18 23985 13 12 211 19 35976 19 13 316 20 5395––––––––––––––––––––––––––––––––––––n(h): Maximum number of inputs for h levels


Wallace and Dadda Reduction Trees

6 FAs

11 FAs

7 FAs

4 FAs + 1 HA

7-bit adder


Adding seven 6-bit numbers using Dadda’s strategy.

12 FAs

6 FAs

6 FAs

4 FAs + 1 HA

7-bit adder


Addition of seven 6-bit numbers using Wallace strategy.

Wallace tree: Reduce the number of operands at the earliest possible opportunity

Dadda tree: Postpone the reduction to the extent possible without causing added delay

h n(h)2 43 64 95 136 19


A Small Optimization in Reduction Trees

6 FAs

11 FAs

7 FAs

4 FAs + 1 HA

7-bit adder


Adding seven 6-bit numbers using Dadda’s strategy.

taking advantage of the final adder’s carry-in.

6 FAs

11 FAs

6 FAs + 1 HA

3 FAs + 2 HA

7-bit adder



Parallel Counters

A 10-input parallel counter also known as a (10; 4)-counter.

0

1 0 1 0 1 0

2 1 1 0

1

0

2

13 2

3-bit ripple-carry adder

FA FA

HA

HA

FA

FAFAFA1-bit full-adder = (3; 2)-counter

Circuit reducing 7 bits to their3-bit sum = (7; 3)-counter

Circuit reducing n bits to their ⎡log2(n + 1)⎤-bit sum

= (n; ⎡log2(n+1)⎤)-counter


Implementation of [4:2] Counter



Generalized Parallel Counters

(5, 5; 4)-counter Dot notation for a (5, 5; 4)-counter and the use of such counters for reducing five numbers to two numbers.

. . .

Multicolumn reduction

(2, 3; 3)-counter

Unequal columns

Gen. parallel counter = Parallel compressor


A General Strategy for Column Compression

n + ψ1 + ψ2 + ψ3 + . . . ≤ 3 + 2ψ1 + 4ψ2 + 8ψ3 + . . .

n – 3 ≤ ψ1 + 3ψ2 + 7ψ3 + . . .

. . . i – 3 i – 2 i – 1 i

n inputs

To i + 1

To i + 2

To i + 3

One circuit slice

ψ 1 ψ 2

ψ 3

ψ 1 ψ 2 ψ 3

(n; 2)-counters

Example: Design a bit-slice of an (11; 2)-counterSolution: Let’s limit transfers to two stages. Then, 8 ≤ ψ1 + 3ψ2Possible choices include ψ1 = 5, ψ2 = 1 or ψ1 = ψ2 = 2


Multiplication


Most slides originate from the textbook author’s PowerPoint presentation files.


III Multiplication

Chapter 12 Variations in Multipliers

Chapter 11 Tree and Array Multipliers

Chapter 10 High-Radix Multipliers

Chapter 9 Basic Multiplication Schemes

Topics in This Part

Review multiplication schemes and various speedup methods• Multiplication is heavily used (in arith & array indexing)• Division = reciprocation + multiplication• Multiplication speedup: high-radix, tree, . . . • Bit-serial, modular, and array multipliers


9 Basic Multiplication Schemes

Chapter GoalsStudy shift/add or bit-at-a-time multipliersand set the stage for faster methods andvariations to be covered in Chapters 10-12

Chapter HighlightsMultiplication = multioperand additionHardware, firmware, software algorithmsMultiplying 2’s-complement numbersThe special case of one constant operand


Shift/Add Multiplication Algorithms

Notation for our discussion of multiplication algorithms:

a Multiplicand ak–1ak–2 . . . a1a0x Multiplier xk–1xk–2 . . . x1x0p Product (a × x) p2k–1p2k–2 . . . p3p2p1p0

Initially, we assume unsigned operands

Multiplication of two 4-bit unsigned binary numbers in dot notation.

Product

Partial products bit-matrix

a x

p

2

x a

0 0

1 x a 2 1 x a 2

2 2

2 3 3

x a

Multiplicand Multiplier ×


Preferred

Multiplication Recurrence

Multiplication with right shifts: top-to-bottom accumulation

p(j+1) = (p(j) + xj a 2k) 2–1 with p(0) = 0 and|–––add–––| p(k) = p = ax + p(0)2–k

|––shift right––|

Product


a x

p

2

x a

0 0

1 x a 2 1 x a 2

2 2

2 3 3

x a


Multiplication with left shifts: bottom-to-top accumulation

p(j+1) = 2p(j) + xk–j–1a with p(0) = 0 and|shift| p(k) = p = ax + p(0)2k

|––––add––––|


Examples of Basic MultiplicationRight-shift algorithm Left-shift algorithm======================== =======================a 1 0 1 0 a 1 0 1 0x 1 0 1 1 x 1 0 1 1======================== =======================p(0) 0 0 0 0 p(0) 0 0 0 0+x0a 1 0 1 0 2p(0) 0 0 0 0 0––––––––––––––––––––––––– +x3a 1 0 1 02p(1) 0 1 0 1 0 ––––––––––––––––––––––––p(1) 0 1 0 1 0 p(1) 0 1 0 1 0+x1a 1 0 1 0 2p(1) 0 1 0 1 0 0––––––––––––––––––––––––– +x2a 0 0 0 02p(2) 0 1 1 1 1 0 ––––––––––––––––––––––––p(2) 0 1 1 1 1 0 p(2) 0 1 0 1 0 0+x2a 0 0 0 0 2p(2) 0 1 0 1 0 0 0––––––––––––––––––––––––– +x1a 1 0 1 02p(3) 0 0 1 1 1 1 0 ––––––––––––––––––––––––p(3) 0 0 1 1 1 1 0 p(3) 0 1 1 0 0 1 0+x3a 1 0 1 0 2p(3) 0 1 1 0 0 1 0 0––––––––––––––––––––––––– +x0a 1 0 1 02p(4) 0 1 1 0 1 1 1 0 ––––––––––––––––––––––––p(4) 0 1 1 0 1 1 1 0 p(4) 0 1 1 0 1 1 1 0======================== =======================


Programmed Using Right-Shift Algorithm{Using right shifts, multiply unsigned m_cand and m_ier, storing the resultant 2k-bit product in p_high and p_low. Registers: R0 holds 0 Rc for counter

Ra for m_cand Rx for m_ierRp for p_high Rq for p_low}

{Load operands into registers Ra and Rx}mult: load Ra with m_cand

load Rx with m_ier{Initialize partial product and counter}

copy R0 into Rpcopy R0 into Rqload k into Rc

{Begin multiplication loop}m_loop: shift Rx right 1 {LSB moves to carry flag}

branch no_add if carry = 0 add Ra to Rp {carry flag is set to cout}

no_add: rotate Rp right 1 {carry to MSB, LSB to carry}rotate Rq right 1 {carry to MSB, LSB to carry}decr Rc {decrement counter by 1}branch m_loop if Rc ≠ 0

{Store the product}store Rp into p_highstore Rq into p_low

m_done: ...

R0 Rc Counter0Ra RxRp Rq

Multiplicand MultiplierProduct, high Product, low


Time Complexity of Programmed Multiplication

Assume k-bit words

k iterations of the main loop 6-7 instructions per iteration, depending on the multiplier bit

Thus, 6k + 3 to 7k + 3 machine instructions,ignoring operand loads and result store

k = 32 implies 200+ instructions on average

This is too slow for many modern applications!Microprogrammed multiply would be somewhat better


Sequential Multiplication with Right Shifts

Multiplier x

Mux

Adder

0

out c

0 1

Doublewidth partial product p

Multiplicand a

Shift

Shift

(j)

j x

x a j

k

k

k

Hardware realization

Clock?

Control path?


Sequential Multiplication with Left Shifts

Multiplier x

Mux

2k-bit adder

0

out c

0 1


Multiplicand a

Shift

Shift

(j)

k-j-1 x

a

2k

k k-j-1 x

2k


Multiplication of Signed Numbers

============================a 1 0 1 1 0x 0 1 0 1 1============================p(0) 0 0 0 0 0+x0a 1 0 1 1 0–––––––––––––––––––––––––––––2p(1) 1 1 0 1 1 0p(1) 1 1 0 1 1 0+x1a 1 0 1 1 0–––––––––––––––––––––––––––––2p(2) 1 1 0 0 0 1 0p(2) 1 1 0 0 0 1 0+x2a 0 0 0 0 0–––––––––––––––––––––––––––––2p(3) 1 1 1 0 0 0 1 0p(3) 1 1 1 0 0 0 1 0+x3a 1 0 1 1 0–––––––––––––––––––––––––––––2p(4) 1 1 0 0 1 0 0 1 0p(4) 1 1 0 0 1 0 0 1 0+x4a 0 0 0 0 0–––––––––––––––––––––––––––––2p(5) 1 1 1 0 0 1 0 0 1 0p(5) 1 1 1 0 0 1 0 0 1 0============================

Negative multiplicand,positive multiplier:

No change, other than looking out for propersign extension


Multiplication with a Negative Multiplier

============================a 1 0 1 1 0x 1 0 1 0 1============================p(0) 0 0 0 0 0+x0a 1 0 1 1 0–––––––––––––––––––––––––––––2p(1) 1 1 0 1 1 0p(1) 1 1 0 1 1 0+x1a 0 0 0 0 0–––––––––––––––––––––––––––––2p(2) 1 1 1 0 1 1 0p(2) 1 1 1 0 1 1 0+x2a 1 0 1 1 0–––––––––––––––––––––––––––––2p(3) 1 1 0 0 1 1 1 0p(3) 1 1 0 0 1 1 1 0+x3a 0 0 0 0 0–––––––––––––––––––––––––––––2p(4) 1 1 1 0 0 1 1 1 0p(4) 1 1 1 0 0 1 1 1 0+(−x4a) 0 1 0 1 0–––––––––––––––––––––––––––––2p(5) 0 0 0 1 1 0 1 1 1 0p(5) 0 0 0 1 1 0 1 1 1 0============================

Negative multiplicand,negative multiplier:

In last step (the sign bit), subtract rather than add

10101=－1x24 + 22+20


Booth’s Recoding–––––––––––––––––––––––––––––––––––––xi xi–1 yi Explanation–––––––––––––––––––––––––––––––––––––0 0 0 No string of 1s in sight0 1 1 End of string of 1s in x1 0 −1 Beginning of string of 1s in x1 1 0 Continuation of string of 1s in x

–––––––––––––––––––––––––––––––––––––

Example1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x

(1) −1 0 1 0 0 −1 1 0 −1 1 −1 1 0 0 −1 0 Recoded version y

Justification2j + 2j–1 + . . . + 2i+1 + 2i = 2j+1 – 2i


Example Multiplication with Booth’s Recoding

============================a 1 0 1 1 0x 1 0 1 0 1 Multipliery −1 1 −1 1 −1 Booth-recoded============================p(0) 0 0 0 0 0+y0a 0 1 0 1 0–––––––––––––––––––––––––––––2p(1) 0 0 1 0 1 0p(1) 0 0 1 0 1 0+y1a 1 0 1 1 0–––––––––––––––––––––––––––––2p(2) 1 1 1 0 1 1 0p(2) 1 1 1 0 1 1 0+y2a 0 1 0 1 0–––––––––––––––––––––––––––––2p(3) 0 0 0 1 1 1 1 0p(3) 0 0 0 1 1 1 1 0+y3a 1 0 1 1 0–––––––––––––––––––––––––––––2p(4) 1 1 1 0 0 1 1 1 0p(4) 1 1 1 0 0 1 1 1 0y4a 0 1 0 1 0–––––––––––––––––––––––––––––2p(5) 0 0 0 1 1 0 1 1 1 0p(5) 0 0 0 1 1 0 1 1 1 0============================

2’ complement of 10110 is 01010


Multiplication by ConstantsExplicit, e.g. y := 12 ∗ x + 1

Implicit, e.g. A[i, j] := A[i, j] + B[i, j]

Address of A[i, j] = base + n ∗ i + j

Software aspects:Optimizing compilers replace multiplications by shifts/adds/subs

Produce efficient code using as few registers as possible Find the best code by a time/space-efficient algorithm

0 1 2 . . . n – 1 0 1 2 ...

m – 1

Row i

Column j

Hardware aspects:Synthesize special-purpose units such as filters

y[t] = a0x[t] + a1x[t – 1] + a2x[t – 2] + b1y[t – 1] + b2y[t – 2]


Multiplication Using Binary Expansion

Example: Multiply R1 by the constant 113 = (1 1 1 0 0 0 1)two

R2 ← R1 shift-left 1R3 ← R2 + R1R6 ← R3 shift-left 1R7 ← R6 + R1R112 ← R7 shift-left 4R113 ← R112 + R1

Shift, add Shift

Ri: Register that contains i times (R1)

This notation is for clarity; only one register other than R1 is needed

Shorter sequence using shift-and-add instructions

R3 ← R1 shift-left 1 + R1R7 ← R3 shift-left 1 + R1R113 ← R7 shift-left 4 + R1


Multiplication via Recoding

Example: Multiply R1 by 113 = (1 1 1 0 0 0 1)two = (1 0 0−1 0 0 0 1)two

R8 ← R1 shift-left 3R7 ← R8 – R1R112 ← R7 shift-left 4R113 ← R112 + R1

Shift, add Shift

Shorter sequence using shift-and-add/subtract instructions

R7 ← R3 shift-left 3 – R1R113 ← R7 shift-left 4 + R1

Shift, subtract

6 shift or add (3 shift-and-add) instructions needed without recoding


Multiplication via Factorization

Example: Multiply R1 by 119 = 7 × 17 = (8 – 1) × (16 + 1)

R8 ← R1 shift-left 3R7 ← R8 – R1R112 ← R7 shift-left 4R119 ← R112 + R7

Shorter sequence using shift-and-add/subtract instructions

R7 ← R3 shift-left 3 – R1R119 ← R7 shift-left 4 + R7

119 = (1 1 1 0 1 1 1)two = (1 0 0 0−1 0 0−1)two

More instructions may be needed without factorization

Requires a scratch register for holding the 7 multiple


High-Radix Multipliers

Chapter GoalsStudy techniques that allow us to handlemore than one multiplier bit in each cycle(two bits in radix 4, three in radix 8, . . .)

Chapter HighlightsHigh radix gives rise to “difficult” multiplesRecoding (change of digit-set) as remedyCarry-save addition reduces cycle timeImplementation and optimization methods


Radix-4 Multiplication in Dot Notation

Number of cycles is halved, but now the “difficult” multiple 3amust be dealt with

Product


a x

p

2

x a

0 0

1 x a 2 1 x a 2

2 2

2 3 3

x a


Multiplier x

p Product

Multiplicand a

(x x ) a 4 1 3 2 two

4 0 a (x x ) 1 0 two

×

Radix 2

Radix-4, or two-bit-at-a-time, multiplication in dot notation


A Possible Design for a Radix-4 Multiplier

Precomputed via shift-and-add(3a = 2a + a) 0 a 2a

3aMultiplier

To the adder

2-bit shifts

00 01 10 11Mux

xi+1 xi


Example Radix-4 Multiplication Using 3a================================a 0 1 1 03a 0 1 0 0 1 0x 1 1 1 0================================p(0) 0 0 0 0+(x1x0)twoa 0 0 1 1 0 0–––––––––––––––––––––––––––––––––4p(1) 0 0 1 1 0 0p(1) 0 0 1 1 0 0+(x3x2)twoa 0 1 0 0 1 0–––––––––––––––––––––––––––––––––4p(2) 0 1 0 1 0 1 0 0p(2) 0 1 0 1 0 1 0 0================================

x

p

a

(x x )3 2

(x x )1 0

×


A Second Design for a Radix-4 Multiplier

xi+1 xi c Mux control Set carry---- --- --- ---------------- ------------0 0 0 0 0 00 0 1 0 1 00 1 0 0 1 00 1 1 1 0 01 0 0 1 0 01 0 1 1 1 11 1 0 1 1 11 1 1 0 0 1

replacing 3a with 4a (carry into next higher radix-4 multiplier digit) and –a.

0 a 2a 　

Multiplier

To the adder

+c FF Set if = = 1 or if = c = 1c

00 01 10 11Mux

2-bit shifts

mod 4Carry

xi+1 xi

xi+1xi+1

xixi+1(xi ∨ c)xi+1⊕ xi c xi ⊕ c

c


Radix-4 Booth’s Recoding–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––xi+1 xi xi–1 yi+1 yi zi/2 Explanation–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––0 0 0 0 0 0 No string of 1s in sight0 0 1 0 1 1 End of string of 1s0 1 0 0 1 1 Isolated 10 1 1 1 0 2 End of string of 1s1 0 0 −1 0 −2 Beginning of string of 1s1 0 1 −1 1 −1 End a string, begin new one1 1 0 0 −1 −1 Beginning of string of 1s1 1 1 0 0 0 Continuation of string of 1s–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

(1) −2 2 −1 2 −1 −1 0 −2 Radix-4 version z

ContextRecoded

radix-2 digits Radix-4 digit

Example1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x

(1) −1 0 1 0 0 −1 1 0 −1 1 −1 1 0 0 −1 0 Recoded version y

Only shifting and complementation required


Example Multiplication via Modified Booth’s Recoding

================================a 0 1 1 0x 1 0 1 0z −1 −2 Radix-4================================p(0) 0 0 0 0 0 0+z0a 1 1 0 1 0 0–––––––––––––––––––––––––––––––––4p(1) 1 1 0 1 0 0p(1) 1 1 1 1 0 1 0 0+z1a 1 1 1 0 1 0–––––––––––––––––––––––––––––––––4p(2) 1 1 0 1 1 1 0 0p(2) 1 1 0 1 1 1 0 0================================

x

p

a

(x x ) a 413 2 two

40a(x x )1 0 two

´


Multiple Generation with Radix-4 Booth’s Recoding

two non0a 2a

EnableSelect

z a

neg

ii+1 i?

i/2

0 1Mux

k+10, a, or 2a

To adder inputAdd/subtract control

x

Multiplier

xx

Recoding Logic

Multiplicand

0

k

0

2-bit shift

Init. 0

Could have named this signal one/two

Sign extension, not 0


Using Carry-Save Adders

Mux

0 2a

0 a

Multiplier

New Cumulative Partial Product

Old Cumulative Partial Product

CSA

Mux xi+1 xi

Adder


Keeping the Partial Product in Carry-Save Form

0

Multiplier

k

k

k-Bit CSA

k

Partial Product

k

Mux

k-Bit Adder

Mux

Multiplicand

Carry

Sum

Shift

Old PP

CS sum

New PP

Next multiple


Carry-Save Multiplier with Radix-4 Booth’s Recoding (1/2)

a

Multiplier

x i+1

x i

Adder

New cumulati ve partial product

Old cumulati ve partial product

FF

2-bit Adder

To the lower hal f of pa rtial product

Booth recoder and selector

CSA

x i-1

z a i/2

Extra “dot”


x x x x

Recoding Logic

two non0a 2a

EnableSelect

z a

neg

ii+1 i?

i/2

i?

0 1Mux

k+10, a, or 2a

k+2

Selective Complement

0, a, , 2a, or ?a　

Extra "Dot" for Column i

xi+2

Carry-Save Multiplier with Radix-4 Booth’s Recoding (2/2)


Another Design for Radix-4 Multiplication

Mux

0 2a

0 a

Multiplier

CSA

Mux xi+1 xi

Adder

CSANew Cumulative Partial Product

Old Cumulative Partial Product

FF2-BitAdder

To the Lower Half of Partial Product


Radix-8 and Radix-16 MultipliersMultiplier

CSA CSA

CSA

CSA

Partial Product (Upper Half)

Mux0 8a

Mux0 4a

Mux0 2a

Mux0 a

x i+3

x i+2

x i+1

x i

CarrySum

4-Bit Shift

FF

To the Lower Half of Partial Product

3 4-BitAdder

4

4

4-bitrightshift


A Spectrum of Multiplier Design Choices

Basic binary

Adder

Adder

Next multiple

Partial product

...

Several multiples

Adder

. . .All multiples

Small CSA tree Full CSA

tree

High-radix or partial tree

Full treeSpeed up Economize

Partial product


VLSI Complexity IssuesA radix-2b multiplier requires:

bk two-input AND gates to form the partial products bit-matrixO(bk) area for the CSA treeAt least Θ(k) area for the final carry-propagate adder

Total area: A = O(bk)Latency: T = O((k/b) log b + log k)

Any VLSI circuit computing the product of two k-bit integers must satisfy the following constraints:

AT grows at least as fast as k3/2

AT2 is at least proportional to k2

The preceding radix-2b implementations are suboptimal, because:

AT = O(k2 log b + bk log k)AT2 = O((k3/b) log2b)


Comparing High- and Low-Radix Multipliers

Intermediate designs do not yield better AT or AT2 values;The multipliers remain asymptotically suboptimal for any b

O(k2)O(k2 log2k)O(k3)AT2

O(k3/2)O(k2 log k)O(k2)AT

AT- or AT2-Optimal

High Speedb = O(k)

Low-Costb = O(1)

AT = O(k2 log b + bk log k) AT2 = O((k3/b) log2b)

By the AT measure (indicator of cost-effectiveness), slower radix-2 multipliers are better than high-radix or tree multipliersThus, when an application requires many independent multiplications, it is more cost-effective to use a large number of slower multipliers

High-radix multiplier latency can be reduced from O((k/b) log b + log k) to O(k/b + log k) through more effective pipelining (Chapter 11)


Tree and Array Multipliers

Chapter GoalsStudy the design of multipliers for highest possible performance (speed, throughput)

Chapter HighlightsTree multiplier = reduction tree

+ redundant-to-binary converterAvoiding full sign extension in multiplying

signed numbersArray multiplier = one-sided reduction tree

+ ripple-carry adder


Full-Tree Multipliers

Higher-order product bits

Multipliera

a

a

a. . .

. . .

Some lower-order product bits are generated directly

Redundant result

Redundant-to-Binary Converter

Multiple- Forming Circuits

(Multi-Operand Addition Tree)

Partial-Products Reduction Tree


Full-Tree versus Partial-Tree Multiplier

Adder

Large tree of carry-save

adders

. . .

All partial products

Product

Adder

Small tree of carry-save

adders

. . .

Several partial products

Product

Log-depth

Log-depth


Variations in Full-Tree Multiplier Design

Designs are distinguished by variations in three elements:

Higher-order product bits

Multipliera

a

a

a. . .

. . .

Some lower-order product bits are generated directly

Redundant result

Redundant-to-Binary Converter

Multiple- Forming Circuits

(Multi-Operand Addition Tree)

Partial-Products Reduction Tree

2. Partial products reduction tree

3. Redundant-to-binary converter

1. Multiple-forming circuits


Example of Variations in CSA Tree Design

1 2 3 4 3 2 1 FA FA FA HA -------------------- 1 3 2 3 2 1 1 FA HA FA HA ---------------------- 2 2 2 2 1 1 1 4-Bit Adder ----------------------1 1 1 1 1 1 1 1

Wallace Tree (5 FAs + 3 HAs + 4-Bit Adder)

1 2 3 4 3 2 1 FA FA -------------------- 1 3 2 2 3 2 1 FA HA HA FA ---------------------- 2 2 2 2 1 2 1 6-Bit Adder ----------------------1 1 1 1 1 1 1 1

Dadda Tree (4 FAs + 2 HAs + 6-Bit Adder)

Two different binary 4 × 4 tree multipliers.

Latency!!


A 7X7 Tree Multiplier

10-bit CPA

7-bit CSA 7-bit CSA

7-bit CSA

10-bit CSA

2Ignore

The index pair [i, j] means that bit positions from i up to j are involved.

7-bit CSA

[0, 6] [1, 7]

[2, 8] [6, 12]

[3, 11] [1,8]

[3, 9] [4, 10]

[5, 11]

[2, 8] [5, 11]

[6, 12]

[2,12]

[3, 12]

[4,13] [4,12]

[4, 13]

[3,9]

3

[3,12]

[2, 8]

[3,12]

[1, 6]

01

xxxxxxx [0,6]

xxxxxxx [1,7]

xxxxxxx [2,8]

xxxxxxx [3,9]

xxxxxxx [4,10]

xxxxxxx [5,11]

Xxxxxxx [6,12]


Balanced-Delay Tree for 11 Inputs

FA FA FA

FA FA

FA FA

FA

FA

Inputs

Level-1 carries

Level-2 carries

Level-3 carries

Level-4 carry

Outputs

FA

FA

FA

FA

FA

FA

FA

FA

FA

11 + ψ1 = 2ψ1 + 3

Therefore, ψ1 = 8 carries are needed


Binary Tree of 4-to-2 Reduction Modules

Due to its recursive structure, a binary tree is more regular than a 3-to-2 reduction tree when laid out in VLSI

CSA

CSA

4-to-2 4-to-2 4-to-2 4-to-2

4-to-2 4-to-2

4-to-24-to-2 reduction module implemented with twolevels of (3; 2)-counters


Tree Multipliers for Signed Numbers

From Fig. 8.18 Sign extension in multioperand addition.

---------- Extended positions ---------- Sign Magnitude positions ---------

xk–1 xk–1 xk–1 xk–1 xk–1 xk–1 xk–2 xk–3 xk–4 . . .yk–1 yk–1 yk–1 yk–1 yk–1 yk–1 yk–2 yk–3 yk–4 . . .zk–1 zk–1 zk–1 zk–1 zk–1 zk–1 zk–2 zk–3 zk–4 . . .

α

β

γ

αβγ

x α

β

γ

α

β

γ

α

β

γ

α

β

γ

α

β

γ

α

β

α

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x x

FA FA FA FA FA FA

Five redundant copies removed

Sign extensions Signs

The difference in multiplication is the shifting sign positions

Fig. 11.7 Sharing of full adders to reduce the CSA width in a signed tree multiplier.


Using the Negative-Weight Property of the Sign Bit

Sign extension is a way of converting negatively weighted bits (negabits) to positively weighted bits (posibits) to facilitate reduction, but there are other methods of accomplishing the same without introducing a lot of extra bits

Baugh and Wooley have contributed two such methods

4 3 2 1 0 4 3 2 1 0

4 3 2 1 0 4 3 2 1 0 a x a x a x a x a x

a a a a a x x x x x 4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4

×

a a a a a x x x x x ---------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x --------------------------------------------------------- p p p p p p p p p p a a a a a x x x x x ---------------------------- -a x a x a x a x a x -a x a x a x a x a x -a x a x a x a x a x -a x a x a x a x a x a x -a x -a x -a x -a x --------------------------------------------------------- p p p p p p p p p p a a a a a x x x x x ---------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a a 1 x x --------------------------------------------------------- p p p p p p p p p p --------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x --------------------------------------------------------- p p p p p p p p p p

1 1

4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4 4 4 4 4

4 3 2 1 0 4 3 2 1 0

4 3 2 1 0 4 3 2 1 0

4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4

4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4

×

×

×

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

a. Unsigned

b. 2's-complement

c. Baugh-Wooley

d. Modified B-W __

__ __

__ __ __ __ __

_ _

_ _

_ _ _ _


Fig. 11.8

4 3 2 1 0 4 3 2 1 0 a x a x a x a x a x

a a a a a x x x x x 4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4

×

a x -a x -a x -a x -a x --------------------------------------------------------- p p p p p p p p p p a a a a a x x x x x ---------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a a 1 x x --------------------------------------------------------- p p p p p p p p p p --------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x --------------------------------------------------------- p p p p p p p p p p

1 1

4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4 4 4 4 4

4 3 2 1 0 4 3 2 1 0

4 4 3 4 2 4 1 4 0 4

×

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

c. Baugh-Wooley

d. Modified B-W __

__ __

__ __ __ __ __

_ _

_ _

_ _ _ _

The Baugh-Wooley Method and Its Modified Form

–a4x0 = a4(1 – x0) – a4= a4x0′ – a4

–a4 a4x0′a4

In next column

–a4x0 = (1 – a4x0) – 1= (a4x0)′ – 1

–1 (a4x0)′1

In next column


Alternate Views of the Baugh-Wooley Methods

+ 0 0 –a4x3 –a4x2 –a4x1 –a4x0+ 0 0 –a3x4 –a2x4 –a1x4 –a0x4--------------------------------------------– 0 0 a4x3 a4x2 a4x1 a4x0– 0 0 a3x4 a2x4 a1x4 a0x4--------------------------------------------+ 1 1 a4x3 a4x2 a4x1 a4x0+ 1 1 a3x4 a2x4 a1x4 a0x4

11

--------------------------------------------+ a4 a4 a4x3 a4x2 a4x1 a4x0+ x4 x4 a3x4 a2x4 a1x4 a0x4

a4x4--------------------------------------------

a41 x4

4 3 2 1 0 4 3 2 1 0

4 3 2 1 0 4 3 2 1 0 a x a x a x a x a x

a a a a a x x x x x 4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4

×

a a a a a x x x x x ---------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x --------------------------------------------------------- p p p p p p p p p p a a a a a x x x x x ---------------------------- -a x a x a x a x a x -a x a x a x a x a x -a x a x a x a x a x -a x a x a x a x a x a x -a x -a x -a x -a x --------------------------------------------------------- p p p p p p p p p p a a a a a x x x x x ---------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a a 1 x x --------------------------------------------------------- p p p p p p p p p p --------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x --------------------------------------------------------- p p p p p p p p p p

1 1

4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4 4 4 4 4

4 3 2 1 0 4 3 2 1 0

4 3 2 1 0 4 3 2 1 0

4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4

4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4

×

×

×

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

a. Unsigned

b. 2's-complement

c. Baugh-Wooley

d. Modified B-W __

__ __

__ __ __ __ __

_ _

_ _

_ _ _ _


Partial-Tree Multipliers

Fig. 11.9 General structure of a partial-tree multiplier.

. . .

CSA Tree

h inputs

Adder

Lower part of the cumulative partial product

FF

h-Bit Adder

Sum Carry

Upper part of the cumulative partial product (stored-carry)

High-radix versus partial-tree multipliers: The difference is quantitative, not qualitative

For small h, say ≤ 8 bits, we view the multiplier of Fig. 11.9 as high-radix

When h is a significant fraction of k, say k/2 or k/4,then we tend to view it as a partial-tree multiplier

Better design through pipelining to be covered in Section 11.6


Truncated Multipliers

Removing the dots at the right does not lead to much loss of precision.

ulp. o o o o o o o o k-by-k fractional

× . o o o o o o o o multiplication---------------------------------. o o o o o o o|o. o o o o o o|o o. o o o o o|o o o. o o o o|o o o o. o o o|o o o o o. o o|o o o o o o. o|o o o o o o o. |o o o o o o o o---------------------------------. o o o o o o o o|o o o o o o o o

Max error = 8/2 + 7/4 + 6/8 + 5/16 + 4/32 + 3/64 + 2/128 + 1/256 = 7.004 ulp

Mean error = 1.751 ulp


Truncated Multipliers with Error Compensation

Constant and variable error compensation for truncated multipliers.

We can introduce additional “dots” on the left-hand side to compensate for the removal of dots from the right-hand side

Constant compensation Variable compensation

. o o o o o o o| . o o o o o o o|

. o o o o o o| . o o o o o o|

. o o o o o| . o o o o o|

. o o o o| . o o o o|

. o o o| . o o o|

. 1 o o| . o o|

. o| . x-1o|

. | . y-1 |

Max error = +4 ulpMax error ≅ −3 ulp

Max error = +? ulpMax error ≅ −? ulp

Mean error = ? ulp Mean error = ? ulp


Array Multipliers

A basic array multiplier uses a one-sided CSA tree and a ripple-carry adder.

0x ax ax a

x a

x a

CSA

CSA

CSA

CSA

Ripple-Carry Adder

012

3

4

ax

p

0

p

1

p

2

p

3

p

4

p 6 p 7 p 8

a x

0 0

a x

1 0

a x

2 0

a x

3 0

a x

4 0

0

0

0

0

a x

0 1

a x

1 1

a x

2 1

a x

3 1

p 9 p 5

a x

4 1

a x

4 2

a x

4 3

a x

4 4

a x

0 2

a x

1 2

a x

2 2

a x

3 2

a x

0 3

a x

1 3

a x

2 3

a x

3 3

a x

0 4

a x

1 4

a x

2 4

a x

3 4

0

Details of a 5×5 array multiplier using FA blocks.

[3:2] Adder, i.e. a full adder


Signed (2’s-complement) Array Multiplierusing the Baugh-Wooley method or to shorten the critical path.

p

0

p

1

p

2

p

3

p 4 p 6p 7p 8

a x

0 0

a x

1 0

a x

2 0

a x

3 0

a x

4 0

0

0

0

0

a x

0 1

a x

1 1

a x

2 1

a x

3 1

p 9 p 5

a x

4 1

a x

4 2

a x

4 3

a x

4 4

a x

0 2

a x

1 2

a x

2 2

a x

3 2

a x

0 3

a x

1 3

a x

2 3

a x

3 3

a x

0 4

a x

1 4

a x

2 4

a x

3 4 1

x

4

a

4

a

4 x

4

_

_

_

_

_

_

_

_

_

_


Array Multiplier Built of Modified Full-Adder Cells

Design of a 5 × 5 array multiplier with two additive inputs and full-adder blocks that include AND gates.

p p p p p

4 3 2 1 0 a a a a a

4

3

2

1

0

x

x

x

x

x

4

3

2

1

0

p

p

p

p

p

9 8 7 6 5

FA


Array Multiplier without a Final Carry-Propagate Adder

i+1i

i+1i

i i

Mux

Mux

Muxk

[k, 2k?] 1i?ii+1k?

Level i

k k

0

Mux

...

...

Bi+1

Bi

All remaining bits of the final product produced only 2 gate levels after pk–1

See next slide


Extend Bits in Less-Significant Part in a Conditional Adder

The circuit in the right part is considered a conditional adder as the circuit in the left part. Source: Ercegovac and Lang, “Digital Arithmetic”, pp.86-87


Pipelined Tree and Array Multipliers

. . .

CSA Tree

h inputs

Adder


FF

h-Bit Adder

Sum Carry

Upper part of the cumulative partial product (stored-carry)

General structure of a partial-tree multiplier.

Efficiently pipelined partial-tree multiplier.

. . .

h inputs

Adder


FF

h-Bit Adder

Sum Carry

CSA

Pipelined CSA Tree

Latches Latches Latches

CSA

(h + 2)-input CSA tree

Latch


Pipelined Array MultipliersWith latches after every FA level, the maximum throughput is achieved

Latches may be inserted after every h FA levels for an intermediate design

Pipelined 5×5 array multiplier using latched FA blocks. The small shaded boxes are latches.

p p p p p

4 3 2 1 0 a a a a a 4 3 2 1 0 x xxxx

4 3 2 1 0 p p p p p 9 8 7 6 5

Latched FA with AND gate

Latch

FA

FA

FA

FA

Example: 3-stage pipeline


Variations in Multipliers

Chapter GoalsLearn additional methods for synthesizing fast multipliers as well as other types of multipliers (bit-serial, modular, etc.)

Chapter HighlightsBuilding a multiplier from smaller units Performing multiply-add as one operationBit-serial and (semi)systolic multipliersUsing a multiplier for squaring is wasteful


Divide-and-Conquer DesignsBuilding wide multiplier from narrower ones

Divide-and-conquer (recursive) strategy for synthesizing a 2b × 2b multiplier from b × b multipliers.

a

×

p

Rearranged partial products in 2b-by-2b multiplication

2b bits

3b bits

H a L

xH xL

a L xH

a L xL

a H xLxHa H

a H xL

a L xH

a L xLxHa H

b bits


General Structure of a Recursive Multiplier

2b × 2b use (3; 2)-counters3b × 3b use (5; 2)-counters4b × 4b use (7; 2)-counters

Using b × b multipliers to synthesize 2b × 2b, 3b× 3b, and 4b × 4b multipliers.

4b × 4b

3b × 3b

2b × 2b

b × b


An 8 X 8 Multiplier Using 4 X 4 Multipliers a x a x a x a x

A dd

A dd

A dd

A dd A dd

pp p p

000

8

8

12

12

H LH H H LLL

[4 , 7] [4 , 7] [0 , 3] [4 , 7] [4 , 7] [0 , 3] [0 , 3] [0 , 3]

[12 ,15] [8 ,11] [8 ,11] [4 , 7] [8 ,11] [4 , 7] [4 , 7] [0 , 3]

[4 , 7]

[4 , 7]

[8 ,11 ]

[8 ,11 ]

[12,15]

[12,15] [8 ,11] [0 , 3][4 , 7]

M u ltip ly M ultip lyM ultip lyM ultip ly


Additive Multiply Modules

Additive multiply module with 2 × 4 multiplier (ax) plus 4-bit and 2-bit additive inputs (y and z).

c

in

y

z

ax

p

4-bit adder

y

z

x a

p = ax + y + z

(a) Block diagram (b) Dot notation

b-bit and c-bit multiplicative inputsb × c AMM b-bit and c-bit additive inputs

(b + c)-bit output

(2b – 1) × (2c – 1) + (2b – 1) + (2c – 1) = 2b+c – 1


Multiplier Built of AMMs

An 8 × 8 multiplier built of 4×2 AMMs. Inputs marked with an asterisk carry 0s.

[0, 1]

[2, 3]

[4, 5]

[6, 7]

[8, 9][10,11][12,15]

[0, 1][2, 3]

[4,5][6, 7]

x

x

x

x [0, 3]a

[0, 3]a

[0, 3]a

[0, 3]a

p

pp

pppp

[0, 1]x

[2, 3]

[4, 5]

[6, 7]x

x

x

[10,11]

[8, 9]

[4, 7]a

[4, 7]a

[4, 7]a

[4, 7]a

[8, 9]

[0, 1]

[2, 3][4, 5]

[6, 7][4,5]

[6, 7]

[8, 11]

[10,13]

[2, 5]

[4,7]

[6, 9][8, 11]

[6, 9]

*

*

* *

**

Legend: 2 bits 4 bits Understanding

an 8 × 8 multiplier built of 4 × 2 AMMs using dot notation


Bit-Serial Multipliers

FA

FFBit-serial adder(LSB first) x0

y0

s0x1

y1

s1x2

y2

s2…

…

…

Bit-serial multipliera0

x0

p0a1

x1

p1a2

x2

p2…

…

…?Systolic arrays: synchronous arrays of processing elements that are interconnected by only short, local wires thus allowing very high clock rates.


Semisystolic Serial-Parallel MultiplierMultiplicand (parallel in)

Multiplier (serial in)LSB-first

Carry

SumFA

Product (serial out)

FA FA FA

a 3 a 2 a 1 a 0x0 x1 x2 x3

Semi-systolic circuit for 4 × 4 multiplication in 8 clock cycles.

This is called “semisystolic” because it has a large signal fan-out of k(k-way broadcasting) and a long wire spanning all k positions


Systolic Retiming as a Design Tool

Example of retiming by delaying the inputs to CL and advancing the outputs from CL by d units

Cut

CL CR CL CR

ef

gh

e+df+d

g　h　

+d

　

　

+dOriginal delays Adjusted delays

A semisystolic circuit can be converted to a systolic circuit via retiming, which involves advancing and retarding signals by means of delay removal and delay insertion in such a way that the relative timings of various parts are unaffected


A First Attempt at Retiming

A retimed version of our semi-systolic multiplier.

Multiplicand (parallel in)


Carry

FAProduct (serial out)

FA FA FA

a 3 a 2 a 1 a 0x0 x1 x2 x3

Sum

Cut 1Cut 2Cut 3



Carry

SumFA


FA FA FA

a 3 a 2 a 1 a 0x0 x1 x2 x3


Deriving a Fully Systolic Multiplier



Carry

SumFA


FA FA FA

a 3 a 2 a 1 a 0x 0 x 1 x 2 x 3

A retimed version of our semi-systolic multiplier.



SumFA


FA FA FA

a3 a2 a1 a0x0 x1 x2 x3

Carry


A Direct Design for a Bit-Serial Multiplier

Fig. 12.13 Bit-serial multiplier design in dot notation.

p

x

a

Already accumulated

into three numbers

(i - 1)

a

x

(i - 1)

i

a

x

i

x

i

(i - 1)

a

i

a

x

(i - 1)

x

i

i

a

Already output

(a) Structure of the bit-matrix

(b) Reduction after each input bit

p

(i - 1)

i

a

x

(i - 1)

x

i

(i - 1)

a

x

i

i

a

2p

(i )

Shift right to obtain p

(i )

Mux

(5; 3)-counter

0

1

012

a x

a x

ss

c c

t t in

out in

in out

out

p

ii

ii(i?)

ax

ss

c c

t t in

out in

in out

out

p

ii

. . .. . .

. . .

. . .

. . .

i

LSB

0

Building block for a latency-free bit-serial multiplier.

The cellular structure of the bit-serial multiplier based on the cell in Fig. 12.11.


Modular Multipliers

. . .FA FAFAFAFA

Mod-15 CSA

Divide by 16

4

4

4

4

Mod-15 CSA

4

Mod-15 CPA

Modulo-(2b – 1) carry-save adder.

Design of a 4 × 4 modulo-15 multiplier.


Other Examples of Modular Multiplication

One way to design of a 4 × 4 modulo-13 multiplier.

16 mod 13 = 3 • •


Squaringx 0 x 1 x 2 x 3 x 4 x 0 x 1 x 2 x 3 x 4

x 0 x 1 x 2 x 3 x 4 x 0 x 0

p 0

x 4

x 1

x 4

x 0 x 1

x 2 x 3

x 4

x 0 x 1

x 2 x 3

x 4

x 0

Multiply x by x

x 1 x 2 x 3 x 4 x 0 x 1 x 2 x 3 x 4 x 0

x 1 x 2 x 3 x 4 x 0 x 1 x 2 x 3 x 4 x 0

x 1 x 2 x 3

x 1 x 2 x 3

x 2 x 3

x 4

p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9

x 1 x 2 x 3 x 4 x 0 x 1

x 0

x 2

x 0 x 1

x 0 x 2 x 3

x 4 x 0 x 3

x 4

x 0

x 1 x 2 x 1

x 2 x 3

x 3 x 4 x 4

p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 0

_

Simplify

Design of a 5-bit squarer.

x1x0 –x1x0


Constant Multiplier



Multiple Constant Multiplier

Source: Ercegovac and Lang, “Digital Arithmetic”, pp. 225


Division


Most slides are revision of PowerPoint files gotten from textbook website.


Division

Chapter 16 Division by Convergence

Chapter 15 Variations in Dividers

Chapter 14 High-Radix Dividers

Chapter 13 Basic Division Schemes

Topics in This Part

Review Division schemes and various speedup methods• Hardest basic operation (fortunately, also the rarest)• Division speedup methods: high-radix, array, . . .• Combined multiplication/division hardware • Digit-recurrence vs convergence division schemes


13 Basic Division Schemes

Chapter GoalsStudy shift/subtract or bit-at-a-time dividersand set the stage for faster methods andvariations to be covered in Chapters 14-16

Chapter HighlightsShift/subtract divide vs shift/add multiplyHardware, firmware, software algorithmsDividing 2’s-complement numbersThe special case of a constant divisor


Shift/Subtract Division Algorithms

Notation for our discussion of division algorithms:

z Dividend z2k–1z2k–2 . . . z3z2z1z0d Divisor dk–1dk–2 . . . d1d0q Quotient qk–1qk–2 . . . q1q0s Remainder, z – (d × q) sk–1sk–2 . . . s1s0

Initially, we assume unsigned operands

Division of an 8-bit number by a 4-bit number in dot notation.

Dividend

Subtracted bit-matrix

z

s Remainder

Quotient q Divisor d

q d 2 3 3 –

q d 2 2 2 –

q d 2 1 1 –

q d 2 0 0 –


Division versus Multiplication (1/2)

Division is more complex than multiplication:Need for quotient digit selection or estimation

Overflow possibility: the high-order k bits of z must be strictly less than d; the quotient of a 2k bit number divided by a k bit number may have a width of more than k bits.

Dividend


z

s Remainder


q d 2 3 3 –

q d 2 2 2 –

q d 2 1 1 –

q d 2 0 0 –


Division versus Multiplication (2/2)

Pentium III latenciesInstruction Latency Cycles/IssueLoad / Store 3 1Integer Multiply 4 1Integer Divide 36 36Double/Single FP Multiply 5 2Double/Single FP Add 3 1Double/Single FP Divide 38 38


Division Recurrence

Division with left shifts

s(j) = 2s(j–1) – qk–j (2k d) with s(0) = z and|–shift–| s(k) = 2ks|–––subtract–––|

(There is no corresponding right-shift algorithm)

Dividend


z

s Remainder


q d 2 3 3 –

q d 2 2 2 –

q d 2 1 1 –

q d 2 0 0 –

Integer division is characterized by z = d × q + s

2–2kz = (2–kd) × (2–kq) + 2–2kszfrac = dfrac × qfrac + 2–ksfrac

Divide fractions like integers; adjust the remainder

No-overflow condition for fractions is:

zfrac < dfrac

k bits k bits

2z

2k d

0


Division Recurrence StepsInitializationIterations

One digit arithmetic left-shift of s(j) to produce rs(j)

Determination of the quotient digit q j+1 by the quotient-digit selection function;

The index of q could be different Generation of the divisor multiple d × qj+1

Subtraction of dqj+1 from rs(j).On-the-fly conversion of the quotient

Or done in the termination step

Termination: make sign(s)=sign(d)), conversion


Examples of Basic DivisionInteger division Fractional division====================== =====================z 0 1 1 1 0 1 0 1 zfrac . 0 1 1 1 0 1 0 124d 1 0 1 0 dfrac . 1 0 1 0 ====================== =====================s(0) 0 1 1 1 0 1 0 1 s(0) . 0 1 1 1 0 1 0 12s(0) 0 1 1 1 0 1 0 1 2s(0) 0 . 1 1 1 0 1 0 1–q3 24d 1 0 1 0 {q3 = 1} –q–1d . 1 0 1 0 {q–1=1}––––––––––––––––––––––– ––––––––––––––––––––––s(1) 0 1 0 0 1 0 1 s(1) . 0 1 0 0 1 0 12s(1) 0 1 0 0 1 0 1 2s(1) 0 . 1 0 0 1 0 1–q2 24d 0 0 0 0 {q2 = 0} –q–2d . 0 0 0 0 {q–2=0}––––––––––––––––––––––– ––––––––––––––––––––––s(2) 1 0 0 1 0 1 s(2) . 1 0 0 1 0 12s(2) 1 0 0 1 0 1 2s(2) 1 . 0 0 1 0 1–q1 24d 1 0 1 0 {q1 = 1} –q–3d . 1 0 1 0 {q–3=1}––––––––––––––––––––––– ––––––––––––––––––––––s(3) 1 0 0 0 1 s(3) . 1 0 0 0 12s(3) 1 0 0 0 1 2s(3) 1 . 0 0 0 1–q0 24d 1 0 1 0 {q0 = 1} –q–4d . 1 0 1 0 {q–4=1}––––––––––––––––––––––– ––––––––––––––––––––––s(4) 0 1 1 1 s(4) . 0 1 1 1s 0 1 1 1 sfrac 0 . 0 0 0 0 0 1 1 1q 1 0 1 1 qfrac . 1 0 1 1====================== =====================

Notice the index of q

What is the residual of 0.0112 / 0.1?


Main Factors Affecting the Overall Execution Time and Cost

Radix rQuotient-digit set

Redundant signed digit?Representation of the residual

CSA?Quotient-digit selection


Programmed Division

Register usage for programmed division.

Rs Rq

Rd0 0 . . . 0 0 0 0

2 dk

Carry Flag

Shifted Partial Remainder

Shifted Partial Quotient

Partial Remainder (2k – j Bits)

Partial Quotient (j Bits)

Next quotient digit inserted here

Divisor d


Assembly Language Program for Division

Programmed division using left shifts.

{Using left shifts, divide unsigned 2k-bit dividend,z_high|z_low, storing the k-bit quotient and remainder. Registers: R0 holds 0 Rc for counter

Rd for divisor Rs for z_high & remainder Rq for z_low & quotient}

{Load operands into registers Rd, Rs, and Rq}div: load Rd with divisor

load Rs with z_highload Rq with z_low

{Check for exceptions} branch d_by_0 if Rd = R0branch d_ovfl if Rs > Rd

{Initialize counter}load k into Rc

{Begin division loop}d_loop: shift Rq left 1 {zero to LSB, MSB to carry}

rotate Rs left 1 {carry to LSB, MSB to carry}skip if carry = 1branch no_sub if Rs < Rd sub Rd from Rs incr Rq {set quotient digit to 1}

no_sub: decr Rc {decrement counter by 1}branch d_loop if Rc 　 0

{Store the quotient and remainder}store Rq into quotientstore Rs into remainder

d_by_0: ...d_ovfl: ...d_done: ...

Rs Rq

Rd0 0 . . . 0 0 0 0

2 dk

Carry Flag

Shifted Partial Remainder

Shifted Partial Quotient

Partial Remainder (2k ?j Bits)

Partial Quotient (j Bits)

Next quotient digit inserted here

Divisor d

Register usage for programmed division.


Time Complexity of Programmed DivisionAssume k-bit words

k iterations of the main loop 6 or 8 instructions per iteration, depending on the quotient bit

Thus, 6k + 3 to 8k + 3 machine instructions,ignoring operand loads and result store

k = 32 implies 220+ instructions on average

This is too slow for many modern applications!

Microprogrammed division would be somewhat better


Restoring Hardware Dividers

Shift/subtract sequential restoring divider.

Quotient q

Mux

Adder out c

0 1

Partial remainder s (initial value z)

Divisor d

Shift

Shift

Load

1 in c

(j)

Quotient digit

selector

q k–j

MSB of 2s (j–1)

k

k

k

Trial difference


Indirect Signed DivisionIn division with signed operands, q and s are defined by

z = d × q + s sign(s) = sign(z) |s | < |d |

Examples of division with signed operands

z = 5 d = 3 ⇒ q = 1 s = 2

z = 5 d = –3 ⇒ q = –1 s = 2

z = –5 d = 3 ⇒ q = –1 s = –2

z = –5 d = –3 ⇒ q = 1 s = –2

Magnitudes of q and s are unaffected by input signsSigns of q and s are derivable from signs of z and d

Will discuss direct signed division later

(not q = –2, s = –1)


Example of Restoring Unsigned Division

=======================z 0 1 1 1 0 1 0 124d 0 1 0 1 0–24d 1 0 1 1 0=======================s(0) 0 0 1 1 1 0 1 0 1 2s(0) 0 1 1 1 0 1 0 1 +(–24d) 1 0 1 1 0 ––––––––––––––––––––––––s(1) 0 0 1 0 0 1 0 1 Positive, so set q3 = 12s(1) 0 1 0 0 1 0 1 +(–24d) 1 0 1 1 0 ––––––––––––––––––––––––s(2) 1 1 1 1 1 0 1 Negative, so set q2 = 0s(2)=2s(1) 0 1 0 0 1 0 1 and restore2s(2) 1 0 0 1 0 1 +(–24d) 1 0 1 1 0 ––––––––––––––––––––––––s(3) 0 1 0 0 0 1 Positive, so set q1 = 12s(3) 1 0 0 0 1 +(–24d) 1 0 1 1 0 ––––––––––––––––––––––––s(4) 0 0 1 1 1 Positive, so set q0 = 1s 0 1 1 1 q 1 0 1 1=======================

No overflow, because(0111)two < (1010)two


Nonrestoring and Signed DivisionThe cycle time in restoring division must be long enough to allow:

Shifting the registersAllowing signals to propagate through the adderDetermining and storing the next quotient digitStoring the trial difference, if required

Quotient q

Mux

Adder out c

0 1

Partial remainder s (initial value z)

Divisor d

Shift

Shift

Load

1 in c

(j)

Quotient digit

selector

q k–j

MSB of 2s (j–1)

k

k

k

Trial difference

Nonrestoring division to the rescue!

Assume qk–j = 1 and subtractStore the result as the new PR

(the partial remainder can become incorrect, hencethe name “nonrestoring”)


Justification for Nonrestoring Division

Why it is acceptable to store an incorrect value in the partial-remainder register?

Shifted partial remainder at start of the cycle is u

Suppose subtraction yields the negative result u – 2kd

Option 1: Restore the partial remainder to correct value u, shift left, and subtract to get 2u – 2kd

Option 2: Keep the incorrect partial remainder u – 2kd, shift left, and add to get 2(u – 2kd) + 2kd = 2u – 2kd


Example of Nonrestoring Unsigned Division

=======================z 0 1 1 1 0 1 0 124d 0 1 0 1 0–24d 1 0 1 1 0=======================s(0) 0 0 1 1 1 0 1 0 1 2s(0) 0 1 1 1 0 1 0 1 Positive,+(–24d) 1 0 1 1 0 so subtract––––––––––––––––––––––––s(1) 0 0 1 0 0 1 0 1 2s(1) 0 1 0 0 1 0 1 Positive, so set q3 = 1+(–24d) 1 0 1 1 0 and subtract––––––––––––––––––––––––s(2) 1 1 1 1 1 0 1 2s(2) 1 1 1 1 0 1 Negative, so set q2 = 0+24d 0 1 0 1 0 and add––––––––––––––––––––––––s(3) 0 1 0 0 0 1 2s(3) 1 0 0 0 1 Positive, so set q1 = 1+(–24d) 1 0 1 1 0 and subtract––––––––––––––––––––––––s(4) 0 0 1 1 1 Positive, so set q0 = 1s 0 1 1 1 q 1 0 1 1=======================

No overflow: (0111)two < (1010)two

Applying “if sign(s) = sign(d) then qk–j = 1 else qk–j = -1 “, we get 11-11, that equals 1011


Graphical Depiction of Nonrestoring Division

300

200

100

0

–100

117

234

74

148

–12

296

136

272

112

s

(0)

s

(1)

s

(2)

s

(3) s =16s

(4)

–160

2

×

2

×

2

×

×

2

–160

–160 –160

Par

tial r

emai

nder

(a) Restoring

148

300

200

100

0

–100

117

234

74

148

–12 –24

136

272

112

s

(0)

s

(1)

s

(2)

s

(3) s =16s

(4)

–160

2

×

2

×

2

×

×

2

–160 +160

–160

Par

tial r

emai

nder

(b) Nonrestoring

Example

(0 1 1 1 0 1 0 1)two / (1 0 1 0)two

(117)ten / (10)ten


Nonrestoring Division with Signed Operands

Restoring divisionqk–j = 0 means no subtraction (or subtraction of 0)qk–j = 1 means subtraction of d

Nonrestoring divisionWe always subtract or addIt is as if quotient digits are selected from the set {1, −1}:

1 corresponds to subtraction −1 corresponds to addition

Our goal is to end up with a remainder that matches the signof the dividend

This idea of trying to match the sign of s with the sign z, leads to a direct signed division algorithm

if sign(s) = sign(d) then qk–j = 1 else qk–j = −1

Example: q = . . . 0 0 0 1 . . .. . . 1 −1 −1 −1 . . .


Quotient Conversion and Final CorrectionPartial remainder variation and selected quotient digits during nonrestoring division with d > 0

d

0

−d

+d

−d

−d

−d

+d

+d

×2×2

×2

×2×2

−1 1 −1 −1 1 1

z

0 1 0 0 1 1

1 1 0 0 1 1 1

Quotient with digits −1 and 1

Final correction step if sign(s) ≠ sign(z):Add d to, or subtract d from, s; subtract 1 from, or add 1 to, q

Check: −32 + 16 – 8 – 4 + 2 + 1 = −25 = −64 + 32 + 4 + 2 + 1

Replace −1s with 0s

Shift left, complement MSB, and set LSB to 1 to get the 2’s-complement quotient

1 1 0 1 0 0 0


Example of Nonrestoring Signed Division

========================z 0 0 1 0 0 0 0 124d 1 1 0 0 1–24d 0 0 1 1 1========================s(0) 0 0 0 1 0 0 0 0 1 2s(0) 0 0 1 0 0 0 0 1 sign(s(0)) ≠ sign(d),+24d 1 1 0 0 1 so set q3 = −1 and add––––––––––––––––––––––––s(1) 1 1 1 0 1 0 0 1 2s(1) 1 1 0 1 0 0 1 sign(s(1)) = sign(d), +(–24d) 0 0 1 1 1 so set q2 = 1 and subtract––––––––––––––––––––––––s(2) 0 0 0 0 1 0 1 2s(2) 0 0 0 1 0 1 sign(s(2)) ≠ sign(d),+24d 1 1 0 0 1 so set q1 = −1 and add––––––––––––––––––––––––s(3) 1 1 0 1 1 1 2s(3) 1 0 1 1 1 sign(s(3)) = sign(d), +(–24d) 0 0 1 1 1 so set q0 = 1 and subtract––––––––––––––––––––––––s(4) 1 1 1 1 0 sign(s(4)) ≠ sign(z),+(–24d) 0 0 1 1 1 so perform corrective subtraction––––––––––––––––––––––––s(4) 0 0 1 0 1 s 0 1 0 1 q −1 1−1 1========================

p = 0 1 0 1 Shift, compl MSB1 1 0 1 1 Add 1 to correct

1 1 0 0 Check: 33/(−7) = −4


On-The-Fly Conversion

Source: Ercegovac and Lang, “Digital Arithmetic”, pp. 257


Nonrestoring Hardware Divider

Shift-subtract sequential nonrestoring divider.

Quotient

k

Partial Remainder

Divisor

add/sub

k-bit adder

k

cout cin

Complement

qk　 2s (j?)MSB of

Divisor Sign

Complement of Partial Remainder Sign


Division by ConstantsSoftware and hardware aspects:As was the case for multiplications by constants, optimizing compilers may replace some divisions by shifts/adds/subs; likewise, in custom VLSI circuits, hardware dividers may be replaced by simpler adders

Method 1: Find the reciprocal of the constant and multiply (particularly efficient if several numbers must be divided by the same divisor)

Method 2: Use the property that for each odd integer d, there exists an odd integer m such that d × m = 2n – 1; hence, d = (2n – 1)/m and

Number of shift-adds required is proportional to log k

Multiplication by constant Shift-adds

L)21)(21)(21(2)21(212

42 nnnnnnn

zmzmzmdz −−−

− +++=−

=−

=


Example: Division by a Constant

L)21)(21)(21(2)21(212

42 nnnnnnn

zmzmzmdz −−−

− +++=−

=−

=

Example: Dividing the number z by 5, assuming 24 bits of precision. We have d = 5, m = 3, n = 4; 5 × 3 = 24 – 1

Instruction sequence for division by 5

q ← z + z shift-left 1 {3z computed}q ← q + q shift-right 4 {3z(1+2–4) computed}q ← q + q shift-right 8 {3z(1+2–4)(1+2–8) computed}q ← q + q shift-right 16 {3z(1+2–4)(1+2–8)(1+2–16) computed}q ← q shift-right 4 {3z(1+2–4)(1+2–8)(1+2–16)/16 computed}

L)21)(21)(21(163

)21(23

123

51684

444−−−

− +++=−

=−

=zzzz

5 shifts4 adds


Preview of Fast Dividers

Like multiplication, there are but two ways to speed it up: a. Reducing the number of operands (divide in a higher radix)b. Adding them faster (keep partial remainder in carry-save form)

a x

p

2

x a

0 0

1 x a 2 1 x a 2

2 2

2 3 3

x a

×

(a) k × k integer multiplication

z

s

q Divisor d

q d 2 3 3 –

q d 2 2 2 –

q d 2 1 1 –

q d 2 0 0 –

(b) 2k / k integer division

Both (a) Multiplication and (b) division can be considered as multioperand addition problems.

There is one complication that makes division inherently more difficult: The terms to be subtracted from (added to) the dividend are not known a priori but become known as quotient digits are computed;quotient digits in turn depend on partial remainders


14 High-Radix Dividers

Chapter GoalsStudy techniques that allow us to obtainmore than one quotient bit in each cycle(two bits in radix 4, three in radix 8, . . .)

Chapter HighlightsRadix > 2 ⇒ quotient digit selection harder Remedy: redundant quotient representationCarry-save addition reduces cycle timeImplementation methods and tradeoffs


Basics of High-Radix Division

Division with left shifts

s(j) = rs(j–1) – qk–j (r k d) with s(0) = z and|–shift–| s(k) = r ks|–––subtract–––|

Dividend z

s Remainder


(q q ) d 4 1 3 – 2 two

4 0 d (q q ) 1 – 0 two

Radix-4 division in dot notation

k digits k digits

rz

qk–j rk d

0


Examples of High-Radix DivisionRadix-4 integer division Radix-10 fractional division====================== =================z 0 1 2 3 1 1 2 3 zfrac . 7 0 0 3 44d 1 2 0 3 dfrac . 9 9 ====================== =================s(0) 0 1 2 3 1 1 2 3 s(0) . 7 0 0 34s(0) 0 1 2 3 1 1 2 3 10s(0) 7 . 0 0 3–q3 44d 0 1 2 0 3 {q3 = 1} –q–1d 6 . 9 3 {q–1 = 7}––––––––––––––––––––––– ––––––––––––––––––s(1) 0 0 2 2 1 2 3 s(1) . 0 7 34s(1) 0 0 2 2 1 2 3 10s(1) 0 . 7 3–q2 44d 0 0 0 0 0 {q2 = 0} –q–2d 0 . 0 0 {q–2 = 0}––––––––––––––––––––––– ––––––––––––––––––s(2) 0 2 2 1 2 3 s(2) . 7 34s(2) 0 2 2 1 2 3 sfrac . 0 0 7 3–q1 44d 0 1 2 0 3 {q1 = 1} qfrac . 7 0––––––––––––––––––––––– =================s(3) 1 0 0 3 3 4s(3) 1 0 0 3 3 –q0 44d 0 3 0 1 2 {q0 = 2}–––––––––––––––––––––––s(4) 1 0 2 1 s 1 0 2 1 q 1 0 1 2======================


Difficulty of Quotient Digit SelectionWhat is the first quotient digit in the following radix-10 division?

_____________2 0 4 3 | 1 2 2 5 7 9 6 8

The problem with the pencil-and-paper division algorithm is that there is no room for error in choosing the next quotient digit

In the worst case, all k digits of the divisor and k + 1 digits in the partial remainder are needed to make a correct choice

12 / 2 = 6122 / 20 = 6

1225 / 204 = 612257 / 2043 = 5

Suppose we used the redundant signed digit set [–9, 9] in radix 10

Then, we could choose 6 as the next quotient digit, knowing that we canrecover from an incorrect choice by using negative digits: 5 9 = 6 -1


Radix-2 SRT Division (1/3)

The new partial remainder, s(j), as a function of the shifted old partial remainder, 2s(j–1), in radix-2 nonrestoring division.

Algorithm in Ch 13.4

–2d

2d

d

–d

q =–1

q =1

2s

(j–1)

s

(j)

–j

–j

d

–d

s(j) = 2s(j–1) – q–j dwith s(0) = zs(k) = 2ksq–j ∈ {−1, 1}


Robertson’s DiagramAxes: the shifted residual 2s(j–1) and the next residual s(j)

It shows the possibilities to choose q and keep the next residual bounded.

P-D DiagramShifted residual (Partial remainder) vs. divisor

Diagrams for Quotient Selection


–2d

2d

d

–d

q =–1

q =0

q =1

2s

(j–1)

s

(j)

–j

–j

–j

d

–d


q–j = 0 requires shifting only, which was faster than shift-and-subtractBut how can you tell if –d ≦ 2s (j-1) < d?

s(j) = 2s(j–1) – q–j dwith s(0) = zs(k) = 2ksq–j ∈ {−1, 0, 1}

•Allowing 0 as a quotient digit in nonrestoring Divisionq-j=0 for –d ≦ 2s (j-1) < d


–2d

2d

d

–d

q =–1

q =0

q =1

2s

(j–1)

s

(j)

–j

–j

–j

d

–d

–1/2 1/2

–1

1

–1/2

1/2


The relationship between new and old partial remainders in radix-2 SRT division.

Comparison with constants −½ and ½ is quite simple2s ≥ +½ means 2s = (0.1xxxxxxxx)2’s-compl2s < −½ means 2s = (1.0xxxxxxxx)2’s-compl

If 2s(j–1) < ½then q–j =－1else if 2s(j–1) ≧ ½

then q–j =1else q–j =0endif

endif


Radix-2 SRT Division with Variable ShiftsS(0) is adjusted to be in [-1/2, 1/2/).We use the comparison constants −½ and ½ for quotient digit selection

For 2s ≥ +½ or 2s = (0.1xxxxxxxx)2’s-compl choose q–j = 1For 2s < −½ or 2s = (1.0xxxxxxxx)2’s-compl choose q–j = −1

Choose q–j = 0 in other cases, that is, for:0 ≤ 2s < +½ or 2s = (0.0xxxxxxxx)2’s-compl−½ ≤ 2s < 0 or 2s = (1.1xxxxxxxx)2’s-compl

Observation: What happens when the magnitude of 2s is fairly small?

2s = (0.00001xxxx)2’s-compl

2s = (1.1110xxxxx)2’s-compl

Choosing q–j = 0 would lead to the same condition in the next step; generate 5 quotient digits 0 0 0 0 1

Generate 4 quotient digits 0 0 0 −1

Use leading 0s or leading 1s detection circuit to determine how many quotient digits can be spewed out at onceStatistically, the average skipping distance will be 2.67 bits


Example Unsigned Radix-2 SRT Division

========================z . 0 1 0 0 0 1 0 1d 0 . 1 0 1 0–d 1 . 0 1 1 0========================s(0) 0 . 0 1 0 0 0 1 0 1 2s(0) 0 . 1 0 0 0 1 0 1 ≥ ½, so set q−1 = 1+(−d) 1 . 0 1 1 0 and subtract––––––––––––––––––––––––s(1) 1 . 1 1 1 0 1 0 1 2s(1) 1 . 1 1 0 1 0 1 In [−½, ½), so set q−2 = 0––––––––––––––––––––––––s(2) =2s(1) 1 . 1 1 0 1 0 1 2s(2) 1 . 1 0 1 0 1 In [−½, ½), so set q−3 = 0––––––––––––––––––––––––s(3) =2s(2) 0 . 1 0 1 0 1 2s(3) 1 . 0 1 0 1 < −½, so set q−4 = −1+d 0 . 1 0 1 0 and add––––––––––––––––––––––––s(4) 1 . 1 1 1 1 Negative,+d 0 . 1 0 1 0 so add to correct––––––––––––––––––––––––s(4) 0 . 1 0 0 1 s 0 . 0 0 0 0 0 1 0 1 q 0 . 1 0 0−1 Uncorrected BSD quotientq 0 . 0 1 1 0 Convert and subtract ulp========================

In [−½, ½), so okay

0.1000

-0.0001

0.0111

-0.0001

0.0110


Using Carry-Save Adders

Constant thresholds used for quotient digit selection in radix-2 division with qk–j in {–1, 0, 1} .

–2d 2d

d

–d

q =–1

q =0 q =1

2s (j–1)

s (j)

–j

–j

–j

d–d

–1/2 0Choose –1 Choose 0 Choose 1

–1/0 0/+1Overlap Overlap

You can choose 0 or 1 in the overlay region


Quotient Digit Selection Based on Truncated PR

Sum part of 2s(j–1): u = (u1u0 . u–1u–2 . . .)2’s-complCarry part of 2s(j–1): v = (v1v0 . v–1v–2 . . .)2’s-compl

Approximation to the partial remainder:

t = u[–2,1] + v[–2,1] {Add the 4 MSBs of u and v}

t := u[–2,1] + v[–2,1]if t < –½then q–j = –1else if t ≥ 0

then q–j = 1else q–j = 0endif

endif

–2d 2d

d

–d

q =–1

q =0 q =1

2s (j–1)

s (j)

–j

–j

–j

d–d

–1/2 0Choose –1 Choose 0 Choose 1

–1/0 0/+1Overlap Overlap


Error in tThe 4-bit number t=(t1t0.t-1t-2)2/s0compl can be compared to the constants -1/2 and 0 based on only the three bit values t1, t0 and t-1.Regardless of sign, truncating the t-2 results in the maximum truncated value being ½ (when the trye carry-in to t-2 is 1 and t-2 is 1.). Still in overlay region:

If t < -1/2, the true value of 2s(j–1) is guaranteed to be less than 0.

If t < 0, we are guaranteed to have 2s(j–1) < ½ ≦d.


Divider with Partial Remainder in Carry-Save Form

Carry v

Mux

Adder

0 1

Divisor d

k k

Carry-save adder

Select q –j

4 bits Shift left

2s

+ulp for 2’s compl

Sum u

Non0 (enable)

Sign (select)

0, d, or d’

Carry Sum


Why We Cannot Use Carry-Save PR with SRT Division

Overlap regions in radix-2 SRT division.

–2d

2d

d

–d

q =–1

q =0

q =1

2s

(j–1)

s

(j)

–j

–j

–j

d

–d

1 – d

–1

1

–1/2

1/2

1 – dThe overlay can become arbitrarily small as d approaches 1.


Choosing the Quotient Digits

A p-d plot for radix-2 division with d ∈ [1/2,1), partial remainder in [–d, d), and quotient digits in [–1, 1].

d

p

Infeasible region (p cannot be ≥ 2d)

Infeasible region (p cannot be < −2d)

.100 .101 .110 .111 1.

00.1

00.0

11.1

10.0

10.1

11.0

01.1

01.0

−00.1

−01.0

−01.1

−10.0

d

2d

−2d

−d

Worst-case error margin in comparison

Choose 1

Choose −1

Choose 0

−1

1

−1 max

−1 min

1 min

1 max

0 max

0 min

Ove

rlap

Ove

rlap

0

Use p-d plot to understand the q selection and derive the needed precision (number of bits to look at).


Design of the Quotient Digit Selection Logic

4-bit adder

Combinational logic

Non0Sign

Shifted sum = (u1u0 . u−1u−2 . . .)2’s-compl

Shifted carry = (v1v0 . v−1v−2 . . .)2’s-compl

Approx shifted PR = (t1t0 . t−1t−2)2’s-compl

Non0 = t1′ ∨ t0′ ∨ t–1′ = (t1 t0 t−1)′Sign = t1 (t0′ ∨ t−1′)


Radix-4 SRT Division

New versus shifted old partial remainder in radix-4 division with q–j in [–3, 3].

Radix-4 fractional division with left shifts and q–j ∈ [–3, 3]

s(j) = 4s(j–1) – q–j d with s(0) = z and s(k) = 4ks|–shift–||––subtract––|

Two difficulties:How do you choose from among the 7 possible values for q−j?If the choice is +3 or −3, how do you form 3d?

–4d 4d

d

–d

4s(j–1)

–3 –2 –1 0 +1 +2 +3

s (j)


Building the p-d Plot for Radix-4 Division

A p-d plot for radix-4 SRT division with quotient digit set [–3, 3].

d

p

Infeasible region (p cannot be ≥ 4d)

.100 .101 .110 .111

10.1

10.0

01.1

00.0

00.1

01.0

11.1

11.0

d

2d

Choose 2

Choose 0

Choose 1

3

1

2 max

2 min

1 min

1 max

0 max

Ove

rlap

0

3d

4d

Choose 3

3 min

2

Ove

rlap

Ove

rlap

Uncertaintyregion

Uncertaintyregion

Uncertainty region: because of truncation.

The choice between q=3 or q=2 depends not only the p but also on one bit, d-2.


–4d 4d

d

–d

4s(j–1) –3 –2 –1 0 +1 +2 +3

s(j)

2d/3

8d/3–2d/3

–8d/3

Restricting the Quotient Digit Set in Radix 4

Fig. 14.13 New versus shifted old partial remainder in radix-4 division with q–j in [–2, 2].

Radix-4 fractional division with left shifts and q–j ∈ [–2, 2]

s(j) = 4s(j–1) – q–j d with s(0) = z and s(k) = 4ks|–shift–||––subtract––|

For this restriction to be feasible, we must have:s ∈ [−hd, hd) for some h < 1, and 4hd – 2d ≤ hdThis yields h ≤ 2/3 (choose h = 2/3 to minimize the restriction)


d

p

.100 .101 .110 .111

10.1

10.0

01.1

00.0

00.1

01.0

11.1

11.0

Choose 2

Choose 0

Choose 1 1

2 min

1 min

2 max

1 max

0 max

0

2

Ove

rlap

Ove

rlap

Infeasible region (p cannot be ≥ 8d/3)

8d/3

5d/3

4d/3

2d/3

d/3

Building the p-d Plot with Restricted Radix-4 Digit Set

A p-d plot for radix-4 SRT division with quotient digit set [–2, 2].

Depends on d


General High-Radix Dividers

Carry v

CSA tree

Adder

Divisor d

k k

Select q –j

Shift left

2s Sum u

Multiple generation /

selection

Carry Sum

q –j

. . . q –j | | d or its complement

Process to derive the details:

Radix r

Digit set [–α, α] for q–j

Number of bits of p (v and u) and d to be inspected

Quotient digit selection unit (table or logic)

Multiple generation/selection scheme

Conversion of redundant q to 2’s complement


15 Variations in Dividers

Chapter GoalsDiscuss practical aspects of designinghigh-radix division schemes and coverother types of fast hardware dividers

Chapter HighlightsBuilding and using p-d plots in practicePrescaling simplifies q digit selectionParallel hardware (array) dividersShared hardware in multipliers/dividersSquare-rooting not special case of division


Quotient Digit Selection RevisitedRadix-r division with quotient digit set [–α, α], α < r – 1 Restrict the partial remainder range, say to [–hd, hd)From the solid rectangle in Fig. 15.1, we get rhd – αd ≤ hd or h ≤ α/(r – 1) To minimize the range restriction, we choose h = α/(r – 1)

The relationship between new and shifted old partial remainders in radix-rdivision with quotient digits in [–α, +α].

–α

r s (j–1)

s (j)

r–1

rhd –rhd

hd

–hd

d

–d

–r+1 α –1 1 0

rd –rd αd –αd d –d 0


Why Using Truncated p and d Values Is Acceptable

A part of p-d plot showing the overlap region for choosing the quotient digitvalue β or β+1 in radix-r division with quotient digit set [–α, α].

p

d

Choose β + 1

Choose β

d min

Overlap region

(h + β + 1)d

A

(h + β)d

(–h + β + 1)d

(–h + β)d

B

4 bits of p 3 bits of d

3 bits of p 4 bits of d

Note: h = α / (r – 1)

Standard pxx.xxxx

Carry-save pxx.xxxxxxx.xxxxx


Table Entries in the Quotient Digit Selection LogicWe want to make the uncertainty rectangle as large as possible, to minimize the number of bits in p and d needed for choosing the quotient digits.

p

d

β

+1(h + )d

( + )d　

(h + + 1)d

( + + 1)d　

Note: h = /(r?)

β

β

β

β

β

αβ

β+1 ββ

ββ

ββ

ββ

β+1 β+1β+1 β+1

β+1 β+1β+1

β+1orδ+1δ

Origin

Staircaselikeselection boundary


Using p-d Plots in Practice

Establishing upper bounds on the dimensions of uncertainty rectangles.

Δp

p

d

Choose α

Choose α − 1

d min

Overlap region

(h + α − 1)d

(−h + α)d

Δd

d min Δd +

(h + α − 1) d min

(−h + α) d min

Smallest Δd occurs for the overlap region of α and α – 1

α+−−

=Δhhdd 12min

)12(min −=Δ hdp


Example: Lower Bounds on Precision

)12(min −=Δ hdp

Fig. 15.4

Δp

p

d

Choose α

Choose α − 1

d min

Overlap region

(h + α − 1)d

(−h + α)d

Δd

d min Δd +

(h + α − 1) d min

(−h + α) d min

For r = 4, divisor range [0.5, 1), digit set [–2, 2], we have α = 2, dmin = 1/2, h = α/(r – 1) = 2/3

Because 1/8 = 2–3 and 2–3 ≤ 1/6 < 2–2, we must inspect at least 3 bits of d (2, given its leading 1) and 3 bits of p These are lower bounds (not truncated bits) and may prove inadequateIn fact, 3 bits of p and 4 (3) bits of d are required With p in carry-save form, 4 bits of each component must be inspected

8/123/2

13/4)2/1( =+−

−=Δd 6/1)13/4)(2/1( =−=Δp

α+−−

=Δhhdd 12min


Upper Bounds for Precision

Theorem: Once lower bounds on precision are determined based on Δdand Δp, one more bit of precision in each direction is always adequate

u v

Δp

p

d

w

Choose a

Choose a − 1

d min

Overlap region

w

(a − 1 + h)d

(a − h)d

Δd A

B

Proof: Let w be the spacing of vertical grid linesw ≤ Δd/2 ⇒ v ≤ Δp/2 ⇒ u ≥ Δp/2


Some Implementation Details

The asymmetry of quotient digit selection process.

p

d

Choose β + 1

Choose β

d min

A

B

d max

−β

β + 1

Choose −β + 1

Choose −β

p

d

β

+1

β

β

β

β β

β

δ β

β+1

β+1

β+1

β+1

β+1

β+1 or

δ+1

δ

*

* *

*

Example of p-d plot allowing larger uncertainty rectangles, if the 4 cases marked with asterisks are handled as exceptions.


5d/3

4d/3

d 1.000 1.001 1.010 1.011 1.100 0.100 0.101 0.110 0.111 1.000

01.10

01.01

01.00

00.11

00.10

00.00

00.01

11.11

11.10

11.01

11.00

10.11

10.10

2d/3

d/3

–d/3

–4d/3

–5d/3

–2d/3

2 1 2 1

2 1,2 1 1,2 1

2 1,2 1 2 1,2

Radix r = 4q–j in [–2, 2]d in [1/2, 1)p in [–8/3, 8/3]

The Pentium chip division bug


Division with Prescaling

Restricting the divisor to the shaded area simplifies quotient digit selection.

p

d

Choose β + 1

Choose β

d min d max

Choose −β + 1

Choose −β

Overlap regions of a p-d plot are wider toward the high end of the divisor range If we can restrict the magnitude of the divisor to an interval close to dmax (say 1 – e < d < 1 + d, when dmax= 1), quotient digit selection may become simpler Thus, we perform the division (zm)/(dm) for a suitably chosen scale factor m (m > 1)Prescaling (multiplying z and d by m) should be done without real multiplications


Modular Dividers and ReducersGiven dividend z and divisor d, with d ≥ 0, a modular divider computes

q = ⎣z / d⎦ and s = z mod d = ⟨z⟩d

The quotient q is, by definition, an integer but the inputs z and d do not have to be integers; the modular remainder is always positive

Example:

⎣–3.76 / 1.23⎦ = –4 and ⟨–3.76⟩1.23 = 1.16

The quotient and remainder of ordinary division are −3 and −0.07A modular reducer computes only the modular remainder and is in many cases simpler than a full-blown divider

<z>d =<zH2k + zL >d = <zH (2k-1)+ zH + ZL >d


Array DividersRestoring array divider composed of controlled subtractor cells.

z

z

–5

–6

s s s–4 –5 –6

q

q

q

–1

–2

–3

FS

Cell

z z z z–1 –2 –3 –4

1 0

d d d–1 –2 –3

0

0

0

–1 –2 –3 –4 –5 –6 –1 –2 –3 –1 –2 –3

–4 –5 –6

Dividend z = .z z z z z z Divisor d = .d d d Quotient q = .q q q Remainder s = .0 0 0 s s s


Nonrestoring Array DividerNonrestoringarray divider built of controlled add/subtract cells.

Similarity to array multiplier is deceiving

Critical path

Dividend z = z .z z z z z z Divisor d = d .d d d Quotient q = q .q q q Remainder s = 0 .0 0 s s s s

0 –1 –2 –3 –4 –5 –6 0 –1 –2 –3 0 –1 –2 –3

–3 –4 –5 –6

z

z

z

–4

–5

–6

s s s s–3 –4 –5 –6

q

q

q

0

–1

–2

q –3

d d d d0 –1 –2 –3z z z z0 –1 –2 –3

FA

XOR

Cell

1


Speedup Methods for Array Dividers

Critical path

However, we still need to know the carry/borrow-out from each rowSolution: Insert a carry-lookahead circuit between successive rowsNot very cost-effective; thus not used in practice

Idea: Pass the partial remainder downward in carry-save form to speed up the operation of each row

Fig. 15.8

Dividend z = z .z z z z z z Divisor d = d .d d d Quotient q = q .q q q Remainder s = 0 .0 0 s s s s

0 –1 –2 –3 –4 –5 –6 0 –1 –2 –3 0 –1 –2 –3

–3 –4 –5 –6

z

z

z

–4

–5

–6

s s s s–3 –4 –5 –6

q

q

q

0

–1

–2

q –3

d d d d0 –1 –2 –3z z z z0 –1 –2 –3

FA

XOR

Cell

1


Combined Multiply/Divide Units

Quotient

k

Partial Remainder

Divisor

add/sub

k-bit adder

k

cout cin

Complement

qk　 2s (j?)MSB of

Divisor Sign

Complement of Partial Remainder Sign

Fig. 9.4 Fig. 13.10

Multiplier x

Mux

Adder

0

out c

0 1


Multiplicand a

Shift

Shift

(j)

j x

x a j

k

k

k

Similarity of blocks in multipliers and dividers (only shift direction is different)


Single Unit for Sequential Multiplication and Division

The control unit proceeds through necessary steps for multiplication or division (including using the appropriate shift direction)

Sequential radix-2 multiply/divide unit.

Multiplier x or quotient q

Mux

Adder out c

0 1

Partial product p or partial remainder s

Multiplicand a or divisor d

Shift control

Shift

Enable

in c

q k–j

MSB of 2s (j–1)

k

k

k

j x

MSB of p (j+1)

Divisor sign

Multiply/ divide control

Select

Mul Div

The slight speed penalty owing to a more complex control unit is insignificant


Single Unit for Array Multiplication and Division

Each cell within the array can act as a modified adder or modified subtractor based on control input values

I/O specification of a universal circuit that can act as an array multiplier or array divider.

In some designs, squaring and square-rooting functions are also included within the same array

Multiplicand or divisor

Multiplier

Product or remainder

Quotient

Mul/Div

Additive input or dividend


16 Division by Convergence

Chapter GoalsShow how by using multiplication as thebasic operation in each division step,the number of iterations can be reduced

Chapter HighlightsDigit-recurrence as convergence methodConvergence by Newton-Raphson iterationComputing the reciprocal of a numberHardware implementation and fine tuning


General Convergence Methods

u (i+1) = f(u (i), v (i), w (i))v (i+1) = g(u (i), v (i), w (i))w (i+1) = h(u (i), v (i), w (i))

u (i+1) = f(u (i), v (i))v (i+1) = g(u (i), v (i))

The complexity of this method depends on two factors:

a. Ease of evaluating f and g (and h)b. Rate of convergence (number of iterations needed)

Constant

Desiredfunction

Guide the iteration such that one of the values converges to a constant (usually 0 or 1)

The other value then converges to the desired function


Division by Repeated Multiplications

Remainder often not needed, but can be obtained by another multiplication if desired: s = z – qd

Motivation: Suppose add takes 1 clock and multiply 3 clocks64-bit divide takes 64 clocks in radix 2, 32 in radix 4

Divide faster via multiplications faster if 10 or fewer needed

)1()1()0(

)1()1()0(

−

−== m

m

xxdxxxzx

dzq

L

LIdea:

Force to 1Converges to q

To turn the identity into a division algorithm, we face three questions:

1. How to select the multipliers x(i) ?2. How many iterations (pairs of multiplications)? 3. How to implement in hardware?


Formulation as a Convergence Computation

)1()1()0(

)1()1()0(

−

−== m

m

xxdxxxzx

dzq

L

LIdea:

Force to 1Converges to q

d (i+1) = d (i) x (i) Set d (0) = d; make d (m) converge to 1z (i+1) = z (i) x (i) Set z (0) = z; obtain z/d = q ≅ z (m)

Question 1: How to select the multipliers x (i) ? x (i) = 2 – d (i)

This choice transforms the recurrence equations into:

d (i+1) = d (i) (2 − d (i)) Set d (0) = d; iterate until d (m) ≅ 1z (i+1) = z (i) (2 − d (i)) Set z (0) = z; obtain z/d = q ≅ z (m)

u (i+1) = f(u (i), v (i))v (i+1) = g(u (i), v (i))

Fits the general form


Determining the Rate of Convergenced (i+1) = d (i) x (i) Set d (0) = d; make d (m) converge to 1z (i+1) = z (i) x (i) Set z (0) = z; obtain z/d = q ≅ z (m)

Question 2: How quickly does d (i) converge to 1?

We can relate the error in step i + 1 to the error in step i:

d (i+1) = d (i) (2 − d (i)) = 1 – (1 – d (i))2

1 – d (i+1) = (1 – d (i))2

For 1 – d (i) ≤ ε, we get 1 – d (i+1) ≤ ε2: Quadratic convergence

In general, for k-bit operands, we need

2m – 1 multiplications and m 2’s complementations

where m = ⎡log2 k⎤


Quadratic ConvergenceTable: Quadratic convergence in computing z/d by repeated multiplications, where 1/2 ≤ d = 1 – y < 1

–––––––––––––––––––––––––––––––––––––––––––––––––––––––i d (i) = d (i–1) x (i–1), with d (0) = d x (i) = 2 – d (i)

–––––––––––––––––––––––––––––––––––––––––––––––––––––––0 1 – y = (.1xxx xxxx xxxx xxxx)two ≥ 1/2 1 + y1 1 – y 2 = (.11xx xxxx xxxx xxxx)two ≥ 3/4 1 + y 2

2 1 – y 4 = (.1111 xxxx xxxx xxxx)two ≥ 15/16 1 + y 4

3 1 – y 8 = (.1111 1111 xxxx xxxx)two ≥ 255/256 1 + y 8

4 1 – y 16 = (.1111 1111 1111 1111)two = 1 – ulp–––––––––––––––––––––––––––––––––––––––––––––––––––––––Each iteration doubles the number of guaranteed leading 1s (convergence to 1 is from below)

Beginning with a single 1 (d ≥ ½), after log2k iterations we get as close to 1 as is possible in a fractional representation


Graphical Depiction of Convergence to q

Question 3 (implementation in hardware) to be discussed later

1 1 – ulp

d

z

q –

Iteration i

d

z

0 1 2 3 4 5 6

(i)

(i)

q ε


Division by Reciprocation

Convergence to a root of f(x) = 0 in the Newton-Raphson method.

The Newton-Raphson method can be used for finding a root of f (x) = 0

f(x)

xx(i+1)x

f(x )

Tangent at x(i)

Root α x(i)(i+2)

(i)

(i)

Start with an initial estimate x(0) for the root

Iteratively refine the estimate via the recurrence

x(i+1) = x(i) – f (x(i)) / f ′(x(i))

Justification:

tan α(i) = f ′(x(i))= f (x(i)) / (x(i) – x(i+1))


Computing 1/d by Convergence1/d is the root of f (x) = 1/x – d

f ′(x) = –1/x2

Substitute in the Newton-Raphson recurrence x(i+1) = x(i) – f (x(i)) / f ′(x(i)) to get:

x (i+1) = x (i) (2 − x (i)d)

One iteration = Two multiplications + One 2’s complementation

Error analysis: Let δ (i) = 1/d – x(i) be the error at the ith iteration

δ (i+1) = 1/d – x (i+1) = 1/d – x (i) (2 – x (i) d) = d (1/d – x (i))2 = d (δ (i))2

Because d < 1, we have δ (i+1) < (δ (i))2

−d

1/d x

f(x)


Choosing the Initial Approximation to 1/dWith x(0) in the range 0 < x(0) < 2/d, convergence is guaranteed

Justification: |δ(0) | = |x(0) – 1/d | < 1/d

δ(1)= |x(1) – 1/d | = d (δ(0))2 = (dδ(0))δ(0) < δ(0)

1

x

1/x

2

10

0

For d in [1/2, 1):

Simple choice x(0) = 1.5

Max error = 0.5 < 1/d

Better approx. x(0) = 4(√3 – 1) – 2d= 2.9282 – 2d

Max error ≅ 0.1


Speedup of Convergence Division

Division can be performed via 2⎡log2k⎤ – 1 multiplications

This is not yet very impressive64-bit numbers, 3-ns multiplier ⇒ 33-ns division

Three types of speedup are possible:

Fewer multiplications (reduce m) Narrower multiplications (reduce the width of some x(i)s)Faster multiplications

)1()1()0(

)1()1()0(

−

−== m

m

xxdxxxzx

dzq

L

L Compute y = 1/d Do the multiplication yz


Initial Approximation via Table LookupConvergence is slow in the beginning: it takes 6 multiplications to get 8 bits of convergence and another 5 to go from 8 bits to 64 bits

d x(0) x(1) x(2) = (0.1111 1111 . . . )two

Approx to 1/d

Better approx

Read this value, x(0+), directly from a table, thereby reducing 6 multiplications to 2

A 2w × w lookup table is necessary and sufficient for w bits of convergence after 2 multiplications

Example with 4-bit lookup: d = 0.1011 xxxx . . . (11/16 ≤ d < 12/16)Inverses of the two extremes are 16/11 ≅ 1.0111 and 16/12 ≅ 1.0101 So, 1.0110 is a good estimate for 1/d1.0110 × 0.1011 = (11/8) × (11/16) = 121/128 = 0.1111001 1.0110 × 0.1100 = (11/8) × (3/4) = 33/32 = 1.000010


Visualizing the Convergence with Table Lookup

Convergence in division by repeated multiplications with initialtable lookup.

1 1 – ulp

d

z

q –

Iterations

After table lookup and 1st pair of multiplications, replacing several iterations

After the 2nd pair of multiplications

ε


Convergence Does Not Have to Be from Below

1 1 ± ulp

d

z

q ±

Iterations

ε


Using Truncated Multiplicative Factors

Fig. 16.4 One step in convergence division with truncated multiplicative factors.

1

Approximate iteration

Precise iteration

B

A

i + 1 i

Iteration

(x (i+1)

d x (0) x (1) x (i) ... x (i+1)

) T

d x (0) x (1) x (i) ...

d x (0) x (1) x (i) ...

< 2 −a

Example (64-bit multiplication)Initial step: Table of size 256 × 8 = 2K bitsMiddle steps: Multiplication pairs, with 9-, 17-, and 33-bit multipliersFinal step: Full 64 × 64 multiplication

Problem 16.9aA truncated denominator d (i), with aidentical leading bits and b extra bits (b ≤ a), leads to a new denominator d (i+1) with a + b identical leading bits


Hardware ImplementationRepeated multiplications: Each pair of ops involves the same multiplier

d (i+1) = d (i) (2 − d (i)) Set d (0) = d; iterate until d (m) ≅ 1z (i+1) = z (i) (2 − d (i)) Set z (0) = z; obtain z/d = q ≅ z (m)

Two multiplications fully overlapped in a 2-stage pipelined multiplier.

z x(i)(i)

d x(i)(i)

x(i)z(i)d(i+1)

d(i+1)

x(i+1)

z x(i)(i)

d x(i+1)(i+1)

z(i+1)

2's Complz(i+1) x(i+1)

z x(i+1)(i+1)

d(i+2)

d x(i+1)(i+1)


Implementing Division with ReciprocationReciprocation: Multiplication pairs are data-dependent, so they cannot be pipelined or performed in parallel

x (i+1) = x (i) (2 − x (i)d)

Options for speedup via a better initial approximation

Consult a larger tableResort to a bipartite or multipartite table (see Chapter 24) Use table lookup, followed with interpolationCompute the approximation via multioperand addition

Unless several multiplications by the same multiplier are needed, division by repeated multiplications is more efficient

However, given a fast method for reciprocation (see Section 24.6), using a reciprocation unit with a standard multiplier is often preferred


Analysis of Lookup Table SizeTable:Sample entries in the lookup table replacing the first four multiplications in division by repeated multiplications

–––––––––––––––––––––––––––––––––––––––––––––––––––––––Address d = 0.1 xxxx xxxx x (0+) = 1. xxxx xxxx

–––––––––––––––––––––––––––––––––––––––––––––––––––––––55 0011 0111 1010 010164 0100 0000 1001 1001

–––––––––––––––––––––––––––––––––––––––––––––––––––––––

Example: Table entry at address 55 (311/512 ≤ d < 312/512)

For 8 bits of convergence, the table entry f must satisfy

(311/512)(1 + . f) ≥ 1 – 2–8 (312/512)(1 + . f) ≤ 1 + 2–8

199/311 ≤ .f ≤ 101/156 or 163.81 ≤ 256 × . f ≤ 165.74

Two choices: 164 = (1010 0100)two or 165 = (1010 0101)two


A General Result for Table Size

Proof strategy for sufficiency: Represent the table entry 1.f as the integer v = 2w × .f and derive upper / lower bound expressions for it. Then, show that at least one integer exists between vlb and vub

Theorem 16.1: To get w ≥ 5 bits of convergence after the first iteration of division by repeated multiplications, w bits of d (beyond the mandatory 1) must be inspected. The factor x(0+) read out from table is of the form (1.xxx . . . xxx)two, with w bits after the radix point

Proof strategy for necessity: Show that derived conditions cannot be met if the table is of size 2k–1 (no matter how wide) or if it is of width k – 1 (no matter how large)

Excluded cases, w < 5: Practically uninteresting (allow smaller table)

General radix r : Same analysis method, and results, apply

computer arithmetic designscholar.fju.edu.tw/課程大綱/upload/054753/content/981...3 computer...

Documents