computer arithmetic designscholar.fju.edu.tw/課程大綱/upload/054753/content/981...3 computer...
TRANSCRIPT
1Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Computer Arithmetic Design
Instructor: Kuan Jen Lin E-Mail: [email protected]: http://vlsi.ee.fju.edu.tw/teacher/kjlin/kjlin.htmDept. of EE, FJU, TaiwanRoom: SF 727B
2Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
SW & HW
SW = Algorithm + Data Structure + Programming techniques
HW = Algorithm + Architecture + Design Method
Computing
Communication
Pipeline
Systolic array
Low power
Interface
…
Full custom
Cell based
FPGA
System level
3Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Course ObjectivesLearn computer algorithms to do arithmetic operationsLearn hardware designs for computer arithmetic.After completing the course
Students are able to implement computer arithmetic hardware designs using HDL.Students are able to read research papers about computer arithmetic.
4Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Textbook•Textbook
Behrooz Parhami,
“Computer Arithmetic
Algorithms and Hardware Designs,”
Oxford University Press
•Reference books:
Ercegovac and Lang, “Digital Arithmetic,” MKP.
Stine, “Digital Computer Aruthmetic datapath Design Using Verilog HDL,” CAP
5Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Syllabus
Number representationTwo-operand AdditionMulti-operand AdditionMultiplicationDivisionSquare RootPapers reading and presentation
6Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Grading
Mid Exam (30%)Papers reading and presentation (30%)Homework (some problems need HDL programming) (30%)Attendance and Others (10%)
7Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Number Representation
Instructor: Kuan Jen Lin E-Mail: [email protected]. of EE, FJU, TaiwanRoom: SF 727B
Most slides are revision of PowerPoint files gotten from textbook website.
8Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Numbers and Arithmetic
Chapter GoalsDefine scope and provide motivationSet the framework for the rest of the bookReview positional fixed-point numbers
Chapter HighlightsWhat goes on inside your calculator?Ways of encoding numbers in k bitsRadices and digit sets: conventional, exoticConversion from one system to another
9Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
What is Computer Arithmetic?
Pentium Division Bug (1994-95): Pentium’s radix-4 SRT algorithm occasionally gave incorrect quotient First noted in 1994 by T. Nicely who computed sums of reciprocals of twin primes:
1/5 + 1/7 + 1/11 + 1/13 + . . . + 1/p + 1/(p + 2) + . . .
Worst-case example of division error in Pentium:
4 195 835
3 145 727
1.333 820 44... 1.333 739 06...
c = = Correct quotient circa 1994 Pentium double FLP value;
accurate to only 14 bits (worse than single!)
10Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Hardware (our focus in this book) Software––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––––––––––––––––––––––––Design of efficient digital circuits for Numerical methods for solvingprimitive and other arithmetic operations systems of linear equations,such as +, –, ×, ÷, √, log, sin, cos partial differential equations, etc.Issues: Algorithms Issues: Algorithms
Error analysis Error analysisSpeed/cost trade-offs Computational complexityHardware implementation ProgrammingTesting, verification Testing, verification
General-purpose Special-purpose–––––––––––––––––––––– –––––––––––––––––––––––Flexible data paths Tailored toFast primitive applications like:
operations like Digital filtering+, –, ×, ÷, √ Image processing
Benchmarking Radar tracking
The Scope of Computer Arithmetic.
11Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Using a calculator with √, x2, and xy functions, compute:u = √√ … √ 2 = 1.000 677 131 “1024th root of 2”v = 21/1024 = 1.000 677 131 Save u and v; If you can’t save, recompute values when neededx = (((u2)2)...)2 = 1.999 999 963x' = u1024 = 1.999 999 973 y = (((v2)2)...)2 = 1.999 999 983y' = v1024 = 1.999 999 994 Perhaps v and u are not really the same valuew = v – u = 1 × 10–11 Nonzero due to hidden digits (u – 1) × 1000 = 0.677 130 680 [Hidden ... (0) 68](v – 1) × 1000 = 0.677 130 690 [Hidden ... (0) 69]
A Motivating Example
12Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Finite Precision Can Lead to DisasterExample: Failure of Patriot Missile (1991 Feb. 25)Source http://www.math.psu.edu/dna/455.f96/disasters.html American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept incoming Iraqi Scud missile
The Scud struck an American Army barracks, killing 28 Cause, per GAO/IMTEC-92-26 report: “software problem” (inaccurate calculation of the time since boot)Problem specifics: Time in tenths of second as measured by the system’s internal clock was multiplied by 1/10 to get the time in seconds Internal registers were 24 bits wide1/10 = 0.0001 1001 1001 1001 1001 100 (chopped to 24 b)Error ≈ 0.1100 1100 × 2–23 ≈ 9.5 × 10–8
Error in 100-hr operation period ≈ 9.5 × 10 –8 × 100 × 60 × 60 × 10 = 0.34 s
Distance traveled by Scud = (0.34 s) × (1676 m/s) ≈ 570 m
13Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Numbers and Their Encodings
Some 4-bit number representation formats
Unsigned integer ± Signed integer
Signed fraction 2's-compl fraction
Floating point Logarithmic
Fixed point, 3+1
±
e s log x
Radix point
Base-2logarithm
Exponent in{−2, −1, 0, 1}
Significand in{0, 1, 2, 3}
14Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Encoding Numbers in 4 Bits0 2 4 6 8 10 12 14 16 −2 −4 −6 −8 −10 −12 −14 −16
Unsigned integers
Signed-magnitude
3 + 1 fixed-point, xxx.x
Signed fraction, ±.xxx
2’s-compl. fraction, x.xxx
2 + 2 floating-point, s × 2 e in [−2, 1], s in [0, 3]
2 + 2 logarithmic (log = xx.xx)
±
±
Number format
log x
s e e
15Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Fixed-Radix Positional Number Systems( xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l )r = xi r i
One can generalize to: Arbitrary radix (not necessarily integer, positive, constant) Arbitrary digit set, usually {–α, –α+1, . . . , β–1, β} = [–α, β]
Example 1.1. Balanced ternary number system: Radix r = 3, digit set = [–1, 1]
Example 1.2. Negative-radix number systems: Radix –r, r ≥ 2, digit set = [0, r – 1]The special case with radix –2 and digit set [0, 1] is known as the negabinary number system
Can it represent all integer number?
∑−
−=
1k
li
16Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
More Examples of Number Systems
Example 1.3. Digit set [–4, 5] for r = 10: (3 –1 5)ten represents 295 = 300 – 10 + 5
Example 1.4. Digit set [–7, 7] for r = 10: (3 –1 5)ten = (3 0 –5)ten = (1 –7 0 –5)ten
Example 1.7. Quater-imaginary number system:radix r = 2j, digit set [0, 3]
17Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Number Radix Conversion
Radix conversion, using arithmetic in the old radix rConvenient when converting from r = 10
u = w . v= ( xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l )r Old= ( XK–1XK–2 . . . X1X0 . X–1X–2 . . . X–L )R New
Radix conversion, using arithmetic in the new radix RConvenient when converting to R = 10
Whole part Fractional part
Example: (31)eight = (25)ten 31 Oct. = 25 Dec. Halloween = Xmas
18Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Radix Conversion: Old-Radix ArithmeticConverting whole part w: (105)ten = (?)five
Repeatedly divide by five Quotient Remainder105 021 14 40
Therefore, (105)ten = (410)fiveConverting fractional part v: (105.486)ten = (410.?)five
Repeatedly multiply by five Whole Part Fraction.486
2 .4302 .1500 .7503 .7503 .750
Therefore, (105.486)ten ≅ (410.22033)five
19Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Radix Conversion: New-Radix ArithmeticConverting whole part w: (22033)five = (?)ten
((((2 × 5) + 2) × 5 + 0) × 5 + 3) × 5 + 3 |-----| : : : :
10 : : : : |-----------| : : :
12 : : : |---------------------| : :
60 : : |-------------------------------| :
303 : |-----------------------------------------|
1518
Converting fractional part v: (410.22033)five = (105.?)ten(0.22033)five × 55 = (22033)five = (1518)ten
1518 / 55 = 1518 / 3125 = 0.48576Therefore, (410.22033)five = (105.48576)ten
Horner’srule or formula
20Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Horner’s Rule for Fractions
Converting fractional part v: (0.22033)five = (?)ten
(((((3 / 5) + 3) / 5 + 0) / 5 + 2) / 5 + 2) / 5|-----| : : : :
0.6 : : : : |-----------| : : :
3.6 : : : |---------------------| : :
0.72 : : |-------------------------------| :
2.144 : |-----------------------------------------|
2.4288 |-----------------------------------------------|
0.48576
Horner’srule or formula
21Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Classes of Number Representations
Signed numberRedundant number systemResidue number systemReal number
22Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
2 Representing Signed Numbers
Chapter GoalsLearn different encodings of the sign infoDiscuss implications for arithmetic design
Chapter HighlightsUsing sign bit, biasing, complementationProperties of 2’s-complement numbersSigned vs unsigned arithmeticSigned numbers, positions, or digits
23Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
0000 0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
0 +1
+3
+4
+5
+6 +7
-7
-3
-5
-4
-0 -1
+2-
+ _
Bit pattern (representation)
Signed values (signed magnitude)
+2 -6
Increment Decrement
-
Four-bit signed-magnitude number representation system for integers
24Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Four-bit biased integer number representation system with a bias of 8
0000 0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
-8 -7
-5
-4
-3
-2 -1
+7
+3
+5
+4
0 +1 +2
+ _
Bit pattern (representation)
Signed values (biased by 8)
-6 +6
Increment Increment
25Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Arithmetic with Biased Numbers
Addition/subtraction of biased numbersx + y + bias = (x + bias) + (y + bias) – biasx – y + bias = (x + bias) – (y + bias) + bias
A power-of-2 (or 2a – 1) bias simplifies addition/subtraction
Comparison of biased numbers:Compare like ordinary unsigned numbersfind true difference by ordinary subtraction
We seldom perform arbitrary arithmetic on biased numbersMain application: Exponent field of floating-point numbers
26Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Example and Two Special CasesExample -- complement system for fixed-point numbers:
Complementation constant M = 12.000Fixed-point number range [–6.000, +5.999]Represent –3.258 as 12.000 – 3.258 = 8.742
Auxiliary operations for complement representationscomplementation or change of sign (computing M – x) computations of residues mod M
Thus, M must be selected to simplify these operations
Two choices allow just this for fixed-point radix-r arithmetic with k whole digits and l fractional digits
Radix complement M = rk
Digit complement M = rk – ulp (aka diminished radix compl)
ulp (unit in least position) stands for r−l
Allows us to forget about l, even for nonintegers
27Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Two’s- Complement Numbers
0000 0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
+0 +1
+3
+4
+5
+6 +7
-1
-5
-3
-4
-8 -7
-6
+ _
Unsigned representations
Signed values (2’s complement)
+2 -2 Two’s complement = radix complement system for r = 2
M = 2k
2k – x = [(2k – ulp) – x] + ulp= xcompl + ulp
Range of representable numbers in with k whole bits:
from –2k–1 to 2k–1 – ulp
ulp (unit in least position) stands for r−l
Allows us to forget about l, even for nonintegers
28Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
One’s-Complement Number Representation
One’s complement = digit complement (diminished radix complement) system for r = 2
M = 2k – ulp
(2k – ulp) – x = xcompl
Range of representable numbers in with k whole bits:
from –2k–1 + ulp to 2k–1 – ulp
0000 0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
+0 +1
+3
+4
+5
+6 +7
-0
-4
-2
-3
-7 -6
-5
+ _
Unsigned representations
Signed values (1’s complement)
+2 -1
29Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Range/Precision extension for 2’s- and 1’s Complement
Range/precision extension for 2’s-complement numbers. . . xk–1 xk–1 xk–1 xk–1 xk–2 . . . x1 x0 . x–1 x–2 . . . x–l 0 0 0 . . .
Sign extension Sign LSD Extension bit
Range/precision extension for 1’s-complement numbers. . . xk–1 xk–1 xk–1 xk–1 xk–2 . . . x1 x0 . x–1 x–2 . . . x–l xk–1 xk–1 xk–1 . . .
Sign extension Sign LSD Extension bit
30Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Mod 2k vs Mod 2k-1
Mod-2k operation needed in 2’s-complement arithmetic is trivial:Simply drop the carry-out (subtract 2k if result is 2k or greater)
Mod-(2k – ulp) operation needed in 1’s-complement arithmetic is done via end-around carry
(x + y) – (2k – ulp) Connect cout to cin
Since the dropped carry is worth 2k unites and the inserted carry is worth ulp, the combined effect is to reduce the magnitude by 2k-ulp.
31Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Why 2’s-Complement Is the Universal Choice
Adder/subtractor architecture for 2’s-complement numbers.
Mux
Adder
0 1
x y
y or y _
s = x ± y
add/sub ___
c in
Controlled complementation
0 for addition, 1 for subtraction
c out
Can replace this mux with k XOR gates
32Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Interpreting a 2’s-complement number as having a negatively weighted most-significant digit.
x = (1 0 1 0 0 1 1 0)two’s-compl
–27 26 25 24 23 22 21 20
–128 + 32 + 4 + 2 = –90
Check:x = (1 0 1 0 0 1 1 0)two’s-compl
–x = (0 1 0 1 1 0 1 0)two
27 26 25 24 23 22 21 20
64 + 16 + 8 + 2 = 90
33Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Redundant Number Systems
Chapter GoalsExplore the advantages and drawbacks of using more than r digit values in radix r
Chapter HighlightsRedundancy eliminates long carry chainsRedundancy takes many forms: trade-offsConversions between redundant
and nonredundant representationsRedundancy used for end values too?
34Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Coping with the Carry Problem
Ways of dealing with the carry propagation problem:1. Limit propagation to within a small number of bits (Chapters 3-4)
2. Detect end of propagation; don’t wait for worst case (Chapter 5)
3. Speed up propagation via lookahead etc. (Chapters 6-7)
4. Ideal: Eliminate carry propagation altogether! (Chapter 3)
35Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Use Redundant Number System (1/2)
5 7 8 2 4 9
6 2 9 3 8 9 Operand digits in [0, 9]––––––––––––––––––––––––––––––––––
11 9 17 5 12 18 Position sums in [0, 18]
But how can we extend this beyond a single addition?Subsequent additions will cause problems.
+
•The digit values 10 through 18 are redundant.
•Carry occurs if the sum >= 10, while not >18.
36Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Use Redundant Number System (2/2)
18 18 18 18 18
+ 0 0 0 0 1
Is there still carry propagation problem?
The sum of digits for each position is in [0, 36], each can be decomposed into an interim sum in [0, 16] and a transfer digit in [0, 2], i.e. carry.
8 8 8 8 9
1 1 1 1
1 9 9 9 9 9
37Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Example: Addition of Redundant Numbers
Position sum decomposition [0, 36] = 10 × [0, 2] + [0, 16]
Absorption of transfer digit [0, 16] + [0, 2] = [0, 18]
6 12 9 10 8 18 Operand digits in [0, 18]
17 21 26 20 20 36
7 11 16 0 10 16
Position sums in [0, 36]
Interim sums in [0, 16]
1 1 1 2 1 2
1 8 12 18 1 12 16
11 9 17 10 12 18
Transfer digits in [0, 2]
Sum digits in [0, 18]
+
38Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Free Addition Schemes
Interim sumat position i
Transfer digitinto position i
Operand digits at position i
s i+1 s i–1s i
xi–1 ,y i–1,x ixi+1 ,y i+1 y i xi–1 ,y i–1,x ixi+1 ,y i+1 y i
(b) Two-stage carry-free.
s i+1 s i–1s i
t i
(c) Single-stage with lookahead.
s i+1 s i–1s i
xi–1 ,y i–1,x ixi+1 ,y i+1 y i
(a) Ideal single-stage carry-free.
(Impossible for positional system with fixed digit set)
39Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Redundancy IndexSo, redundancy helps us achieve carry-free addition
But how much redundancy is actually needed? Is [0, 11] enough for r = 10?
18 12 16 21 12 16 Position sums in [0, 22]
8 2 6 1 2 6
1 1 1 2 1 1
Interim sums in [0, 9]
Transfer digits in [0, 2]
1 9 3 8 2 3 6
11 10 7 11 3 8
Sum digits in [0, 11]
+ 7 2 9 10 9 8 Operand digits in [0, 11]
Redundancy index ρ = α + β + 1 – r For example, 0 + 11 + 1 – 10 = 2
40Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Digit Sets and Digit-Set ConversionsExample 3.1: Convert from digit set [0, 18] to [0, 9] in radix 10
11 9 17 10 12 18 18 = 10 (carry 1) + 811 9 17 10 13 8 13 = 10 (carry 1) + 311 9 17 11 3 8 11 = 10 (carry 1) + 111 9 18 1 3 8 18 = 10 (carry 1) + 811 10 8 1 3 8 10 = 10 (carry 1) + 012 0 8 1 3 8 12 = 10 (carry 1) + 2
1 2 0 8 1 3 8 Answer; all digits in [0, 9]
Note: Conversion from redundant to nonredundant representation always involves carry propagation
Thus, the process is sequential and slow
41Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Generalized Signed-Digit NumbersRadix-r Positional
ρ = 0 ρ ≥ 1
Non-redundant
α = 0 α ≥ 1
Conventional Non-redundant signed-digit
Generalized signed-digit (GSD)
ρ = 1 ρ ≥ 2
Minimal GSD
Non-minimal GSD
α = β(even r)
α ≠ β
Symmetric minimal GSD
r = 2
BSD or BSB
Asymmetric minimal GSD
α = 0 α = 1(r ?2)
Stored- carry (SC)
Non-binary SB
Symmetric non- minimal GSD
α = β α ≠ β
Asymmetric non- minimal GSD
α < r
Ordinary signed-digit
Minimally redundant OSD
Maximally redundant OSD BSCB
SCB
r = 2
α = 1β = rα = 0
Unsigned-digit redundant (UDR)
r = 2
BSC
α = r ?1α = ⎣ ⎦r/2 + 1
≠
Radix rDigit set [–α, β]Requirement
α + β + 1 ≥ rRedundancy index
ρ = α + β + 1 – r
42Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Binary Signed Digit (BSD)
xi 1 –1 0 –1 0 BSD representation of +6⟨s, v⟩ 01 11 00 11 00 Sign and value encoding2’s-compl 01 10 00 10 00 2-bit 2’s-complement ⟨n, p⟩ 01 10 00 10 00 Negative & positive flags ⟨n, z, p⟩ 001 100 010 100 010 1-out-of-3 encoding
43Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Free Addition AlgorithmsCarry-free addition of GSD numbers
Compute the position sums pi = xi + yi
Divide pi into a transfer ti+1 and interim sum wi = pi – rti+1
Add incoming transfers to get the sum digits si = wi + ti
xi? ,yi?,xixi+1,yi+1 yi
s i+1 s i?s i
tiwi
If the transfer digits ti are in [–λ, μ], we must have:
–α + λ ≤ pi – rti+1 ≤ β – μ
interim sum
Smallest interim sum Largest interim sumif a transfer of –λ if a transfer of μis to be absorbable is to be absorbable
These constraints lead to:
λ ≥ α / (r – 1)
μ ≥ β / (r – 1)
44Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Is Carry-Free Addition Always Applicable?No: It requires one of the following two conditions [Parh 90]
a. r > 2, ρ ≥ 3
b. r > 2, ρ = 2, α ≠ 1, β ≠ 1 e.g., not [−1, 10] in radix 10
In other words, it is inapplicable for
r = 2 Perhaps most useful case
ρ = 1 e.g., carry-save
ρ = 2 with α = 1 or β = 1 e.g., carry/borrow-save
BSD is not two-stage carry-free -1 -10 -1-1 -2-1
-1
45Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Use Carry-Estimate
A position sum –1 is kept intact when the incoming transfer is in [0, 1], whereas it is rewritten as 1 with a carry of –1 for incoming transfer in [–1, 0]. This guarantees that ti ≠ wi and thus –1 ≤ si ≤ 1.
1 –1 0 –1 0 x in [–1, 1]
+ 0 –1 –1 0 1
1 –2 –1 –1 1
1 0 1 –1 –1
–1 –1 0 1
0 –1 1 0 –1
i
i+1
y in [–1, 1] i
p in [–2, 2] i
w in [–1, 1] i
s in [–1, 1] i
t in [–1, 1]
low low low high high high
0
0
e in {low: [–1, 0], high: [0, 1]} i
46Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Residue Number Systems
Chapter GoalsStudy a way of encoding large numbers as a collection of smaller numbersto simplify and speed up some operations
Chapter HighlightsModuli, range, arithmetic operationsMany sets of moduli possible: tradeoffsConversions between RNS and binary The Chinese remainder theoremWhy are RNS applications limited?
47Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
RNS Representations and Arithmetic
Chinese puzzle, 1500 years ago:
What number has the remainders of 2, 3, and 2 when divided by 7, 5, and 3, respectively?
Residues uniquely identify the number, hence they constitute a representation
Pairwise relatively prime moduli: mk–1 > . . . > m1 > m0
The residue xi of x wrt the ith modulus mi (similar to a digit):xi = x mod mi = ⟨x⟩mi
RNS representation contains a list of k residues or digits:x = (2 | 3 | 2)RNS(7|5|3)
Default RNS for this chapter: RNS(8 | 7 | 5 | 3)
48Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
RNS Dynamic RangeProduct M of the k pairwise relatively prime moduli is the dynamic range
M = mk–1 × . . . × m1 × m0
For RNS(8 | 7 | 5 | 3), M = 8 ×7 ×5 ×3 = 840
Negative numbers: Complement relative to M⟨–x⟩mi = ⟨M – x⟩mi21 = (5 | 0 | 1 | 0)RNS
–21 = (8 – 5 | 0 | 5 – 1 | 0)RNS = (3 | 0 | 4 | 0)RNS
Here are some example numbers in our default RNS(8 | 7 | 5 | 3):(0 | 0 | 0 | 0)RNS Represents 0 or 840 or . . .(1 | 1 | 1 | 1)RNS Represents 1 or 841 or . . .(2 | 2 | 2 | 2)RNS Represents 2 or 842 or . . .. .(0 | 1 | 4 | 1)RNS Represents 64 or 904 or . . .(2 | 0 | 0 | 2)RNS Represents –70 or 770 or . . .(7 | 6 | 4 | 2)RNS Represents –1 or 839 or . . .
We can take the range of RNS(8|7|5|3) to be [−420, 419] or any other set of 840 consecutive integers
49Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
We will see later how the weights can be determined for a given RNS
RNS as Weighted Representation
For RNS(8 | 7 | 5 | 3), the weights of the 4 positions are:
105 120 336 280
Example: (1 | 2 | 4 | 0)RNS represents the number
⟨105×1 + 120×2 + 336×4 + 280×0⟩840 = ⟨1689⟩840 = 9
For RNS(7 | 5 | 3), the weights of the 3 positions are:
15 21 70
Example -- Chinese puzzle: (2 | 3 | 2)RNS(7|5|3) represents the number
⟨15 × 2 + 21 × 3 + 70 × 2⟩105 = ⟨233⟩105 = 23
50Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
RNS Encoding and Arithmetic Operations
Binary-coded format for RNS(8 | 7 | 5 | 3).
Arithmetic in RNS(8 | 7 | 5 | 3)(5 | 5 | 0 | 2)RNS Represents x = +5(7 | 6 | 4 | 2)RNS Represents y = –1(4 | 4 | 4 | 1)RNS x + y : ⟨5 + 7⟩8 = 4, ⟨5 + 6⟩7 = 4, etc.(6 | 6 | 1 | 0)RNS x – y : ⟨5 – 7⟩8 = 6, ⟨5 – 6⟩7 = 6, etc.
(alternatively, find –y and add to x)(3 | 2 | 0 | 1)RNS x × y : ⟨5 × 7⟩8 = 3, ⟨5 × 6⟩7 = 2, etc.
mod 8 mod 7 mod 5 mod 3
mod 8 mod 7 mod 5 mod 3
Mod-8 Unit
Mod-7 Unit
Mod-5 Unit
Mod-3 Unit
3 3 3 2
Operand 1 Operand 2
Result
51Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Choosing the RNS Moduli
Target range for our RNS: Decimal values [0, 100 000]
Strategy 1: To minimize the largest modulus, and thus ensure high-speed arithmetic, pick prime numbers in sequence
Pick m0 = 2, m1 = 3, m2 = 5, etc. After adding m5 = 13:RNS(13 | 11 | 7 | 5 | 3 | 2) M = 30 030 Inadequate
RNS(17 | 13 | 11 | 7 | 5 | 3 | 2) M = 510 510 Too large
RNS(17 | 13 | 11 | 7 | 3 | 2) M = 102 102 Just right!5 + 4 + 4 + 3 + 2 + 1 = 19 bits
Fine tuning: Combine pairs of moduli 2 & 13 (26) and 3 & 7 (21)RNS(26 | 21 | 17 | 11) M = 102 102
52Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
An Improved Strategy
Target range for our RNS: Decimal values [0, 100 000]
Strategy 2: Improve strategy 1 by including powers of smaller primes before proceeding to the next larger prime
RNS(22 | 3) M = 12RNS(32 | 23 | 7 | 5) M = 2520RNS(11 | 32 | 23 | 7 | 5) M = 27 720RNS(13 | 11 | 32 | 23 | 7 | 5) M = 360 360
(remove one 3, combine 3 & 5)RNS(15 | 13 | 11 | 23 | 7) M = 120 120
4 + 4 + 4 + 3 + 3 = 18 bits
Fine tuning: Maximize the size of the even modulus within the 4-bit limitRNS(24 | 13 | 11 | 32 | 7 | 5) M = 720 720 Too largeWe can now remove 5 or 7; not an improvement in this example
53Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Low-Cost RNS ModuliTarget range for our RNS: Decimal values [0, 100 000]
Strategy 3: To simplify the modular reduction (mod mi) operations, choose only moduli of the forms 2a or 2a – 1, aka “low-cost moduli”
RNS(2ak–1 | 2ak–2 – 1 | . . . | 2a1 – 1 | 2a0 – 1)
We can have only one even modulus2ai – 1 and 2aj – 1 are relatively prime iff ai and aj are relatively prime
RNS(23 | 23–1 | 22–1) basis: 3, 2 M = 168RNS(24 | 24–1 | 23–1) basis: 4, 3 M = 1680RNS(25 | 25–1 | 23–1 | 22–1) basis: 5, 3, 2 M = 20 832RNS(25 | 25–1 | 24–1 | 23–1) basis: 5, 4, 3 M = 104 160
ComparisonRNS(15 | 13 | 11 | 23 | 7) 18 bits M = 120 120RNS(25 | 25–1 | 24–1 | 23–1) 17 bits M = 104 160
It’s easy to mod 2k and 2k -1
54Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Encoding and Decoding of Numbers
Conversion from binary/decimal to RNS
–––––––––––––––––––––––––––––i 2i ⟨2i⟩7 ⟨2i⟩5 ⟨2i⟩3
–––––––––––––––––––––––––––––0 1 1 1 11 2 2 2 22 4 4 4 13 8 1 3 24 16 2 1 15 32 4 2 26 64 1 4 17 128 2 3 28 256 4 1 19 512 1 2 2
–––––––––––––––––––––––––––––
Table 4.1 Residues of the first 10 powers of 2
Example 4.1: Represent the number y = (1010 0100)two = (164)tenin RNS(8 | 7 | 5 | 3)
The mod-8 residue is easy to find
x3 = ⟨y⟩8 = (100)two = 4
We have y = 27+25+22; thus
x2 = ⟨y⟩7 = ⟨2 + 4 + 4⟩7 = 3
x1 = ⟨y⟩5 = ⟨3 + 2 + 4⟩5 = 4
x0 = ⟨y⟩3 = ⟨2 + 2 + 1⟩3 = 2
55Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Conversion from RNS to Binary/DecimalTheorem 4.1 (The Chinese remainder theorem)
x = (xk–1 | . . . | x2 | x1 | x0)RNS = ⟨ ∑i Mi ⟨αi xi⟩mi ⟩Mwhere Mi = M/mi and αi = ⟨Mi
–1⟩mi (multiplicative inverse of Mi wrt mi)
Implementing CRT-based RNS-to-binary conversionx = ⟨ ∑i Mi ⟨αi xi⟩mi ⟩M = ⟨ ∑i fi(xi) ⟩M
We can use a table to store the fi values –- ∑i mi entries
Table 4.2 Values needed in applying the Chinese remainder theorem to RNS(8 | 7 | 5 | 3)
––––––––––––––––––––––––––––––i mi xi ⟨Mi ⟨αi xi⟩mi⟩M––––––––––––––––––––––––––––––3 8 0 0
1 1052 2103 315. .. .. .
56Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Intuitive Justification for CRTPuzzle: What number has the remainders of 2, 3, and 2
when divided by the numbers 7, 5, and 3, respectively?
x = (2 | 3 | 2)RNS(7|5|3) = (?)ten
(1 | 0 | 0)RNS(7|5|3) = multiple of 15 that is 1 mod 7 = 15(0 | 1 | 0)RNS(7|5|3) = multiple of 21 that is 1 mod 5 = 21(0 | 0 | 1)RNS(7|5|3) = multiple of 35 that is 1 mod 3 = 70
(2 | 3 | 2)RNS(7|5|3) = (2 | 0 | 0) + (0 | 3 | 0) + (0 | 0 | 2)= 2 × (1 | 0 | 0) + 3 × (0 | 1 | 0) + 2 × (0 | 0 | 1)
= 2 × 15 + 3 × 21 + 2 × 70 = 30 + 63 + 140= 233 = 23 mod 105
Therefore, x = (23)ten
57Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Difficult RNS Arithmetic Operations
Sign test Magnitude comparisonDivision
•Could convert back and forth to/from binary. •Another approach: convert to a mixed radix system, as numbers in a mixed radix system are comparable.
58Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Difficult RNS Arithmetic Operations
Example: Of the following RNS(8 | 7 | 5 | 3) numbers:Which, if any, are negative?Which is the largest?Which is the smallest?
Assume a range of [–420, 419]a = (0 | 1 | 3 | 2)RNS
b = (0 | 1 | 4 | 1)RNS
c = (0 | 6 | 2 | 1)RNS
d = (2 | 0 | 0 | 2)RNS
e = (5 | 0 | 1 | 0)RNS
f = (7 | 6 | 4 | 2)RNS
Answers:d < c < f < a < e < b
–70 < –8 < –1 < 8 < 21 < 64
59Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
General RNS DivisionGeneral RNS division, as opposed to division by one of the moduli (aka scaling), is difficult; hence, use of RNS is unlikely to be effective when an application requires many divisions
Scheme proposed in 1994 PhD thesis of Ching-Yu Hung (UCSB):Use an algorithm that has built-in tolerance to imprecision, and apply the approximate CRT decoding to choose quotient digits
Example –– SRT algorithm (s is the partial remainder)
s < 0 quotient digit = –1s ≅ 0 quotient digit = 0s > 0 quotient digit = 1
The BSD quotient can be converted to RNS on the fly
60Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Limits of Fast Arithmetic in RNS
Known results from number theory
Implications to speed of arithmetic in RNS
Theorem 4.5: It is possible to represent all k-bit binary numbers in RNS with O(k / log k) moduli such that the largest modulus has O(log k) bits
That is, with fast log-time adders, addition needs O(log log k) time
Theorem 4.2: The ith prime pi is asymptotically i ln i
Theorem 4.3: The number of primes in [1, n] is asymptotically n / ln n
Theorem 4.4: The product of all primes in [1, n] is asymptotically en
61Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan
Hardware Implementation for RNS Representations
mod 8 mod 7 mod 5 mod 3
Mod-8 Unit
Mod-7 Unit
Mod-5 Unit
Mod-3 Unit
3 3 3 2
Operand 1 Operand 2
Result
1Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Addition/Subtraction
Instructor: Kuan Jen Lin E-Mail: [email protected]. of EE, FJU, TaiwanRoom: SF 727B
Most slides originate from the textbook author’s PowerPoint presentation files.
2Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
II Addition / Subtraction
Chapter 8 Multioperand Addition
Chapter 7 Variations in Fast Adder
Chapter 6 Carry-Lookahead Adders
Chapter 5 Basic Addition and Counting
Topics in This Part
Review addition schemes and various speedup methods• Addition is a key op (in itself, and as a building block)• Subtraction = negation + addition• Carry propagation speedup: lookahead, skip, select, …• Two-operand versus multioperand addition
3Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Basic Addition and Counting
Chapter GoalsStudy the design of ripple-carry adders, discuss why their latency is unacceptable,and set the foundation for faster adders
Chapter HighlightsFull adders are versatile building blocksLongest carry chain on average: log2k bitsFast asynchronous adders are simpleCounting is relatively easy to speed up
4Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
HA and FA Adders
Half-adder (HA): Truth table and block diagram
Full-adder (FA): Truth table and block diagram
x y c c s ---------------------- 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1
Inputs Outputs
c out c in
out in x
y
s
FA
x y c s ---------------- 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0
Inputs Outputs
HA
x y
c
s
5Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Half-Adder Implementations
c
s
(b) NOR-gate half-adder.
xy
xy
(c) NAND-gate half-adder with complemented carry.
x
y
c
s
s
c xy
xy
(a) AND/XOR half-adder._
__c
6Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Some Full-Adder Details
Logic equations for a full-adder:s = x ⊕ y ⊕ cin (odd parity function)
= xycin ∨ x ′y ′cin ∨ x ′y cin′ ∨ x y ′cin′
cout = x y ∨ x cin ∨ y cin (majority function)
7Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Full-Adder Implementations
HA
HA
xy
cin
cout
(a) Built of half-adders.s
(b) Built as an AND-OR circuit.
(c) Suitable for CMOS realization.
cout
s
cin
xy
0 1 2 3
0 1 2 3
xy
cin
cout
s
0
1
Mux
8Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Bit Serial Adder and Ripple Adder
x y
c
x
s
y
c
x
s
y
c out c in
0 0
0
c 0
31
31
31
31
FA
s
c c
1 1
1
1 2 FA FA
32 . . .
s 32
x
s
y
c c
i i
i
i i+1 FA Carry
FF Shift
Shift
x
y
s
(a) Bit-serial adder.
(b) Ripple-carry adder.
Clock
9Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Critical Path Through a Ripple-Carry Adder
Critical path in a k-bit ripple-carry adder.
x
s
y
c
x
s
y
c
x
s
y
c
x
s
y
c
c out c in
0 0
0
c 0
1 1
1
1
k-2 k–2
k–2
2 k
k–1
k–1
k–1
k–1
FA FA FA FA . . . c k–2
s k
Tripple-add = TFA(x,y→cout) + (k – 2)×TFA(cin→cout) + TFA(cin→s)
10Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Conditions and Exceptions
overflow2’s-compl = xk–1 yk–1 sk–1′ ∨ xk–1′ yk–1′ sk–1
overflow2’s-compl = ck ⊕ ck–1 = ck ck–1′ ∨ ck′ ck–1
FAFA
xy 11 x0y0
c0c1
s0s1
FAc2
sk–1
cout cin...
ck–1ck–2
sk–2
ck
xk–2yk–2xk–1yk–1
FA
Overflow
Negative
Zero
Overflows occurs when two numbers of like sign are added and a result of the opposite sign is produced.
11Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Binary Adders as Versatile Building Blocks (1/2)
Fig. 5.6 Four-bit binary adder used to realize the logic function f = w + xyz and its complement.
c
3
c
4
c
2
c
1
c
0
0
1 w
1 z
0 y
x Bit 3 Bit 2 Bit 1 Bit 0
w ∨ xyz
(w ∨ xyz)′
w ∨ xyz xyz xy 0
Set one input to 0: cout = AND of other inputs
Set one input to 1: cout = OR of other inputs
Set one input to 0 and another to 1: s = NOT of third input
cout cin
x y
s
FA
12Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Binary Adders as Versatile Building Blocks (2/2)
x y c c s----------------------0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1
Inputs Outputs
c out c in
outin x y
s
FA
13Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Example of Carry Propagation
Bit positions15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0----------- ----------- ----------- -----------1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0
cout 0 1 0 1 1 0 0 1 1 1 0 0 0 0 1 1 cin\__________/\__________________/ \________/\____/
4 6 3 2Carry chains and their lengths
14Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Using Probability to Analyze Carry PropagationGiven binary numbers with random bits, for each position i we have
Probability of carry generation = ¼ (both 1s)Probability of carry annihilation = ¼ (both 0s)Probability of carry propagation = ½ (different)
Probability that carry generated at position i propagates through position j – 1 and stops at position j (j > i)
2–(j–1–i) × 1/2 = 2–(j–i)
Expected length of the carry chain that starts at position i
)1()1()1(
)1(1
1
)1(1
1
)(
222)(2)1(2
2)(22)(2)(
−−−−−−−−−
−−−−−
=
−−−−−
+=
−−
−=−++−−=
−+=−+− ∑∑ikikik
ikik
l
likk
ij
ij
ikik
iklikij
Because the carry definitely stops at position k, the term for k is not multiplied by ½.
15Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry Completion Detection
. . .
. . .
. . .
. . .
x y = x +y
alldoneFrom other bit positions
i+1
c = c
b = c
b = 1: No carry c = 1: Carry
b
i+1c 0
i i i i
ib
ic
x + yi i
x y i i
x y i i
0
in
in
}
di+1 ii
c = c k out
b k
bi ci0 0 Carry not yet known0 1 Carry known to be 11 0 Carry known to be 0
Dual rail coding
16Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Self-Timed Adder
17Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Self-Timed Adder with Parallel carry Completion Sensing
18Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Addition of a Constant: Counters
Count register
Mux
Incrementer (Decrementer)
+1 (−1)
Data in
Load
Count / Initialize _____
x + 1
x
0 1
Data out
Reset Clear Enable Clock
Counter overflow
(x − 1)
c out
19Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Implementing a Simple Up Counter
Four-bit asynchronous up counter built only of negative-edge-triggered T flip-flops.
T
Q
Q T
Q
Q T
Q
Q T
Q
QIncrement
0
0
1
1
2
2
3
3
Count Output
Ripple-carry incrementer for use in an up counter.
1
0
k−2
k−1
. . . c
k−1
c
k
c
k−2
c
1
x
x
x
x
c
2
1 0 k−2 k−1 s s s s 2 s
20Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Manchester Carry Chains and AddersSum digit in radix r si = (xi + yi + ci) mod rSpecial case of radix 2 si = xi ⊕ yi ⊕ ci
Computing the carries ci is thus our central problem For this, the actual operand digits are not important What matters is whether in a given position a carry is
generated, propagated, or annihilated (absorbed)
For binary addition:gi = xi yi pi = xi ⊕ yi ai = xi′yi ′ = (xi ∨ yi) ′
It is also helpful to define a transfer signal:ti = gi ∨ pi = ai′ = xi ∨ yi
Using these signals, the carry recurrence is written asci+1 = gi ∨ ci pi = gi ∨ ci gi ∨ ci pi = gi ∨ ci ti
21Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Manchester Carry Network
p
g
a
Logic 1
Logic 0
c
c
i+1
i
i
i
i
0
1
0
1
0 1
(a) Conceptual representation
c'i+1 ic'
Clock
ip
VDD
VSS
ig
(b) Possible CMOS realization.
The worst-case delay of a Manchester carry chain has three components:
1. Latency of forming the switch control signals2. Set-up time for switches3. Signal propagation delay through k switches
gi = xi yi pi = xi⊕ yi
ci+1 = gi∨ ci pi
22Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry Network is the Essence of a Fast Adder
The main part of an adder is the carry network. The rest is just a set of gates to produce the g and p signals and the sum bits.
Carry network
. . . . . .
x i y i
g p
s
i i
i
c i c i+1
c k−1
c k c k−2 c 1
c 0
g p 1 1 g p 0 0
g p k−2 k−2 g p i+1 i+1 g p k−1 k−1
c 0 . . . . . .
0 0 0 1 1 0 1 1
annihilated or killed propagated generated (impossible)
Carry is: g i p i gi = xi yi
pi = xi ⊕ yi
Ripple; Skip;Lookahead;Parallel-prefix
23Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry Propagation Network of a Ripple-Carry Adder
. . . c
k−1
c
k c k−2
c 1
g
p
1
1
g
p
0
0
g
p
k−2
k−2
g
p
k−1
k−1
c
0 c 2
The carry recurrence: ci+1 = gi ∨ pi ci
Latency of k-bit adder is roughly 2k gate delays:
1 gate delay for production of p and g signals, plus 2(k – 1) gate delays for carry propagation, plus1 XOR gate delay for generation of the sum bits
24Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Lookahead Adders
Chapter GoalsUnderstand the carry-lookahead method and its many variationsused in the design of fast adders
Chapter HighlightsSingle- and multilevel carry lookaheadVarious designs for log-time addersRelating the carry determination problem
to parallel prefix computationImplementing fast adders in VLSI
25Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Unrolling the Carry RecurrenceRecall the generate, propagate, annihilate (absorb), and transfer signals:
Signal Radix r Binarygi is 1 iff xi + yi ≥ r xi yipi is 1 iff xi + yi = r – 1 xi ⊕ yiai is 1 iff xi + yi < r – 1 xi′yi ′ = (xi ∨ yi) ′ti is 1 iff xi + yi ≥ r – 1 xi ∨ yi
si (xi + yi + ci) mod r xi ⊕ yi ⊕ ci
The carry recurrence can be unrolled to obtain each carry signal directly from inputs, rather than through propagation
ci = gi–1 ∨ ci–1 pi–1= gi–1 ∨ (gi–2 ∨ ci–2 pi–2)pi–1= gi–1 ∨ gi–2pi–1 ∨ ci–2 pi–2pi–1= gi–1 ∨ gi–2pi–1 ∨ gi–3 pi–2pi–1 ∨ ci–3 pi–3 pi–2pi–1= gi–1 ∨ gi–2pi–1 ∨ gi–3 pi–2pi–1 ∨ gi–4 pi–3 pi–2pi–1 ∨ ci–4 pi–4 pi–3 pi–2pi–1=….
Where pj can be replaced with tj.
26Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Four-Bit Carry-Lookahead Adder (1/2)Complexity reduced by deriving the carry-out indirectlyc4=g3+c3p3
g0
g1
g2
g3
c0
c4
c1
c2
c3
p3
p2
p1
p0
Full carry lookahead is quite practical for a 4-bit adder
c1 = g0 ∨ c0 p0c2 = g1 ∨ g0p1 ∨ c0 p0p1c3 = g2 ∨ g1p2 ∨ g0 p1p2 ∨ c0 p0 p1p2c4 = g3 ∨ g2p3 ∨ g1 p2p3 ∨ g0 p1 p2p3
∨ c0 p0 p1 p2p3
27Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Four-Bit Carry-Lookahead Adder (2/2)
Source: Ercegovac and Lang, “Digital Arithmetic,” MKP
28Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry Lookahead Beyond 4 Bits
32-input AND
Consider a 32-bit adder
c1 = g0 ∨ c0 p0c2 = g1 ∨ g0p1 ∨ c0 p0p1c3 = g2 ∨ g1p2 ∨ g0 p1p2 ∨ c0 p0 p1p2
.
.
.
c31 = g30 ∨ g29p30 ∨ g28 p29p30 ∨ g27 p28 p29p30 ∨ . . . ∨ c0 p0 p1p2p3 ... p29p30
32-input OR. . . High fan-ins necessitate
tree-structured circuits
For wide words, full carry lookahead is impractical.
29Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Two Schemes to Manage the ComplexityHigh-radix addition (i.e., radix 2h)
Increases the latency for generating g and p signals and sum digits,but simplifies the carry network (optimal radix?)
Multilevel lookahead
Example: 16-bit addition
Radix-16 (four digits)
Two-level carry lookahead (four 4-bit blocks)
Either way, the carries c4, c8, and c12 are determined first
c16 c15 c14 c13 c12 c11 c10 c9 c8 c7 c6 c5 c4 c3 c2 c1 c0cout ? ? ? cin
30Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
One-Level carry Lookahead Adder
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.72.
31Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Block Generate and Propagate signals
Block generate and propagate signals
g [i,i+3] = gi+3 ∨ gi+2pi+3 ∨ gi+1 pi+2pi+3 ∨ gi pi+1 pi+2pi+3
p [i,i+3] = pi pi+1 pi+2pi+3
ic4-bit lookahead carry generator
g p g p g p g p
[i,i+3]p
i+1c i+2c i+3c
g
iii+1i+1i+2 i+2 i+3 i+3
[i,i+3]
Note: unrelated to ci
Ck = g[0,k-1]+c0p[0,k-1]
Ci+4 = g[i,i+3]+cip[i,i+3]
32Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
4-bit Lookahead Carry Generator
gi
gi+1
g i+2
gi+3
ci
ci+1
ci+2
ci+3
pi+3
pi+2
pi+1
pi
g
p [i,i+3]
Block Signal GenerationIntermediate Carries
[i,i+3]
33Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
A Two-Level Carry-Lookahead Adder (64 bits)
cccc
4-bit lookahead carry generator
4-bit lookahead carry generator
g p
ccc
g p
12 8 4 0
48 32 16
[0,63]
16-bit Carry-Lookahead Adder
[0,63]
[48,63][48,63] g
p[32,47][32,47] g
p[0,15][0,15]g
p[16,31][16,31]
g p [12,15]
[12,15] g p [8,11]
[8,11] g p [4,7]
[4,7] g p [0,3]
[0,3]
16 bit CLA
C4, C8 and C12 are the Ci+1, Ci+2 an Ci+3 respectively in last slide.
Ck = g[0,k-1]+c0p[0,k-1]
34Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Latency of a 16-bit 2-Level l Carry-Lookahead Adder (1/2)
(Level 1) g and p for individual bit positions 1 gate level
(Level 1) g and p signals for 4-bit blocks 2 gate levelsi.e. g[0,3], p[0,3]……g[12, 15], p[12, 15]
(Level 2) Block carry-in signals c4, c8, and c12 2 gate levelsg[0,15], p[0,15]
(Level 1) Internal carries within 4-bit blocks 2 gate levelsc1, c2, c3, c5,…..(Level 2) C15 if required
(Level 1) Sum bits (XOR) 2 gate levels???
35Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Latency of a 16-bit 2-Level l Carry-Lookahead Adder (2/2)
Total latency for the 16-bit adder is 9 gate levelsEach additional lookahead level adds 4 gate levels of latency (yellow block in last slide)
Latency for k-bit CLA adder:4 log4k + 1 gate levels
36Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Combining of g and p signals
Combining of g and p signals of two (contiguous or overlapping) blocks B' and B" of arbitrary widths into the g and p signals for block B.
g" p"
i 0i 1
j 0j 1
g p
g' p'
Block B'Block B"
Block B(g, p)
(g", p") (g', p')
¢g = g" + g'p" p = p'p"
g p
g″ p″ g′ p′
37Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Formulating the Prefix Computation ProblemThe problem of carry determination can be formulated as:Given (g0, p0) (g1, p1) . . . (gk–2, pk–2) (gk–1, pk–1) Find (g [0,0] , p [0,0]) (g [0,1] , p [0,1]) . . . (g [0,k–2] , p [0,k–2]) (g [0,k–1] , p [0,k–1])
c1 c2 . . . ck–1 ck
Carry-in can be viewed as an extra (−1) position: (g–1, p–1) = (cin, 0)
The desired pairs are found by evaluating all prefixes of(g0, p0) ¢ (g1, p1) ¢ . . . ¢ (gk–2, pk–2) ¢ (gk–1, pk–1)
The carry operator ¢ is associative, but not commutative[(g1, p1) ¢ (g2, p2)] ¢ (g3, p3) = (g1, p1) ¢ [(g2, p2) ¢ (g3, p3)]
Prefix sums analogy:Given x0 x1 x2 . . . xk–1Find x0 x0+x1 x0+x1+x2 . . . x0+x1+...+xk–1
38Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
g0, p0g1, p1g2, p2g3, p3
g[0,0], p[0,0]= (c1, --)
g[0,1], p[0,1]= (c2, --)
g[0,2], p[0,2]= (c3, --)
g[0,3], p[0,3]= (c4, --)
Prefix-Based Carry Network
g p
g″ p″ g′ p′
++
++
26 5−1
712 56g0, p0g1, p1g2, p2g3, p3
g[0,0], p[0,0]= (c1, --)
g[0,1], p[0,1]= (c2, --)
g[0,2], p[0,2]= (c3, --)
g[0,3], p[0,3]= (c4, --)
¢¢
¢¢
Four-input prefix sums network
Scan order
Four-bitCarry lookahead network
39Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Parallel Prefix Sums Network Built of Two k/2-Input Networks and k/2 Adders(Ladner-Fischer)
Delay recurrence D(k) = D(k/2) + 1 = log2kCost recurrence C(k) = 2C(k/2) + k/2 = (k/2) log2kIncurs large fanout
. . .
Prefix Sums k/2 Prefix Sums k/2
. . .
xk–1 xk/2 xk/2–1 x0
s k–1 s k/2
s k/2–1 s 0+ +. . .
. . .
. . . . . .
. . .
. . .. . .
Recursive dividing
40Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
a is t in the textbook
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.81
41Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Eliminate Large Fanout
Increase the number of levelsIncrease the number of cells
42Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
The Brent-Kung Recursive Construction
Delay recurrence D(k) = D(k/2) + 2 = 2 log2k – 1 (–2 really)Cost recurrence C(k) = C(k/2) + k – 1 = 2k – 2 – log2k
Parallel prefix sums network built of one k/2-input network and k – 1 adders.
Prefix Sums k/2
xk–1 xk–2 x3 x2 x1 x0
s k–1 s k–2 s 3 s 2 s 1 s 0
++
+
+
+
. . .
. . .
. . .
. . .
43Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Brent-Kung Carry Network (8-Bit Adder)
¢ ¢ ¢ ¢
¢ ¢
¢ ¢
¢ ¢ ¢
[7, 7 ] [6, 6 ] [5, 5 ] [4, 4 ] [3, 3 ] [2, 2 ] [1, 1 ] [0, 0 ]
[0, 7 ] [0, 6 ] [0, 5 ] [0, 4 ] [0, 3 ] [0, 2 ] [0, 1 ] [0, 0 ]
g p [0,1] [0,1]
g p [1,1] [1,1] g p [0,0] [0,0]
[2, 3 ] [4, 5 ]
[6, 7 ]
[4, 7 ] [0, 3 ]
[0, 1 ]
44Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Source: Ercegovacand Lang, “Digital Arithmetic”, pp.83
45Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Brent-Kung Carry Network (16-Bit Adder)x0x1x2x3x4x5x6x7
x8x9x10x11x12x13x14x15
s0s1s2s3s4s5s6s7s8s9s10s11
s12s13s14s15
1 2 3 4 5 6
Level
Reason for latency being 2 log2k – 2
46Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Kogge-Stone Carry Network (16-Bit Adder)x0x1x2x3x4x5x6x7
x8x9x10x11x12x13x14x15
s0s1s2s3s4s5s6s7s8s9s10s11
s12s13s14s15
log2k levels (minimum possible)
Cost formulaC(k) = (k – 1)
+ (k – 2) + (k – 4) + . . . + (k – k/2)
= k log2k – k + 1
47Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Source: Ercegovacand Lang, “Digital Arithmetic”, pp.84
48Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Speed-Cost Tradeoffs in Carry Networks
2k – 2 – log2k2 log2k – 2 Brent-Kung
k log2k – k + 1log2kKogge-Stone
(k/2) log2klog2kLadner-Fischer
CostDelayMethod
. . .
Prefix Sums k/2 Prefix Sums k/2
. . .
xk? xk/2 xk/2? x0
sk? sk/2
sk/2? s0+ +. . .
. . .
. . . . . .
. . .
. . .. . .Improving the Ladner/Fischer design
These outputs can be produced one time unit later without increasing the overall latency
This strategy saves enough to make the overall cost linear (best possible)
49Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Hybrid B-K/K-S Carry Network (16-Bit Adder)x0x1x2x3x4x5x6x7
x8x9x10x11x12x13x14x15
s0s 1s2s 3s4s5s 6s7s8s9s 10s11s12s 13s14s 15
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
s0s1s2s3s4s5s6s7s8s 9s10s11s12s13s14s15
1 2 3 4 5 6
Level
x0x1x2x3x4x5x6x7x8x9x10x11
x12x13x14x15
s0s1s2s3s4s5s6s7s8s9s10s11
s12s13s14s15
Brent- Kung
Brent- Kung
Kogge- Stone
Brent-Kung: 6 levels
26 cells
Kogge-Stone: 4 levels
49 cells
Hybrid: 5 levels
32 cells
50Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Four-Bit Manchester Carry Chains (Transistor Level)
PH2g2
PH2g3
PH2g1
PH2g0
p3
p2
p1
p0
g[0,3]
PH2p[0,3]
(a)
PH2
PH2
g2
g3
g1
g0
p3
p2
p1
p0
g[0,3]
p[0,3]
g[0,2]
p[0,2]
g[0,1]
p[0,1]
PH2PH2
(b)
PH2 PH2
PH2 PH2
PH2 PH2
PH2PH2
51Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Variations in Fast Adders
Chapter GoalsStudy alternatives to the carry-lookahead method for designing fast adders
Chapter HighlightsMany methods besides CLA are available
(both competing and complementary)Best design is technology-dependent
(often hybrid rather than pure)Knowledge of timing allows optimizations
52Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Simple Carry-Skip Adders
cc ccc
cc ccc
ppppSkipSkipSkip
4-Bit Block
Skip logic (2 gates)
16 12
8
4
0
0
4
8
1216
[12,15] [8,11] [4,7][0,3]
(a) Ripple-carry adder.
(b) Simple carry-skip adder.
3 2 1 0
Ripple-carry stages
4-Bit Block
4-Bit Block
4-Bit Block
4-Bit Block
4-Bit Block
3 2 1 0
53Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Skip Adder Using MUX
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.66.
54Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Another View of Carry-Skip Addition
Street/freeway analogy for carry-skip adder.
c
g
p
4j+1
4j+1
g
p
4j
4j
g
p
4j+2
4j+2
g
p
4j+3
4j+3
c
4j
4j+4
c
4j+3
c
4j+2
c
4j+1
One-way street
Freeway
55Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Skip Adder with Fixed Block SizeBlock width b; k/b blocks to form a k-bit adder (assume b divides k)
Example: k = 32, b opt = 4, T opt = 12.5 stages(contrast with 32 stages for a ripple-carry adder)
Tfixed-skip-add = (b – 1) + 0.5 + (k/b – 2) + (b – 1) in block 0 OR gate skips in last block
≅ 2b + k/b – 3.5 stages
dT/db = 2 – k/b2 = 0 ⇒ b opt = √k/2
T opt = 2√2k – 3.5
. . .
1stage =
2 gate levels
56Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Worst Case Delay
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.67-68.
57Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
1111
+0001 C0=0Worst case in block 0
0111
+0000 C12=1Worst case in last block
58Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Skip Adder with Variable-Width Blocks (1/2)
b b b b. . .
RippleSkip
Carry path (1)
01t–1 t–2 Block widths
Carry path (3)
Carry path (2)
Carry path (2) goes through one fewer skip than (1), so block t-2 can be one bit wider than block t-1 without increasing the total delay.
Carry path (3) goes through one fewer skip than (1), so block 1 can be one bit wider than block 0 without increasing the total delay.
59Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Skip Adder with Variable-Width Blocks (2/2)
The total number of bits in the t blocks is k:
2[b + (b + 1) + . . . + (b + t/2 – 1)] = t(b + t/4 – 1/2) = k
b = k/t – t/4 + 1/2
Tvar-skip-add = 2(b – 1) + 0.5 + t – 2 = 2k/t + t/2 – 2.5
dT/db = –2k/t 2 + 1/2 = 0 ⇒ t opt = 2√k
T opt = 2√k – 2.5 (a factor of √2 smaller than for fixed-block)
Let b=1
60Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Multilevel Carry-Skip Adders
S 1
c out c in
S 1 S 1 S 1 S 1
S 2
S 1
c out c in
S 1 S 1 S 1 S 1
c out c in
S 2
S
1
S
1
S
1
61Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Single-Level Carry-Skip Adder (Example 7.1)Assumptions: Each of the following takes one unit of time: generation of gi and pi, generation of level-i skip signal from level-(i–1) skip signals, ripple, skip, and formation of sum bit once the incoming carry is known
Build the widest possible one-level carry-skip adder with total delay of 8
c cbbbbbbb 0
2345678
2
inout
S1 S1 S1 S1 S1
0123456
Stage b0 takes 2 time units: one for generating gp and the other for generating carry.
Stage b1 cannot be more than 3 bits, because its output is available at time 3, so it can take one time unit for generating gp and two for propagation across 2 bits.
At the right end, block width is limited by the output timing requirement.
62Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Generalization of Example 7.1 for total time T (even or odd)1 2 3 . . . T/2 T/2 . . . 4 3 11 2 3 . . . (T + 1)/2 . . . 4 3 1
Thus, for any T, the total width is ⎣(T + 1)2/4⎦ – 2
Stage b4 cannot be more than 3 bits, because its input become available at time 5 and the total adder delay is to be 8 units..
Max adder width = 18 (1 + 2 + 3 + 4 + 4 + 3 + 1)
At the left end, block width is limited by input timing.
63Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Two-Level Carry-Skip Adder (1/2)
Given the delay pair {β, α} for a level-2 block in Fig. 7.7a, the number of level-1 blocks that can be accommodated is γ = min(β–1, α)
Example 7.2
Single-level carry-skip adder with Tassimilate = α
Single-level carry-skip adder with Tproduce = β
Width of the ith level-1 block in the level-2 block characterized by {β, α} is bi = min(β – γ + i + 1, α – i); the total block width is then ∑i=0 to γ–1 bi
c cbb
234β
inout
S1 S1 S1 S1 S1
12
– 1β – 2βb –3βb –2β
S1
b0
S1
1
c cbb
0123
αinout
S1 S1 S1 S1 S1
12
– 1α – 2αS1
b0
S1
b –1α b –2α
64Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Two-Level Carry-Skip Adder (2/2)
Max adder width = 30(4 + 8 + 8 + 6 + 3 + 1)
c c
80
7 6 5 34 3
b b b b b b{8, 1} {7, 2} {6, 3} {5, 4} {4, 5} {3, 8}
inoutABCDEF
S2 S2 S2 S2 S2
Tproduce Tassimilate
(a)
3457 6
2 t=0t=8cout cin2
3
Block E Block D Block C Block B Block AF
65Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Skip Adder Optimization Scheme
Inputs
Level-h skip
Block of b full-adder uni ts
I(b)
A(b)
G(b)
E (b) h S (b) h
66Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Select Adders
Cselect-add(k) = 3Cadd(k/2) + k/2 + 1
Tselect-add(k) = Tadd(k/2) + 1
k/2-bit adder k/2-bit adder
k - 1 k/2 k - 1 0
0 1
k/2+1 k/2+1 k/2
1 0 Mux
k/2 c out
c k/2
c in
High k /2 bits Low k /2 bits
k /2-bit adder Carry-select adder for k-bit numbers built from three k/2-bit adders.
67Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Two-level Carry-Select Adder Built of k/4-bit adders
k /4-bit adder k/4-bit adder
k /2 - 1 k /4 k /4 - 1 0
0 1
k/4+1 k/4+1 k/4
1 0 Mux
k/4
k/4-bit adder
k - 1 3k/4 0 1
k/4+1 k/4+1 k/4
1 0 Mux
k /4-bit adder
3k/4 - 1 k /2 0 1
1 0 Mux
k/2+1
k/4
c k/2
c k/4
c out
c in
, High k /2 bits Middle k /4 bits Low k /4 bits
k/2-bit conditional-sum
68Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Conditional Adder
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.86
69Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry Select Adder
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.87
70Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Conditional Sum Adder
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.87
71Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
16-Bit Conditional Sum Adder
The same as Fig. 7.20 in textbookSource: Ercegovac and Lang, “Digital Arithmetic”, pp.89
72Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Conditional-Sum AdderMultilevel carry-select idea carried out to the extreme (to 1-bit blocks.
C(k) ≅ 2C(k/2) + k + 2 ≅ k (log2k + 2) + k C(1)
T(k) = T(k/2) + 1 = log2k + T(1)
where C(1) and T(1) are the cost and delay of the circuit of the following circuit for deriving the sum and carry bits with a carry-in of 0 and 1
sc
xy
sc
ii
ii+1 i+1 i
For c = 0iFor c = 1i
k + 2 is an upper bound on number of single-bit 2-to-1 multiplexers needed for combining two k/2-bit adders into a k-bit adder
73Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
A Hybrid Carry-Lookahead/Carry-Select Adder
Lookahead Carry Generator
Carry-Select
c
g, p
in
MuxMuxMux
cout
01
01
01
Block
The most popular hybrid addition scheme:
74Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Summary
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.114.
75Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
A Hybrid Ripple-Carry/Carry-Lookahead Design
Any Two Addition Schemes Can Be CombinedOther possibilities: hybrid carry-select/ripple-carry
hybrid ripple-carry/carry-select. . .
cccc
4-Bit Lookahead Carry Generator
c12 8 4 016
16-bit Carry-Lookahead Adder
g p [12,15]
[12,15] g p [8,11]
[8,11] g p [4,7]
[4,7] g p [0,3]
[0,3]
c32c48
(with carry-out)
76Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Optimizations in Fast Adders
What looks best at the block diagram or gate level may not be best when a circuit-level design is generated (effects of wire length, signal loading, . . . )
Modern practice: Optimization at the transistor level
Variable-block carry-lookahead adder
Optimizations for average or peak power consumption
Timing-based optimizations (next slide)
77Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Multioperand Addition
Chapter GoalsLearn methods for speeding up the addition of several numbers (needed for multiplication or inner-product)
Chapter HighlightsRunning total kept in redundant formCurrent total + Next number → New total Deferred carry assimilationWallace/Dadda trees and parallel counters
78Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Some Applications of Multioperand Addition
• • • • a • • • • x ---------- • • • • x a • • • • x a • • • • x a • • • • x a ----------------• • • • • • • • p
×
0123
0123
2 2 2 2
• • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p -----------------• • • • • • • • • s
(0)(1)(2)(3)(4)(5)(6)
Multioperand addition problems for multiplication or inner-product computation in dot notation.
79Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Serial Implementation with One Adder
Tserial-multi-add = O(n log(k + log n))
= O(n log k + n log log n)
Therefore, addition time grows superlinearly with n when k is fixed and logarithmically with k for a given n
Adderx
k bits
k + log n bits∑ xj=0
i–1
(i)
2 (j)
Partial sum register
80Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Pipelined Adder
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.166.
81Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Parallel Implementation as Tree of Adders
Adding 7 numbers in a binary tree of adders.
Adder Adder Adder
AdderAdder
Adder
k
k+1
k+2
k+3
k+2
k+1k+1
k kk kk k
Ttree-fast-multi-add = O(log k + log(k + 1) + . . . + log(k + ⎡log2n⎤ – 1))
= O(log n log k + log n log log n)
Ttree-ripple-multi-add = O(k + log n) [Justified on the next slide]
⎡log2n⎤adder levelsn – 1
adders
82Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Elaboration on Tree of Ripple-Carry Adders
Ttree-ripple-multi-add = O(k + log n)
Adder Adder Adder
AdderAdder
Adder
k
k+1
k+2
k+3
k+2
k+1k+1
k kk kk k
Fig. 8.5 Ripple-carry adders at levels i and i + 1 in the tree of adders used for multi-operand addition.
. . .
. . . Level i
Level i+1
HAFA
HAFA
t
t+1
tt+1t+1
t+1
t+1
t+2
t+2 t+2
t+2
t+3t+2t+3
The absolute best latency that we can hope for is O(log k + log n)
There are kn data bits to process and using any set of computation elements with constant fan-in, this requires O(log(kn)) time
We will see shortly that carry-save adders achieve this optimum time
83Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Save Adders
FA FAFA FA FAFA
FA FAFA FA FAFA
Cut
Carry-propagate adder
Carry-save adder (CSA) or (3; 2)-counter or 3-to-2 reduction circuit
c
in
c
out
dot notation.
Half-adder
Full-adder
Specifying full- and half-adder blocks, with their inputs and outputs, in dot notation.
Ripple carry adder
Carry save adder
84Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Example of CSA
Also considered as reduction by column [3:2].
[p:q] counter: p bits of the same weight and produce q bits of adjacent weights.
3
2
Reduction by row (3:2) counter
85Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Use Dot Notation
Carry-propagate adder
Carry-save adder (CSA) or (3; 2)-counter or 3-to-2 reduction circuit
c
in
c
out
86Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Multioperand Addition Using Carry-Save Adders
Tree of carry-save adders reducing seven numbers to two.
CSACSA
CSA
CSA
CSA
Tcarry-save-multi-add = O(tree height + TCPA)
= O(log n + log k)
Ccarry-save-multi-add = (n – 2)CCSA + CCPA
Carry-propagate adder
Serial carry-save addition using a single CSA.
CSA
Input
Sum registerCarry register
Output
CPA
87Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Reduction by a CSA Tree
12 FAs
6 FAs
6 FAs
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
Addition of seven 6-bit numbers in dot notation.
8 7 6 5 4 3 2 1 0 Bit position
7 7 7 7 7 7 6×2 = 12 FAs2 5 5 5 5 5 3 6 FAs3 4 4 4 4 4 1 6 FAs
1 2 3 3 3 3 2 1 4 FAs + 1 HA 2 2 2 2 2 1 2 1 7-bit adder
--Carry-propagate adder--
1 1 1 1 1 1 1 1 1
Representing a seven-operand addition in tabular form.
A full-adder compacts 3 dots into 2(compression ratio of 1.5)
A half-adder rearranges 2 dots(no compression, but still useful)
88Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Width of Adders in a CSA TreeAdding seven k-bit numbers and the CSA/CPA widths required.
Due to the gradual retirement (dropping out) of some of the result bits, CSA widths do not vary much as we go down the tree levels
k-bit CPA
k-bit CSA k-bit CSA
k-bit CSA
k-bit CSA
0k+2
The index pair [i, j] means that bit positions from i up to j are involved.
k-bit CSA
[0, k–1] [0, k–1]
[0, k–1] [0, k–1]
[0, k–1] [0, k–1]
[0, k–1] [0, k–1]
[0, k–1]
[1, k] [1, k]
[1, k]
[1, k]
[0, k–1]
[2, k+1] [2, k+1]
[2, k+1]
[2, k+1] [1, k–1]
1
[1, k+1]
Bit K+1 does not involve addition
89Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Wallace and Dadda Trees
h(n) = 1 + h(⎡2n/3⎤)
n(h) = ⎣3n(h – 1)/2⎦
2×1.5h–1< n(h) ≤ 2×1.5h
. . . inputsn
2 outputs
levelshh levels
Table 8.1 The maximum number n(h) of inputs for an h-level CSA tree
––––––––––––––––––––––––––––––––––––h n(h) h n(h) h n(h)––––––––––––––––––––––––––––––––––––0 2 7 28 14 4741 3 8 42 15 7112 4 9 63 16 10663 6 10 94 17 15994 9 11 141 18 23985 13 12 211 19 35976 19 13 316 20 5395––––––––––––––––––––––––––––––––––––n(h): Maximum number of inputs for h levels
90Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Wallace and Dadda Reduction Trees
6 FAs
11 FAs
7 FAs
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
Adding seven 6-bit numbers using Dadda’s strategy.
12 FAs
6 FAs
6 FAs
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
Addition of seven 6-bit numbers using Wallace strategy.
Wallace tree: Reduce the number of operands at the earliest possible opportunity
Dadda tree: Postpone the reduction to the extent possible without causing added delay
h n(h)2 43 64 95 136 19
91Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
A Small Optimization in Reduction Trees
6 FAs
11 FAs
7 FAs
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
Adding seven 6-bit numbers using Dadda’s strategy.
taking advantage of the final adder’s carry-in.
6 FAs
11 FAs
6 FAs + 1 HA
3 FAs + 2 HA
7-bit adder
Total cost = 7-bit adder + 26 FAs + 3 HA
92Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Parallel Counters
A 10-input parallel counter also known as a (10; 4)-counter.
0
1 0 1 0 1 0
2 1 1 0
1
0
2
13 2
3-bit ripple-carry adder
FA FA
HA
HA
FA
FAFAFA1-bit full-adder = (3; 2)-counter
Circuit reducing 7 bits to their3-bit sum = (7; 3)-counter
Circuit reducing n bits to their ⎡log2(n + 1)⎤-bit sum
= (n; ⎡log2(n+1)⎤)-counter
93Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Implementation of [4:2] Counter
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.145.
94Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Implementation of [5:2] Counter
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.146.
95Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Implementation of [7:2] Counter
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.146.
96Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
Generalized Parallel Counters
(5, 5; 4)-counter Dot notation for a (5, 5; 4)-counter and the use of such counters for reducing five numbers to two numbers.
. . .
Multicolumn reduction
(2, 3; 3)-counter
Unequal columns
Gen. parallel counter = Parallel compressor
97Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan
A General Strategy for Column Compression
n + ψ1 + ψ2 + ψ3 + . . . ≤ 3 + 2ψ1 + 4ψ2 + 8ψ3 + . . .
n – 3 ≤ ψ1 + 3ψ2 + 7ψ3 + . . .
. . . i – 3 i – 2 i – 1 i
n inputs
To i + 1
To i + 2
To i + 3
One circuit slice
ψ 1 ψ 2
ψ 3
ψ 1 ψ 2 ψ 3
(n; 2)-counters
Example: Design a bit-slice of an (11; 2)-counterSolution: Let’s limit transfers to two stages. Then, 8 ≤ ψ1 + 3ψ2Possible choices include ψ1 = 5, ψ2 = 1 or ψ1 = ψ2 = 2
1Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Multiplication
Instructor: Kuan Jen Lin E-Mail: [email protected]. of EE, FJU, TaiwanRoom: SF 727B
Most slides originate from the textbook author’s PowerPoint presentation files.
2Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
III Multiplication
Chapter 12 Variations in Multipliers
Chapter 11 Tree and Array Multipliers
Chapter 10 High-Radix Multipliers
Chapter 9 Basic Multiplication Schemes
Topics in This Part
Review multiplication schemes and various speedup methods• Multiplication is heavily used (in arith & array indexing)• Division = reciprocation + multiplication• Multiplication speedup: high-radix, tree, . . . • Bit-serial, modular, and array multipliers
3Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
9 Basic Multiplication Schemes
Chapter GoalsStudy shift/add or bit-at-a-time multipliersand set the stage for faster methods andvariations to be covered in Chapters 10-12
Chapter HighlightsMultiplication = multioperand additionHardware, firmware, software algorithmsMultiplying 2’s-complement numbersThe special case of one constant operand
4Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Shift/Add Multiplication Algorithms
Notation for our discussion of multiplication algorithms:
a Multiplicand ak–1ak–2 . . . a1a0x Multiplier xk–1xk–2 . . . x1x0p Product (a × x) p2k–1p2k–2 . . . p3p2p1p0
Initially, we assume unsigned operands
Multiplication of two 4-bit unsigned binary numbers in dot notation.
Product
Partial products bit-matrix
a x
p
2
x a
0 0
1 x a 2 1 x a 2
2 2
2 3 3
x a
Multiplicand Multiplier ×
5Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Preferred
Multiplication Recurrence
Multiplication with right shifts: top-to-bottom accumulation
p(j+1) = (p(j) + xj a 2k) 2–1 with p(0) = 0 and|–––add–––| p(k) = p = ax + p(0)2–k
|––shift right––|
Product
Partial products bit-matrix
a x
p
2
x a
0 0
1 x a 2 1 x a 2
2 2
2 3 3
x a
Multiplicand Multiplier ×
Multiplication with left shifts: bottom-to-top accumulation
p(j+1) = 2p(j) + xk–j–1a with p(0) = 0 and|shift| p(k) = p = ax + p(0)2k
|––––add––––|
6Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Examples of Basic MultiplicationRight-shift algorithm Left-shift algorithm======================== =======================a 1 0 1 0 a 1 0 1 0x 1 0 1 1 x 1 0 1 1======================== =======================p(0) 0 0 0 0 p(0) 0 0 0 0+x0a 1 0 1 0 2p(0) 0 0 0 0 0––––––––––––––––––––––––– +x3a 1 0 1 02p(1) 0 1 0 1 0 ––––––––––––––––––––––––p(1) 0 1 0 1 0 p(1) 0 1 0 1 0+x1a 1 0 1 0 2p(1) 0 1 0 1 0 0––––––––––––––––––––––––– +x2a 0 0 0 02p(2) 0 1 1 1 1 0 ––––––––––––––––––––––––p(2) 0 1 1 1 1 0 p(2) 0 1 0 1 0 0+x2a 0 0 0 0 2p(2) 0 1 0 1 0 0 0––––––––––––––––––––––––– +x1a 1 0 1 02p(3) 0 0 1 1 1 1 0 ––––––––––––––––––––––––p(3) 0 0 1 1 1 1 0 p(3) 0 1 1 0 0 1 0+x3a 1 0 1 0 2p(3) 0 1 1 0 0 1 0 0––––––––––––––––––––––––– +x0a 1 0 1 02p(4) 0 1 1 0 1 1 1 0 ––––––––––––––––––––––––p(4) 0 1 1 0 1 1 1 0 p(4) 0 1 1 0 1 1 1 0======================== =======================
7Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Programmed Using Right-Shift Algorithm{Using right shifts, multiply unsigned m_cand and m_ier, storing the resultant 2k-bit product in p_high and p_low. Registers: R0 holds 0 Rc for counter
Ra for m_cand Rx for m_ierRp for p_high Rq for p_low}
{Load operands into registers Ra and Rx}mult: load Ra with m_cand
load Rx with m_ier{Initialize partial product and counter}
copy R0 into Rpcopy R0 into Rqload k into Rc
{Begin multiplication loop}m_loop: shift Rx right 1 {LSB moves to carry flag}
branch no_add if carry = 0 add Ra to Rp {carry flag is set to cout}
no_add: rotate Rp right 1 {carry to MSB, LSB to carry}rotate Rq right 1 {carry to MSB, LSB to carry}decr Rc {decrement counter by 1}branch m_loop if Rc ≠ 0
{Store the product}store Rp into p_highstore Rq into p_low
m_done: ...
R0 Rc Counter0Ra RxRp Rq
Multiplicand MultiplierProduct, high Product, low
8Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Time Complexity of Programmed Multiplication
Assume k-bit words
k iterations of the main loop 6-7 instructions per iteration, depending on the multiplier bit
Thus, 6k + 3 to 7k + 3 machine instructions,ignoring operand loads and result store
k = 32 implies 200+ instructions on average
This is too slow for many modern applications!Microprogrammed multiply would be somewhat better
9Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Sequential Multiplication with Right Shifts
Multiplier x
Mux
Adder
0
out c
0 1
Doublewidth partial product p
Multiplicand a
Shift
Shift
(j)
j x
x a j
k
k
k
Hardware realization
Clock?
Control path?
10Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Sequential Multiplication with Left Shifts
Multiplier x
Mux
2k-bit adder
0
out c
0 1
Doublewidth partial product p
Multiplicand a
Shift
Shift
(j)
k-j-1 x
a
2k
k k-j-1 x
2k
11Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Multiplication of Signed Numbers
============================a 1 0 1 1 0x 0 1 0 1 1============================p(0) 0 0 0 0 0+x0a 1 0 1 1 0–––––––––––––––––––––––––––––2p(1) 1 1 0 1 1 0p(1) 1 1 0 1 1 0+x1a 1 0 1 1 0–––––––––––––––––––––––––––––2p(2) 1 1 0 0 0 1 0p(2) 1 1 0 0 0 1 0+x2a 0 0 0 0 0–––––––––––––––––––––––––––––2p(3) 1 1 1 0 0 0 1 0p(3) 1 1 1 0 0 0 1 0+x3a 1 0 1 1 0–––––––––––––––––––––––––––––2p(4) 1 1 0 0 1 0 0 1 0p(4) 1 1 0 0 1 0 0 1 0+x4a 0 0 0 0 0–––––––––––––––––––––––––––––2p(5) 1 1 1 0 0 1 0 0 1 0p(5) 1 1 1 0 0 1 0 0 1 0============================
Negative multiplicand,positive multiplier:
No change, other than looking out for propersign extension
12Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Multiplication with a Negative Multiplier
============================a 1 0 1 1 0x 1 0 1 0 1============================p(0) 0 0 0 0 0+x0a 1 0 1 1 0–––––––––––––––––––––––––––––2p(1) 1 1 0 1 1 0p(1) 1 1 0 1 1 0+x1a 0 0 0 0 0–––––––––––––––––––––––––––––2p(2) 1 1 1 0 1 1 0p(2) 1 1 1 0 1 1 0+x2a 1 0 1 1 0–––––––––––––––––––––––––––––2p(3) 1 1 0 0 1 1 1 0p(3) 1 1 0 0 1 1 1 0+x3a 0 0 0 0 0–––––––––––––––––––––––––––––2p(4) 1 1 1 0 0 1 1 1 0p(4) 1 1 1 0 0 1 1 1 0+(−x4a) 0 1 0 1 0–––––––––––––––––––––––––––––2p(5) 0 0 0 1 1 0 1 1 1 0p(5) 0 0 0 1 1 0 1 1 1 0============================
Negative multiplicand,negative multiplier:
In last step (the sign bit), subtract rather than add
10101=-1x24 + 22+20
13Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Booth’s Recoding–––––––––––––––––––––––––––––––––––––xi xi–1 yi Explanation–––––––––––––––––––––––––––––––––––––0 0 0 No string of 1s in sight0 1 1 End of string of 1s in x1 0 −1 Beginning of string of 1s in x1 1 0 Continuation of string of 1s in x
–––––––––––––––––––––––––––––––––––––
Example1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x
(1) −1 0 1 0 0 −1 1 0 −1 1 −1 1 0 0 −1 0 Recoded version y
Justification2j + 2j–1 + . . . + 2i+1 + 2i = 2j+1 – 2i
14Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Example Multiplication with Booth’s Recoding
============================a 1 0 1 1 0x 1 0 1 0 1 Multipliery −1 1 −1 1 −1 Booth-recoded============================p(0) 0 0 0 0 0+y0a 0 1 0 1 0–––––––––––––––––––––––––––––2p(1) 0 0 1 0 1 0p(1) 0 0 1 0 1 0+y1a 1 0 1 1 0–––––––––––––––––––––––––––––2p(2) 1 1 1 0 1 1 0p(2) 1 1 1 0 1 1 0+y2a 0 1 0 1 0–––––––––––––––––––––––––––––2p(3) 0 0 0 1 1 1 1 0p(3) 0 0 0 1 1 1 1 0+y3a 1 0 1 1 0–––––––––––––––––––––––––––––2p(4) 1 1 1 0 0 1 1 1 0p(4) 1 1 1 0 0 1 1 1 0y4a 0 1 0 1 0–––––––––––––––––––––––––––––2p(5) 0 0 0 1 1 0 1 1 1 0p(5) 0 0 0 1 1 0 1 1 1 0============================
2’ complement of 10110 is 01010
15Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Multiplication by ConstantsExplicit, e.g. y := 12 ∗ x + 1
Implicit, e.g. A[i, j] := A[i, j] + B[i, j]
Address of A[i, j] = base + n ∗ i + j
Software aspects:Optimizing compilers replace multiplications by shifts/adds/subs
Produce efficient code using as few registers as possible Find the best code by a time/space-efficient algorithm
0 1 2 . . . n – 1 0 1 2 ...
m – 1
Row i
Column j
Hardware aspects:Synthesize special-purpose units such as filters
y[t] = a0x[t] + a1x[t – 1] + a2x[t – 2] + b1y[t – 1] + b2y[t – 2]
16Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Multiplication Using Binary Expansion
Example: Multiply R1 by the constant 113 = (1 1 1 0 0 0 1)two
R2 ← R1 shift-left 1R3 ← R2 + R1R6 ← R3 shift-left 1R7 ← R6 + R1R112 ← R7 shift-left 4R113 ← R112 + R1
Shift, add Shift
Ri: Register that contains i times (R1)
This notation is for clarity; only one register other than R1 is needed
Shorter sequence using shift-and-add instructions
R3 ← R1 shift-left 1 + R1R7 ← R3 shift-left 1 + R1R113 ← R7 shift-left 4 + R1
17Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Multiplication via Recoding
Example: Multiply R1 by 113 = (1 1 1 0 0 0 1)two = (1 0 0−1 0 0 0 1)two
R8 ← R1 shift-left 3R7 ← R8 – R1R112 ← R7 shift-left 4R113 ← R112 + R1
Shift, add Shift
Shorter sequence using shift-and-add/subtract instructions
R7 ← R3 shift-left 3 – R1R113 ← R7 shift-left 4 + R1
Shift, subtract
6 shift or add (3 shift-and-add) instructions needed without recoding
18Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Multiplication via Factorization
Example: Multiply R1 by 119 = 7 × 17 = (8 – 1) × (16 + 1)
R8 ← R1 shift-left 3R7 ← R8 – R1R112 ← R7 shift-left 4R119 ← R112 + R7
Shorter sequence using shift-and-add/subtract instructions
R7 ← R3 shift-left 3 – R1R119 ← R7 shift-left 4 + R7
119 = (1 1 1 0 1 1 1)two = (1 0 0 0−1 0 0−1)two
More instructions may be needed without factorization
Requires a scratch register for holding the 7 multiple
19Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
High-Radix Multipliers
Chapter GoalsStudy techniques that allow us to handlemore than one multiplier bit in each cycle(two bits in radix 4, three in radix 8, . . .)
Chapter HighlightsHigh radix gives rise to “difficult” multiplesRecoding (change of digit-set) as remedyCarry-save addition reduces cycle timeImplementation and optimization methods
20Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Radix-4 Multiplication in Dot Notation
Number of cycles is halved, but now the “difficult” multiple 3amust be dealt with
Product
Partial products bit-matrix
a x
p
2
x a
0 0
1 x a 2 1 x a 2
2 2
2 3 3
x a
Multiplicand Multiplier ×
Multiplier x
p Product
Multiplicand a
(x x ) a 4 1 3 2 two
4 0 a (x x ) 1 0 two
×
Radix 2
Radix-4, or two-bit-at-a-time, multiplication in dot notation
21Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
A Possible Design for a Radix-4 Multiplier
Precomputed via shift-and-add(3a = 2a + a) 0 a 2a
3aMultiplier
To the adder
2-bit shifts
00 01 10 11Mux
xi+1 xi
22Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Example Radix-4 Multiplication Using 3a================================a 0 1 1 03a 0 1 0 0 1 0x 1 1 1 0================================p(0) 0 0 0 0+(x1x0)twoa 0 0 1 1 0 0–––––––––––––––––––––––––––––––––4p(1) 0 0 1 1 0 0p(1) 0 0 1 1 0 0+(x3x2)twoa 0 1 0 0 1 0–––––––––––––––––––––––––––––––––4p(2) 0 1 0 1 0 1 0 0p(2) 0 1 0 1 0 1 0 0================================
x
p
a
(x x )3 2
(x x )1 0
×
23Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
A Second Design for a Radix-4 Multiplier
xi+1 xi c Mux control Set carry---- --- --- ---------------- ------------0 0 0 0 0 00 0 1 0 1 00 1 0 0 1 00 1 1 1 0 01 0 0 1 0 01 0 1 1 1 11 1 0 1 1 11 1 1 0 0 1
replacing 3a with 4a (carry into next higher radix-4 multiplier digit) and –a.
0 a 2a
Multiplier
To the adder
+c FF Set if = = 1 or if = c = 1c
00 01 10 11Mux
2-bit shifts
mod 4Carry
xi+1 xi
xi+1xi+1
xixi+1(xi ∨ c)xi+1⊕ xi c xi ⊕ c
c
24Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Radix-4 Booth’s Recoding–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––xi+1 xi xi–1 yi+1 yi zi/2 Explanation–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––0 0 0 0 0 0 No string of 1s in sight0 0 1 0 1 1 End of string of 1s0 1 0 0 1 1 Isolated 10 1 1 1 0 2 End of string of 1s1 0 0 −1 0 −2 Beginning of string of 1s1 0 1 −1 1 −1 End a string, begin new one1 1 0 0 −1 −1 Beginning of string of 1s1 1 1 0 0 0 Continuation of string of 1s–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
(1) −2 2 −1 2 −1 −1 0 −2 Radix-4 version z
ContextRecoded
radix-2 digits Radix-4 digit
Example1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x
(1) −1 0 1 0 0 −1 1 0 −1 1 −1 1 0 0 −1 0 Recoded version y
Only shifting and complementation required
25Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Example Multiplication via Modified Booth’s Recoding
================================a 0 1 1 0x 1 0 1 0z −1 −2 Radix-4================================p(0) 0 0 0 0 0 0+z0a 1 1 0 1 0 0–––––––––––––––––––––––––––––––––4p(1) 1 1 0 1 0 0p(1) 1 1 1 1 0 1 0 0+z1a 1 1 1 0 1 0–––––––––––––––––––––––––––––––––4p(2) 1 1 0 1 1 1 0 0p(2) 1 1 0 1 1 1 0 0================================
x
p
a
(x x ) a 413 2 two
40a(x x )1 0 two
´
26Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Multiple Generation with Radix-4 Booth’s Recoding
two non0a 2a
EnableSelect
z a
neg
ii+1 i?
i/2
0 1Mux
k+10, a, or 2a
To adder inputAdd/subtract control
x
Multiplier
xx
Recoding Logic
Multiplicand
0
k
0
2-bit shift
Init. 0
Could have named this signal one/two
Sign extension, not 0
27Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Using Carry-Save Adders
Mux
0 2a
0 a
Multiplier
New Cumulative Partial Product
Old Cumulative Partial Product
CSA
Mux xi+1 xi
Adder
28Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Keeping the Partial Product in Carry-Save Form
0
Multiplier
k
k
k-Bit CSA
k
Partial Product
k
Mux
k-Bit Adder
Mux
Multiplicand
Carry
Sum
Shift
Old PP
CS sum
New PP
Next multiple
29Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Carry-Save Multiplier with Radix-4 Booth’s Recoding (1/2)
a
Multiplier
x i+1
x i
Adder
New cumulati ve partial product
Old cumulati ve partial product
FF
2-bit Adder
To the lower hal f of pa rtial product
Booth recoder and selector
CSA
x i-1
z a i/2
Extra “dot”
30Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
x x x x
Recoding Logic
two non0a 2a
EnableSelect
z a
neg
ii+1 i?
i/2
i?
0 1Mux
k+10, a, or 2a
k+2
Selective Complement
0, a, , 2a, or ?a
Extra "Dot" for Column i
xi+2
Carry-Save Multiplier with Radix-4 Booth’s Recoding (2/2)
31Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Another Design for Radix-4 Multiplication
Mux
0 2a
0 a
Multiplier
CSA
Mux xi+1 xi
Adder
CSANew Cumulative Partial Product
Old Cumulative Partial Product
FF2-BitAdder
To the Lower Half of Partial Product
32Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Radix-8 and Radix-16 MultipliersMultiplier
CSA CSA
CSA
CSA
Partial Product (Upper Half)
Mux0 8a
Mux0 4a
Mux0 2a
Mux0 a
x i+3
x i+2
x i+1
x i
CarrySum
4-Bit Shift
FF
To the Lower Half of Partial Product
3 4-BitAdder
4
4
4-bitrightshift
33Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
A Spectrum of Multiplier Design Choices
Basic binary
Adder
Adder
Next multiple
Partial product
...
Several multiples
Adder
. . .All multiples
Small CSA tree Full CSA
tree
High-radix or partial tree
Full treeSpeed up Economize
Partial product
34Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
VLSI Complexity IssuesA radix-2b multiplier requires:
bk two-input AND gates to form the partial products bit-matrixO(bk) area for the CSA treeAt least Θ(k) area for the final carry-propagate adder
Total area: A = O(bk)Latency: T = O((k/b) log b + log k)
Any VLSI circuit computing the product of two k-bit integers must satisfy the following constraints:
AT grows at least as fast as k3/2
AT2 is at least proportional to k2
The preceding radix-2b implementations are suboptimal, because:
AT = O(k2 log b + bk log k)AT2 = O((k3/b) log2b)
35Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Comparing High- and Low-Radix Multipliers
Intermediate designs do not yield better AT or AT2 values;The multipliers remain asymptotically suboptimal for any b
O(k2)O(k2 log2k)O(k3)AT2
O(k3/2)O(k2 log k)O(k2)AT
AT- or AT2-Optimal
High Speedb = O(k)
Low-Costb = O(1)
AT = O(k2 log b + bk log k) AT2 = O((k3/b) log2b)
By the AT measure (indicator of cost-effectiveness), slower radix-2 multipliers are better than high-radix or tree multipliersThus, when an application requires many independent multiplications, it is more cost-effective to use a large number of slower multipliers
High-radix multiplier latency can be reduced from O((k/b) log b + log k) to O(k/b + log k) through more effective pipelining (Chapter 11)
36Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Tree and Array Multipliers
Chapter GoalsStudy the design of multipliers for highest possible performance (speed, throughput)
Chapter HighlightsTree multiplier = reduction tree
+ redundant-to-binary converterAvoiding full sign extension in multiplying
signed numbersArray multiplier = one-sided reduction tree
+ ripple-carry adder
37Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Full-Tree Multipliers
Higher-order product bits
Multipliera
a
a
a. . .
. . .
Some lower-order product bits are generated directly
Redundant result
Redundant-to-Binary Converter
Multiple- Forming Circuits
(Multi-Operand Addition Tree)
Partial-Products Reduction Tree
38Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Full-Tree versus Partial-Tree Multiplier
Adder
Large tree of carry-save
adders
. . .
All partial products
Product
Adder
Small tree of carry-save
adders
. . .
Several partial products
Product
Log-depth
Log-depth
39Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Variations in Full-Tree Multiplier Design
Designs are distinguished by variations in three elements:
Higher-order product bits
Multipliera
a
a
a. . .
. . .
Some lower-order product bits are generated directly
Redundant result
Redundant-to-Binary Converter
Multiple- Forming Circuits
(Multi-Operand Addition Tree)
Partial-Products Reduction Tree
2. Partial products reduction tree
3. Redundant-to-binary converter
1. Multiple-forming circuits
40Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Example of Variations in CSA Tree Design
1 2 3 4 3 2 1 FA FA FA HA -------------------- 1 3 2 3 2 1 1 FA HA FA HA ---------------------- 2 2 2 2 1 1 1 4-Bit Adder ----------------------1 1 1 1 1 1 1 1
Wallace Tree (5 FAs + 3 HAs + 4-Bit Adder)
1 2 3 4 3 2 1 FA FA -------------------- 1 3 2 2 3 2 1 FA HA HA FA ---------------------- 2 2 2 2 1 2 1 6-Bit Adder ----------------------1 1 1 1 1 1 1 1
Dadda Tree (4 FAs + 2 HAs + 6-Bit Adder)
Two different binary 4 × 4 tree multipliers.
Latency!!
41Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
A 7X7 Tree Multiplier
10-bit CPA
7-bit CSA 7-bit CSA
7-bit CSA
10-bit CSA
2Ignore
The index pair [i, j] means that bit positions from i up to j are involved.
7-bit CSA
[0, 6] [1, 7]
[2, 8] [6, 12]
[3, 11] [1,8]
[3, 9] [4, 10]
[5, 11]
[2, 8] [5, 11]
[6, 12]
[2,12]
[3, 12]
[4,13] [4,12]
[4, 13]
[3,9]
3
[3,12]
[2, 8]
[3,12]
[1, 6]
01
xxxxxxx [0,6]
xxxxxxx [1,7]
xxxxxxx [2,8]
xxxxxxx [3,9]
xxxxxxx [4,10]
xxxxxxx [5,11]
Xxxxxxx [6,12]
42Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Balanced-Delay Tree for 11 Inputs
FA FA FA
FA FA
FA FA
FA
FA
Inputs
Level-1 carries
Level-2 carries
Level-3 carries
Level-4 carry
Outputs
FA
FA
FA
FA
FA
FA
FA
FA
FA
11 + ψ1 = 2ψ1 + 3
Therefore, ψ1 = 8 carries are needed
43Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Binary Tree of 4-to-2 Reduction Modules
Due to its recursive structure, a binary tree is more regular than a 3-to-2 reduction tree when laid out in VLSI
CSA
CSA
4-to-2 4-to-2 4-to-2 4-to-2
4-to-2 4-to-2
4-to-24-to-2 reduction module implemented with twolevels of (3; 2)-counters
44Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Tree Multipliers for Signed Numbers
From Fig. 8.18 Sign extension in multioperand addition.
---------- Extended positions ---------- Sign Magnitude positions ---------
xk–1 xk–1 xk–1 xk–1 xk–1 xk–1 xk–2 xk–3 xk–4 . . .yk–1 yk–1 yk–1 yk–1 yk–1 yk–1 yk–2 yk–3 yk–4 . . .zk–1 zk–1 zk–1 zk–1 zk–1 zk–1 zk–2 zk–3 zk–4 . . .
α
β
γ
αβγ
x α
β
γ
α
β
γ
α
β
γ
α
β
γ
α
β
γ
α
β
α
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
FA FA FA FA FA FA
Five redundant copies removed
Sign extensions Signs
The difference in multiplication is the shifting sign positions
Fig. 11.7 Sharing of full adders to reduce the CSA width in a signed tree multiplier.
45Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Using the Negative-Weight Property of the Sign Bit
Sign extension is a way of converting negatively weighted bits (negabits) to positively weighted bits (posibits) to facilitate reduction, but there are other methods of accomplishing the same without introducing a lot of extra bits
Baugh and Wooley have contributed two such methods
4 3 2 1 0 4 3 2 1 0
4 3 2 1 0 4 3 2 1 0 a x a x a x a x a x
a a a a a x x x x x 4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4
×
a a a a a x x x x x ---------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x --------------------------------------------------------- p p p p p p p p p p a a a a a x x x x x ---------------------------- -a x a x a x a x a x -a x a x a x a x a x -a x a x a x a x a x -a x a x a x a x a x a x -a x -a x -a x -a x --------------------------------------------------------- p p p p p p p p p p a a a a a x x x x x ---------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a a 1 x x --------------------------------------------------------- p p p p p p p p p p --------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x --------------------------------------------------------- p p p p p p p p p p
1 1
4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4 4 4 4 4
4 3 2 1 0 4 3 2 1 0
4 3 2 1 0 4 3 2 1 0
4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4
4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4
×
×
×
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
a. Unsigned
b. 2's-complement
c. Baugh-Wooley
d. Modified B-W __
__ __
__ __ __ __ __
_ _
_ _
_ _ _ _
46Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Fig. 11.8
4 3 2 1 0 4 3 2 1 0 a x a x a x a x a x
a a a a a x x x x x 4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4
×
a x -a x -a x -a x -a x --------------------------------------------------------- p p p p p p p p p p a a a a a x x x x x ---------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a a 1 x x --------------------------------------------------------- p p p p p p p p p p --------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x --------------------------------------------------------- p p p p p p p p p p
1 1
4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4 4 4 4 4
4 3 2 1 0 4 3 2 1 0
4 4 3 4 2 4 1 4 0 4
×
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
c. Baugh-Wooley
d. Modified B-W __
__ __
__ __ __ __ __
_ _
_ _
_ _ _ _
The Baugh-Wooley Method and Its Modified Form
–a4x0 = a4(1 – x0) – a4= a4x0′ – a4
–a4 a4x0′a4
In next column
–a4x0 = (1 – a4x0) – 1= (a4x0)′ – 1
–1 (a4x0)′1
In next column
47Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Alternate Views of the Baugh-Wooley Methods
+ 0 0 –a4x3 –a4x2 –a4x1 –a4x0+ 0 0 –a3x4 –a2x4 –a1x4 –a0x4--------------------------------------------– 0 0 a4x3 a4x2 a4x1 a4x0– 0 0 a3x4 a2x4 a1x4 a0x4--------------------------------------------+ 1 1 a4x3 a4x2 a4x1 a4x0+ 1 1 a3x4 a2x4 a1x4 a0x4
11
--------------------------------------------+ a4 a4 a4x3 a4x2 a4x1 a4x0+ x4 x4 a3x4 a2x4 a1x4 a0x4
a4x4--------------------------------------------
a41 x4
4 3 2 1 0 4 3 2 1 0
4 3 2 1 0 4 3 2 1 0 a x a x a x a x a x
a a a a a x x x x x 4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4
×
a a a a a x x x x x ---------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x --------------------------------------------------------- p p p p p p p p p p a a a a a x x x x x ---------------------------- -a x a x a x a x a x -a x a x a x a x a x -a x a x a x a x a x -a x a x a x a x a x a x -a x -a x -a x -a x --------------------------------------------------------- p p p p p p p p p p a a a a a x x x x x ---------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a a 1 x x --------------------------------------------------------- p p p p p p p p p p --------------------------- a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x a x --------------------------------------------------------- p p p p p p p p p p
1 1
4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4 4 4 4 4
4 3 2 1 0 4 3 2 1 0
4 3 2 1 0 4 3 2 1 0
4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4
4 0 3 0 2 0 1 0 0 0 4 1 3 1 2 1 1 1 0 1 4 2 3 2 2 2 1 2 0 2 4 3 3 3 2 3 1 3 0 3 4 4 3 4 2 4 1 4 0 4
×
×
×
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
a. Unsigned
b. 2's-complement
c. Baugh-Wooley
d. Modified B-W __
__ __
__ __ __ __ __
_ _
_ _
_ _ _ _
48Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Partial-Tree Multipliers
Fig. 11.9 General structure of a partial-tree multiplier.
. . .
CSA Tree
h inputs
Adder
Lower part of the cumulative partial product
FF
h-Bit Adder
Sum Carry
Upper part of the cumulative partial product (stored-carry)
High-radix versus partial-tree multipliers: The difference is quantitative, not qualitative
For small h, say ≤ 8 bits, we view the multiplier of Fig. 11.9 as high-radix
When h is a significant fraction of k, say k/2 or k/4,then we tend to view it as a partial-tree multiplier
Better design through pipelining to be covered in Section 11.6
49Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Truncated Multipliers
Removing the dots at the right does not lead to much loss of precision.
ulp. o o o o o o o o k-by-k fractional
× . o o o o o o o o multiplication---------------------------------. o o o o o o o|o. o o o o o o|o o. o o o o o|o o o. o o o o|o o o o. o o o|o o o o o. o o|o o o o o o. o|o o o o o o o. |o o o o o o o o---------------------------------. o o o o o o o o|o o o o o o o o
Max error = 8/2 + 7/4 + 6/8 + 5/16 + 4/32 + 3/64 + 2/128 + 1/256 = 7.004 ulp
Mean error = 1.751 ulp
50Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Truncated Multipliers with Error Compensation
Constant and variable error compensation for truncated multipliers.
We can introduce additional “dots” on the left-hand side to compensate for the removal of dots from the right-hand side
Constant compensation Variable compensation
. o o o o o o o| . o o o o o o o|
. o o o o o o| . o o o o o o|
. o o o o o| . o o o o o|
. o o o o| . o o o o|
. o o o| . o o o|
. 1 o o| . o o|
. o| . x-1o|
. | . y-1 |
Max error = +4 ulpMax error ≅ −3 ulp
Max error = +? ulpMax error ≅ −? ulp
Mean error = ? ulp Mean error = ? ulp
51Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Array Multipliers
A basic array multiplier uses a one-sided CSA tree and a ripple-carry adder.
0x ax ax a
x a
x a
CSA
CSA
CSA
CSA
Ripple-Carry Adder
012
3
4
ax
p
0
p
1
p
2
p
3
p
4
p 6 p 7 p 8
a x
0 0
a x
1 0
a x
2 0
a x
3 0
a x
4 0
0
0
0
0
a x
0 1
a x
1 1
a x
2 1
a x
3 1
p 9 p 5
a x
4 1
a x
4 2
a x
4 3
a x
4 4
a x
0 2
a x
1 2
a x
2 2
a x
3 2
a x
0 3
a x
1 3
a x
2 3
a x
3 3
a x
0 4
a x
1 4
a x
2 4
a x
3 4
0
Details of a 5×5 array multiplier using FA blocks.
[3:2] Adder, i.e. a full adder
52Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Signed (2’s-complement) Array Multiplierusing the Baugh-Wooley method or to shorten the critical path.
p
0
p
1
p
2
p
3
p 4 p 6p 7p 8
a x
0 0
a x
1 0
a x
2 0
a x
3 0
a x
4 0
0
0
0
0
a x
0 1
a x
1 1
a x
2 1
a x
3 1
p 9 p 5
a x
4 1
a x
4 2
a x
4 3
a x
4 4
a x
0 2
a x
1 2
a x
2 2
a x
3 2
a x
0 3
a x
1 3
a x
2 3
a x
3 3
a x
0 4
a x
1 4
a x
2 4
a x
3 4 1
x
4
a
4
a
4 x
4
_
_
_
_
_
_
_
_
_
_
53Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Array Multiplier Built of Modified Full-Adder Cells
Design of a 5 × 5 array multiplier with two additive inputs and full-adder blocks that include AND gates.
p p p p p
4 3 2 1 0 a a a a a
4
3
2
1
0
x
x
x
x
x
4
3
2
1
0
p
p
p
p
p
9 8 7 6 5
FA
54Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Array Multiplier without a Final Carry-Propagate Adder
i+1i
i+1i
i i
Mux
Mux
Muxk
[k, 2k?] 1i?ii+1k?
Level i
k k
0
Mux
...
...
Bi+1
Bi
All remaining bits of the final product produced only 2 gate levels after pk–1
See next slide
55Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Extend Bits in Less-Significant Part in a Conditional Adder
The circuit in the right part is considered a conditional adder as the circuit in the left part. Source: Ercegovac and Lang, “Digital Arithmetic”, pp.86-87
56Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Pipelined Tree and Array Multipliers
. . .
CSA Tree
h inputs
Adder
Lower part of the cumulative partial product
FF
h-Bit Adder
Sum Carry
Upper part of the cumulative partial product (stored-carry)
General structure of a partial-tree multiplier.
Efficiently pipelined partial-tree multiplier.
. . .
h inputs
Adder
Lower part of the cumulative partial product
FF
h-Bit Adder
Sum Carry
CSA
Pipelined CSA Tree
Latches Latches Latches
CSA
(h + 2)-input CSA tree
Latch
57Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Pipelined Array MultipliersWith latches after every FA level, the maximum throughput is achieved
Latches may be inserted after every h FA levels for an intermediate design
Pipelined 5×5 array multiplier using latched FA blocks. The small shaded boxes are latches.
p p p p p
4 3 2 1 0 a a a a a 4 3 2 1 0 x xxxx
4 3 2 1 0 p p p p p 9 8 7 6 5
Latched FA with AND gate
Latch
FA
FA
FA
FA
Example: 3-stage pipeline
58Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Variations in Multipliers
Chapter GoalsLearn additional methods for synthesizing fast multipliers as well as other types of multipliers (bit-serial, modular, etc.)
Chapter HighlightsBuilding a multiplier from smaller units Performing multiply-add as one operationBit-serial and (semi)systolic multipliersUsing a multiplier for squaring is wasteful
59Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Divide-and-Conquer DesignsBuilding wide multiplier from narrower ones
Divide-and-conquer (recursive) strategy for synthesizing a 2b × 2b multiplier from b × b multipliers.
a
×
p
Rearranged partial products in 2b-by-2b multiplication
2b bits
3b bits
H a L
xH xL
a L xH
a L xL
a H xLxHa H
a H xL
a L xH
a L xLxHa H
b bits
60Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
General Structure of a Recursive Multiplier
2b × 2b use (3; 2)-counters3b × 3b use (5; 2)-counters4b × 4b use (7; 2)-counters
Using b × b multipliers to synthesize 2b × 2b, 3b× 3b, and 4b × 4b multipliers.
4b × 4b
3b × 3b
2b × 2b
b × b
61Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
An 8 X 8 Multiplier Using 4 X 4 Multipliers a x a x a x a x
A dd
A dd
A dd
A dd A dd
pp p p
000
8
8
12
12
H LH H H LLL
[4 , 7] [4 , 7] [0 , 3] [4 , 7] [4 , 7] [0 , 3] [0 , 3] [0 , 3]
[12 ,15] [8 ,11] [8 ,11] [4 , 7] [8 ,11] [4 , 7] [4 , 7] [0 , 3]
[4 , 7]
[4 , 7]
[8 ,11 ]
[8 ,11 ]
[12,15]
[12,15] [8 ,11] [0 , 3][4 , 7]
M u ltip ly M ultip lyM ultip lyM ultip ly
62Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Additive Multiply Modules
Additive multiply module with 2 × 4 multiplier (ax) plus 4-bit and 2-bit additive inputs (y and z).
c
in
y
z
ax
p
4-bit adder
y
z
x a
p = ax + y + z
(a) Block diagram (b) Dot notation
b-bit and c-bit multiplicative inputsb × c AMM b-bit and c-bit additive inputs
(b + c)-bit output
(2b – 1) × (2c – 1) + (2b – 1) + (2c – 1) = 2b+c – 1
63Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Multiplier Built of AMMs
An 8 × 8 multiplier built of 4×2 AMMs. Inputs marked with an asterisk carry 0s.
[0, 1]
[2, 3]
[4, 5]
[6, 7]
[8, 9][10,11][12,15]
[0, 1][2, 3]
[4,5][6, 7]
x
x
x
x [0, 3]a
[0, 3]a
[0, 3]a
[0, 3]a
p
pp
pppp
[0, 1]x
[2, 3]
[4, 5]
[6, 7]x
x
x
[10,11]
[8, 9]
[4, 7]a
[4, 7]a
[4, 7]a
[4, 7]a
[8, 9]
[0, 1]
[2, 3][4, 5]
[6, 7][4,5]
[6, 7]
[8, 11]
[10,13]
[2, 5]
[4,7]
[6, 9][8, 11]
[6, 9]
*
*
* *
**
Legend: 2 bits 4 bits Understanding
an 8 × 8 multiplier built of 4 × 2 AMMs using dot notation
64Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Bit-Serial Multipliers
FA
FFBit-serial adder(LSB first) x0
y0
s0x1
y1
s1x2
y2
s2…
…
…
Bit-serial multipliera0
x0
p0a1
x1
p1a2
x2
p2…
…
…?Systolic arrays: synchronous arrays of processing elements that are interconnected by only short, local wires thus allowing very high clock rates.
65Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Semisystolic Serial-Parallel MultiplierMultiplicand (parallel in)
Multiplier (serial in)LSB-first
Carry
SumFA
Product (serial out)
FA FA FA
a 3 a 2 a 1 a 0x0 x1 x2 x3
Semi-systolic circuit for 4 × 4 multiplication in 8 clock cycles.
This is called “semisystolic” because it has a large signal fan-out of k(k-way broadcasting) and a long wire spanning all k positions
66Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Systolic Retiming as a Design Tool
Example of retiming by delaying the inputs to CL and advancing the outputs from CL by d units
Cut
CL CR CL CR
ef
gh
e+df+d
g h
+d
+dOriginal delays Adjusted delays
A semisystolic circuit can be converted to a systolic circuit via retiming, which involves advancing and retarding signals by means of delay removal and delay insertion in such a way that the relative timings of various parts are unaffected
67Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
A First Attempt at Retiming
A retimed version of our semi-systolic multiplier.
Multiplicand (parallel in)
Multiplier (serial in)LSB-first
Carry
FAProduct (serial out)
FA FA FA
a 3 a 2 a 1 a 0x0 x1 x2 x3
Sum
Cut 1Cut 2Cut 3
Multiplicand (parallel in)
Multiplier (serial in)LSB-first
Carry
SumFA
Product (serial out)
FA FA FA
a 3 a 2 a 1 a 0x0 x1 x2 x3
68Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Deriving a Fully Systolic Multiplier
Multiplicand (parallel in)
Multiplier (serial in)LSB-first
Carry
SumFA
Product (serial out)
FA FA FA
a 3 a 2 a 1 a 0x 0 x 1 x 2 x 3
A retimed version of our semi-systolic multiplier.
Multiplicand (parallel in)
Multiplier (serial in)LSB-first
SumFA
Product (serial out)
FA FA FA
a3 a2 a1 a0x0 x1 x2 x3
Carry
69Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
A Direct Design for a Bit-Serial Multiplier
Fig. 12.13 Bit-serial multiplier design in dot notation.
p
x
a
Already accumulated
into three numbers
(i - 1)
a
x
(i - 1)
i
a
x
i
x
i
(i - 1)
a
i
a
x
(i - 1)
x
i
i
a
Already output
(a) Structure of the bit-matrix
(b) Reduction after each input bit
p
(i - 1)
i
a
x
(i - 1)
x
i
(i - 1)
a
x
i
i
a
2p
(i )
Shift right to obtain p
(i )
Mux
(5; 3)-counter
0
1
012
a x
a x
ss
c c
t t in
out in
in out
out
p
ii
ii(i?)
ax
ss
c c
t t in
out in
in out
out
p
ii
. . .. . .
. . .
. . .
. . .
i
LSB
0
Building block for a latency-free bit-serial multiplier.
The cellular structure of the bit-serial multiplier based on the cell in Fig. 12.11.
70Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Modular Multipliers
. . .FA FAFAFAFA
Mod-15 CSA
Divide by 16
4
4
4
4
Mod-15 CSA
4
Mod-15 CPA
Modulo-(2b – 1) carry-save adder.
Design of a 4 × 4 modulo-15 multiplier.
71Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Other Examples of Modular Multiplication
One way to design of a 4 × 4 modulo-13 multiplier.
16 mod 13 = 3 • •
72Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Squaringx 0 x 1 x 2 x 3 x 4 x 0 x 1 x 2 x 3 x 4
x 0 x 1 x 2 x 3 x 4 x 0 x 0
p 0
x 4
x 1
x 4
x 0 x 1
x 2 x 3
x 4
x 0 x 1
x 2 x 3
x 4
x 0
Multiply x by x
x 1 x 2 x 3 x 4 x 0 x 1 x 2 x 3 x 4 x 0
x 1 x 2 x 3 x 4 x 0 x 1 x 2 x 3 x 4 x 0
x 1 x 2 x 3
x 1 x 2 x 3
x 2 x 3
x 4
p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9
x 1 x 2 x 3 x 4 x 0 x 1
x 0
x 2
x 0 x 1
x 0 x 2 x 3
x 4 x 0 x 3
x 4
x 0
x 1 x 2 x 1
x 2 x 3
x 3 x 4 x 4
p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 0
_
Simplify
Design of a 5-bit squarer.
x1x0 –x1x0
73Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Constant Multiplier
Source: Ercegovac and Lang, “Digital Arithmetic”, pp.224
74Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan
Multiple Constant Multiplier
Source: Ercegovac and Lang, “Digital Arithmetic”, pp. 225
1Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Division
Instructor: Kuan Jen Lin E-Mail: [email protected]. of EE, FJU, TaiwanRoom: SF 727B
Most slides are revision of PowerPoint files gotten from textbook website.
2Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Division
Chapter 16 Division by Convergence
Chapter 15 Variations in Dividers
Chapter 14 High-Radix Dividers
Chapter 13 Basic Division Schemes
Topics in This Part
Review Division schemes and various speedup methods• Hardest basic operation (fortunately, also the rarest)• Division speedup methods: high-radix, array, . . .• Combined multiplication/division hardware • Digit-recurrence vs convergence division schemes
3Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
13 Basic Division Schemes
Chapter GoalsStudy shift/subtract or bit-at-a-time dividersand set the stage for faster methods andvariations to be covered in Chapters 14-16
Chapter HighlightsShift/subtract divide vs shift/add multiplyHardware, firmware, software algorithmsDividing 2’s-complement numbersThe special case of a constant divisor
4Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Shift/Subtract Division Algorithms
Notation for our discussion of division algorithms:
z Dividend z2k–1z2k–2 . . . z3z2z1z0d Divisor dk–1dk–2 . . . d1d0q Quotient qk–1qk–2 . . . q1q0s Remainder, z – (d × q) sk–1sk–2 . . . s1s0
Initially, we assume unsigned operands
Division of an 8-bit number by a 4-bit number in dot notation.
Dividend
Subtracted bit-matrix
z
s Remainder
Quotient q Divisor d
q d 2 3 3 –
q d 2 2 2 –
q d 2 1 1 –
q d 2 0 0 –
5Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Division versus Multiplication (1/2)
Division is more complex than multiplication:Need for quotient digit selection or estimation
Overflow possibility: the high-order k bits of z must be strictly less than d; the quotient of a 2k bit number divided by a k bit number may have a width of more than k bits.
Dividend
Subtracted bit-matrix
z
s Remainder
Quotient q Divisor d
q d 2 3 3 –
q d 2 2 2 –
q d 2 1 1 –
q d 2 0 0 –
6Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Division versus Multiplication (2/2)
Pentium III latenciesInstruction Latency Cycles/IssueLoad / Store 3 1Integer Multiply 4 1Integer Divide 36 36Double/Single FP Multiply 5 2Double/Single FP Add 3 1Double/Single FP Divide 38 38
7Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Division Recurrence
Division with left shifts
s(j) = 2s(j–1) – qk–j (2k d) with s(0) = z and|–shift–| s(k) = 2ks|–––subtract–––|
(There is no corresponding right-shift algorithm)
Dividend
Subtracted bit-matrix
z
s Remainder
Quotient q Divisor d
q d 2 3 3 –
q d 2 2 2 –
q d 2 1 1 –
q d 2 0 0 –
Integer division is characterized by z = d × q + s
2–2kz = (2–kd) × (2–kq) + 2–2kszfrac = dfrac × qfrac + 2–ksfrac
Divide fractions like integers; adjust the remainder
No-overflow condition for fractions is:
zfrac < dfrac
k bits k bits
2z
2k d
0
8Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Division Recurrence StepsInitializationIterations
One digit arithmetic left-shift of s(j) to produce rs(j)
Determination of the quotient digit q j+1 by the quotient-digit selection function;
The index of q could be different Generation of the divisor multiple d × qj+1
Subtraction of dqj+1 from rs(j).On-the-fly conversion of the quotient
Or done in the termination step
Termination: make sign(s)=sign(d)), conversion
9Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Examples of Basic DivisionInteger division Fractional division====================== =====================z 0 1 1 1 0 1 0 1 zfrac . 0 1 1 1 0 1 0 124d 1 0 1 0 dfrac . 1 0 1 0 ====================== =====================s(0) 0 1 1 1 0 1 0 1 s(0) . 0 1 1 1 0 1 0 12s(0) 0 1 1 1 0 1 0 1 2s(0) 0 . 1 1 1 0 1 0 1–q3 24d 1 0 1 0 {q3 = 1} –q–1d . 1 0 1 0 {q–1=1}––––––––––––––––––––––– ––––––––––––––––––––––s(1) 0 1 0 0 1 0 1 s(1) . 0 1 0 0 1 0 12s(1) 0 1 0 0 1 0 1 2s(1) 0 . 1 0 0 1 0 1–q2 24d 0 0 0 0 {q2 = 0} –q–2d . 0 0 0 0 {q–2=0}––––––––––––––––––––––– ––––––––––––––––––––––s(2) 1 0 0 1 0 1 s(2) . 1 0 0 1 0 12s(2) 1 0 0 1 0 1 2s(2) 1 . 0 0 1 0 1–q1 24d 1 0 1 0 {q1 = 1} –q–3d . 1 0 1 0 {q–3=1}––––––––––––––––––––––– ––––––––––––––––––––––s(3) 1 0 0 0 1 s(3) . 1 0 0 0 12s(3) 1 0 0 0 1 2s(3) 1 . 0 0 0 1–q0 24d 1 0 1 0 {q0 = 1} –q–4d . 1 0 1 0 {q–4=1}––––––––––––––––––––––– ––––––––––––––––––––––s(4) 0 1 1 1 s(4) . 0 1 1 1s 0 1 1 1 sfrac 0 . 0 0 0 0 0 1 1 1q 1 0 1 1 qfrac . 1 0 1 1====================== =====================
Notice the index of q
What is the residual of 0.0112 / 0.1?
10Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Main Factors Affecting the Overall Execution Time and Cost
Radix rQuotient-digit set
Redundant signed digit?Representation of the residual
CSA?Quotient-digit selection
11Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Programmed Division
Register usage for programmed division.
Rs Rq
Rd0 0 . . . 0 0 0 0
2 dk
Carry Flag
Shifted Partial Remainder
Shifted Partial Quotient
Partial Remainder (2k – j Bits)
Partial Quotient (j Bits)
Next quotient digit inserted here
Divisor d
12Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Assembly Language Program for Division
Programmed division using left shifts.
{Using left shifts, divide unsigned 2k-bit dividend,z_high|z_low, storing the k-bit quotient and remainder. Registers: R0 holds 0 Rc for counter
Rd for divisor Rs for z_high & remainder Rq for z_low & quotient}
{Load operands into registers Rd, Rs, and Rq}div: load Rd with divisor
load Rs with z_highload Rq with z_low
{Check for exceptions} branch d_by_0 if Rd = R0branch d_ovfl if Rs > Rd
{Initialize counter}load k into Rc
{Begin division loop}d_loop: shift Rq left 1 {zero to LSB, MSB to carry}
rotate Rs left 1 {carry to LSB, MSB to carry}skip if carry = 1branch no_sub if Rs < Rd sub Rd from Rs incr Rq {set quotient digit to 1}
no_sub: decr Rc {decrement counter by 1}branch d_loop if Rc 0
{Store the quotient and remainder}store Rq into quotientstore Rs into remainder
d_by_0: ...d_ovfl: ...d_done: ...
Rs Rq
Rd0 0 . . . 0 0 0 0
2 dk
Carry Flag
Shifted Partial Remainder
Shifted Partial Quotient
Partial Remainder (2k ?j Bits)
Partial Quotient (j Bits)
Next quotient digit inserted here
Divisor d
Register usage for programmed division.
13Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Time Complexity of Programmed DivisionAssume k-bit words
k iterations of the main loop 6 or 8 instructions per iteration, depending on the quotient bit
Thus, 6k + 3 to 8k + 3 machine instructions,ignoring operand loads and result store
k = 32 implies 220+ instructions on average
This is too slow for many modern applications!
Microprogrammed division would be somewhat better
14Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Restoring Hardware Dividers
Shift/subtract sequential restoring divider.
Quotient q
Mux
Adder out c
0 1
Partial remainder s (initial value z)
Divisor d
Shift
Shift
Load
1 in c
(j)
Quotient digit
selector
q k–j
MSB of 2s (j–1)
k
k
k
Trial difference
15Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Indirect Signed DivisionIn division with signed operands, q and s are defined by
z = d × q + s sign(s) = sign(z) |s | < |d |
Examples of division with signed operands
z = 5 d = 3 ⇒ q = 1 s = 2
z = 5 d = –3 ⇒ q = –1 s = 2
z = –5 d = 3 ⇒ q = –1 s = –2
z = –5 d = –3 ⇒ q = 1 s = –2
Magnitudes of q and s are unaffected by input signsSigns of q and s are derivable from signs of z and d
Will discuss direct signed division later
(not q = –2, s = –1)
16Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Example of Restoring Unsigned Division
=======================z 0 1 1 1 0 1 0 124d 0 1 0 1 0–24d 1 0 1 1 0=======================s(0) 0 0 1 1 1 0 1 0 1 2s(0) 0 1 1 1 0 1 0 1 +(–24d) 1 0 1 1 0 ––––––––––––––––––––––––s(1) 0 0 1 0 0 1 0 1 Positive, so set q3 = 12s(1) 0 1 0 0 1 0 1 +(–24d) 1 0 1 1 0 ––––––––––––––––––––––––s(2) 1 1 1 1 1 0 1 Negative, so set q2 = 0s(2)=2s(1) 0 1 0 0 1 0 1 and restore2s(2) 1 0 0 1 0 1 +(–24d) 1 0 1 1 0 ––––––––––––––––––––––––s(3) 0 1 0 0 0 1 Positive, so set q1 = 12s(3) 1 0 0 0 1 +(–24d) 1 0 1 1 0 ––––––––––––––––––––––––s(4) 0 0 1 1 1 Positive, so set q0 = 1s 0 1 1 1 q 1 0 1 1=======================
No overflow, because(0111)two < (1010)two
17Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Nonrestoring and Signed DivisionThe cycle time in restoring division must be long enough to allow:
Shifting the registersAllowing signals to propagate through the adderDetermining and storing the next quotient digitStoring the trial difference, if required
Quotient q
Mux
Adder out c
0 1
Partial remainder s (initial value z)
Divisor d
Shift
Shift
Load
1 in c
(j)
Quotient digit
selector
q k–j
MSB of 2s (j–1)
k
k
k
Trial difference
Nonrestoring division to the rescue!
Assume qk–j = 1 and subtractStore the result as the new PR
(the partial remainder can become incorrect, hencethe name “nonrestoring”)
18Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Justification for Nonrestoring Division
Why it is acceptable to store an incorrect value in the partial-remainder register?
Shifted partial remainder at start of the cycle is u
Suppose subtraction yields the negative result u – 2kd
Option 1: Restore the partial remainder to correct value u, shift left, and subtract to get 2u – 2kd
Option 2: Keep the incorrect partial remainder u – 2kd, shift left, and add to get 2(u – 2kd) + 2kd = 2u – 2kd
19Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Example of Nonrestoring Unsigned Division
=======================z 0 1 1 1 0 1 0 124d 0 1 0 1 0–24d 1 0 1 1 0=======================s(0) 0 0 1 1 1 0 1 0 1 2s(0) 0 1 1 1 0 1 0 1 Positive,+(–24d) 1 0 1 1 0 so subtract––––––––––––––––––––––––s(1) 0 0 1 0 0 1 0 1 2s(1) 0 1 0 0 1 0 1 Positive, so set q3 = 1+(–24d) 1 0 1 1 0 and subtract––––––––––––––––––––––––s(2) 1 1 1 1 1 0 1 2s(2) 1 1 1 1 0 1 Negative, so set q2 = 0+24d 0 1 0 1 0 and add––––––––––––––––––––––––s(3) 0 1 0 0 0 1 2s(3) 1 0 0 0 1 Positive, so set q1 = 1+(–24d) 1 0 1 1 0 and subtract––––––––––––––––––––––––s(4) 0 0 1 1 1 Positive, so set q0 = 1s 0 1 1 1 q 1 0 1 1=======================
No overflow: (0111)two < (1010)two
Applying “if sign(s) = sign(d) then qk–j = 1 else qk–j = -1 “, we get 11-11, that equals 1011
20Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Graphical Depiction of Nonrestoring Division
300
200
100
0
–100
117
234
74
148
–12
296
136
272
112
s
(0)
s
(1)
s
(2)
s
(3) s =16s
(4)
–160
2
×
2
×
2
×
×
2
–160
–160 –160
Par
tial r
emai
nder
(a) Restoring
148
300
200
100
0
–100
117
234
74
148
–12 –24
136
272
112
s
(0)
s
(1)
s
(2)
s
(3) s =16s
(4)
–160
2
×
2
×
2
×
×
2
–160 +160
–160
Par
tial r
emai
nder
(b) Nonrestoring
Example
(0 1 1 1 0 1 0 1)two / (1 0 1 0)two
(117)ten / (10)ten
21Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Nonrestoring Division with Signed Operands
Restoring divisionqk–j = 0 means no subtraction (or subtraction of 0)qk–j = 1 means subtraction of d
Nonrestoring divisionWe always subtract or addIt is as if quotient digits are selected from the set {1, −1}:
1 corresponds to subtraction −1 corresponds to addition
Our goal is to end up with a remainder that matches the signof the dividend
This idea of trying to match the sign of s with the sign z, leads to a direct signed division algorithm
if sign(s) = sign(d) then qk–j = 1 else qk–j = −1
Example: q = . . . 0 0 0 1 . . .. . . 1 −1 −1 −1 . . .
22Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Quotient Conversion and Final CorrectionPartial remainder variation and selected quotient digits during nonrestoring division with d > 0
d
0
−d
+d
−d
−d
−d
+d
+d
×2×2
×2
×2×2
−1 1 −1 −1 1 1
z
0 1 0 0 1 1
1 1 0 0 1 1 1
Quotient with digits −1 and 1
Final correction step if sign(s) ≠ sign(z):Add d to, or subtract d from, s; subtract 1 from, or add 1 to, q
Check: −32 + 16 – 8 – 4 + 2 + 1 = −25 = −64 + 32 + 4 + 2 + 1
Replace −1s with 0s
Shift left, complement MSB, and set LSB to 1 to get the 2’s-complement quotient
1 1 0 1 0 0 0
23Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Example of Nonrestoring Signed Division
========================z 0 0 1 0 0 0 0 124d 1 1 0 0 1–24d 0 0 1 1 1========================s(0) 0 0 0 1 0 0 0 0 1 2s(0) 0 0 1 0 0 0 0 1 sign(s(0)) ≠ sign(d),+24d 1 1 0 0 1 so set q3 = −1 and add––––––––––––––––––––––––s(1) 1 1 1 0 1 0 0 1 2s(1) 1 1 0 1 0 0 1 sign(s(1)) = sign(d), +(–24d) 0 0 1 1 1 so set q2 = 1 and subtract––––––––––––––––––––––––s(2) 0 0 0 0 1 0 1 2s(2) 0 0 0 1 0 1 sign(s(2)) ≠ sign(d),+24d 1 1 0 0 1 so set q1 = −1 and add––––––––––––––––––––––––s(3) 1 1 0 1 1 1 2s(3) 1 0 1 1 1 sign(s(3)) = sign(d), +(–24d) 0 0 1 1 1 so set q0 = 1 and subtract––––––––––––––––––––––––s(4) 1 1 1 1 0 sign(s(4)) ≠ sign(z),+(–24d) 0 0 1 1 1 so perform corrective subtraction––––––––––––––––––––––––s(4) 0 0 1 0 1 s 0 1 0 1 q −1 1−1 1========================
p = 0 1 0 1 Shift, compl MSB1 1 0 1 1 Add 1 to correct
1 1 0 0 Check: 33/(−7) = −4
24Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
On-The-Fly Conversion
Source: Ercegovac and Lang, “Digital Arithmetic”, pp. 257
25Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Nonrestoring Hardware Divider
Shift-subtract sequential nonrestoring divider.
Quotient
k
Partial Remainder
Divisor
add/sub
k-bit adder
k
cout cin
Complement
qk 2s (j?)MSB of
Divisor Sign
Complement of Partial Remainder Sign
26Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Division by ConstantsSoftware and hardware aspects:As was the case for multiplications by constants, optimizing compilers may replace some divisions by shifts/adds/subs; likewise, in custom VLSI circuits, hardware dividers may be replaced by simpler adders
Method 1: Find the reciprocal of the constant and multiply (particularly efficient if several numbers must be divided by the same divisor)
Method 2: Use the property that for each odd integer d, there exists an odd integer m such that d × m = 2n – 1; hence, d = (2n – 1)/m and
Number of shift-adds required is proportional to log k
Multiplication by constant Shift-adds
L)21)(21)(21(2)21(212
42 nnnnnnn
zmzmzmdz −−−
− +++=−
=−
=
27Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Example: Division by a Constant
L)21)(21)(21(2)21(212
42 nnnnnnn
zmzmzmdz −−−
− +++=−
=−
=
Example: Dividing the number z by 5, assuming 24 bits of precision. We have d = 5, m = 3, n = 4; 5 × 3 = 24 – 1
Instruction sequence for division by 5
q ← z + z shift-left 1 {3z computed}q ← q + q shift-right 4 {3z(1+2–4) computed}q ← q + q shift-right 8 {3z(1+2–4)(1+2–8) computed}q ← q + q shift-right 16 {3z(1+2–4)(1+2–8)(1+2–16) computed}q ← q shift-right 4 {3z(1+2–4)(1+2–8)(1+2–16)/16 computed}
L)21)(21)(21(163
)21(23
123
51684
444−−−
− +++=−
=−
=zzzz
5 shifts4 adds
28Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Preview of Fast Dividers
Like multiplication, there are but two ways to speed it up: a. Reducing the number of operands (divide in a higher radix)b. Adding them faster (keep partial remainder in carry-save form)
a x
p
2
x a
0 0
1 x a 2 1 x a 2
2 2
2 3 3
x a
×
(a) k × k integer multiplication
z
s
q Divisor d
q d 2 3 3 –
q d 2 2 2 –
q d 2 1 1 –
q d 2 0 0 –
(b) 2k / k integer division
Both (a) Multiplication and (b) division can be considered as multioperand addition problems.
There is one complication that makes division inherently more difficult: The terms to be subtracted from (added to) the dividend are not known a priori but become known as quotient digits are computed;quotient digits in turn depend on partial remainders
29Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
14 High-Radix Dividers
Chapter GoalsStudy techniques that allow us to obtainmore than one quotient bit in each cycle(two bits in radix 4, three in radix 8, . . .)
Chapter HighlightsRadix > 2 ⇒ quotient digit selection harder Remedy: redundant quotient representationCarry-save addition reduces cycle timeImplementation methods and tradeoffs
30Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Basics of High-Radix Division
Division with left shifts
s(j) = rs(j–1) – qk–j (r k d) with s(0) = z and|–shift–| s(k) = r ks|–––subtract–––|
Dividend z
s Remainder
Quotient q Divisor d
(q q ) d 4 1 3 – 2 two
4 0 d (q q ) 1 – 0 two
Radix-4 division in dot notation
k digits k digits
rz
qk–j rk d
0
31Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Examples of High-Radix DivisionRadix-4 integer division Radix-10 fractional division====================== =================z 0 1 2 3 1 1 2 3 zfrac . 7 0 0 3 44d 1 2 0 3 dfrac . 9 9 ====================== =================s(0) 0 1 2 3 1 1 2 3 s(0) . 7 0 0 34s(0) 0 1 2 3 1 1 2 3 10s(0) 7 . 0 0 3–q3 44d 0 1 2 0 3 {q3 = 1} –q–1d 6 . 9 3 {q–1 = 7}––––––––––––––––––––––– ––––––––––––––––––s(1) 0 0 2 2 1 2 3 s(1) . 0 7 34s(1) 0 0 2 2 1 2 3 10s(1) 0 . 7 3–q2 44d 0 0 0 0 0 {q2 = 0} –q–2d 0 . 0 0 {q–2 = 0}––––––––––––––––––––––– ––––––––––––––––––s(2) 0 2 2 1 2 3 s(2) . 7 34s(2) 0 2 2 1 2 3 sfrac . 0 0 7 3–q1 44d 0 1 2 0 3 {q1 = 1} qfrac . 7 0––––––––––––––––––––––– =================s(3) 1 0 0 3 3 4s(3) 1 0 0 3 3 –q0 44d 0 3 0 1 2 {q0 = 2}–––––––––––––––––––––––s(4) 1 0 2 1 s 1 0 2 1 q 1 0 1 2======================
32Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Difficulty of Quotient Digit SelectionWhat is the first quotient digit in the following radix-10 division?
_____________2 0 4 3 | 1 2 2 5 7 9 6 8
The problem with the pencil-and-paper division algorithm is that there is no room for error in choosing the next quotient digit
In the worst case, all k digits of the divisor and k + 1 digits in the partial remainder are needed to make a correct choice
12 / 2 = 6122 / 20 = 6
1225 / 204 = 612257 / 2043 = 5
Suppose we used the redundant signed digit set [–9, 9] in radix 10
Then, we could choose 6 as the next quotient digit, knowing that we canrecover from an incorrect choice by using negative digits: 5 9 = 6 -1
33Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Radix-2 SRT Division (1/3)
The new partial remainder, s(j), as a function of the shifted old partial remainder, 2s(j–1), in radix-2 nonrestoring division.
Algorithm in Ch 13.4
–2d
2d
d
–d
q =–1
q =1
2s
(j–1)
s
(j)
–j
–j
d
–d
s(j) = 2s(j–1) – q–j dwith s(0) = zs(k) = 2ksq–j ∈ {−1, 1}
34Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Robertson’s DiagramAxes: the shifted residual 2s(j–1) and the next residual s(j)
It shows the possibilities to choose q and keep the next residual bounded.
P-D DiagramShifted residual (Partial remainder) vs. divisor
Diagrams for Quotient Selection
35Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
–2d
2d
d
–d
q =–1
q =0
q =1
2s
(j–1)
s
(j)
–j
–j
–j
d
–d
Radix-2 SRT Division (2/3)
q–j = 0 requires shifting only, which was faster than shift-and-subtractBut how can you tell if –d ≦ 2s (j-1) < d?
s(j) = 2s(j–1) – q–j dwith s(0) = zs(k) = 2ksq–j ∈ {−1, 0, 1}
•Allowing 0 as a quotient digit in nonrestoring Divisionq-j=0 for –d ≦ 2s (j-1) < d
36Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
–2d
2d
d
–d
q =–1
q =0
q =1
2s
(j–1)
s
(j)
–j
–j
–j
d
–d
–1/2 1/2
–1
1
–1/2
1/2
Radix-2 SRT Division (3/3)
The relationship between new and old partial remainders in radix-2 SRT division.
Comparison with constants −½ and ½ is quite simple2s ≥ +½ means 2s = (0.1xxxxxxxx)2’s-compl2s < −½ means 2s = (1.0xxxxxxxx)2’s-compl
If 2s(j–1) < ½then q–j =-1else if 2s(j–1) ≧ ½
then q–j =1else q–j =0endif
endif
37Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Radix-2 SRT Division with Variable ShiftsS(0) is adjusted to be in [-1/2, 1/2/).We use the comparison constants −½ and ½ for quotient digit selection
For 2s ≥ +½ or 2s = (0.1xxxxxxxx)2’s-compl choose q–j = 1For 2s < −½ or 2s = (1.0xxxxxxxx)2’s-compl choose q–j = −1
Choose q–j = 0 in other cases, that is, for:0 ≤ 2s < +½ or 2s = (0.0xxxxxxxx)2’s-compl−½ ≤ 2s < 0 or 2s = (1.1xxxxxxxx)2’s-compl
Observation: What happens when the magnitude of 2s is fairly small?
2s = (0.00001xxxx)2’s-compl
2s = (1.1110xxxxx)2’s-compl
Choosing q–j = 0 would lead to the same condition in the next step; generate 5 quotient digits 0 0 0 0 1
Generate 4 quotient digits 0 0 0 −1
Use leading 0s or leading 1s detection circuit to determine how many quotient digits can be spewed out at onceStatistically, the average skipping distance will be 2.67 bits
38Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Example Unsigned Radix-2 SRT Division
========================z . 0 1 0 0 0 1 0 1d 0 . 1 0 1 0–d 1 . 0 1 1 0========================s(0) 0 . 0 1 0 0 0 1 0 1 2s(0) 0 . 1 0 0 0 1 0 1 ≥ ½, so set q−1 = 1+(−d) 1 . 0 1 1 0 and subtract––––––––––––––––––––––––s(1) 1 . 1 1 1 0 1 0 1 2s(1) 1 . 1 1 0 1 0 1 In [−½, ½), so set q−2 = 0––––––––––––––––––––––––s(2) =2s(1) 1 . 1 1 0 1 0 1 2s(2) 1 . 1 0 1 0 1 In [−½, ½), so set q−3 = 0––––––––––––––––––––––––s(3) =2s(2) 0 . 1 0 1 0 1 2s(3) 1 . 0 1 0 1 < −½, so set q−4 = −1+d 0 . 1 0 1 0 and add––––––––––––––––––––––––s(4) 1 . 1 1 1 1 Negative,+d 0 . 1 0 1 0 so add to correct––––––––––––––––––––––––s(4) 0 . 1 0 0 1 s 0 . 0 0 0 0 0 1 0 1 q 0 . 1 0 0−1 Uncorrected BSD quotientq 0 . 0 1 1 0 Convert and subtract ulp========================
In [−½, ½), so okay
0.1000
-0.0001
0.0111
-0.0001
0.0110
39Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Using Carry-Save Adders
Constant thresholds used for quotient digit selection in radix-2 division with qk–j in {–1, 0, 1} .
–2d 2d
d
–d
q =–1
q =0 q =1
2s (j–1)
s (j)
–j
–j
–j
d–d
–1/2 0Choose –1 Choose 0 Choose 1
–1/0 0/+1Overlap Overlap
You can choose 0 or 1 in the overlay region
40Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Quotient Digit Selection Based on Truncated PR
Sum part of 2s(j–1): u = (u1u0 . u–1u–2 . . .)2’s-complCarry part of 2s(j–1): v = (v1v0 . v–1v–2 . . .)2’s-compl
Approximation to the partial remainder:
t = u[–2,1] + v[–2,1] {Add the 4 MSBs of u and v}
t := u[–2,1] + v[–2,1]if t < –½then q–j = –1else if t ≥ 0
then q–j = 1else q–j = 0endif
endif
–2d 2d
d
–d
q =–1
q =0 q =1
2s (j–1)
s (j)
–j
–j
–j
d–d
–1/2 0Choose –1 Choose 0 Choose 1
–1/0 0/+1Overlap Overlap
41Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Error in tThe 4-bit number t=(t1t0.t-1t-2)2/s0compl can be compared to the constants -1/2 and 0 based on only the three bit values t1, t0 and t-1.Regardless of sign, truncating the t-2 results in the maximum truncated value being ½ (when the trye carry-in to t-2 is 1 and t-2 is 1.). Still in overlay region:
If t < -1/2, the true value of 2s(j–1) is guaranteed to be less than 0.
If t < 0, we are guaranteed to have 2s(j–1) < ½ ≦d.
42Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Divider with Partial Remainder in Carry-Save Form
Carry v
Mux
Adder
0 1
Divisor d
k k
Carry-save adder
Select q –j
4 bits Shift left
2s
+ulp for 2’s compl
Sum u
Non0 (enable)
Sign (select)
0, d, or d’
Carry Sum
43Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Why We Cannot Use Carry-Save PR with SRT Division
Overlap regions in radix-2 SRT division.
–2d
2d
d
–d
q =–1
q =0
q =1
2s
(j–1)
s
(j)
–j
–j
–j
d
–d
1 – d
–1
1
–1/2
1/2
1 – dThe overlay can become arbitrarily small as d approaches 1.
44Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Choosing the Quotient Digits
A p-d plot for radix-2 division with d ∈ [1/2,1), partial remainder in [–d, d), and quotient digits in [–1, 1].
d
p
Infeasible region (p cannot be ≥ 2d)
Infeasible region (p cannot be < −2d)
.100 .101 .110 .111 1.
00.1
00.0
11.1
10.0
10.1
11.0
01.1
01.0
−00.1
−01.0
−01.1
−10.0
d
2d
−2d
−d
Worst-case error margin in comparison
Choose 1
Choose −1
Choose 0
−1
1
−1 max
−1 min
1 min
1 max
0 max
0 min
Ove
rlap
Ove
rlap
0
Use p-d plot to understand the q selection and derive the needed precision (number of bits to look at).
45Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Design of the Quotient Digit Selection Logic
4-bit adder
Combinational logic
Non0Sign
Shifted sum = (u1u0 . u−1u−2 . . .)2’s-compl
Shifted carry = (v1v0 . v−1v−2 . . .)2’s-compl
Approx shifted PR = (t1t0 . t−1t−2)2’s-compl
Non0 = t1′ ∨ t0′ ∨ t–1′ = (t1 t0 t−1)′Sign = t1 (t0′ ∨ t−1′)
46Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Radix-4 SRT Division
New versus shifted old partial remainder in radix-4 division with q–j in [–3, 3].
Radix-4 fractional division with left shifts and q–j ∈ [–3, 3]
s(j) = 4s(j–1) – q–j d with s(0) = z and s(k) = 4ks|–shift–||––subtract––|
Two difficulties:How do you choose from among the 7 possible values for q−j?If the choice is +3 or −3, how do you form 3d?
–4d 4d
d
–d
4s(j–1)
–3 –2 –1 0 +1 +2 +3
s (j)
47Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Building the p-d Plot for Radix-4 Division
A p-d plot for radix-4 SRT division with quotient digit set [–3, 3].
d
p
Infeasible region (p cannot be ≥ 4d)
.100 .101 .110 .111
10.1
10.0
01.1
00.0
00.1
01.0
11.1
11.0
d
2d
Choose 2
Choose 0
Choose 1
3
1
2 max
2 min
1 min
1 max
0 max
Ove
rlap
0
3d
4d
Choose 3
3 min
2
Ove
rlap
Ove
rlap
Uncertaintyregion
Uncertaintyregion
Uncertainty region: because of truncation.
The choice between q=3 or q=2 depends not only the p but also on one bit, d-2.
48Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
–4d 4d
d
–d
4s(j–1) –3 –2 –1 0 +1 +2 +3
s(j)
2d/3
8d/3–2d/3
–8d/3
Restricting the Quotient Digit Set in Radix 4
Fig. 14.13 New versus shifted old partial remainder in radix-4 division with q–j in [–2, 2].
Radix-4 fractional division with left shifts and q–j ∈ [–2, 2]
s(j) = 4s(j–1) – q–j d with s(0) = z and s(k) = 4ks|–shift–||––subtract––|
For this restriction to be feasible, we must have:s ∈ [−hd, hd) for some h < 1, and 4hd – 2d ≤ hdThis yields h ≤ 2/3 (choose h = 2/3 to minimize the restriction)
49Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
d
p
.100 .101 .110 .111
10.1
10.0
01.1
00.0
00.1
01.0
11.1
11.0
Choose 2
Choose 0
Choose 1 1
2 min
1 min
2 max
1 max
0 max
0
2
Ove
rlap
Ove
rlap
Infeasible region (p cannot be ≥ 8d/3)
8d/3
5d/3
4d/3
2d/3
d/3
Building the p-d Plot with Restricted Radix-4 Digit Set
A p-d plot for radix-4 SRT division with quotient digit set [–2, 2].
Depends on d
50Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
General High-Radix Dividers
Carry v
CSA tree
Adder
Divisor d
k k
Select q –j
Shift left
2s Sum u
Multiple generation /
selection
Carry Sum
q –j
. . . q –j | | d or its complement
Process to derive the details:
Radix r
Digit set [–α, α] for q–j
Number of bits of p (v and u) and d to be inspected
Quotient digit selection unit (table or logic)
Multiple generation/selection scheme
Conversion of redundant q to 2’s complement
51Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
15 Variations in Dividers
Chapter GoalsDiscuss practical aspects of designinghigh-radix division schemes and coverother types of fast hardware dividers
Chapter HighlightsBuilding and using p-d plots in practicePrescaling simplifies q digit selectionParallel hardware (array) dividersShared hardware in multipliers/dividersSquare-rooting not special case of division
52Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Quotient Digit Selection RevisitedRadix-r division with quotient digit set [–α, α], α < r – 1 Restrict the partial remainder range, say to [–hd, hd)From the solid rectangle in Fig. 15.1, we get rhd – αd ≤ hd or h ≤ α/(r – 1) To minimize the range restriction, we choose h = α/(r – 1)
The relationship between new and shifted old partial remainders in radix-rdivision with quotient digits in [–α, +α].
–α
r s (j–1)
s (j)
r–1
rhd –rhd
hd
–hd
d
–d
–r+1 α –1 1 0
rd –rd αd –αd d –d 0
53Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Why Using Truncated p and d Values Is Acceptable
A part of p-d plot showing the overlap region for choosing the quotient digitvalue β or β+1 in radix-r division with quotient digit set [–α, α].
p
d
Choose β + 1
Choose β
d min
Overlap region
(h + β + 1)d
A
(h + β)d
(–h + β + 1)d
(–h + β)d
B
4 bits of p 3 bits of d
3 bits of p 4 bits of d
Note: h = α / (r – 1)
Standard pxx.xxxx
Carry-save pxx.xxxxxxx.xxxxx
54Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Table Entries in the Quotient Digit Selection LogicWe want to make the uncertainty rectangle as large as possible, to minimize the number of bits in p and d needed for choosing the quotient digits.
p
d
β
+1(h + )d
( + )d
(h + + 1)d
( + + 1)d
Note: h = /(r?)
β
β
β
β
β
αβ
β+1 ββ
ββ
ββ
ββ
β+1 β+1β+1 β+1
β+1 β+1β+1
β+1orδ+1δ
Origin
Staircaselikeselection boundary
55Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Using p-d Plots in Practice
Establishing upper bounds on the dimensions of uncertainty rectangles.
Δp
p
d
Choose α
Choose α − 1
d min
Overlap region
(h + α − 1)d
(−h + α)d
Δd
d min Δd +
(h + α − 1) d min
(−h + α) d min
Smallest Δd occurs for the overlap region of α and α – 1
α+−−
=Δhhdd 12min
)12(min −=Δ hdp
56Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Example: Lower Bounds on Precision
)12(min −=Δ hdp
Fig. 15.4
Δp
p
d
Choose α
Choose α − 1
d min
Overlap region
(h + α − 1)d
(−h + α)d
Δd
d min Δd +
(h + α − 1) d min
(−h + α) d min
For r = 4, divisor range [0.5, 1), digit set [–2, 2], we have α = 2, dmin = 1/2, h = α/(r – 1) = 2/3
Because 1/8 = 2–3 and 2–3 ≤ 1/6 < 2–2, we must inspect at least 3 bits of d (2, given its leading 1) and 3 bits of p These are lower bounds (not truncated bits) and may prove inadequateIn fact, 3 bits of p and 4 (3) bits of d are required With p in carry-save form, 4 bits of each component must be inspected
8/123/2
13/4)2/1( =+−
−=Δd 6/1)13/4)(2/1( =−=Δp
α+−−
=Δhhdd 12min
57Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Upper Bounds for Precision
Theorem: Once lower bounds on precision are determined based on Δdand Δp, one more bit of precision in each direction is always adequate
u v
Δp
p
d
w
Choose a
Choose a − 1
d min
Overlap region
w
(a − 1 + h)d
(a − h)d
Δd A
B
Proof: Let w be the spacing of vertical grid linesw ≤ Δd/2 ⇒ v ≤ Δp/2 ⇒ u ≥ Δp/2
58Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Some Implementation Details
The asymmetry of quotient digit selection process.
p
d
Choose β + 1
Choose β
d min
A
B
d max
−β
β + 1
Choose −β + 1
Choose −β
p
d
β
+1
β
β
β
β β
β
δ β
β+1
β+1
β+1
β+1
β+1
β+1 or
δ+1
δ
*
* *
*
Example of p-d plot allowing larger uncertainty rectangles, if the 4 cases marked with asterisks are handled as exceptions.
59Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
5d/3
4d/3
d 1.000 1.001 1.010 1.011 1.100 0.100 0.101 0.110 0.111 1.000
01.10
01.01
01.00
00.11
00.10
00.00
00.01
11.11
11.10
11.01
11.00
10.11
10.10
2d/3
d/3
–d/3
–4d/3
–5d/3
–2d/3
2 1 2 1
2 1,2 1 1,2 1
2 1,2 1 2 1,2
Radix r = 4q–j in [–2, 2]d in [1/2, 1)p in [–8/3, 8/3]
The Pentium chip division bug
60Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Division with Prescaling
Restricting the divisor to the shaded area simplifies quotient digit selection.
p
d
Choose β + 1
Choose β
d min d max
Choose −β + 1
Choose −β
Overlap regions of a p-d plot are wider toward the high end of the divisor range If we can restrict the magnitude of the divisor to an interval close to dmax (say 1 – e < d < 1 + d, when dmax= 1), quotient digit selection may become simpler Thus, we perform the division (zm)/(dm) for a suitably chosen scale factor m (m > 1)Prescaling (multiplying z and d by m) should be done without real multiplications
61Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Modular Dividers and ReducersGiven dividend z and divisor d, with d ≥ 0, a modular divider computes
q = ⎣z / d⎦ and s = z mod d = ⟨z⟩d
The quotient q is, by definition, an integer but the inputs z and d do not have to be integers; the modular remainder is always positive
Example:
⎣–3.76 / 1.23⎦ = –4 and ⟨–3.76⟩1.23 = 1.16
The quotient and remainder of ordinary division are −3 and −0.07A modular reducer computes only the modular remainder and is in many cases simpler than a full-blown divider
<z>d =<zH2k + zL >d = <zH (2k-1)+ zH + ZL >d
62Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Array DividersRestoring array divider composed of controlled subtractor cells.
z
z
–5
–6
s s s–4 –5 –6
q
q
q
–1
–2
–3
FS
Cell
z z z z–1 –2 –3 –4
1 0
d d d–1 –2 –3
0
0
0
–1 –2 –3 –4 –5 –6 –1 –2 –3 –1 –2 –3
–4 –5 –6
Dividend z = .z z z z z z Divisor d = .d d d Quotient q = .q q q Remainder s = .0 0 0 s s s
63Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Nonrestoring Array DividerNonrestoringarray divider built of controlled add/subtract cells.
Similarity to array multiplier is deceiving
Critical path
Dividend z = z .z z z z z z Divisor d = d .d d d Quotient q = q .q q q Remainder s = 0 .0 0 s s s s
0 –1 –2 –3 –4 –5 –6 0 –1 –2 –3 0 –1 –2 –3
–3 –4 –5 –6
z
z
z
–4
–5
–6
s s s s–3 –4 –5 –6
q
q
q
0
–1
–2
q –3
d d d d0 –1 –2 –3z z z z0 –1 –2 –3
FA
XOR
Cell
1
64Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Speedup Methods for Array Dividers
Critical path
However, we still need to know the carry/borrow-out from each rowSolution: Insert a carry-lookahead circuit between successive rowsNot very cost-effective; thus not used in practice
Idea: Pass the partial remainder downward in carry-save form to speed up the operation of each row
Fig. 15.8
Dividend z = z .z z z z z z Divisor d = d .d d d Quotient q = q .q q q Remainder s = 0 .0 0 s s s s
0 –1 –2 –3 –4 –5 –6 0 –1 –2 –3 0 –1 –2 –3
–3 –4 –5 –6
z
z
z
–4
–5
–6
s s s s–3 –4 –5 –6
q
q
q
0
–1
–2
q –3
d d d d0 –1 –2 –3z z z z0 –1 –2 –3
FA
XOR
Cell
1
65Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Combined Multiply/Divide Units
Quotient
k
Partial Remainder
Divisor
add/sub
k-bit adder
k
cout cin
Complement
qk 2s (j?)MSB of
Divisor Sign
Complement of Partial Remainder Sign
Fig. 9.4 Fig. 13.10
Multiplier x
Mux
Adder
0
out c
0 1
Doublewidth partial product p
Multiplicand a
Shift
Shift
(j)
j x
x a j
k
k
k
Similarity of blocks in multipliers and dividers (only shift direction is different)
66Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Single Unit for Sequential Multiplication and Division
The control unit proceeds through necessary steps for multiplication or division (including using the appropriate shift direction)
Sequential radix-2 multiply/divide unit.
Multiplier x or quotient q
Mux
Adder out c
0 1
Partial product p or partial remainder s
Multiplicand a or divisor d
Shift control
Shift
Enable
in c
q k–j
MSB of 2s (j–1)
k
k
k
j x
MSB of p (j+1)
Divisor sign
Multiply/ divide control
Select
Mul Div
The slight speed penalty owing to a more complex control unit is insignificant
67Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Single Unit for Array Multiplication and Division
Each cell within the array can act as a modified adder or modified subtractor based on control input values
I/O specification of a universal circuit that can act as an array multiplier or array divider.
In some designs, squaring and square-rooting functions are also included within the same array
Multiplicand or divisor
Multiplier
Product or remainder
Quotient
Mul/Div
Additive input or dividend
68Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
16 Division by Convergence
Chapter GoalsShow how by using multiplication as thebasic operation in each division step,the number of iterations can be reduced
Chapter HighlightsDigit-recurrence as convergence methodConvergence by Newton-Raphson iterationComputing the reciprocal of a numberHardware implementation and fine tuning
69Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
General Convergence Methods
u (i+1) = f(u (i), v (i), w (i))v (i+1) = g(u (i), v (i), w (i))w (i+1) = h(u (i), v (i), w (i))
u (i+1) = f(u (i), v (i))v (i+1) = g(u (i), v (i))
The complexity of this method depends on two factors:
a. Ease of evaluating f and g (and h)b. Rate of convergence (number of iterations needed)
Constant
Desiredfunction
Guide the iteration such that one of the values converges to a constant (usually 0 or 1)
The other value then converges to the desired function
70Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Division by Repeated Multiplications
Remainder often not needed, but can be obtained by another multiplication if desired: s = z – qd
Motivation: Suppose add takes 1 clock and multiply 3 clocks64-bit divide takes 64 clocks in radix 2, 32 in radix 4
Divide faster via multiplications faster if 10 or fewer needed
)1()1()0(
)1()1()0(
−
−== m
m
xxdxxxzx
dzq
L
LIdea:
Force to 1Converges to q
To turn the identity into a division algorithm, we face three questions:
1. How to select the multipliers x(i) ?2. How many iterations (pairs of multiplications)? 3. How to implement in hardware?
71Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Formulation as a Convergence Computation
)1()1()0(
)1()1()0(
−
−== m
m
xxdxxxzx
dzq
L
LIdea:
Force to 1Converges to q
d (i+1) = d (i) x (i) Set d (0) = d; make d (m) converge to 1z (i+1) = z (i) x (i) Set z (0) = z; obtain z/d = q ≅ z (m)
Question 1: How to select the multipliers x (i) ? x (i) = 2 – d (i)
This choice transforms the recurrence equations into:
d (i+1) = d (i) (2 − d (i)) Set d (0) = d; iterate until d (m) ≅ 1z (i+1) = z (i) (2 − d (i)) Set z (0) = z; obtain z/d = q ≅ z (m)
u (i+1) = f(u (i), v (i))v (i+1) = g(u (i), v (i))
Fits the general form
72Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Determining the Rate of Convergenced (i+1) = d (i) x (i) Set d (0) = d; make d (m) converge to 1z (i+1) = z (i) x (i) Set z (0) = z; obtain z/d = q ≅ z (m)
Question 2: How quickly does d (i) converge to 1?
We can relate the error in step i + 1 to the error in step i:
d (i+1) = d (i) (2 − d (i)) = 1 – (1 – d (i))2
1 – d (i+1) = (1 – d (i))2
For 1 – d (i) ≤ ε, we get 1 – d (i+1) ≤ ε2: Quadratic convergence
In general, for k-bit operands, we need
2m – 1 multiplications and m 2’s complementations
where m = ⎡log2 k⎤
73Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Quadratic ConvergenceTable: Quadratic convergence in computing z/d by repeated multiplications, where 1/2 ≤ d = 1 – y < 1
–––––––––––––––––––––––––––––––––––––––––––––––––––––––i d (i) = d (i–1) x (i–1), with d (0) = d x (i) = 2 – d (i)
–––––––––––––––––––––––––––––––––––––––––––––––––––––––0 1 – y = (.1xxx xxxx xxxx xxxx)two ≥ 1/2 1 + y1 1 – y 2 = (.11xx xxxx xxxx xxxx)two ≥ 3/4 1 + y 2
2 1 – y 4 = (.1111 xxxx xxxx xxxx)two ≥ 15/16 1 + y 4
3 1 – y 8 = (.1111 1111 xxxx xxxx)two ≥ 255/256 1 + y 8
4 1 – y 16 = (.1111 1111 1111 1111)two = 1 – ulp–––––––––––––––––––––––––––––––––––––––––––––––––––––––Each iteration doubles the number of guaranteed leading 1s (convergence to 1 is from below)
Beginning with a single 1 (d ≥ ½), after log2k iterations we get as close to 1 as is possible in a fractional representation
74Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Graphical Depiction of Convergence to q
Question 3 (implementation in hardware) to be discussed later
1 1 – ulp
d
z
q –
Iteration i
d
z
0 1 2 3 4 5 6
(i)
(i)
q ε
75Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Division by Reciprocation
Convergence to a root of f(x) = 0 in the Newton-Raphson method.
The Newton-Raphson method can be used for finding a root of f (x) = 0
f(x)
xx(i+1)x
f(x )
Tangent at x(i)
Root α x(i)(i+2)
(i)
(i)
Start with an initial estimate x(0) for the root
Iteratively refine the estimate via the recurrence
x(i+1) = x(i) – f (x(i)) / f ′(x(i))
Justification:
tan α(i) = f ′(x(i))= f (x(i)) / (x(i) – x(i+1))
76Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Computing 1/d by Convergence1/d is the root of f (x) = 1/x – d
f ′(x) = –1/x2
Substitute in the Newton-Raphson recurrence x(i+1) = x(i) – f (x(i)) / f ′(x(i)) to get:
x (i+1) = x (i) (2 − x (i)d)
One iteration = Two multiplications + One 2’s complementation
Error analysis: Let δ (i) = 1/d – x(i) be the error at the ith iteration
δ (i+1) = 1/d – x (i+1) = 1/d – x (i) (2 – x (i) d) = d (1/d – x (i))2 = d (δ (i))2
Because d < 1, we have δ (i+1) < (δ (i))2
−d
1/d x
f(x)
77Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Choosing the Initial Approximation to 1/dWith x(0) in the range 0 < x(0) < 2/d, convergence is guaranteed
Justification: |δ(0) | = |x(0) – 1/d | < 1/d
δ(1)= |x(1) – 1/d | = d (δ(0))2 = (dδ(0))δ(0) < δ(0)
1
x
1/x
2
10
0
For d in [1/2, 1):
Simple choice x(0) = 1.5
Max error = 0.5 < 1/d
Better approx. x(0) = 4(√3 – 1) – 2d= 2.9282 – 2d
Max error ≅ 0.1
78Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Speedup of Convergence Division
Division can be performed via 2⎡log2k⎤ – 1 multiplications
This is not yet very impressive64-bit numbers, 3-ns multiplier ⇒ 33-ns division
Three types of speedup are possible:
Fewer multiplications (reduce m) Narrower multiplications (reduce the width of some x(i)s)Faster multiplications
)1()1()0(
)1()1()0(
−
−== m
m
xxdxxxzx
dzq
L
L Compute y = 1/d Do the multiplication yz
79Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Initial Approximation via Table LookupConvergence is slow in the beginning: it takes 6 multiplications to get 8 bits of convergence and another 5 to go from 8 bits to 64 bits
d x(0) x(1) x(2) = (0.1111 1111 . . . )two
Approx to 1/d
Better approx
Read this value, x(0+), directly from a table, thereby reducing 6 multiplications to 2
A 2w × w lookup table is necessary and sufficient for w bits of convergence after 2 multiplications
Example with 4-bit lookup: d = 0.1011 xxxx . . . (11/16 ≤ d < 12/16)Inverses of the two extremes are 16/11 ≅ 1.0111 and 16/12 ≅ 1.0101 So, 1.0110 is a good estimate for 1/d1.0110 × 0.1011 = (11/8) × (11/16) = 121/128 = 0.1111001 1.0110 × 0.1100 = (11/8) × (3/4) = 33/32 = 1.000010
80Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Visualizing the Convergence with Table Lookup
Convergence in division by repeated multiplications with initialtable lookup.
1 1 – ulp
d
z
q –
Iterations
After table lookup and 1st pair of multiplications, replacing several iterations
After the 2nd pair of multiplications
ε
81Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Convergence Does Not Have to Be from Below
1 1 ± ulp
d
z
q ±
Iterations
ε
82Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Using Truncated Multiplicative Factors
Fig. 16.4 One step in convergence division with truncated multiplicative factors.
1
Approximate iteration
Precise iteration
B
A
i + 1 i
Iteration
(x (i+1)
d x (0) x (1) x (i) ... x (i+1)
) T
d x (0) x (1) x (i) ...
d x (0) x (1) x (i) ...
< 2 −a
Example (64-bit multiplication)Initial step: Table of size 256 × 8 = 2K bitsMiddle steps: Multiplication pairs, with 9-, 17-, and 33-bit multipliersFinal step: Full 64 × 64 multiplication
Problem 16.9aA truncated denominator d (i), with aidentical leading bits and b extra bits (b ≤ a), leads to a new denominator d (i+1) with a + b identical leading bits
83Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Hardware ImplementationRepeated multiplications: Each pair of ops involves the same multiplier
d (i+1) = d (i) (2 − d (i)) Set d (0) = d; iterate until d (m) ≅ 1z (i+1) = z (i) (2 − d (i)) Set z (0) = z; obtain z/d = q ≅ z (m)
Two multiplications fully overlapped in a 2-stage pipelined multiplier.
z x(i)(i)
d x(i)(i)
x(i)z(i)d(i+1)
d(i+1)
x(i+1)
z x(i)(i)
d x(i+1)(i+1)
z(i+1)
2's Complz(i+1) x(i+1)
z x(i+1)(i+1)
d(i+2)
d x(i+1)(i+1)
84Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Implementing Division with ReciprocationReciprocation: Multiplication pairs are data-dependent, so they cannot be pipelined or performed in parallel
x (i+1) = x (i) (2 − x (i)d)
Options for speedup via a better initial approximation
Consult a larger tableResort to a bipartite or multipartite table (see Chapter 24) Use table lookup, followed with interpolationCompute the approximation via multioperand addition
Unless several multiplications by the same multiplier are needed, division by repeated multiplications is more efficient
However, given a fast method for reciprocation (see Section 24.6), using a reciprocation unit with a standard multiplier is often preferred
85Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
Analysis of Lookup Table SizeTable:Sample entries in the lookup table replacing the first four multiplications in division by repeated multiplications
–––––––––––––––––––––––––––––––––––––––––––––––––––––––Address d = 0.1 xxxx xxxx x (0+) = 1. xxxx xxxx
–––––––––––––––––––––––––––––––––––––––––––––––––––––––55 0011 0111 1010 010164 0100 0000 1001 1001
–––––––––––––––––––––––––––––––––––––––––––––––––––––––
Example: Table entry at address 55 (311/512 ≤ d < 312/512)
For 8 bits of convergence, the table entry f must satisfy
(311/512)(1 + . f) ≥ 1 – 2–8 (312/512)(1 + . f) ≤ 1 + 2–8
199/311 ≤ .f ≤ 101/156 or 163.81 ≤ 256 × . f ≤ 165.74
Two choices: 164 = (1010 0100)two or 165 = (1010 0101)two
86Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan
A General Result for Table Size
Proof strategy for sufficiency: Represent the table entry 1.f as the integer v = 2w × .f and derive upper / lower bound expressions for it. Then, show that at least one integer exists between vlb and vub
Theorem 16.1: To get w ≥ 5 bits of convergence after the first iteration of division by repeated multiplications, w bits of d (beyond the mandatory 1) must be inspected. The factor x(0+) read out from table is of the form (1.xxx . . . xxx)two, with w bits after the radix point
Proof strategy for necessity: Show that derived conditions cannot be met if the table is of size 2k–1 (no matter how wide) or if it is of width k – 1 (no matter how large)
Excluded cases, w < 5: Practically uninteresting (allow smaller table)
General radix r : Same analysis method, and results, apply