datapath designs
Post on 11-Jan-2016
42 Views
Preview:
DESCRIPTION
TRANSCRIPT
Datapath Designs
CK Cheng
CSE Department
UC, San Diego
Prefix Adder – Well-known and Well-developed?
• Classic prefix networks: Sklansky, Kogge-Stone, Brent-Kung, Ladner-Fischer, Han-Carlson, Knowles etc.
Prefix Adder – New Respects, New Method
• Realistic design considerations: Timing, Power and Area.
• Integer Linear Programming for prefix adder:– Logic effort timing model (gate cap. + wire cap.)– Activity-statistic power model– Non-uniform signal arrival/required times
Logic Levels
Max Fanouts Max Wire Tracks
Timing
Power Area
Prefix Adder – Optimum Prefix adders
• Uniform signal arrival/required times
Sklansky Adder Kogge-Stone Adder
Fastest depth-4 optimal prefix adderFastest depth-3 optimal prefix adder
Prefix Adder – Optimum Prefix adders
• Uniform signal arrival/required times
45
50
55
60
65
70
75
80
30 35 40 45 50 55 60
Timi ng
Power
Depth = 3 Depth = 4 Depth = 5
Brent - Kung Kogge- Stone Skl ansky
Prefix Adder – Optimum Prefix adders
• Non-uniform signal arrival/required times
Increasing Signal Arrival Times Decreasing Signal Arrival Times Convex Signal Arrival Times
Division – Iteration effort
• Pencil and paper method: (A=QB+2-nR and R<B)
1 bit partial quotient per iteration, n iterationsA = 0.1001,
B = 0.1010;
Q = A / B.
Q = 0.1101
+Qi: Partial Quotient
Ri: Partial Remainder
Ri+1 = Ri – B Qi
1 0 0 11 0 1 0 R0=A
1
1 0 1 00 1 0 0 R2
0
0 0 0 01 0 0 0 R3
1
1 0 1 00 1 1 0 R4
1 0 1 0
0.1
1 0 0 0 R1Q1 = 0.1Q2 = 0.01Q3 = 0.000Q4 = 0.0001
Division – Memory effort
• Lookup table is the simplest way to obtain multiple partial quotient bits in each iteration.
• SRT method: a lookup tables stores m-bit partial quotients decided by m bits of partial remainder and m bits of divisor.
Table size: 22m m
• STR method is limited by memory wall.
Division – Arithmetic effort
• Partial quotient is calculated by arithmetic functions.• Prescaling:
• Taylor expansion:
• Series expansion:
ERRQB
A
EB
EA
B
A
BE
iii
1
ERQ
BB
BB
BBE
ii
hl
hl
h
322 )1
()1
(11
ERQ
XXXXXXB
E
XB
ii
)1)(1)(1(11
1
4232
Division – Solution space
• Modern FPGAs contains plenty of memory and build-in multipliers, which enable high performance divider.
Iteration Effort
Memory Effort
Arithmetic Effort
Memory Wall
Pencil-and-paper
SRT
Prescaling
Taylor Expansion
Low area
Series Expansion
Low latency
Our target
Division – PST algorithm
• Utilize the power of series expansion, but need a good start point.
• Prescaling provide a scaled divisor close to 1.
• 0-order Taylor expansion iterates to reach the final quotient
21)1)(1(
11
1
XXXEB
XB
EXB
B
A
EB
EA
B
A
BE
1
ERQ ii
Division – PST algorithm
E0 = Table (B(m)) 1/B
A1 = AE0; B1 = BE0
E1 = (2 B1) INV(B1(2m))
Qi = Ri-1 E1
Ri = Ri-1 Qi B1
Q = Q + Qi
A = 0.1011,0110B = 0.1100,1011
B(m) = 0.1100 E0 = 1.0011
E1 = INV(B1(2m)) = 1.0000,1110
A1 = A E0 = 0.1101,1000,0010B1 = B E0 = 0.1111,0001,0001
Q1 = A1 E1 = 0.1110,0011R1 = B1 – Q1 B1 = 0.0000,0010,0101,1110,1101
Q2 = R1 E1 = 0.1001,1111R2 = R1 – Q2 B1 = 0.0000,0001,1111,1011,0001
Q = 0.1110,0011 + 0.0000,0010,0111,11 = 0.1110,0101,0111,11
Division – FPGA Implementation• PST algorithm is suitable for high-perform
ance division unit design in FPGAs
Fmax(Period)
ALUTs
Memory Bits
DSP Blocks
Power Consumption
(Dynamic+Static)
Throughput
IP Core(no DSP)
50.16MHz
(19.935ns)
1203 84 0 381mW(52mW+329mW)
50.16Mdiv/s
PST(DSP)
72.8MHz(13.737n
s)
213 768 28 350mW(23mW+327mW)
24.3Mdiv/s
PST(no DSP)
73.20MHz
(13.661ns)
1437 768 0 378mW(50mW+328mW)
24.4Mdiv/s
PST-pipelined(DSP)
74.15MHz
(13.486ns)
261 768 40 344mW(17mW+327mW)
74.15Mdiv/s
PSTp(no DSP)
76.05MHz
(13.150ns)
1940 768 0 359mW(31mW+328mW)
76.05Mdiv/s
32-bit division with 5-cycle latency
top related