area optimizations for dual-rail circuits using relative-timing analysis
DESCRIPTION
Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis. Tiberiu Chelcea , Girish Venkataramani, Seth C. Goldstein Department of Computer Science Carnegie Mellon University. QDI: Orphans problem. Early propagation : “A” arrives early => Z transitions - PowerPoint PPT PresentationTRANSCRIPT
1
Area Optimizations for Dual-Rail Circuits Using Relative-
Timing Analysis
Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein
Department of Computer ScienceCarnegie Mellon University
2
QDI: Orphans problem
• Early propagation:– “A” arrives early => Z
transitions– Stale values on the
other signals
• Incorrect behavior: inputs acknowledged before being received
Z0
Y0
X0
D0
C0
B0
A0
A1
B1
C1
D1
X1
Y1
Z1
3
NCL-X solution
Z0
Y0
X0
D0
C0
B0
A0
A1
B1
C1
D1
X1
Y1
Z1N1
N2
N3
DoneC
Add completion detection
DoneA
4
QDI Gate Delays
QDI implementations always assume the worst:equal probability for any gate delay
5
Motivation
• Quasi-Delay Insensitive (QDI) circuits:– One timing constraint– Naturally tolerate
parametric variation, but…
• Have large area overheads– Added completion
detection for correctness
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
add_bk_32 lsr16 C880
gates cd
6
Parametric Variation and Gate Delays
Goal: pay only what is necessary
ITRS’05: 35% parametric
variation by 2020
7
Goal: Optimizing Sync→Async Flow
• Use timing information to reduce size of completion detection
• Use mixed gates to further reduce area– w/ early propagation– w/o early propagation
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
NCL-X Direct
gates cd
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
NCL-X Direct Exact
gates strict cd
regular gates
strict gates
8
ContributionsThree new relative-timing
area optimizations:• Direct method:
– Timing analysis + simple CD elimination
• Greedy method: fast but not optimal– Uses strict gates, but
may increase area
• Exact method: optimal, but slow– Solves an mILP
problem
0.83
0.55
0.43
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Direct Greedy Exact
9
Outline
• Timing analysis & Direct Optimization
• Greedy optimization method
• Exact optimization method
• Results
• Conclusions
10
Basics• QDI circuits:
– Unbounded but finite delays on gates and wires
– One timing assumption: isochronic fork
• Timed circuits:
– Delays on gates and wires: bounded time intervals
– Given input arrival times: compute propagation intervals for each gate and wire
11
Timing Computation
• Conservative assumption: any input change can trigger an output change
Z
YD
C
A
B
X
N1
N2
N3
(1.5,1.9)
(1.1,1.2)
(1.0,1.2)
(0.5,0.7)
(0.6,0.8)
(0.5,0.7)
(0,0)
(0,0)
(0,0)
(0,0)
(0,0)
(0,0)(0,0)
(3.5,4.1)
(3.0,4.0)
(1.5,1.9)
(3.6,4.9)
(3.6,4.9)
(2.0,5.6)(2.0,5.6)
GlobalPI
12
Direct Optimization Method
• Gate completion detection iff gate may not be stable when outputs are produced
Z
YD
C
A
B
X
N1
N2
N3
(1.5,1.9)
(1.1,1.2)
(1.0,1.2)
(3.5,4.1)
(3.0,4.0)
(1.5,1.9)
(3.6,4.9)
(3.6,4.9)
(2.0,5.6)(2.0,5.6)
CDone
Under any input change, gate quiescent when output produced
1.9 < 2.0
13
Strict Gates
• All inputs must arrive before producing an output
• Eliminate early propagation effect
- Extremely expensive+ Decrease length of
propagation interval
A
B
C
C
C
14
Timing Computation with Strict Gates
• Entire completion detection: single OR gate
Z
YD
C
A
B
X
N1
N2
N3
(1.5,1.9)
(1.1,1.2)
(1.0,1.2)
(3.5,4.1)
(3.0,4.0)
(1.5,1.9)
(3.6,4.9)
(3.6,4.9)
(5.0,6.8)(5.0,6.8)
(1.4,1.9)Done
• This circuit: area not reduced• Goal: smart insertion of strict gates
15
Outline
• Timing analysis & Direct Optimization
• Greedy optimization method
• Exact optimization method
• Results
• Conclusions
16
Greedy Optimization (1)
• Strict gates: area implications– GlobalPI may be narrower and delayed– Fewer gates non-quiescent– Smaller completion detection
• Greedy optimization framework:– Flip gates in the circuit from normal to strict– Select most promising candidate– Continue until no improvements possible
17
Greedy Optimization (2)
Algorithm:
1. For each gate Gi in the circuit
a. Flip each gate Gi in turn from regular to strict
b. Perform timing analysis, compute GlobalPIi
c. Flip back Gi to regular
2. Select Gk with the narrowest GlobalPIk
3. If GlobalPIk narrower than previous best:
a. Flip Gk to strict permanently
b. Continue (goto 1)
Else: finish
18
Greedy Optimization (3)
• Algorithm does not optimize for area directly
• Instead: may reduce the completion detection by narrowing the output interval
• Results promising, but individual benchmarks may result in larger area
19
Outline
• Timing analysis & Direct Method
• Greedy optimization method
• Exact optimization method
• Results
• Conclusions
20
Exact Optimization Method
• mixed Integer Linear Programming (mILP)
• Transform circuit graph into an optimization problem:
– Introduce variables for each gate, wire and primary input/output
– Matrix coefficients: from library (gate areas) and back-annotation (gate/wire delays) files
– Decision variables (GS) should gate be strict?
21
mILP formulation• Minimize: TotalArea = GateArea+CDArea
• GateArea = i (GSi·SAreai + (1-GSi)·NAreai)
• CDArea = SCD·Or2Area + (SCD-1)·CArea– SCD: # gates that need completion detection
• NeedsCD: does a gate need CD?– NeedsCD = 0 if PIM < GlobalPIm or successor is
strict; otherwise 1
• Rest of the model implements timing computation
22
Improving the mILP Model
• Basic mILP model: too slow even for small circuits (hours for dozen gates)
• Leverage problem knowledge into model improvements:– Branching order: gates closer to the output are
more likely to become strict => inspected first – Single input gates: never strict– Provide initial solution (result of greedy opt)
• Can solve problems with hundreds of gates in minutes
23
Related Work: Optimizations• Cortadella et al:
– logical function decompositions– can achieve substantial area savings– can be the starting point for our methods
• Zhou et al: consider strict gates in optimization, but no timing information
• Sokolov et al: two timing optimizations– Alternate levels: unrealistic assumptions for
gate delays– Longest path: applicable only for small circuits
24
Experimental Setup• Tool flow:
– Synthesis & tech-mapping with Synopsys Design Compiler
– Perl scripts for dual-rail implementations– Optimization tool reads structural Verilog and
timing back-annotations– End result: optimized circuits (Verilog)
• Experiments:– Arithmetic and ISCAS’89 benchmarks– Pre-layout runs in 0.18m technology
25
0
0.2
0.4
0.6
0.8
1
1.2
direct greedy ilp
Area: Ratio vs. NCL-X methodGreedy: 2.83x NCL-X areafor le32mILP does not finish in
less than 1 hourPartial results
Direct: 0.83xGreedy: 0.55xmILP: 0.43x
26
Area breakdown
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
add_bk_32 Lsr16 C880
regular strict cdDi
rect
Gree
dy
ILP
NCLX
8/168 strict4.7% before → 40% after
Over twice as small than NCL-X
27
Parametric Variation: BK adder
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0% 5% 10% 15% 20% 25% 30% 35%
Parametric Variation
Ratio
vs.
NCL
-X A
rea
Direct Greedy Exact
28
Conclusions• Paper introduced:
– a method to translate synchronous circuits into optimized asynchronous circuits
– Three new relative timing optimizations for improving area
• Direct: extremely simple• Greedy: fast, good results• Exact: optimal, may be extremely slow
– Analyzed the impact of parametric variation on these circuits
29
Backup slides
30
Outline
• Background
• Timing analysis & Direct Optimization
• Greedy optimization method
• Exact optimization method
• Results
• Conclusions
31
Introduction
• Future deep sub-micron technologies:– large parametric variations (ITRS’05 predicts
35% by 2020).– Asynchronous design a natural fit– Asynchronous handshaking: widespread
• Acceptance for asynchronous circuits is predicated on quality CAD tools:– “Pure” async: from scratch– Sync to async translation
32
Synchronous to Asynchronous Translation
Synchronous circuit
Template-based replacement of each sync gate
AB
CD
Z
Y
X
Z = (A·B)·(C+D)
N1
N2
N3
Dual-rail circuit
Z0
Y0
X0
D0
C0
B0
A0
A1
B1
C1
D1
X1
Y1
Z1N1
N2
N3
33
Related Work
• Numerous approaches for translating synchronous circuits into asynchronous
• Dealing with the orphans problem:
– Kondratiev et al: NCL-X (discussed below)
– Brej: anti-tokens
• Allows for early propagation
• Completion detection in background
• Even larger area overheads
34
ILP optimization for 32-bit BK adder
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
65%
Time (s)
% e
rror
% Crt Sol % Best Estimation
CrtSol: current bestInteger solution
Best Estimation: best guess ofhow far the optimum isWhen 0, optimum found
35
Outline
• Timing analysis & Direc Optimization
• Greedy optimization method
• Exact optimization method
• Results
• Conclusions
36
Area breakdown
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
add_bk_32 Lsr16 C880
regular strict cdDi
rect
Gree
dy
ILP
NCLX
8/168 strict4.7% before → 40% after
Over twice as small than NCL-X
37
mILP Run TimeBench #Inps #Outs #Gates #Vars #Constr Runtime
Eq32 64 1 37 731 1158 0.23s
Decode32 5 32 49 1239 2068 12.2s
C432 36 7 80 2391 4223 27m46s
Lsl16 32 16 81 1819 3534 10m24s
Lsr16 32 16 81 2315 4080 19m15s
Absval32 32 32 92 2420 4149 6m7s
C880 60 26 168 4385 7724 39m25s
C1908 33 25 190 3263 5300 20m23s
Bk32 64 32 285 4923 8293 78s
Clf32 64 32 309 5195 8737 71s