area optimizations for dual-rail circuits using relative-timing analysis

1

Area Optimizations for Dual-Rail Circuits Using Relative-

Timing Analysis

Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein

Department of Computer ScienceCarnegie Mellon University

2

QDI: Orphans problem

• Early propagation:– “A” arrives early => Z

transitions– Stale values on the

other signals

• Incorrect behavior: inputs acknowledged before being received

Z0

Y0

X0

D0

C0

B0

A0

A1

B1

C1

D1

X1

Y1

Z1

3

NCL-X solution

Z0

Y0

X0

D0

C0

B0

A0

A1

B1

C1

D1

X1

Y1

Z1N1

N2

N3

DoneC

Add completion detection

DoneA

4

QDI Gate Delays

QDI implementations always assume the worst:equal probability for any gate delay

5

Motivation

• Quasi-Delay Insensitive (QDI) circuits:– One timing constraint– Naturally tolerate

parametric variation, but…

• Have large area overheads– Added completion

detection for correctness

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

add_bk_32 lsr16 C880

gates cd

6

Parametric Variation and Gate Delays

Goal: pay only what is necessary

ITRS’05: 35% parametric

variation by 2020

7

Goal: Optimizing Sync→Async Flow

• Use timing information to reduce size of completion detection

• Use mixed gates to further reduce area– w/ early propagation– w/o early propagation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

NCL-X Direct

gates cd

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

NCL-X Direct Exact

gates strict cd

regular gates

strict gates

8

ContributionsThree new relative-timing

area optimizations:• Direct method:

– Timing analysis + simple CD elimination

• Greedy method: fast but not optimal– Uses strict gates, but

may increase area

• Exact method: optimal, but slow– Solves an mILP

problem

0.83

0.55

0.43

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Direct Greedy Exact

9

Outline

• Timing analysis & Direct Optimization

• Greedy optimization method

• Exact optimization method

• Results

• Conclusions

10

Basics• QDI circuits:

– Unbounded but finite delays on gates and wires

– One timing assumption: isochronic fork

• Timed circuits:

– Delays on gates and wires: bounded time intervals

– Given input arrival times: compute propagation intervals for each gate and wire

11

Timing Computation

• Conservative assumption: any input change can trigger an output change

Z

YD

C

A

B

X

N1

N2

N3

(1.5,1.9)

(1.1,1.2)

(1.0,1.2)

(0.5,0.7)

(0.6,0.8)

(0.5,0.7)

(0,0)

(0,0)

(0,0)

(0,0)

(0,0)

(0,0)(0,0)

(3.5,4.1)

(3.0,4.0)

(1.5,1.9)

(3.6,4.9)

(3.6,4.9)

(2.0,5.6)(2.0,5.6)

GlobalPI

12

Direct Optimization Method

• Gate completion detection iff gate may not be stable when outputs are produced

Z

YD

C

A

B

X

N1

N2

N3

(1.5,1.9)

(1.1,1.2)

(1.0,1.2)

(3.5,4.1)

(3.0,4.0)

(1.5,1.9)

(3.6,4.9)

(3.6,4.9)

(2.0,5.6)(2.0,5.6)

CDone

Under any input change, gate quiescent when output produced

1.9 < 2.0

13

Strict Gates

• All inputs must arrive before producing an output

• Eliminate early propagation effect

- Extremely expensive+ Decrease length of

propagation interval

A

B

C

C

C

14

Timing Computation with Strict Gates

• Entire completion detection: single OR gate

Z

YD

C

A

B

X

N1

N2

N3

(1.5,1.9)

(1.1,1.2)

(1.0,1.2)

(3.5,4.1)

(3.0,4.0)

(1.5,1.9)

(3.6,4.9)

(3.6,4.9)

(5.0,6.8)(5.0,6.8)

(1.4,1.9)Done

• This circuit: area not reduced• Goal: smart insertion of strict gates

15

Outline




• Results

• Conclusions

16

Greedy Optimization (1)

• Strict gates: area implications– GlobalPI may be narrower and delayed– Fewer gates non-quiescent– Smaller completion detection

• Greedy optimization framework:– Flip gates in the circuit from normal to strict– Select most promising candidate– Continue until no improvements possible

17


Algorithm:

1. For each gate Gi in the circuit

a. Flip each gate Gi in turn from regular to strict

b. Perform timing analysis, compute GlobalPIi

c. Flip back Gi to regular

2. Select Gk with the narrowest GlobalPIk

3. If GlobalPIk narrower than previous best:

a. Flip Gk to strict permanently

b. Continue (goto 1)

Else: finish

18


• Algorithm does not optimize for area directly

• Instead: may reduce the completion detection by narrowing the output interval

• Results promising, but individual benchmarks may result in larger area

19

Outline

• Timing analysis & Direct Method



• Results

• Conclusions

20

Exact Optimization Method

• mixed Integer Linear Programming (mILP)

• Transform circuit graph into an optimization problem:

– Introduce variables for each gate, wire and primary input/output

– Matrix coefficients: from library (gate areas) and back-annotation (gate/wire delays) files

– Decision variables (GS) should gate be strict?

21

mILP formulation• Minimize: TotalArea = GateArea+CDArea

• GateArea = i (GSi·SAreai + (1-GSi)·NAreai)

• CDArea = SCD·Or2Area + (SCD-1)·CArea– SCD: # gates that need completion detection

• NeedsCD: does a gate need CD?– NeedsCD = 0 if PIM < GlobalPIm or successor is

strict; otherwise 1

• Rest of the model implements timing computation

22

Improving the mILP Model

• Basic mILP model: too slow even for small circuits (hours for dozen gates)

• Leverage problem knowledge into model improvements:– Branching order: gates closer to the output are

more likely to become strict => inspected first – Single input gates: never strict– Provide initial solution (result of greedy opt)

• Can solve problems with hundreds of gates in minutes

23

Related Work: Optimizations• Cortadella et al:

– logical function decompositions– can achieve substantial area savings– can be the starting point for our methods

• Zhou et al: consider strict gates in optimization, but no timing information

• Sokolov et al: two timing optimizations– Alternate levels: unrealistic assumptions for

gate delays– Longest path: applicable only for small circuits

24

Experimental Setup• Tool flow:

– Synthesis & tech-mapping with Synopsys Design Compiler

– Perl scripts for dual-rail implementations– Optimization tool reads structural Verilog and

timing back-annotations– End result: optimized circuits (Verilog)

• Experiments:– Arithmetic and ISCAS’89 benchmarks– Pre-layout runs in 0.18m technology

25

0

0.2

0.4

0.6

0.8

1

1.2

direct greedy ilp

Area: Ratio vs. NCL-X methodGreedy: 2.83x NCL-X areafor le32mILP does not finish in

less than 1 hourPartial results

Direct: 0.83xGreedy: 0.55xmILP: 0.43x

26

Area breakdown

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

add_bk_32 Lsr16 C880

regular strict cdDi

rect

Gree

dy

ILP

NCLX

8/168 strict4.7% before → 40% after

Over twice as small than NCL-X

27

Parametric Variation: BK adder

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0% 5% 10% 15% 20% 25% 30% 35%

Parametric Variation

Ratio

vs.

NCL

-X A

rea

Direct Greedy Exact

28

Conclusions• Paper introduced:

– a method to translate synchronous circuits into optimized asynchronous circuits

– Three new relative timing optimizations for improving area

• Direct: extremely simple• Greedy: fast, good results• Exact: optimal, may be extremely slow

– Analyzed the impact of parametric variation on these circuits

29

Backup slides

30

Outline

• Background




• Results

• Conclusions

31

Introduction

• Future deep sub-micron technologies:– large parametric variations (ITRS’05 predicts

35% by 2020).– Asynchronous design a natural fit– Asynchronous handshaking: widespread

• Acceptance for asynchronous circuits is predicated on quality CAD tools:– “Pure” async: from scratch– Sync to async translation

32

Synchronous to Asynchronous Translation

Synchronous circuit

Template-based replacement of each sync gate

AB

CD

Z

Y

X

Z = (A·B)·(C+D)

N1

N2

N3

Dual-rail circuit

Z0

Y0

X0

D0

C0

B0

A0

A1

B1

C1

D1

X1

Y1

Z1N1

N2

N3

33

Related Work

• Numerous approaches for translating synchronous circuits into asynchronous

• Dealing with the orphans problem:

– Kondratiev et al: NCL-X (discussed below)

– Brej: anti-tokens

• Allows for early propagation

• Completion detection in background

• Even larger area overheads

34

ILP optimization for 32-bit BK adder

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

Time (s)

% e

rror

% Crt Sol % Best Estimation

CrtSol: current bestInteger solution

Best Estimation: best guess ofhow far the optimum isWhen 0, optimum found

35

Outline

• Timing analysis & Direc Optimization



• Results

• Conclusions

36

Area breakdown

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

add_bk_32 Lsr16 C880

regular strict cdDi

rect

Gree

dy

ILP

NCLX

8/168 strict4.7% before → 40% after

Over twice as small than NCL-X

37

mILP Run TimeBench #Inps #Outs #Gates #Vars #Constr Runtime

Eq32 64 1 37 731 1158 0.23s

Decode32 5 32 49 1239 2068 12.2s

C432 36 7 80 2391 4223 27m46s

Lsl16 32 16 81 1819 3534 10m24s

Lsr16 32 16 81 2315 4080 19m15s

Absval32 32 32 92 2420 4149 6m7s

C880 60 26 168 4385 7724 39m25s

C1908 33 25 190 3263 5300 20m23s

Bk32 64 32 285 4923 8293 78s

Clf32 64 32 309 5195 8737 71s

area optimizations for dual-rail circuits using relative-timing analysis

Documents

timing constraintnaturally

gate delaymotivationquasi

area optimizations

dualrail circuits

insensitive qdi circuits

parametric variation

girish venkataramani

signalsincorrect behavior