Download - Asynchronous Links, for NanoNets?

17/09/2007 1NanoNet’07, Catania

Asynchronous Links, for NanoNets?Alex Yakovlev

University of Newcastle, UK


Motivation-1 At very deep submicron, gate delay is much less than

interconnect delay: total interconnect length can reach several meters; interconnect delay can be as much as 90% of total path delay in VDSM circuits

Timing issue is a problem, particularly for global wires

Source: ITRS, 2003Source: ITRS, 20030.1

1

10

100250 180 130 90 65 45 32

Feature size (nm)Relativedelay

Gate delay (fanout 4)Local interconnect (M1,2)Global interconnect with repeatersGlobal interconnect without repeaters

Multiple clock domains are reality, problem of interface between them

ITRS’05 predicted: 4x (8x) increase in global asynchronous signalling by 2012 (2020)


Motivation-2

Variability and uncertainty– Geometry and process: for long channels intra-die

variations are less correlated for different part of the interconnect, both for interconnects and repeaters

• e.g., M4 and M5 resistance/um massively differ, leading to mistracking (C.Visuweswariah, SLIP’06)

• e.g. 250nm clock skew has 25% variability due to interconnect variations (Y.Liu et.al. DAC’00)

– Behavioural: crosstalk (sidewall capacitance can cause up to 7x variation in delay (R. Ho, M.Horowitz))


A Network on Chip

Synchronization required

Arbitration required

Multiple ClocksAsync Links


Example from the Past: Fault-Tolerant Self-Timed Ring (Varshavsky et al. 1986)

For an onboard airborne computer-control system which tolerated up to two faults. Self-timed ring was a GALS system with self-checking and self-repair at the hardware level

Individually clocked subsystems

Self-timed adapters forming a ring


Communication Channel Adapter

Data (DR,DS) is encoded using 3-of-6 Sperner code (16 data values for half-byte, plus 4 tokens for ring acquisition protocol)AR, AS – acknowledgementsRR, RS – spare (for self-repair) lines

Much higher reliability than a bus and other forms of redundancy

MCC was developed TTL-Schottky gate arrays, approx 2K gates.


Outline

Token-based view of communication Basics of asynchronous signalling Self-timed data encoding Pipelining How to hide acknowledgements Serial vs Parallel links Arbiters and routers Async2sync interface CAD issues


Data exchange: token-based view

Question 1: when can Rx look at the incoming data?Data validity issue – Forming a well-defined token

source tx rx destData



Question 1: when can Rx looked at the data?Data validity issue – Forming a well-defined token

Question 2: when can Tx send new data?Acknowledgement issue – Separation b/w tokens




Question 1: when can Rx looked at the data?Data validity issue – Forming a well-defined token

Question 2: when can Tx send new data?Acknowledgement issue – Separation b/w tokens

These are fundamental issues of flow control at the physical and link levels

The answers are determined by many design aspects: technology level, system architecture (application, pipelining), latency, throughput, power, design process etc.



Tokens and spaces with global clocking

In globally clocked systems both Q1 and Q2 are resolved with the aid of clock pulses


clk


Tokens and spaces

Without global clocking: Q1 can be resolved differently from Q2

E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing

source tx rx dest

Data

Clk_tx Clk_rx

D_valid

bundle


Tokens and spaces

Without global clocking: Q1 can be resolved differently from Q2

E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing

source tx rx dest

Data

D_valid

bundleack

ack ack


Petri net modelTx Rxsource dest

Tx delay Rx delay

Tx Rxsource dest

Tx delay or ack Rx delay or ack

Data Valid

Data Valid

ack

Always safe but with a round trip delay!

One way delay, but may be unsafe!


Asynchronous handshake signalling

Valid data tokens and safe spaces between them can be created by different means of signalling and encoding

Level-based -> Return-To-Zero (RTZ) or 4-phase protocol

Transition-based -> Non-Return-to-Zero (NRZ) or 2-phase protocol

Pulse-based, e.g. GasP Phase-difference-based Data encoding: bundled data (BD), Delay-

insensitive (DI)


Handshake Signalling Protocols Level Signalling (RTZ or 4-phase)

Transition Signalling (RTZ or 4-phase)

One cycle

req

ack

req

ack

One cycle

req

ackOne cycle


Handshake Signalling Protocols Pulse Signalling

Single-track Signalling (GasP)

One cycle

req

ack

req

ack

One cycle

req + ackreq

ack


GasP signalling

Pull up from pred (req)

Pull down here (ack)

Pull up from here (req)

Pull down from succ (ack)

Pulse length control loops

Source: R. Ho et al, Async’04


Data encoding Bundled data

– Code is positional binary, token is determined by Req+ signal; Req+ arrives with a safe set-up delay from data

Delay-insensitive codes (tokens determined by the codeword values, require a spacer, or NULL, state if RTZ)– 1-of-2 (Dual-rail per bit) – systematic code, encoding,

decoding straightforward– m-of-n (n>2) – not systematic, i.e. incur encoding and

decoding costs, optimal when m=n/2– One-hot ,1-of-n (n>2), completion detection is easy, not

practical beyond n>4– Systematic, such as Berger, incur complex completion

detection


Bundled Data

req

ack

Data

One cycle

req

ack

Data

RTZ:

NRZ:

One cycle

req

ack

Data

One cycle


DI encoded data (Dual-Rail)

ack

Data.0

One cycle

Data.1

ack

Data.0Data.1Logical 1

Logical 0

One cycle

NULL (spacer) NULL

cycle

Data.1

ack

Data.0Logical 1

Logical 0

cycle cycle

Logical 1 Logical 1

cycle

RTZ:

NRZ:


DI encoded data (Dual-Rail)

ack

Data.0

One cycle

Data.1

ack

Data.0Data.1Logical 1

Logical 0

One cycle

NULL (spacer) NULL

cycle

Data.1

ack

Data.0Logical 1

Logical 0

cycle cycle

Logical 1 Logical 1

cycle

RTZ:

NRZ:This coding leads to complex logic implementation; hard to track odd and even phases and logic values – hence see LEDR

below


DI codes (1-of-n and m-of-n)

1-of-4: – 0001=> 00, 0010=>01, 0100=>10, 1000=>11

2-of-4:– 1100, 1010, 1001, 0110, 0101, 0011 – total 6

combinations (cf. 2-bit dual-rail – 4 comb.) 3-of-6:

– 111000, 110100, …, 000111 – total 20 combinations (can encode 4 bits + 4 control tokens)

2-of-7:– 1100000, 1010000, …, 0000011 – total 21

combinations (4 bits + 5 control tokens)


DI codes completion detection and decoding

1-of-4 completion detection is a 4-input OR gate (CD=d0+d1+d2+d3)

Decode 1-of-4 to dual rail is a set of four 2-input OR gates (q0.0=d0+d2; q0.1=d1+d3; q1.0=d0+d1; q1.1=d2+d3)

For m-of-n codes CD and decoding is non-trivial

From J.Bainbridge et al, ASYNC’03


Incomplete DI codes

Incomplete 2-of-7:

Composed of

1-of-3

and

1-of-4

From J.Bainbridge et al ASYNC’03


Phase difference based encoding (C. D’Alessandro et al. ASYNC’06,’07)

sp0 sp1

0

sp0

1

sp1

0

sp0

0

ref

t_1

t_0

data

t_1 before t_0t_0 before t_1

The proposed system consists in encoding a bit of data in the phase relationship between two signals generated using a reference

This would ensure that any transient fault appearing on one of the reference signals will be ignored if it is not mirrored by a corresponding transition on the other line

Similarity with multi-wire communication


Phase encoding: multiple rail No group of wires has the same delay All wires toggle when an item of data is sent Increased number of states available ( n wires = n! states) hence

more bits/symbol Table illustrates examples of phase encoding compared to the

respective m-of-n counterpart

Type of Link Number of states

Bits per Symbol

Extra states

Transitions per symbol

Symbols per packet

Transitions per packet

Phase enc. (4) 24 4 8 4 32 128

1-of-4 4 2 0 2 64 128

Phase enc. (6) 720 9 208 6 15 90

3-of-6 20 4 4 6 32 192


Phase encoding Repeater

i1

i2

i3

o1

o2

o3

sender

go

receiver

Phase detectors (Mutexes)

1<3

3<1

2<3

3<2

1<2

2<1


PipelinesDual-rail pipeline

From J.Bainbridge & S. Furber IEEE Micro, 2002


The problem of Acking

Question 2 “when can Tx send new data?” has two aspects: – Safety (not to overflow the channel or

when Tx and Rx have much variation in delay)

– Performance (to maximize throughput and reduce latency)

Can we hide ack (round trip) delay?


From R.Ho et al. ASYNC’04

To maintain throughput more pipeline stages are required but that costs too much latency and power

First minimize latency along a long wire (not specific to asynchronous) and then maximize throughput (using “wagging tail buffer” approach)


From R.Ho et al. ASYNC’04

Use of wagging buffer approach

Alternate between top and bottom

control


“Wagging tail buffer” approachreqtop

acktop

ackbot

reqbot

data

Top and bot

control channels work at

½ frequency of data channel


Serial Link vs Parallel Link (from R. Dobkin) Why Serial Link?

– Less interconnect area– Less routing congestion– Less coupling– Less power (depends on

range)

The relative improvement grows with technology scaling. The example on the right refers to: – Single gate delay serial link– Fully-shielded parallel link with

8 gate delay clock cycle– Equal bit-rate– Word width N=8

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

180 130 90 65 30 15

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

180 130 90 65 30 15

Parallel Link dissipates less power

Serial Link dissipates less power

Technology Node [nm]

Link Length [mm]

Parallel Link requires less area

Serial Link requires less area


Serialization model

Tx Rx

Acking at the bit level

… …


Serialization model

Tx Rx

Acking at the word level


Serialization model

Tx Rx

Acking at the word level (with more concurrency)


Serial Link – Top Structure (R.Dobkin, Async’07)

Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS)

Acknowledge per word instead of per bit Synchronizers used at the level of the ack signals Wave-pipelining over channel Differential encoding (DS-DE, IEEE1355-95) Reported throughput: 67Gps for 65nm process (viz. one bit per

15ps – expected FO4 inverter delay), based on simulations


Encoding –Two Phase NRZ LEDR

Two Phase Non-Return-to-Zero Level Encoded Dual Rail – “delta” encoding (one transition per bit)

Uncoded (B)

State bit (S)

Phase bit (P)

0 0 1 1 0 0 0 0 1 0

( ),( )( ),

B i i oddP iB i i even

( ) ( )S i B i i


Transmitter – Fast SR Approach (from R. Dobkin)


Receiver Splitter (from R. Dobkin)


Self Timed Networks Router requires priority arbitration

– Arbitration necessary at every router merge– Potential delay at every node on the pathBUT– Asynchronous merge/arbitration time is average not worst

case Adapters to locally clocked cells require

synchronization Synchronization necessary when clocks are unknown

– Occurs when receiving data (data valid), and when sending (acknowledge)

BUT– Time can be long (2 cycles?)– Must assume worst case time (maybe)


Router priority

Virtual channels implement scheduling algorithm Contention for link resolved by priority circuits

Merge Split

Link

Flow Control


Asynchronous Arbiters

Multiway arbiters (e.g. for Xbar switches):– Cascaded mesh (latency ~ N)– Cascaded Tree (latency ~ logN)– Token-Ring (busy ring and lazy ring) (latency ~

from 1 to N) Priority arbiters (e.g. for Routers with different QS):

– Static priority (topological order)– Dynamic priority (request arrives with priority

code)– Ordered (time-priority) - multiway arbiter, followed

by a FIFO buffer


Static Priority Arbiter

s q

r*C

MUTEX

Cs* q

r

MUTEX

Cs* q

r

MUTEX

Cs* q

rG1

G2

G3

R1

R2

R3

Lock

Lock Register

Prio

rity

Mod

ule

r1

r2

r3

s1

s2

s3


Why Synchronizer?

Here one clock cycle is used for the metastability to resolve.

DFFCLK

DATA QDATA

CLK

Q

Metastability

DFFCLK

DATA

DFFQ

0101

Metastability

Two DFF Synchronizer


CAD support: Async design flow


DeviceLDS

LDTACK

D

DSr

DSw

DTACK

VME BusController

DataTransceiver

Bus DSr

LDS

LDTACK

D

DTACK

Read Cycle

Synthesis of Asynchronous link interfaces


DTACK-DSr+

LDS+

LDTACK+

D+

DTACK+

DSr-

D-

LDS-

LDTACK-

DSw-

DSw+

D+

LDS+

LDTACK+

D-

DTACK+


DSr+

DSr+

DSr+

DTACK-

DTACK-

DTACK-

LDS-LDS-LDS-

LDTACK- LDTACK- LDTACK-

D-

DSr-DTACK+

D+

LDTACK+

LDS+

Complete State Coding (CSC)

csc -

csc +

Boolean equations:Boolean equations:

LDS = D cscDTACK = DD = LDTACK csc = DSr

Logic asynchronous circuit

DTACKD

DSr

LDS

LDTACK

csc

synthesis

DTACK-DSr+

LDS+

LDTACK+

D+

DTACK+

DSr-

D-

LDS-

LDTACK-

DSw-

DSw+

D+

LDS+

LDTACK+

D-

DTACK+


Conclusions on Async Links At nm level links will be more asynchronous, perhaps first,

mesochronous to avoid global clock skew Delay-insensitive codes can be used to tolerate interwire-delay

variability Phase-encoding can be used for higher power-bit efficiency and

SEU tolerance Acking will be mainly used for flow control (word level) and its

overhead can be ‘hidden’ by using the “wagging buffer” technique

Serial Links save area and power for long interconnects, with buffering (pipelining) if one wants to maintain high throughput; they also simplify building switches

Synthesis tools can be used to build clock-free interfaces between different links

Asynchronous logic can be used for building higher level circuits, e.g. arbiters for switches and routers


And finally …


ASYNC’08 and NOCs’08 …plus SLIP’08

Held in Newcastle upon Tyne, UK, 7-11 April 2008 (SLIP on 5-6 April – weekend)

async.org.uk/async2008 async.org.uk/nocs2008 Submission deadlines:

– Async’08: Abstract – Oct. 8 , Full paper – Oct. 15– NOCs’08: Abstract – Nov. 12, Full paper – Nov. 19


Extras

More slides if I have time!


Chain Network Components

From J.Bainbridge & S. Furber IEEE Micro, 2002


A Network on Chip

Synchronization required

Arbitration required

Multiple Clocks


Transmitter – Fast SR Approach (from R. Dobkin)


Receiver Splitter (from R. Dobkin)


Self Timed Networks Router requires priority arbitration

– Arbitration necessary at every router merge– Potential delay at every node on the pathBUT– Asynchronous merge/arbitration time is average not worst

case Adapters to locally clocked cells require

synchronization Synchronization necessary when clocks are unknown

– Occurs when receiving data (data valid), and when sending (acknowledge)

BUT– Time can be long (2 cycles?)– Must assume worst case time (maybe)


Router priority

Virtual channels implement scheduling algorithm Contention for link resolved by priority circuits

Merge Split

Link

Flow Control

NanoNet’07, Catania 61

17/09/2007

Static priority arbiter

s q

r*C

MUTEX

Cs* q

r

MUTEX

Cs* q

r

MUTEX

Cs* q

rG1

G2

G3

R1

R2

R3

Lock

Lock Register

Prio

rity

Mod

ule

r1

r2

r3

s1

s2

s3


Reliability and latency

Asynchronous arbiters fail only if time is bounded– Latency depends on fixed gates plus MUTEX lock time– for 2 channels, + ln(N-1) for more– This likely to be small compared with flow control latency

Synchronizers fail at (fairly) predictable rates but these rates may get worse– Latency can be 35 now for good reliability


The synchronizer

Clock and valid can happen very close together Flip Flop #1 gets caught in metastability We wait until it is resolved (1 –2 clock periods)

D Q D Q

CLK2

VALID#1 #2

DATA

CLK1


MTBF

For a 0.18 process is 20 – 50 ps Tw is similar Suppose the clock and data frequencies are 2 GHz t needs to be > 25 (more than one clock period) to get

MTBF > 28 days– 100 synchronizers + 5 – MTBF > 1year + 2 – PVT variations +5 - 10 . . .

MTBFe

T f f

t

w

/

. .

c d


Event Histogram

Metastability Time

1E-19

1E-16

1E-13

1E-10

-1.0E-08 -8.0E-09 -6.0E-09 -4.0E-09 -2.0E-09 0.0E+00

Q to Clock time

Effe

ctiv

e In

put O

verl

ap

100ps input variation10ps noise and jitterDeep meta

Measurement Convert to log scale, slope is


Not always simple

Metastability Time

1E-20

1E-18

1E-16

1E-14

1E-12

1E-10

-1.000E-

08

-9.000E-

09

-8.000E-

09

-7.000E-

09

-6.000E-

09

-5.000E-

09

-4.000E-

09

-3.000E-

09

Q to Clock time

Effe

ctiv

e In

put O

verl

ap

10ps noise and jitter

Deep meta

More than one slope350ps120ps140ps


Synchronization Strategies

Avoid synchronization time (and arbitration time) by – predicting clocks, stoppable clocks– dedicate link paths for long periods of time

Minimize time by circuit methods– Higher power, better – Reducing apparent device variability - wide transistors– many parallel synchronizers increase throughput

Reduce average latency by speculation– Reduce synchronization time, detect errors and roll back


Timing regions can have predictable relationships

Locked– Two clocks from same source– Linked by PLL– One produced by dividing the other– Some asynchronous systems– Some GALS

Not locked together but predictable– Two clocks same frequency, but different

oscillators.– As above, same frequency ratio


Don’t synchronise when you don’t need to

If the two clocks are locked together, you don’t need a synchroniser, just an asynchronous FIFO big enough to accommodate any jitter/skew

FIFO must never overflow Next read clock can be predicted and metastability avoided

REQ INWrite Data Available

Read done

ACK IN REQ OUT

ACK OUT

FIFODATA DATA


Conflict Prediction

Receiver ClockTransmitter Clock

Predicted Transmitter Clock Synchronization problem

known a cycle in advance of the Receiver clock.

We can do this thanks to the periodic nature of the clocks


Problems predicting next cycle

Difficult to predict– Multiple source clocks– Input output interfaces

Dynamic jitter and noise – GALS start up clocks take several cycles to stabilise– Crosstalk– power supply variations introducing noise into both data and

clock .– temperature changes alter relative delays

As a proportion of cycle time, this is likely to increase with smaller geometries


Synchronizer reliability trends

Clock rates increase. 10 GHz gives 100ps for a cycle.– Both data and clock rates up by n down by n

Assume scales with cycle time reliability (MTBF) of one synchronizer down by n

Number of synchronizers goes up by N – Die reliability down by N

Die – die and on-die variability increases to as much as 40%– 40% more time needed for all synchronizers


An example

Example– 10 GHz clock and data rate = 10 ps– 100 synchronizers– MBTF required 3.8 months (107 seconds )– Time required 41 , or 4.1 cycles + 40% =5.8

cycles Does this matter?


Power futures Total synchronizer area/power small, BUT very sensitive to voltage/power – both n and p

transistors can turn off at low voltages – no gain This affects MUTEX circuits as well

tau

0

50

100

150

200

250

0.5 1 1.5 2

Vdd

ps


Power/speed tradeoffs

Increase Vdd when synchronisation required

Make synchronizer transistors wide to reduce variation and, to some extent,

Make many synchronizer circuits, and select the consistently fastest one

Avoid reducing synchronizer Vdd when running slow


Speculation

Mostly, the synchronizer does not need 35 to settle

Only e-10 (0.005%) need more than 10

Why not go ahead anyway, and try again if more time was needed


Low latency synchronization Data Available, or Free to write are produced early

– After one cycle?. If they prove to be in error, synchronization failed

– Only know this after two of more cycles Read Fail or Write Fail flag is then raised and the action can be

repeated.

Read Fail

Data Available

WRITE

FIFO

Write Fail

Write Data Read done

Free to writeFull Not Empty

READ

DATA DATA

Write clock Read Clock

Speculativesynchronizer

Speculativesynchronizer


Comments

Synchronization time will be an issue for future GALS

Latency and throughput can be affected– Should the flit be large to reduce the effective

overhead of time and power? Some power speed trade off is possible

– Higher power synchronization can buy some performance ?

Speculation is complex – Is it worth it?

Download - Asynchronous Links, for NanoNets?

Top Related