Download - Asynchronous Links, for NanoNets?
17/09/2007 1NanoNet’07, Catania
Asynchronous Links, for NanoNets?Alex Yakovlev
University of Newcastle, UK
17/09/2007 2NanoNet’07, Catania
Motivation-1 At very deep submicron, gate delay is much less than
interconnect delay: total interconnect length can reach several meters; interconnect delay can be as much as 90% of total path delay in VDSM circuits
Timing issue is a problem, particularly for global wires
Source: ITRS, 2003Source: ITRS, 20030.1
1
10
100250 180 130 90 65 45 32
Feature size (nm)Relativedelay
Gate delay (fanout 4)Local interconnect (M1,2)Global interconnect with repeatersGlobal interconnect without repeaters
Multiple clock domains are reality, problem of interface between them
ITRS’05 predicted: 4x (8x) increase in global asynchronous signalling by 2012 (2020)
17/09/2007 3NanoNet’07, Catania
Motivation-2
Variability and uncertainty– Geometry and process: for long channels intra-die
variations are less correlated for different part of the interconnect, both for interconnects and repeaters
• e.g., M4 and M5 resistance/um massively differ, leading to mistracking (C.Visuweswariah, SLIP’06)
• e.g. 250nm clock skew has 25% variability due to interconnect variations (Y.Liu et.al. DAC’00)
– Behavioural: crosstalk (sidewall capacitance can cause up to 7x variation in delay (R. Ho, M.Horowitz))
17/09/2007 4NanoNet’07, Catania
A Network on Chip
Synchronization required
Arbitration required
Multiple ClocksAsync Links
17/09/2007 5NanoNet’07, Catania
Example from the Past: Fault-Tolerant Self-Timed Ring (Varshavsky et al. 1986)
For an onboard airborne computer-control system which tolerated up to two faults. Self-timed ring was a GALS system with self-checking and self-repair at the hardware level
Individually clocked subsystems
Self-timed adapters forming a ring
17/09/2007 6NanoNet’07, Catania
Communication Channel Adapter
Data (DR,DS) is encoded using 3-of-6 Sperner code (16 data values for half-byte, plus 4 tokens for ring acquisition protocol)AR, AS – acknowledgementsRR, RS – spare (for self-repair) lines
Much higher reliability than a bus and other forms of redundancy
MCC was developed TTL-Schottky gate arrays, approx 2K gates.
17/09/2007 7NanoNet’07, Catania
Outline
Token-based view of communication Basics of asynchronous signalling Self-timed data encoding Pipelining How to hide acknowledgements Serial vs Parallel links Arbiters and routers Async2sync interface CAD issues
17/09/2007 8NanoNet’07, Catania
Data exchange: token-based view
Question 1: when can Rx look at the incoming data?Data validity issue – Forming a well-defined token
source tx rx destData
17/09/2007 9NanoNet’07, Catania
Data exchange: token-based view
Question 1: when can Rx looked at the data?Data validity issue – Forming a well-defined token
Question 2: when can Tx send new data?Acknowledgement issue – Separation b/w tokens
source tx rx destData
17/09/2007 10NanoNet’07, Catania
Data exchange: token-based view
Question 1: when can Rx looked at the data?Data validity issue – Forming a well-defined token
Question 2: when can Tx send new data?Acknowledgement issue – Separation b/w tokens
These are fundamental issues of flow control at the physical and link levels
The answers are determined by many design aspects: technology level, system architecture (application, pipelining), latency, throughput, power, design process etc.
source tx rx destData
17/09/2007 11NanoNet’07, Catania
Tokens and spaces with global clocking
In globally clocked systems both Q1 and Q2 are resolved with the aid of clock pulses
source tx rx destData
clk
17/09/2007 12NanoNet’07, Catania
Tokens and spaces
Without global clocking: Q1 can be resolved differently from Q2
E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing
source tx rx dest
Data
Clk_tx Clk_rx
D_valid
bundle
17/09/2007 13NanoNet’07, Catania
Tokens and spaces
Without global clocking: Q1 can be resolved differently from Q2
E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing
source tx rx dest
Data
D_valid
bundleack
ack ack
17/09/2007 14NanoNet’07, Catania
Petri net modelTx Rxsource dest
Tx delay Rx delay
Tx Rxsource dest
Tx delay or ack Rx delay or ack
Data Valid
Data Valid
ack
Always safe but with a round trip delay!
One way delay, but may be unsafe!
17/09/2007 15NanoNet’07, Catania
Asynchronous handshake signalling
Valid data tokens and safe spaces between them can be created by different means of signalling and encoding
Level-based -> Return-To-Zero (RTZ) or 4-phase protocol
Transition-based -> Non-Return-to-Zero (NRZ) or 2-phase protocol
Pulse-based, e.g. GasP Phase-difference-based Data encoding: bundled data (BD), Delay-
insensitive (DI)
17/09/2007 16NanoNet’07, Catania
Handshake Signalling Protocols Level Signalling (RTZ or 4-phase)
Transition Signalling (RTZ or 4-phase)
One cycle
req
ack
req
ack
One cycle
req
ackOne cycle
17/09/2007 17NanoNet’07, Catania
Handshake Signalling Protocols Pulse Signalling
Single-track Signalling (GasP)
One cycle
req
ack
req
ack
One cycle
req + ackreq
ack
17/09/2007 18NanoNet’07, Catania
GasP signalling
Pull up from pred (req)
Pull down here (ack)
Pull up from here (req)
Pull down from succ (ack)
Pulse length control loops
Source: R. Ho et al, Async’04
17/09/2007 19NanoNet’07, Catania
Data encoding Bundled data
– Code is positional binary, token is determined by Req+ signal; Req+ arrives with a safe set-up delay from data
Delay-insensitive codes (tokens determined by the codeword values, require a spacer, or NULL, state if RTZ)– 1-of-2 (Dual-rail per bit) – systematic code, encoding,
decoding straightforward– m-of-n (n>2) – not systematic, i.e. incur encoding and
decoding costs, optimal when m=n/2– One-hot ,1-of-n (n>2), completion detection is easy, not
practical beyond n>4– Systematic, such as Berger, incur complex completion
detection
17/09/2007 20NanoNet’07, Catania
Bundled Data
req
ack
Data
One cycle
req
ack
Data
RTZ:
NRZ:
One cycle
req
ack
Data
One cycle
17/09/2007 21NanoNet’07, Catania
DI encoded data (Dual-Rail)
ack
Data.0
One cycle
Data.1
ack
Data.0Data.1Logical 1
Logical 0
One cycle
NULL (spacer) NULL
cycle
Data.1
ack
Data.0Logical 1
Logical 0
cycle cycle
Logical 1 Logical 1
cycle
RTZ:
NRZ:
17/09/2007 22NanoNet’07, Catania
DI encoded data (Dual-Rail)
ack
Data.0
One cycle
Data.1
ack
Data.0Data.1Logical 1
Logical 0
One cycle
NULL (spacer) NULL
cycle
Data.1
ack
Data.0Logical 1
Logical 0
cycle cycle
Logical 1 Logical 1
cycle
RTZ:
NRZ:This coding leads to complex logic implementation; hard to track odd and even phases and logic values – hence see LEDR
below
17/09/2007 23NanoNet’07, Catania
DI codes (1-of-n and m-of-n)
1-of-4: – 0001=> 00, 0010=>01, 0100=>10, 1000=>11
2-of-4:– 1100, 1010, 1001, 0110, 0101, 0011 – total 6
combinations (cf. 2-bit dual-rail – 4 comb.) 3-of-6:
– 111000, 110100, …, 000111 – total 20 combinations (can encode 4 bits + 4 control tokens)
2-of-7:– 1100000, 1010000, …, 0000011 – total 21
combinations (4 bits + 5 control tokens)
17/09/2007 24NanoNet’07, Catania
DI codes completion detection and decoding
1-of-4 completion detection is a 4-input OR gate (CD=d0+d1+d2+d3)
Decode 1-of-4 to dual rail is a set of four 2-input OR gates (q0.0=d0+d2; q0.1=d1+d3; q1.0=d0+d1; q1.1=d2+d3)
For m-of-n codes CD and decoding is non-trivial
From J.Bainbridge et al, ASYNC’03
17/09/2007 25NanoNet’07, Catania
Incomplete DI codes
Incomplete 2-of-7:
Composed of
1-of-3
and
1-of-4
From J.Bainbridge et al ASYNC’03
17/09/2007 26NanoNet’07, Catania
Phase difference based encoding (C. D’Alessandro et al. ASYNC’06,’07)
sp0 sp1
0
sp0
1
sp1
0
sp0
0
ref
t_1
t_0
data
t_1 before t_0t_0 before t_1
The proposed system consists in encoding a bit of data in the phase relationship between two signals generated using a reference
This would ensure that any transient fault appearing on one of the reference signals will be ignored if it is not mirrored by a corresponding transition on the other line
Similarity with multi-wire communication
17/09/2007 27NanoNet’07, Catania
Phase encoding: multiple rail No group of wires has the same delay All wires toggle when an item of data is sent Increased number of states available ( n wires = n! states) hence
more bits/symbol Table illustrates examples of phase encoding compared to the
respective m-of-n counterpart
Type of Link Number of states
Bits per Symbol
Extra states
Transitions per symbol
Symbols per packet
Transitions per packet
Phase enc. (4) 24 4 8 4 32 128
1-of-4 4 2 0 2 64 128
Phase enc. (6) 720 9 208 6 15 90
3-of-6 20 4 4 6 32 192
17/09/2007 28NanoNet’07, Catania
Phase encoding Repeater
i1
i2
i3
o1
o2
o3
sender
go
receiver
Phase detectors (Mutexes)
1<3
3<1
2<3
3<2
1<2
2<1
17/09/2007 29NanoNet’07, Catania
PipelinesDual-rail pipeline
From J.Bainbridge & S. Furber IEEE Micro, 2002
17/09/2007 30NanoNet’07, Catania
The problem of Acking
Question 2 “when can Tx send new data?” has two aspects: – Safety (not to overflow the channel or
when Tx and Rx have much variation in delay)
– Performance (to maximize throughput and reduce latency)
Can we hide ack (round trip) delay?
17/09/2007 31NanoNet’07, Catania
From R.Ho et al. ASYNC’04
To maintain throughput more pipeline stages are required but that costs too much latency and power
First minimize latency along a long wire (not specific to asynchronous) and then maximize throughput (using “wagging tail buffer” approach)
17/09/2007 32NanoNet’07, Catania
From R.Ho et al. ASYNC’04
Use of wagging buffer approach
Alternate between top and bottom
control
17/09/2007 33NanoNet’07, Catania
“Wagging tail buffer” approachreqtop
acktop
ackbot
reqbot
data
Top and bot
control channels work at
½ frequency of data channel
17/09/2007 34NanoNet’07, Catania
Serial Link vs Parallel Link (from R. Dobkin) Why Serial Link?
– Less interconnect area– Less routing congestion– Less coupling– Less power (depends on
range)
The relative improvement grows with technology scaling. The example on the right refers to: – Single gate delay serial link– Fully-shielded parallel link with
8 gate delay clock cycle– Equal bit-rate– Word width N=8
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
180 130 90 65 30 15
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
180 130 90 65 30 15
Parallel Link dissipates less power
Serial Link dissipates less power
Technology Node [nm]
Link Length [mm]
Parallel Link requires less area
Serial Link requires less area
17/09/2007 35NanoNet’07, Catania
Serialization model
Tx Rx
Acking at the bit level
… …
17/09/2007 36NanoNet’07, Catania
Serialization model
Tx Rx
Acking at the word level
17/09/2007 37NanoNet’07, Catania
Serialization model
Tx Rx
Acking at the word level (with more concurrency)
17/09/2007 38NanoNet’07, Catania
Serial Link – Top Structure (R.Dobkin, Async’07)
Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS)
Acknowledge per word instead of per bit Synchronizers used at the level of the ack signals Wave-pipelining over channel Differential encoding (DS-DE, IEEE1355-95) Reported throughput: 67Gps for 65nm process (viz. one bit per
15ps – expected FO4 inverter delay), based on simulations
17/09/2007 39NanoNet’07, Catania
Encoding –Two Phase NRZ LEDR
Two Phase Non-Return-to-Zero Level Encoded Dual Rail – “delta” encoding (one transition per bit)
Uncoded (B)
State bit (S)
Phase bit (P)
0 0 1 1 0 0 0 0 1 0
( ),( )( ),
B i i oddP iB i i even
( ) ( )S i B i i
17/09/2007 40NanoNet’07, Catania
Transmitter – Fast SR Approach (from R. Dobkin)
17/09/2007 41NanoNet’07, Catania
Receiver Splitter (from R. Dobkin)
17/09/2007 42NanoNet’07, Catania
Self Timed Networks Router requires priority arbitration
– Arbitration necessary at every router merge– Potential delay at every node on the pathBUT– Asynchronous merge/arbitration time is average not worst
case Adapters to locally clocked cells require
synchronization Synchronization necessary when clocks are unknown
– Occurs when receiving data (data valid), and when sending (acknowledge)
BUT– Time can be long (2 cycles?)– Must assume worst case time (maybe)
17/09/2007 43NanoNet’07, Catania
Router priority
Virtual channels implement scheduling algorithm Contention for link resolved by priority circuits
Merge Split
Link
Flow Control
17/09/2007 44NanoNet’07, Catania
Asynchronous Arbiters
Multiway arbiters (e.g. for Xbar switches):– Cascaded mesh (latency ~ N)– Cascaded Tree (latency ~ logN)– Token-Ring (busy ring and lazy ring) (latency ~
from 1 to N) Priority arbiters (e.g. for Routers with different QS):
– Static priority (topological order)– Dynamic priority (request arrives with priority
code)– Ordered (time-priority) - multiway arbiter, followed
by a FIFO buffer
17/09/2007 45NanoNet’07, Catania
Static Priority Arbiter
s q
r*C
MUTEX
Cs* q
r
MUTEX
Cs* q
r
MUTEX
Cs* q
rG1
G2
G3
R1
R2
R3
Lock
Lock Register
Prio
rity
Mod
ule
r1
r2
r3
s1
s2
s3
17/09/2007 46NanoNet’07, Catania
Why Synchronizer?
Here one clock cycle is used for the metastability to resolve.
DFFCLK
DATA QDATA
CLK
Q
Metastability
DFFCLK
DATA
DFFQ
0101
Metastability
Two DFF Synchronizer
17/09/2007 47NanoNet’07, Catania
CAD support: Async design flow
17/09/2007 48NanoNet’07, Catania
DeviceLDS
LDTACK
D
DSr
DSw
DTACK
VME BusController
DataTransceiver
Bus DSr
LDS
LDTACK
D
DTACK
Read Cycle
Synthesis of Asynchronous link interfaces
17/09/2007 49NanoNet’07, Catania
DTACK-DSr+
LDS+
LDTACK+
D+
DTACK+
DSr-
D-
LDS-
LDTACK-
DSw-
DSw+
D+
LDS+
LDTACK+
D-
DTACK+
17/09/2007 50NanoNet’07, Catania
DSr+
DSr+
DSr+
DTACK-
DTACK-
DTACK-
LDS-LDS-LDS-
LDTACK- LDTACK- LDTACK-
D-
DSr-DTACK+
D+
LDTACK+
LDS+
Complete State Coding (CSC)
csc -
csc +
Boolean equations:Boolean equations:
LDS = D cscDTACK = DD = LDTACK csc = DSr
Logic asynchronous circuit
DTACKD
DSr
LDS
LDTACK
csc
synthesis
DTACK-DSr+
LDS+
LDTACK+
D+
DTACK+
DSr-
D-
LDS-
LDTACK-
DSw-
DSw+
D+
LDS+
LDTACK+
D-
DTACK+
17/09/2007 51NanoNet’07, Catania
Conclusions on Async Links At nm level links will be more asynchronous, perhaps first,
mesochronous to avoid global clock skew Delay-insensitive codes can be used to tolerate interwire-delay
variability Phase-encoding can be used for higher power-bit efficiency and
SEU tolerance Acking will be mainly used for flow control (word level) and its
overhead can be ‘hidden’ by using the “wagging buffer” technique
Serial Links save area and power for long interconnects, with buffering (pipelining) if one wants to maintain high throughput; they also simplify building switches
Synthesis tools can be used to build clock-free interfaces between different links
Asynchronous logic can be used for building higher level circuits, e.g. arbiters for switches and routers
17/09/2007 52NanoNet’07, Catania
And finally …
17/09/2007 53NanoNet’07, Catania
ASYNC’08 and NOCs’08 …plus SLIP’08
Held in Newcastle upon Tyne, UK, 7-11 April 2008 (SLIP on 5-6 April – weekend)
async.org.uk/async2008 async.org.uk/nocs2008 Submission deadlines:
– Async’08: Abstract – Oct. 8 , Full paper – Oct. 15– NOCs’08: Abstract – Nov. 12, Full paper – Nov. 19
17/09/2007 54NanoNet’07, Catania
Extras
More slides if I have time!
17/09/2007 55NanoNet’07, Catania
Chain Network Components
From J.Bainbridge & S. Furber IEEE Micro, 2002
17/09/2007 56NanoNet’07, Catania
A Network on Chip
Synchronization required
Arbitration required
Multiple Clocks
17/09/2007 57NanoNet’07, Catania
Transmitter – Fast SR Approach (from R. Dobkin)
17/09/2007 58NanoNet’07, Catania
Receiver Splitter (from R. Dobkin)
17/09/2007 59NanoNet’07, Catania
Self Timed Networks Router requires priority arbitration
– Arbitration necessary at every router merge– Potential delay at every node on the pathBUT– Asynchronous merge/arbitration time is average not worst
case Adapters to locally clocked cells require
synchronization Synchronization necessary when clocks are unknown
– Occurs when receiving data (data valid), and when sending (acknowledge)
BUT– Time can be long (2 cycles?)– Must assume worst case time (maybe)
17/09/2007 60NanoNet’07, Catania
Router priority
Virtual channels implement scheduling algorithm Contention for link resolved by priority circuits
Merge Split
Link
Flow Control
NanoNet’07, Catania 61
17/09/2007
Static priority arbiter
s q
r*C
MUTEX
Cs* q
r
MUTEX
Cs* q
r
MUTEX
Cs* q
rG1
G2
G3
R1
R2
R3
Lock
Lock Register
Prio
rity
Mod
ule
r1
r2
r3
s1
s2
s3
17/09/2007 62NanoNet’07, Catania
Reliability and latency
Asynchronous arbiters fail only if time is bounded– Latency depends on fixed gates plus MUTEX lock time– for 2 channels, + ln(N-1) for more– This likely to be small compared with flow control latency
Synchronizers fail at (fairly) predictable rates but these rates may get worse– Latency can be 35 now for good reliability
17/09/2007 63NanoNet’07, Catania
The synchronizer
Clock and valid can happen very close together Flip Flop #1 gets caught in metastability We wait until it is resolved (1 –2 clock periods)
D Q D Q
CLK2
VALID#1 #2
DATA
CLK1
17/09/2007 64NanoNet’07, Catania
MTBF
For a 0.18 process is 20 – 50 ps Tw is similar Suppose the clock and data frequencies are 2 GHz t needs to be > 25 (more than one clock period) to get
MTBF > 28 days– 100 synchronizers + 5 – MTBF > 1year + 2 – PVT variations +5 - 10 . . .
MTBFe
T f f
t
w
/
. .
c d
17/09/2007 65NanoNet’07, Catania
Event Histogram
Metastability Time
1E-19
1E-16
1E-13
1E-10
-1.0E-08 -8.0E-09 -6.0E-09 -4.0E-09 -2.0E-09 0.0E+00
Q to Clock time
Effe
ctiv
e In
put O
verl
ap
100ps input variation10ps noise and jitterDeep meta
Measurement Convert to log scale, slope is
17/09/2007 66NanoNet’07, Catania
Not always simple
Metastability Time
1E-20
1E-18
1E-16
1E-14
1E-12
1E-10
-1.000E-
08
-9.000E-
09
-8.000E-
09
-7.000E-
09
-6.000E-
09
-5.000E-
09
-4.000E-
09
-3.000E-
09
Q to Clock time
Effe
ctiv
e In
put O
verl
ap
10ps noise and jitter
Deep meta
More than one slope350ps120ps140ps
17/09/2007 67NanoNet’07, Catania
Synchronization Strategies
Avoid synchronization time (and arbitration time) by – predicting clocks, stoppable clocks– dedicate link paths for long periods of time
Minimize time by circuit methods– Higher power, better – Reducing apparent device variability - wide transistors– many parallel synchronizers increase throughput
Reduce average latency by speculation– Reduce synchronization time, detect errors and roll back
17/09/2007 68NanoNet’07, Catania
Timing regions can have predictable relationships
Locked– Two clocks from same source– Linked by PLL– One produced by dividing the other– Some asynchronous systems– Some GALS
Not locked together but predictable– Two clocks same frequency, but different
oscillators.– As above, same frequency ratio
17/09/2007 69NanoNet’07, Catania
Don’t synchronise when you don’t need to
If the two clocks are locked together, you don’t need a synchroniser, just an asynchronous FIFO big enough to accommodate any jitter/skew
FIFO must never overflow Next read clock can be predicted and metastability avoided
REQ INWrite Data Available
Read done
ACK IN REQ OUT
ACK OUT
FIFODATA DATA
17/09/2007 70NanoNet’07, Catania
Conflict Prediction
Receiver ClockTransmitter Clock
Predicted Transmitter Clock Synchronization problem
known a cycle in advance of the Receiver clock.
We can do this thanks to the periodic nature of the clocks
17/09/2007 71NanoNet’07, Catania
Problems predicting next cycle
Difficult to predict– Multiple source clocks– Input output interfaces
Dynamic jitter and noise – GALS start up clocks take several cycles to stabilise– Crosstalk– power supply variations introducing noise into both data and
clock .– temperature changes alter relative delays
As a proportion of cycle time, this is likely to increase with smaller geometries
17/09/2007 72NanoNet’07, Catania
Synchronizer reliability trends
Clock rates increase. 10 GHz gives 100ps for a cycle.– Both data and clock rates up by n down by n
Assume scales with cycle time reliability (MTBF) of one synchronizer down by n
Number of synchronizers goes up by N – Die reliability down by N
Die – die and on-die variability increases to as much as 40%– 40% more time needed for all synchronizers
17/09/2007 73NanoNet’07, Catania
An example
Example– 10 GHz clock and data rate = 10 ps– 100 synchronizers– MBTF required 3.8 months (107 seconds )– Time required 41 , or 4.1 cycles + 40% =5.8
cycles Does this matter?
17/09/2007 74NanoNet’07, Catania
Power futures Total synchronizer area/power small, BUT very sensitive to voltage/power – both n and p
transistors can turn off at low voltages – no gain This affects MUTEX circuits as well
tau
0
50
100
150
200
250
0.5 1 1.5 2
Vdd
ps
17/09/2007 75NanoNet’07, Catania
Power/speed tradeoffs
Increase Vdd when synchronisation required
Make synchronizer transistors wide to reduce variation and, to some extent,
Make many synchronizer circuits, and select the consistently fastest one
Avoid reducing synchronizer Vdd when running slow
17/09/2007 76NanoNet’07, Catania
Speculation
Mostly, the synchronizer does not need 35 to settle
Only e-10 (0.005%) need more than 10
Why not go ahead anyway, and try again if more time was needed
17/09/2007 77NanoNet’07, Catania
Low latency synchronization Data Available, or Free to write are produced early
– After one cycle?. If they prove to be in error, synchronization failed
– Only know this after two of more cycles Read Fail or Write Fail flag is then raised and the action can be
repeated.
Read Fail
Data Available
WRITE
FIFO
Write Fail
Write Data Read done
Free to writeFull Not Empty
READ
DATA DATA
Write clock Read Clock
Speculativesynchronizer
Speculativesynchronizer
17/09/2007 78NanoNet’07, Catania
Comments
Synchronization time will be an issue for future GALS
Latency and throughput can be affected– Should the flit be large to reduce the effective
overhead of time and power? Some power speed trade off is possible
– Higher power synchronization can buy some performance ?
Speculation is complex – Is it worth it?