low-latency fifo’s using token rings

47
Low-Latency FIFO’s Using Token Rings Tiberiu Chelcea Steven M. Nowick Columbia University New York, USA

Upload: chenoa

Post on 11-Jan-2016

34 views

Category:

Documents


4 download

DESCRIPTION

Low-Latency FIFO’s Using Token Rings. Tiberiu Chelcea. Steven M. Nowick. Columbia University New York, USA. Contributions. Two novel FIFO designs: Circular buffer of identical cells Distributed control Common buses Token passing: 2 tokens control I/O behavior No data movement - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Low-Latency FIFO’s Using Token Rings

Low-Latency FIFO’s Using Token Rings

Tiberiu Chelcea Steven M. Nowick

Columbia UniversityNew York, USA

Page 2: Low-Latency FIFO’s Using Token Rings

Introduction (1)

Contributions

• Two novel FIFO designs:– Circular buffer of identical cells

– Distributed control

– Common buses

– Token passing: 2 tokens control I/O behavior

– No data movement

• Very low latency in an empty FIFO

• Still maintain high throughput

Page 3: Low-Latency FIFO’s Using Token Rings

IntroductionTwo FIFO Protocols:

• Basic: simple, non-overlapped write/read to a cell

• Optimized: overlapped write/read to a cell– more concurrency per cell– various low-level optimizations:

• “early drive” of receiver’s data bus

• single-wire signaling, etc.

3 implementations of basic, 1 of optimized

HSpice simulations

Page 4: Low-Latency FIFO’s Using Token Rings

Related Work (1)

Related Work

• Most FIFO’s targeted to high-throughput:– poor latency– data movement

• One solution: modify structure to obtain lower latency [Brunvand95]– types: folded, tree, square– drawbacks:

• data still moved• latency proportional to # of stages• complex critical paths

Page 5: Low-Latency FIFO’s Using Token Rings

Related Work (2)

Low-Latency FIFO’sCommonly implemented as circular buffers

– no data movement

1. Centralized Control [Sutherland89, Yakovlev95]

Limitations:– complex centralized counters for head/tail positions– overhead: delay/area (including arbiters!)

2. Distributed Control [Yakovlev89, Kishinevsky93]

Limitations:– no overlapped put/get to same cell (unlike ours)– significant latencies (e.g. 3-stage delay)

Two closer approaches presented later on…

Page 6: Low-Latency FIFO’s Using Token Rings

Summary (1)

Overview of the Talk

• Basic FIFO:– Basic Protocol

– Implementation

• Optimized FIFO:– Optimized Protocol

– Implementation

– Related Work

• Results• Conclusions

Page 7: Low-Latency FIFO’s Using Token Rings

Basic Protocol (1)

FIFO Interface

• Interfaces to two environments:– sender communicates on put port– receiver communicates on get port

• FIFO allows concurrent puts (writes) and gets (reads)

FIFOput get

Page 8: Low-Latency FIFO’s Using Token Rings

Basic Protocol (2)

FIFO Architecture

• FIFO = replicated cells + starter (circular buffer)

• put/get ports each consists of a common bus (data+control)

• Two tokens in FIFO: put token and get token– “Starter” cell places tokens in circulation

– no data movement

• When full, every cell contains data (capacity N)

Cell Cell Cell Cell

Sta

rter

put

get

Page 9: Low-Latency FIFO’s Using Token Rings

Basic Protocol (3)

FIFO Simulation 1: Start

Sta

rter

put

get

P

G

PP

G

Put token requested

valid

put_req

Get token requested

Page 10: Low-Latency FIFO’s Using Token Rings

Basic Protocol (4)

valid

FIFO Simulation 2: Steady-State Operation

Sta

rter

put

get

P

G

valid

P

G

P

valid

G

Pvalid

put_reqput_req

get_reqget_req

put_req

P

Page 11: Low-Latency FIFO’s Using Token Rings

Basic Protocol (4)

valid

FIFO Simulation 2: Steady-State Operation

Sta

rter

put

get

P

validvalid valid

P

G

put_reqput_req

Page 12: Low-Latency FIFO’s Using Token Rings

Basic Protocol (4)

valid

FIFO Simulation 3: Full

Sta

rter

put

get

P

G

P

valid validvalid

put_req: pending

G

get_req

Put token not passed: next cell not ready

Put request acknowledged

Put token requested

valid

Page 13: Low-Latency FIFO’s Using Token Rings

Basic Protocol (6)

FIFO=*[[

]]

right;[put put?x];[left left];right;[get get!x];[left left];

Basic Cell Protocol

Cell

put

rightleft

get

forever {ObtainPutTokenEnqueueDataPassPutTokenObtainGetTokenDequeueDataPassGetToken}

do_put

do_get

Pseudo-code Program: CSP Program:

Page 14: Low-Latency FIFO’s Using Token Rings

Basic Protocol (7)

Cell’s Handshake Behavior

• Port Activity:– put & get: passive– right: active– left: passive

• Channel Implementation: – 4-phase handshaking– bundled data: put and get– validity scheme [Peeters96]:

• get: “middle data validity” (ack+ req-)• put: early data validity (req+ ack+)

Page 15: Low-Latency FIFO’s Using Token Rings

Basic Protocol Implementation (1)

Basic Cell Implementation #1: Tangram

“Starter Cell”: can also be implemented using Tangram

proc cell (put?T & get!T & right & left)

beginx: T

forever doright; put?x; left;right; get!x; left;

od end

Tangram program: Handshake circuit:

MUX

;

REG

; ;

; ;

MUX

get put

left right

#

Page 16: Low-Latency FIFO’s Using Token Rings

Basic Protocol Implementation (2)

Basic Cell Implementation #2: Petrify

right_req+

right_ack+

right_req-

right_ack-

put_ack+

put_req-

put_ack-

left_ack+

left_req-

left_ack-

left_req-

put_req+

put_req-

right_req+

right_ack+

right_req-

right_ack-

get_ack+

get_req-

get_ack-

left_ack+

left_req-

left_ack-

left_req-

get_req+

get_req-

Page 17: Low-Latency FIFO’s Using Token Rings

Basic Protocol Implementation (3)

Basic Cell Implementation #3: Burst-Mode

Decomposed into several communicating BM machines:

• Put/Get Controllers: handle put/get ports

• Left Controller: passes tokens to left

• Token Distributor: controls token flow to the three controllers

LeftController

Token Distributorleft

GetController

PutController REG

right

put get

ptok gtok

pass

Page 18: Low-Latency FIFO’s Using Token Rings

Basic Protocol Implementation (4)

Put Controller

• Synchronizes handshakes on put and ptok channels:– put: environmental request– ptok: put token is in cell

• If cell has token (ptok_r+): cell does put operation

• If no token, no put: put_req+/- = partial input burst => ignored

put_req+ ptok_r+/put_ack+ ptok_a+

put_req- ptok_r-/put_ack- ptok_a-

GetController

PutController

LeftController

Token Distributor

REG

left right

put get

ptok gtok

pass

Page 19: Low-Latency FIFO’s Using Token Rings

Basic Protocol Implementation (7)

Token Distributor

• Receives tokens from right channel

• Distributes tokens to Put and Get Controllers, respectively

• Passes tokens to Left Controller

GetController

PutController

LeftController

Token Distributor

REG

left right

put get

ptok gtok

pass

right_ack+/right_req-

pass_a+/pass_r-

pass_a-/right_req+

pass_a-/right_req+

right_ack-/ptok_r+

ptok_a+/ptok_r-

ptok_a-/pass_r+

pass_a+/pass_r-

gtok_a-/pass_r+

gtok_a+/gtok_r-

right_ack-/gtok_r+

right_ack+/right_req-

right_ack+/right_req-

right_ack-/

right

pass_r+

pass_a+/pass_r-

pass_a-/

pass

right_ack+/right_req-

right_ack-/

right_req+

right

pass_a+/pass_r-

pass_r+

pass_a-/

pass

ptok_r+

ptok_a+/ptok_r-

ptok_a-/

ptok

gtok_a+/gtok_r-

gtok_r+

gtok_a-/

gtok

Page 20: Low-Latency FIFO’s Using Token Rings

Basic Protocol Implementation (8)

Token Distributor: Burst-Mode Implementation

• Synthesized with the MINIMALIST CAD Package [Fuhrer,Nowick et. al,99]

• Optimized for speed

pass_a

gtok_a

ptok_a

ra

pass_r

nrr

ptok_r

gtok_r

y2

y1

y0

Page 21: Low-Latency FIFO’s Using Token Rings

Summary (2)

Overview of the Talk

• Basic FIFO:– Basic Protocol

– Implementation

• Optimized FIFO:– Optimized Protocol

– Implementation

– Related Work

• Results• Conclusions

Page 22: Low-Latency FIFO’s Using Token Rings

Optimized Protocol (1)

Problems with Basic Protocol

No “Program-Level Parallelism”:• no overlapped write/read to same cell• large latency• poor throughput• two tokens “multiplexed” onto single channels

Limited Low-Level Optimizations:• “late enable” of get data bus• handshake overheads• limited fine-grained concurrency

Page 23: Low-Latency FIFO’s Using Token Rings

Basic Protocol: Sequential Program

Actions strictly sequential

Latency (3 actions):– EnqueueData– PassPutToken– ObtainGetToken[DequeueData]

Throughput (3 actions):– ObtainPutToken– EnqueueData– PassPutToken

ObtainPutToken

EnqueueData

PassPutToken

ObtainGetToken

DequeueData

PassGetToken

to th

e le

ft c

ell

from

the

righ

t cel

l

Page 24: Low-Latency FIFO’s Using Token Rings

Optimized Protocol (2)

Optimized Protocol: Concurrent Program

Token passing: off critical paths

Latency: 1 action

Throughput: 2 actions

Further low-level optimizations:

– effectively improve throughput to

1 action

ObtainPutToken

EnqueueData PassPutToken

ObtainGetToken

DequeueData PassGetToken

to th

e le

ft c

ell

from

the

righ

t cel

l

Page 25: Low-Latency FIFO’s Using Token Rings

Optimized Protocol (3)

Architectural Modifications

• Tokens passed on two separate channels

• One cell can hold both tokens simultaneously:– allows overlapped writes and reads

• get token may be briefly ahead of the put token !

• No explicit “Starter” cell

Cell

Put

Get

CellCell Cell

Page 26: Low-Latency FIFO’s Using Token Rings

Optimized Protocol Implementation (1)

Optimized Cell Architecture

• ObtainPutToken: receives put token

• ObtainGetToken: receives get token

• PutController: handles communication on put channel

• GetController: handles communication on get channel

• DataValid: indicates the validity of REG contents

PutController

ObtainPut Token

ObtainGet Token

GetController

REGDataValid

we

re

we1

re1

get

put ObtainPutToken

ObtainPut Token

PutController

PutController ObtainGetToken

ObtainGet Token

GetController

GetController

DataValid

DataValid

Page 27: Low-Latency FIFO’s Using Token Rings

Optimized Protocol Implementation (2)

Optimized Cell Implementation

• OPT/OGT: Burst-mode machines

• DV: uses relative timing (synthesized using Petrify)

• PC/GC: asymmetric C-elements

• Optimizations:

– “early data out” enabling

– single-wire token passing

we1

re1

put_req put_data put_ack

GC

REG

C++

C+

get_reqget_ack get_data

we

re

OPT

OGT

PC

+C DV

Page 28: Low-Latency FIFO’s Using Token Rings

Optimized Protocol Implementation (3)

Enqueuing Data

• put token received on we1= single wire

• we+ (when request & token) triggers:– latching data– start passing put token– resetting OPT

we1+/

we1-/ptok+

we+/ptok-

we-/

ObtainPutTokenwe1

re1

put_req put_data put_ack

GC

REG

C++

C+

get_reqget_ack get_data

we

re

OPT

OGT

PC

+C DV

ptok

Page 29: Low-Latency FIFO’s Using Token Rings

Optimized Protocol Implementation (5)

Data Valid

Asymmetric protocol:– data valid: in active

phase of put (we+)– data invalid: in RZ

phase of get (re-)– avoids overwrite by next

put

we+

valid+

we- re+

re-

valid-

we1

re1

put_req put_data put_ack

GC

REG

C++

C+

get_reqget_ack get_data

we

re

OPT

OGT

PC

+C DV

Page 30: Low-Latency FIFO’s Using Token Rings

Optimized Protocol Implementation (5)

Early Enable: Get Data Bus

Early Enable = get token in cell

Late Enable = get token + get request

• Extra slack to meet bundling constraints

we1

re1

put_req put_data put_ack

GC

REG

C++

C+

get_reqget_ack get_data

we

re

OPT

OGT

PC

+C DV

Page 31: Low-Latency FIFO’s Using Token Rings

Optimized Protocol Implementation (6)

we1

re1

put_req put_data put_ack

GC

REG

C++

C+

get_reqget_ack get_data

we

re

OPT

OGT

PC

+C DV

Timing Constraints 1. Pulse-Width Requirements

• 2 pulse width constraints

• re and we - race between:– state change– environment path

• easily met

• DV synthesized using Petrify (“slowenv” option)

Page 32: Low-Latency FIFO’s Using Token Rings

Optimized Protocol Implementation (7)

we1

re1

put_req put_data put_ack

GC

REG

C++

C+

get_reqget_ack get_data

we

re

OPT

OGT

PC

+C DV

Timing Constraints 2. Bundling Constraint• Get operation:

get_ack must indicate valid data

• Bundling constraint: get_data faster than get_ack+

• Moderate size FIFO’s: easy to meet

• Very large FIFO’s: padded delays on control

• “Early drive” of get_data alleviates the problem: extra slack

Page 33: Low-Latency FIFO’s Using Token Rings

Related Work (3)

Related Work: Close Approaches

• Two Designs [Yi95, Chu86]:– use: circular arrays, common data buses, token passing

• “Word-Slice FIFO” [Yi95]:– worse throughput for get than ours (10 gates vs. 6)

– tighter bundling constraints: uses “late read enable”

• FIFO for Packet Networks[Chu86]:– worse throughput for put than ours (6 block delays vs.

4)

– tighter bundling constraints: uses “late read enable”

Page 34: Low-Latency FIFO’s Using Token Rings

Summary (3)

Overview of the Talk

• Basic FIFO:– Basic protocol

– Implementation

• Optimized FIFO:– Optimized protocol

– Implementation

– Related Work

• Results• Conclusions

Page 35: Low-Latency FIFO’s Using Token Rings

Results (1)

Results

• HSpice simulations: 0.6HP CMOS, 3.3V, 27C

• Word size: 8 bits

• Buses modeled carefully:

– wire lengths, load

– attached capacitance

• Various experiments:– FIFO capacity (4- vs. 16-place)– environmental latency (slow vs. fast)

Page 36: Low-Latency FIFO’s Using Token Rings

Results (1)

Results: Latency

Basic Optimized

(ns) Tangram Petrify(centralized)

Burst-Mode(distributed)

FIFO 4-S 13.76 12.54 7.94 1.73

FIFO 4-F 13.75 12.54 7.81 1.73

FIFO 16-S 14.32 13.01 8.52 2.30

FIFO 16-F 14.13 12.90 8.41 2.29

S= slow environmentF= fast environment

Page 37: Low-Latency FIFO’s Using Token Rings

Results (2)

Results: Throughput

367167202162

348164196161

454175216162

427172208161

get

put

get

put

get

put

get

put

423204200185FIFO 4-F

335191190175FIFO 16-S

404200200185FIFO 4-S

OptimizedBasic

359192195179FIFO 16-F

Burst-Mode(distributed)

Petrify(centralized)

Tangram(MegaOps/s)

Page 38: Low-Latency FIFO’s Using Token Rings

Conclusions (1)

Conclusions• Presented novel FIFO designs

• Two protocols: basic, optimized

– circular buffers

– common buses

– token passing

• Very low latency achieved by protocol manipulation

• Maintain high throughput

• Potential for low power: no data movement

Page 39: Low-Latency FIFO’s Using Token Rings

Basic Protocol (5)

FIFO Behavior: Empty

Sta

rter

Put

Get

P

G

P

G

Page 40: Low-Latency FIFO’s Using Token Rings

Basic Protocol Implementation (5)

Get Controller

• Triggered by a (i) Get request and (ii) Get token• Synchronizes handshaking on Get and Gtok channels• If no token, only get_req+ can arrive = partial input burst• If token (gtok_r+), then get_req+ becomes an input burst

get_req+ gtok_r+/get_ack+ gtok_a+

get_req- gtok_r-/get_ack- gtok_a-

GetController

PutController

LeftController

Token Distributor

REG

left right

put get

ptok gtok

pass

Page 41: Low-Latency FIFO’s Using Token Rings

Basic Protocol Implementation (6)

Left Controller

• Waits for a request for tokens and their availability• Completes handshaking on both Left and Pass channel

left_req+ pass_r+/left_ack+ pass_a+

left_req- pass_r-/left_ack- pass_a-

GetController

PutController

LeftController

Token Distributor

REG

left right

put get

ptok gtok

pass

Page 42: Low-Latency FIFO’s Using Token Rings

Introduction (2)

Overview of Approach

• The FIFO interfaces two environments

• Circular structure of identical cells

• Cells connected to common data and control buses

• Two tokens dictate the I/O behavior– put token selects the input cell

– get token selects the output cell

• Once enqueued, data is not moved until dequeuing. Thus the potential for low latency

Page 43: Low-Latency FIFO’s Using Token Rings

Introduction

• Distributed control

• Circular buffer of identical cells

• Common buses: all cells communicate on them

• Token passing determines the I/O behavior

• FIFO allows concurrent reads/writes

• When full, every cell contains data (capacity N)

Page 44: Low-Latency FIFO’s Using Token Rings

Basic Protocol Implementation (4)

Put Controller

• If token (ptok_r+), cell does the put operation

• If no token, no put: put_req+/- partial input burst => ignored

– put_req+/put_req- partial input bursts => ignored

– burst-mode implmentation handles this behavior

put_req+ ptok_r+/put_ack+ ptok_a+

put_req- ptok_r-/put_ack- ptok_a-

GetController

PutController

LeftController

Token Distributor

REG

left right

put get

ptok gtok

pass

Page 45: Low-Latency FIFO’s Using Token Rings

Optimized Protocol Implementation (4)

Dequeuing Data

• get token received on re1: single wire

– no 4-phase handshaking

re1+/

re1-/gtok+

re+ /

re-/gtok-

ObtainGetTokenwe1

re1

put_req put_data put_ack

GC

REG

C++

C+

get_reqget_ack get_data

we

re

OPT

OGT

PC

+C DV

Page 46: Low-Latency FIFO’s Using Token Rings

Optimized Protocol Implementation (4)

Dequeuing Data

• When there is get request, generate re to:– start passing the get

token– ack the receiver– start reseting OGT

re1+/

re1-/gtok+

re+ /

re-/gtok-

ObtainGetTokenwe1

re1

put_req put_data put_ack

GC

REG

C++

C+

get_reqget_ack get_data

we

re

OPT

OGT

PC

+C DV

Page 47: Low-Latency FIFO’s Using Token Rings

Optimized Protocol Implementation (3)

Enqueuing Data

• When there is put request, generate we to:– latch data

– start passing put token

– reset OPT

we1+/

we1-/ptok+

we+/ptok-

we-/

ObtainPutTokenwe1

re1

put_req put_data put_ack

GC

REG

C++

C+

get_reqget_ack get_data

we

re

OPT

OGT

PC

+C DV

ptok