argo - an area-efficient time-division-multiplexed network ... · argo – an area-efficient...

54
Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms Jens Sparsø [email protected]

Upload: others

Post on 14-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms Jens Sparsø [email protected]

Page 2: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 2 DTU Compute, Technical University of Denmark

Introduction – NoC for real-time • Hard real-time systems need

worst case execution time (WCET) guarantees: – Execution time of tasks

• Time predictable processor cores – Guaranteed latency and bandwidth of task-to-task communication

• Time predictable NoC • Time predictable NoC

– End-to-end virtual circuits • No interference among traffic flows • Analysis of individual traffic flows one-by-one

– Solution: • Time division multiplexing (TDM) and static scheduling • Rate control and non-blocking routers + network calculus

Page 3: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 4 DTU Compute, Technical University of Denmark

Q: Why yet another NoC after 10-15 years of NoC-research?

Q: Why yet another TDM-based NoC? – TU/e: Æthereal, Aelite, dAelite – KTH: Nostrum – TU Vienna: TTNoC

A: Many NoC designs, both BE og GS, have very large implementations. – Buffers and flow control – Size of router + NI often approach size of processor core. [As I see it] a result of : – Focus on providing solutions – Focus on layering and encapsulation

Introduction – why another NoC

Page 4: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 5 DTU Compute, Technical University of Denmark

Introduction – Argo • TDM scheduling requires a common global notion of time. • Chip technology calls for Globally-Asynchronous Locally-Synchronous

(GALS) or Mesochronous timing architecture

• ARGO combines TDM and GALS/Mesochronous – Individually clocked processors (GALS) – Mesochronous network interfaces – Asynchronous routers

• ARGO avoids all buffering and (run time) flow control. TDM-mechanism transfer data end-to-end, i.e., SPM to SPM. (not just NI to NI as in other NoCs)

– A novel NI microarchitecture with a very small HW-implementation

Page 5: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 6 DTU Compute, Technical University of Denmark

Acknowledgements • Presentation is based on joint work with:

– Associate Professor Martin Schoeberl – Wolfgang Puffitsch (Post Doc. 2013-2016) – Evangelia Kasapaki (PhD student 2011-2015) – Rasmus Bo Sørensen (PhD student 2013-2016)

• Work has been partly funded by: – European Union’s 7th Framework Programme: Time-predictable

Multi-Core Architecture for Embedded Systems (T-CREST) under grant agreement no. 288008. [2011-14]

– “Hard Real-Time Embedded Multiprocessor Platform - RTEMP” funded by the Danish Research Council for Technology and Production Sciences under contract no. 12-127600. [2013-2016]

Page 6: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 7 DTU Compute, Technical University of Denmark

Outline 1. Introduction 2. Background

– The T-CREST multi-core platform – Message passing, TDM and static scheduling

3. Architecture and implementation of Argo (globally synchronous) – Router – NI microarchitecture – Results (area)

4. Timing organization of Argo – Individually clocked processor cores – Mesochronous NIs – Network of asynchronous routers

5. Analysis of (clock and reset) skew tolerance 6. Generating schedules 7. Conclusion

Page 7: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 8 DTU Compute, Technical University of Denmark

Meassage passing using DMAs + virtual circuits

CPU

I$ D$ SPM

DMA

NI

CPU

I$ D$ SPM

DMA

NI

CPU

I$ D$ SPM

DMA

NI

R R R

R R

Send(…) Receive(…)

Page 8: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 9 DTU Compute, Technical University of Denmark

The T-CREST multicore platform • All components are designed to be time-predictable

• PATMOS processors – Dual issue RISC – Special caches (method, stack, …) – Private scratch pad memories (SPM)

• Argo NoC supporting message passing – TDM and static scheduling – Supports GALS timing organization

• Memory tree NoC – All processors towards one memory – TDM and static scheduling

Proc. node

Page 9: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 10 DTU Compute, Technical University of Denmark

Communicating tasks, communicating processors, and scheduling of packets.

t2 t1

t4

t3

t6

t5

c0

c2

c1

c3

c4

c6

c7

c8 c9

c5

c10

NoC

P

t1 t2

P t3

P

P

t4

c4 c5 c8

c7

c6

c2 c3

t5 t6

Task graph Core communication graph

Page 10: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 11 DTU Compute, Technical University of Denmark

Communicating tasks, communicating processors, and routing of packets in the NoC

t2 t1

t4

t3

t6

t5

c0

c2

c1

c3 c4

c6

c7

c8 c9

c5

c10

t1

t3

t2

t4 t5 t6 P

t1 t2

P t3

P

P t4

c4 c5 c78

c6

c23

t5 t6

c4

NoC c23

c4

TDM period

c23 c4

Page 11: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 12 DTU Compute, Technical University of Denmark

Synchronous router for source-routed TDM-based NoC

HPU

HPU

HPU

. . .

. . .

. . .

. . .

X bar

Data[31:0] Address[15:0] Route[15:0]

Header flit Payload flit

Link (N)

Link (E)

Link (L)

Link (N)

Link (E)

Link (L)

Payload flit

Data[31:0]

Packet:

Page 12: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 13 DTU Compute, Technical University of Denmark

Processor

P1

Processor

P2

Processor

P4

Processor

P4

Processor

P5

Processor

P6

X X X

X X X

Global schedule: 1) Departure time 2) Route

Packet ≈

Page 13: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 14 DTU Compute, Technical University of Denmark

Processor

P1

Processor

P2

Processor

P4

Processor

P4

Processor

P5

Processor

P6

X X X

X X X

Global schedule: 1) Departure time 2) Route

Packet ≈

Page 14: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 15 DTU Compute, Technical University of Denmark

Processor

P1

Processor

P2

Processor

P4

Processor

P4

Processor

P5

Processor

P6

X X X

X X X

Global schedule: 1) Departure time 2) Route

Packet ≈

Page 15: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 16 DTU Compute, Technical University of Denmark

Processor

P1

Processor

P2

Processor

P4

Processor

P4

Processor

P5

Processor

P6

X X X

X X X

Global schedule: 1) Departure time 2) Route

Packet ≈

Page 16: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 17 DTU Compute, Technical University of Denmark

Processor

P1

Processor

P2

Processor

P4

Processor

P4

Processor

P5

Processor

P6

X X X

X X X

Global schedule: 1) Departure time 2) Route

Packet ≈

Page 17: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 19 DTU Compute, Technical University of Denmark

• 3-flit package and 3-stage pipelined router: (TDM slot is 3 clock cycles)

• Variable length packets and arbitrary pipeline depth of router (TDM slot is 1 clock cycle)

TDM schedule + granularity

c4

TDM period = 6 slots = 18 clock cycles

c23 c4 H D D H D D H D D

1 2 3 4 5 6 1

c4

TDM period = 18 clocks cycles

c23 c4 H D D D H D D H D

1 2 3 4 9 5 6 7 8 10 11 12 13 14 15 16 17 18 1 2 3 4

Page 18: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 20 DTU Compute, Technical University of Denmark

Traditional design: DMAs in processor nodes

Page 19: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 21 DTU Compute, Technical University of Denmark

Traditional design: DMAs in processor nodes

• Data is copied (across CDC) from SPM to VC-buffer in NI = latency

• Physical buffer per virtual circuit = area • Flow control: VC-buffer vs. DMA

Page 20: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 22 DTU Compute, Technical University of Denmark

Traditional design: DMAs in processor nodes

• NoC transfer data: - from VC-buffer in sender NI - to VC-buffer in receiver NI

• Credit based flow control = area

Page 21: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 23 DTU Compute, Technical University of Denmark

Traditional design: DMAs in processor nodes

• Data is copied (across CDC) from NI to SPM = latency

• Physical buffer per virtual circuit • Flow control: buffer vs. DMA

Page 22: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 24 DTU Compute, Technical University of Denmark

New design: DMAs in network interfaces

TDM TDM

Page 23: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 25 DTU Compute, Technical University of Denmark

S

S

SM M

S

DMA tableWord countWrite ptr. Read ptr.

S

TDM counter

+1 +1

SPM

M

S S

Router

Outgoing packets Incoming packets

Data

AddressAddress Data

Flits Flits

To Processor (Data memory bus)

MUX DEMUX

To Processor (Data memory bus)

RouteSlot table

DMA indx

Dest. Addr.Route

– 1

Data

Header Payload

Dest. Addr.Route Data

PayloadHeader

Read Write

TRANSMIT

RECEIVE

Page 24: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 26 DTU Compute, Technical University of Denmark

NI synthesized for Altera EP2C70 FPGA TDM period = 3 clock cycles.

Slot

Page 25: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 27 DTU Compute, Technical University of Denmark

ASIC Results for a 16-node bi-torus NoC • 4 x 4 bi-torus (16 NIs and 16 routers) • All-to-all schedule. TDM-period = 23 slots = 69 clock cycles • Total of 16 x 15 = 240 virtual circuits • 65 nm CMOS • 1.5 x 1.5 mm2 tile size

Page 26: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 28 DTU Compute, Technical University of Denmark

ASIC Results for a 16-node bi-torus NoC • 4 x 4 bi-torus (16 NIs and 16 routers) • All-to-all schedule. TDM-period = 23 slots = 69 clock cycles • Total of 16 x 15 = 240 virtual circuits • 65 nm CMOS • 1.5 x 1.5 mm2 tile size

Page 27: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 29 DTU Compute, Technical University of Denmark

Results summary

• Relative area:

• Other routers and NI’s ? / easily 10+

Argo Other TDM NoC Router 1 1 (synchronous)

2 (mesochronous) NI 1 2-4

Page 28: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 30 DTU Compute, Technical University of Denmark

Outline 1. Introduction 2. Background

– The T-CREST multi-core platform – Message passing, TDM and static scheduling

3. Architecture and implementation of Argo (globally synchronous) – Router – NI microarchitecture – Results (area)

4. Timing organization of Argo – Individually clocked processor cores – Mesochronous NIs – Network of asynchronous routers

5. Analysis of (clock and reset) skew tolerance 6. Generating schedules 7. Conclusion

Page 29: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 32 DTU Compute, Technical University of Denmark

Timing organization? - mesochronous (same oscillator, some skew)

Core Core

Core

Network interface

Router

Link

Page 30: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 33 DTU Compute, Technical University of Denmark

Mesochronous router - synchronous router + bi-synchronous FIFOs

HPU

HPU

HPU

. . .

. . .

. . .

. . .

X bar

Link

Link

Link

Link

Link

Link

Bi-synchronous FIFOs

Area= 1-2 x Router !!

Page 31: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 34 DTU Compute, Technical University of Denmark

Timing organization? - Globally-Asynchronous Locally-Synchronous

Core Core

Core

Network interface

Router

Link

Page 32: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 35 DTU Compute, Technical University of Denmark

1) Asynchronous routers and links

Request Acknowledge

Data

• Link:

Page 33: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 36 DTU Compute, Technical University of Denmark

2) Mesochronous Network Interfaces • Mesochronous =

–A single oscillator –Bounded skew (possibly varying)

Page 34: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 37 DTU Compute, Technical University of Denmark

3) Independently clocked IP coores • Different cores have

different “natural” speeds

• Frequency scaling • Voltage scaling

Core Core

Page 35: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 38 DTU Compute, Technical University of Denmark

Clocking strategy -mesochronous (same oscillator, some skew)

Mesochronous NIs

Asynchronous links and routers

Individually clocked

processor cores

TDM TDM

Page 36: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 39 DTU Compute, Technical University of Denmark

Timing Organization of ARGO

R

R R R

R R

NI

NI

CPU $ SPM

CPU $ SPM

Mesochronous NIs

Asynchronous routers

Individually clocked processors

NI

CPU $ SPM

NI

CPU $ SPM

NI

CPU $ SPM

NI

CPU $ SPM

Page 37: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 40 DTU Compute, Technical University of Denmark

Asynchronous router for source-routed TDM-based NoC

HPU

HPU

HPU

. . .

. . .

. . .

X bar

Link

Link

Link

Link

Link

Join

Req Ack

Req Ack

CTL

Data (Phit) Data (Phit) Latch

en

Lt

Note: A token is a flit: - a word in a packet - a void

Fork

Page 38: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 41 DTU Compute, Technical University of Denmark

Timing Organization of ARGO

R

R R R

R R

NI

NI

CPU $ SPM

CPU $ SPM

Mesochronous NIs

Asynchronous routers

Individually clocked processors

NI

CPU $ SPM

NI

CPU $ SPM

NI

CPU $ SPM

NI

CPU $ SPM

Page 39: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 42 DTU Compute, Technical University of Denmark

Understanding the NoC

. . .

HPU

X bar

HPU

HPU

. . .

. . .

Page 40: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 43 DTU Compute, Technical University of Denmark

Understanding the NoC

NI NI

NI IP IP

IP

Asynchronous

Mesochronous

Independently Clocked IPs

Page 41: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 44 DTU Compute, Technical University of Denmark

Resetting the NoC

NI NI

NI IP IP

IP

Asynchronous

Mesochronous

Independently Clocked IPs

Page 42: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 45 DTU Compute, Technical University of Denmark

Resetting the NoC

NI NI

NI IP IP

IP

Asynchronous

Mesochronous

Independently Clocked IPs

Page 43: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 46 DTU Compute, Technical University of Denmark

Mesochronous-Asynchronous Interface

• Mesochronous NIs – Producer and consumer operate at

TDM clk : 𝑇𝑐𝑐𝑐 – Skew in reset and clock results in

phase shift (±δ ) among TDM schedules in NIs. Possibly more than 1 cycle!

• Timing assumption:

– Routers are faster than NIs 𝑇𝐻𝑎𝑎𝑎𝑎𝑎𝑎𝑐𝑎 < 𝑇𝑐𝑐𝑐

– Producer ignores ack consumer ignores req

no synchronization no metastability

NI1 NI2

clk clk’ rst rst’

Producer Consumer

req ack ack (NC)

req (NC)

Page 44: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 47 DTU Compute, Technical University of Denmark

Mesochronous-Asynchronous Interface

• Mesochronous NIs – Producer and consumer operate at

TDM clk : 𝑇𝑐𝑐𝑐 – Skew in reset and clock results in

phase shift (±δ ) among TDM schedules in NIs. Possibly more than 1 cycle!

• Timing assumption:

– Routers are faster than NIs 𝑇𝐻𝑎𝑎𝑎𝑎𝑎𝑎𝑐𝑎 < 𝑇𝑐𝑐𝑐

– Producer ignores ack consumer ignores req

no synchronization no metastability

NI1 NI2

clk clk’ rst rst’

Producer Consumer

req ack ack (NC)

req (NC)

1 2 3

1 2 3

Clock skew

TDM skew

NI1

NI2

Page 45: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 51 DTU Compute, Technical University of Denmark

Tolerating skew – How much? • Producer and consumer drive FIFO below its max. speed • Speed of a FIFO depends on how full it is • Skew tolerance = F(Structure, Lr, Lf, Tclk)

Max speed of FIFO (i.e., cycle time)

FIFO stages per token/phit

Skew (more full)

Skew (more empty)

Clk

Lf

Lr

Ideal=Reset state

Page 46: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 52 DTU Compute, Technical University of Denmark

Performance Analysis of Concurrent Systems • System model

– Timed marked graph (a subclass of Petri nets) – Time Separation of Events (TSE)

• Two classes of algorithms: – Steady state average TSE – Worst case TSE (possibly an initial phase followed by oscillations)

• We have used the worst-case TSE algorithm of Hulgaard et al.: – H. Hulgaard, S. M. Burns, T. Amon, and G. Borriello. An algorithm

for exact bounds on the time separation of events in concurrent systems. IEEE Transactions on Computers, 44(11), Nov. 1995.

• Model 1: Detailed model of handshake latch implementation. • Model 2: Coarser model of handshake latch implementation:

– Analysis of complete 2 x 2 NoC (same results) – Study of possible oscillations of max TSE.

Page 47: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 53 DTU Compute, Technical University of Denmark

Example of coarse model • Ring w. 4 handshake-latches and 2 tokens

• Time separation between successive tokens written into latch a: 4, 8, 4, 8, … average is 6

a

d c

b

c d

b a

3 1

3

3

3

1 1

1

Forward latency: Lf=1

Reverse latency: Lr=3

Circuit: Timed marked graph:

Page 48: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 55 DTU Compute, Technical University of Denmark

Coarse-Grained Model of 2x2 NoC • Pipeline stage = node • Edges annotated with

Lf and Lr

• Case 1: Skew among neighbour nodes

– d1 – d2=d3=0

• Case 2: Skew among diagonal nodes

– d2 – d1=d3=0

Page 49: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 56 DTU Compute, Technical University of Denmark

Skew – Period Results

• Neighboring nodes (case 1) is the worst-case skew tolerance

Page 50: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 57 DTU Compute, Technical University of Denmark

Generating schedules • Metaheuristic scheduler

– Input: Core communication graph w. bandwidth requirements – Output: Schedule + (clock)frequency

• Scheduler works on normalized BW requirements

– Lowest BW requirement assigned 1 slot – Highest BW requirement assigned n slots (n can be large) – Schedules can be compressed by over-assigning BW to channels with

small BW requirements. For a range of benchmarks we found that compressing to 100 slot TDM periods is possible with negligible effects (i.e., need to increase clock frequency)

• NoC is effecticely drained for packets between schedule periods (except

for pipeline depth in shortest NI-to-NI path. – Perspectives for fast reconfiguration (support for mode changes)

Page 51: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 58 DTU Compute, Technical University of Denmark

Conclusions • The Argo NoC combines TDM and GALS

– Asynchronous routers – Mesochronous NIs – Independently clocked processor cores

• Significant time-elasticity provided by network of asynchronous routers.

• Small and efficient implementation

– Avoids run time synchronization, arbitration and buffering – End-to-end, SPM-to-SPM transfer of data controlled by TDM schedule.

Page 52: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 59 DTU Compute, Technical University of Denmark

References • Network interface:

– J. Sparsø, E. Kasapaki, and M. Schoeberl, “An area-efficient network interface for a TDM-based network-on-chip,” in Proc. Design, Automation and Test in Europe (DATE), 2013, pp. 1044–1047.

– R. B. Sørensen, L. Pezzarossa, and J. Sparsø, “An area-efficient TDM NoC supporting reconfiguration for mode changes,” in Proceedings of the International Symposium on Networks-on-Chip (NOCS). IEEE, 2016.

• Asynchronous router and timing analysis: – E. Kasapaki and J. Sparsø,“Argo: A Time-Elastic Time-Division-Multiplexed NoC

using Asynchronous Routers,” in Proc. IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE Computer Society Press, 2014, pp. 45–52.

– E. Kasapaki and J. Sparsø, “The Argo NoC: Combining TDM and GALS”, in Proc. European Conference on Circuit Theory and Design (ECCTD), 2015, pp. 1-4.

• Scheduler – R. B. Sørensen, J. Sparsø, M. R. Pedersen, and J. Højgaard, “A Metaheuristic

Scheduler for Time Division Multiplexed Networks-on-Chip,” in Proc. IEEE/IFIP Workshop on Software Technologies for Future Embedded and Ubiquitous Systems (SEUS), 2014, pp. 309–316.

Page 53: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 60 DTU Compute, Technical University of Denmark

References (cont.) • Journal articles with a broader perspective

– E. Kasapaki, M. Schoeberl, R. B. Sørensen, C. T. Müller, K. Goossens, and J. Sparsø, “Argo: A Real-Time Network-on-Chip Architecture with an Efficient GALS Implementation,” IEEE Transactions on VLSI Systems, vol. 24, no. 2, pp. 479–492, 2015.

– M. Schoeberl, S. Abbaspour, B. Akesson, N. Audsley, R. Capasso, J. Garside, K. Goossens, S. Goossens, S. Hansen, R. Heckmann, S. Hepp, B. Huber, A. Jordan, E. Kasapaki, J. Knoop, Y. Li, D. Prokesch, W. Puffitsch, P. Puschner, A. Rocha, C. Silva, J. Sparso, and A. Tocchi. “T-CREST: Time-predictable multi-core architecture for embedded systems,” Journal of Systems Architecture, 61(9):449-471, 2015.

• Open source: – Argo hardware: https://github.com/t-crest/argo – Scheduler: https://github.com/t-crest/poseidon

• Starting point for more information about the T-CREST plaform:

– http://patmos.compute.dtu.dk/

Page 54: Argo - An area-efficient time-division-multiplexed network ... · Argo – An area-efficient time-division-multiplexed network-on-chip for real-time multicore platforms . Jens Sparsø

05/07/2016

Jens Sparsø / RTN 2016 Keynote 61 DTU Compute, Technical University of Denmark

END