jieming yin * , pingqiang zhou + , sachin s. sapatnekar * and antonia zhai *

Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC

for Heterogeneous Multicore Systems

Jieming Yin*, Pingqiang Zhou+, Sachin S. Sapatnekar* and Antonia Zhai*

* University of Minnesota, Twin Cities, USA+ ShanghaiTech University, China

28th IEEE International Parallel & Distributed Processing Symposium

2ShanghaiTech

Heterogeneous Multicore System

GPUCPU CPU GPU GPU GPU

L2 L2 MEM MEM

Interconnection Network

3

On-chip Traffic Characteristics

CPU

GPU

Traffic Pattern Switching Mechanism

ErraticRandomLatency-sensitive

StreamingDedicatedThroughput-intensive

Packet Switching

Circuit Switching

NoCs must handle different traffic differently

ShanghaiTech

4ShanghaiTech

Src node

Intm. node1

Intm. node2

Intm. node3

Dest node

Src node

Intm. node1

Intm. node2

Intm. node3

Dest node

data link traversal

router pipeline

Network delay

setup

ack

Network delay

Setup delay

data

Packet-switched Circuit-switched

link traversal

router pipeline

Packet Switching vs. Circuit SwitchingPerformance Perspective

5

Packet Switching vs. Circuit Switching

Packet-switched

Circuit-switchedCircuit-switched NoC: potentially energy efficient for certain traffic pattern

Allocation & Arbitration



Energy Perspective

ShanghaiTech

6

Packet Switching Flexible, Scalable Latency, Energy

Circuit Switching

Latency, Energy

Setup, Maintenance

Regular Erratic

Fixe

d

Frequency

Des

tinati

on

Rand

om Packet Switching

CircuitSwitching

Packet Switching

Packet Switching

Packet Switching or Circuit Switching

NoC with both packet and circuit switching?

ShanghaiTech

7

Multi-plane vs. Single-plane

CSPS

PS+CS

Multi-plane: Independent packet-switched (PS) and circuit-switched (CS) planes

Single-plane:Packet and circuit switching sharing the same communication fabric

Increasing hardware requirement Low resource utilization

How can Packet and Circuit Switching share the same fabric?

ShanghaiTech

8

SDM

A

B

C

D

4 bits

2 bits1 bits1 bits

Space-Division Multiplexing

A

B

C

D

A

B

C

D

(Space-divisionMultiplexing)

PS+CS

Physically divide a channel into sub-channels

• K. Lusala et al., IJRC 2012• S. Secchi et al., DSD 2008• A. K. Lusala, ReCoSoC 2011• M. Modarressi et al., DATE 2009SDM suffers from packet serialization problem

ShanghaiTech

9

A

B

C

D

0D

1C

2B

3B

4A

5A

6A

7A

time

A B C D

8 bits

TDM

Time-Division Multiplexing

A

B

C

D

(Time-divisionMultiplexing)

PS+CS

We propose TDM-based hybrid-switched NoC !

ShanghaiTech

10

Outline

• Introduction• Design TDM-based Hybrid-switching NoC• Optimizations for Hybrid Switching• Conclusion

ShanghaiTech

11

Output 1

BWRC VA SA ST

Packet-switched Pipeline

HPST

Circuit-switched Pipeline

Routing Logic

Crossbar

Input 1

Packet-switched

Circuit-switched

Slot Table

VC AllocatorSW Allocator

Output nInput n

Packet-switched

Circuit-switched

Slot Table

Hybrid-switched Router

ShanghaiTech

12

R0 R1 R2

R3R5 R4

Circuit-switched Path SetupR0 R1 R2 R3

t0t1t2t3t4t5t6t7

CS

CS

CS

CS

t0

• Set up the path before transmission• Setup messages are sent through the packet-switched network• Acknowledge the source upon successful setupKeep time-slot assignment in Slot Tables

ShanghaiTech

0000

in_1

0

0

0

0

in_2

s0

s1

s2

s3

0

0

1 out_4

1 out_4

in_1

0

0

0

0

in_2

s0

s1

s2

s3

0

0

1 out_4

1 out_4

in_1

0

0

0

0

in_2

s0

s1

s2

s3

0

0

0 out_4

0 out_4

in_1

0

0

0

0

in_2

s0

s1

s2

s3

setup 1(succeed)

in_1 → out_4slot_id = 2

duration = 2

setup 2(fail)


duration = 1

teardown 1


duration = 2

① ②

③ ④

v out v out v out v out

v out v outv out v out

Slot Table Configuration Walkthrough

13

ShanghaiTech

14

Slot Table Size

Smaller slot table• Less energy overhead• Smaller packet waiting time• Coarser-grain multiplexing

Larger slot table• More energy overhead• Longer packet waiting time• Finer-grain multiplexing

Initial (reset)

more request

more request

(reset)

Slot table

V.S.

Slot table size should be adjusted dynamically

active

inactive

ShanghaiTech

15

Circuit-Switched Path ExclusivenessSlot Table

s0s1s2s3s4s5s6s7

11011011

v out

out_3out_3(PS)

out_2out_2(PS)

out_1out_1

Crossbar

SW Allocator

• Crossbar must be configured before a circuit-switched flit’s arrival. Time slot is wasted if circuit-switched flit is not presented.

configuration signals

Exclusively occupied by

circuit-switched paths

ShanghaiTech

16ShanghaiTech

Time-slot Stealing

SW Allocator

Crossbar

v out

Dec

oderLine Address

valid

Slot Table

VC Allocator

configurationsignals

CS flit enable

From upstream routerEnable path reuse between packet- and circuit-switched data paths

17

Routing decision is made based on the utilization of slot tables in neighbor routers

Hybrid-switched Network• Path Setup– Endpoint Selection: Frequent communication pairs– Route Selection: Adaptive Routing

• Switching Decision– Referring to packet slack*

*J. Yin et al., ISLPED 2012

ShanghaiTech

18

CPU Core/GPU SM/L2 Cache/

MC

RR

Full System Evaluation Platform

G G

MEM

C L2 C L2

G G G G

M L2 C L2

MEM

MEM

MEM

C L2

G G G G

G M

C L2

G G

C M

C L2

G G

M G

C L2

• Benchmarks– CPU: ammp, applu, art, equake, gafort, mgrid, swim, wupwise– GPU: blackscholes, lps, lib, nn, hotspot, pathfinder, sto

ShanghaiTech

19

ammp applu art equake gafort mgrid swim wupwise AVG0.8

0.9

1

1.1

1.2Packet Hybrid+Steal

CPU

Spe

edup

Black Hotspot LIB LPS NN PF STO AVG0.8

0.9

1

1.1

1.21.26

Packet Hybrid+Steal

GPU

Spe

edup

Performance Evaluation↑ 0.3%

CPU

GPU

↑ 4.1%

GPU performance is improved

CPU performance impact is negligible

ShanghaiTech

20


0.9

1

1.1

1.2

Packet Hybrid+Steal

Net

wor

k En

ergy

Network Energy Evaluation

6.3% saving

ShanghaiTech

21

Overall – Basic Hybrid-switched NoC

AVG0.8

0.9

1

1.1

1.2

PacketHybrid+Steal

CPU

Spe

edup

AVG0.8

0.9

1

1.1

1.2

Packet Hybrid+Steal

GPU

Spe

edup

AVG0.8

0.9

1

1.1

1.2

Packet Hybrid+Steal

Net

wor

k En

ergy

CPU Speedup GPU Speedup Network Energy

0.3% CPU performance

improvement

4.1%GPU performance

improvement

6.3%Network energy

reduction

Can we do better?

ShanghaiTech

22

Outline

• Introduction• Design TDM-based Hybrid-switching NoC• Optimizations for Hybrid Switching• Conclusion

ShanghaiTech

23

Opportunity: Low Path Utilization

Circuit-switched paths are under

utilized

• Large number of overlapped circuit-switched paths• Circuit-switched paths are not fully utilized • Waste of on-chip resource (slot-tables)

Overlapped paths

ShanghaiTech

24ShanghaiTech

Circuit-switched Path

Hitchhiker-sharing Sources

Optimization: Path Sharing

Circuit-switched Path

Vicinity-sharing Destinations

Hitchhiker-sharing

Vicinity-sharing

Enable path reuse among circuit-switched data paths

25


0.9

1

1.1

1.21.26 1.25

Packet Hybrid+Steal Hybrid+Steal+Share

GPU

Spe

edup


0.9

1

1.1

1.2Packet Hybrid+Steal Hybrid+Steal+Share

CPU

Spe

edup

Performance Evaluation↑ 0.3% ↑ 0.2%

CPU

GPU

↑ 4.1% ↑ 3.7%

ShanghaiTech

26


0.9

1

1.1

1.2

Packet Hybrid+Steal Hybrid+Steal+Share

Net

wor

k En

ergy


Can we do EVEN better?

6.3% saving

9.0% saving

ShanghaiTech

27

Percentage of flits that are circuit-switched

Opportunity: Lower Buffer Pressure

Packet-switched

Circuit-switched

GPU benchmark

Circuit-switched flits percent (%)

Blackscholes 55.7

Hotspot 29.1

Lib 34.4

Lps 55.0

Nn 38.9

Pathfinder 49.1

Sto 18.5Observation: Circuit switching diverts on-chip traffic, alleviating the buffer pressure on packet-switched data paths.

ShanghaiTech

28

Circuit switching some of the packets alleviates buffer pressure, facilitates more aggressive power gating.

Input 1

Packet-switched

Circuit-switched

Slot Table

Optimization: Aggressive Power-gating

Reduce dynamic and leakage power dissipation

active

inactive

ShanghaiTech

29


0.9

1

1.1

1.2

Packet Hybrid+StealHybrid+Steal+Share Hybrid+Steal+Share+PG

GPU

Spe

edup


0.9

1

1.1

1.2


CPU

Spe

edup

Performance Evaluation↑ 0.3% ↑ 0.2%

CPU

GPU

↑ 4.1% ↑ 3.7% ↑ 2.6%

↓ 1.6%

ShanghaiTech

30


0.800000000000001

0.900000000000001

1

1.1

1.2


Net

wor

k En

ergy


Energy saving is significant

6.3% saving

9.0% saving

17.1% saving

ShanghaiTech

31

Overall

CPU Speedup GPU Speedup Network Energy

1.6% CPU performance

degradation

2.6%GPU performance

improvement

17.1%Network energy

reduction

AVG0.700000000000001

0.800000000000001

0.900000000000001

1

1.1

1.2

PacketHybrid+Steal+Share+PG

Net

wor

k En

ergy

AVG0.8

0.9

1

1.1

1.2


GPU

Spe

edup

AVG0.8

0.9

1

1.1

1.2


CPU

Spe

edup

ShanghaiTech

32

Conclusion

TDM-based Hybrid-switched Network TDM is an efficient way to enable on-chip resource sharing Hybrid-switched NoC handles different traffic differently

Performance Energy efficiency Scalability (in paper)

ShanghaiTech

jieming yin * , pingqiang zhou + , sachin s. sapatnekar * and antonia zhai *

Documents