jieming yin * , pingqiang zhou + , sachin s. sapatnekar * and antonia zhai *

32
Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin * , Pingqiang Zhou + , Sachin S. Sapatnekar * and Antonia Zhai * * University of Minnesota, Twin Cities, USA + ShanghaiTech University, China 28 th IEEE International Parallel & Distributed Processing Symposium

Upload: zudora

Post on 11-Jan-2016

38 views

Category:

Documents


1 download

DESCRIPTION

Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems. Jieming Yin * , Pingqiang Zhou + , Sachin S. Sapatnekar * and Antonia Zhai *. * University of Minnesota, Twin Cities, USA + ShanghaiTech University, China. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC

for Heterogeneous Multicore Systems

Jieming Yin*, Pingqiang Zhou+, Sachin S. Sapatnekar* and Antonia Zhai*

* University of Minnesota, Twin Cities, USA+ ShanghaiTech University, China

28th IEEE International Parallel & Distributed Processing Symposium

Page 2: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

2ShanghaiTech

Heterogeneous Multicore System

GPUCPU CPU GPU GPU GPU

L2 L2 MEM MEM

Interconnection Network

Page 3: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

3

On-chip Traffic Characteristics

CPU

GPU

Traffic Pattern Switching Mechanism

ErraticRandomLatency-sensitive

StreamingDedicatedThroughput-intensive

Packet Switching

Circuit Switching

NoCs must handle different traffic differently

ShanghaiTech

Page 4: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

4ShanghaiTech

Src node

Intm. node1

Intm. node2

Intm. node3

Dest node

Src node

Intm. node1

Intm. node2

Intm. node3

Dest node

data link traversal

router pipeline

Network delay

setup

ack

Network delay

Setup delay

data

Packet-switched Circuit-switched

link traversal

router pipeline

Packet Switching vs. Circuit SwitchingPerformance Perspective

Page 5: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

5

Packet Switching vs. Circuit Switching

Packet-switched

Circuit-switchedCircuit-switched NoC: potentially energy efficient for certain traffic pattern

Allocation & Arbitration

Allocation & Arbitration

Allocation & Arbitration

Energy Perspective

ShanghaiTech

Page 6: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

6

Packet Switching Flexible, Scalable Latency, Energy

Circuit Switching

Latency, Energy

Setup, Maintenance

Regular Erratic

Fixe

d

Frequency

Des

tinati

on

Rand

om Packet Switching

CircuitSwitching

Packet Switching

Packet Switching

Packet Switching or Circuit Switching

NoC with both packet and circuit switching?

ShanghaiTech

Page 7: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

7

Multi-plane vs. Single-plane

CSPS

PS+CS

Multi-plane: Independent packet-switched (PS) and circuit-switched (CS) planes

Single-plane:Packet and circuit switching sharing the same communication fabric

Increasing hardware requirement Low resource utilization

How can Packet and Circuit Switching share the same fabric?

ShanghaiTech

Page 8: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

8

SDM

A

B

C

D

4 bits

2 bits1 bits1 bits

Space-Division Multiplexing

A

B

C

D

A

B

C

D

(Space-divisionMultiplexing)

PS+CS

Physically divide a channel into sub-channels

• K. Lusala et al., IJRC 2012• S. Secchi et al., DSD 2008• A. K. Lusala, ReCoSoC 2011• M. Modarressi et al., DATE 2009SDM suffers from packet serialization problem

ShanghaiTech

Page 9: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

9

A

B

C

D

0D

1C

2B

3B

4A

5A

6A

7A

time

A B C D

8 bits

TDM

Time-Division Multiplexing

A

B

C

D

(Time-divisionMultiplexing)

PS+CS

We propose TDM-based hybrid-switched NoC !

ShanghaiTech

Page 10: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

10

Outline

• Introduction• Design TDM-based Hybrid-switching NoC• Optimizations for Hybrid Switching• Conclusion

ShanghaiTech

Page 11: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

11

Output 1

BWRC VA SA ST

Packet-switched Pipeline

HPST

Circuit-switched Pipeline

Routing Logic

Crossbar

Input 1

Packet-switched

Circuit-switched

Slot Table

VC AllocatorSW Allocator

Output nInput n

Packet-switched

Circuit-switched

Slot Table

Hybrid-switched Router

ShanghaiTech

Page 12: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

12

R0 R1 R2

R3R5 R4

Circuit-switched Path SetupR0 R1 R2 R3

t0t1t2t3t4t5t6t7

CS

CS

CS

CS

t0

• Set up the path before transmission• Setup messages are sent through the packet-switched network• Acknowledge the source upon successful setupKeep time-slot assignment in Slot Tables

ShanghaiTech

Page 13: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

0000

in_1

0

0

0

0

in_2

s0

s1

s2

s3

0

0

1 out_4

1 out_4

in_1

0

0

0

0

in_2

s0

s1

s2

s3

0

0

1 out_4

1 out_4

in_1

0

0

0

0

in_2

s0

s1

s2

s3

0

0

0 out_4

0 out_4

in_1

0

0

0

0

in_2

s0

s1

s2

s3

setup 1(succeed)

in_1 → out_4slot_id = 2

duration = 2

setup 2(fail)

in_1 → out_3slot_id = 3

duration = 1

teardown 1

in_1 → out_4slot_id = 2

duration = 2

① ②

③ ④

v out v out v out v out

v out v outv out v out

Slot Table Configuration Walkthrough

13

ShanghaiTech

Page 14: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

14

Slot Table Size

Smaller slot table• Less energy overhead• Smaller packet waiting time• Coarser-grain multiplexing

Larger slot table• More energy overhead• Longer packet waiting time• Finer-grain multiplexing

Initial (reset)

more request

more request

(reset)

Slot table

V.S.

Slot table size should be adjusted dynamically

active

inactive

ShanghaiTech

Page 15: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

15

Circuit-Switched Path ExclusivenessSlot Table

s0s1s2s3s4s5s6s7

11011011

v out

out_3out_3(PS)

out_2out_2(PS)

out_1out_1

Crossbar

SW Allocator

• Crossbar must be configured before a circuit-switched flit’s arrival. Time slot is wasted if circuit-switched flit is not presented.

configuration signals

Exclusively occupied by

circuit-switched paths

ShanghaiTech

Page 16: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

16ShanghaiTech

Time-slot Stealing

SW Allocator

Crossbar

v out

Dec

oderLine Address

valid

Slot Table

VC Allocator

configurationsignals

CS flit enable

From upstream routerEnable path reuse between packet- and circuit-switched data paths

Page 17: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

17

Routing decision is made based on the utilization of slot tables in neighbor routers

Hybrid-switched Network• Path Setup– Endpoint Selection: Frequent communication pairs– Route Selection: Adaptive Routing

• Switching Decision– Referring to packet slack*

*J. Yin et al., ISLPED 2012

ShanghaiTech

Page 18: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

18

CPU Core/GPU SM/L2 Cache/

MC

RR

Full System Evaluation Platform

G G

MEM

C L2 C L2

G G G G

M L2 C L2

MEM

MEM

MEM

C L2

G G G G

G M

C L2

G G

C M

C L2

G G

M G

C L2

• Benchmarks– CPU: ammp, applu, art, equake, gafort, mgrid, swim, wupwise– GPU: blackscholes, lps, lib, nn, hotspot, pathfinder, sto

ShanghaiTech

Page 19: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

19

ammp applu art equake gafort mgrid swim wupwise AVG0.8

0.9

1

1.1

1.2Packet Hybrid+Steal

CPU

Spe

edup

Black Hotspot LIB LPS NN PF STO AVG0.8

0.9

1

1.1

1.21.26

Packet Hybrid+Steal

GPU

Spe

edup

Performance Evaluation↑ 0.3%

CPU

GPU

↑ 4.1%

GPU performance is improved

CPU performance impact is negligible

ShanghaiTech

Page 20: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

20

Black Hotspot LIB LPS NN PF STO AVG0.8

0.9

1

1.1

1.2

Packet Hybrid+Steal

Net

wor

k En

ergy

Network Energy Evaluation

6.3% saving

ShanghaiTech

Page 21: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

21

Overall – Basic Hybrid-switched NoC

AVG0.8

0.9

1

1.1

1.2

PacketHybrid+Steal

CPU

Spe

edup

AVG0.8

0.9

1

1.1

1.2

Packet Hybrid+Steal

GPU

Spe

edup

AVG0.8

0.9

1

1.1

1.2

Packet Hybrid+Steal

Net

wor

k En

ergy

CPU Speedup GPU Speedup Network Energy

0.3% CPU performance

improvement

4.1%GPU performance

improvement

6.3%Network energy

reduction

Can we do better?

ShanghaiTech

Page 22: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

22

Outline

• Introduction• Design TDM-based Hybrid-switching NoC• Optimizations for Hybrid Switching• Conclusion

ShanghaiTech

Page 23: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

23

Opportunity: Low Path Utilization

Circuit-switched paths are under

utilized

• Large number of overlapped circuit-switched paths• Circuit-switched paths are not fully utilized • Waste of on-chip resource (slot-tables)

Overlapped paths

ShanghaiTech

Page 24: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

24ShanghaiTech

Circuit-switched Path

Hitchhiker-sharing Sources

Optimization: Path Sharing

Circuit-switched Path

Vicinity-sharing Destinations

Hitchhiker-sharing

Vicinity-sharing

Enable path reuse among circuit-switched data paths

Page 25: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

25

Black Hotspot LIB LPS NN PF STO AVG0.8

0.9

1

1.1

1.21.26 1.25

Packet Hybrid+Steal Hybrid+Steal+Share

GPU

Spe

edup

ammp applu art equake gafort mgrid swim wupwise AVG0.8

0.9

1

1.1

1.2Packet Hybrid+Steal Hybrid+Steal+Share

CPU

Spe

edup

Performance Evaluation↑ 0.3% ↑ 0.2%

CPU

GPU

↑ 4.1% ↑ 3.7%

ShanghaiTech

Page 26: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

26

Black Hotspot LIB LPS NN PF STO AVG0.8

0.9

1

1.1

1.2

Packet Hybrid+Steal Hybrid+Steal+Share

Net

wor

k En

ergy

Network Energy Evaluation

Can we do EVEN better?

6.3% saving

9.0% saving

ShanghaiTech

Page 27: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

27

Percentage of flits that are circuit-switched

Opportunity: Lower Buffer Pressure

Packet-switched

Circuit-switched

GPU benchmark

Circuit-switched flits percent (%)

Blackscholes 55.7

Hotspot 29.1

Lib 34.4

Lps 55.0

Nn 38.9

Pathfinder 49.1

Sto 18.5Observation: Circuit switching diverts on-chip traffic, alleviating the buffer pressure on packet-switched data paths.

ShanghaiTech

Page 28: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

28

Circuit switching some of the packets alleviates buffer pressure, facilitates more aggressive power gating.

Input 1

Packet-switched

Circuit-switched

Slot Table

Optimization: Aggressive Power-gating

Reduce dynamic and leakage power dissipation

active

inactive

ShanghaiTech

Page 29: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

29

Black Hotspot LIB LPS NN PF STO AVG0.8

0.9

1

1.1

1.2

Packet Hybrid+StealHybrid+Steal+Share Hybrid+Steal+Share+PG

GPU

Spe

edup

ammp applu art equake gafort mgrid swim wupwise AVG0.8

0.9

1

1.1

1.2

Packet Hybrid+StealHybrid+Steal+Share Hybrid+Steal+Share+PG

CPU

Spe

edup

Performance Evaluation↑ 0.3% ↑ 0.2%

CPU

GPU

↑ 4.1% ↑ 3.7% ↑ 2.6%

↓ 1.6%

ShanghaiTech

Page 30: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

30

Black Hotspot LIB LPS NN PF STO AVG0.700000000000001

0.800000000000001

0.900000000000001

1

1.1

1.2

Packet Hybrid+StealHybrid+Steal+Share Hybrid+Steal+Share+PG

Net

wor

k En

ergy

Network Energy Evaluation

Energy saving is significant

6.3% saving

9.0% saving

17.1% saving

ShanghaiTech

Page 31: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

31

Overall

CPU Speedup GPU Speedup Network Energy

1.6% CPU performance

degradation

2.6%GPU performance

improvement

17.1%Network energy

reduction

AVG0.700000000000001

0.800000000000001

0.900000000000001

1

1.1

1.2

PacketHybrid+Steal+Share+PG

Net

wor

k En

ergy

AVG0.8

0.9

1

1.1

1.2

PacketHybrid+Steal+Share+PG

GPU

Spe

edup

AVG0.8

0.9

1

1.1

1.2

PacketHybrid+Steal+Share+PG

CPU

Spe

edup

ShanghaiTech

Page 32: Jieming  Yin * ,  Pingqiang  Zhou + ,  Sachin  S.  Sapatnekar *  and Antonia  Zhai *

32

Conclusion

TDM-based Hybrid-switched Network TDM is an efficient way to enable on-chip resource sharing Hybrid-switched NoC handles different traffic differently

Performance Energy efficiency Scalability (in paper)

ShanghaiTech