jieming yin * , pingqiang zhou + , sachin s. sapatnekar * and antonia zhai *
DESCRIPTION
Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems. Jieming Yin * , Pingqiang Zhou + , Sachin S. Sapatnekar * and Antonia Zhai *. * University of Minnesota, Twin Cities, USA + ShanghaiTech University, China. - PowerPoint PPT PresentationTRANSCRIPT
Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC
for Heterogeneous Multicore Systems
Jieming Yin*, Pingqiang Zhou+, Sachin S. Sapatnekar* and Antonia Zhai*
* University of Minnesota, Twin Cities, USA+ ShanghaiTech University, China
28th IEEE International Parallel & Distributed Processing Symposium
2ShanghaiTech
Heterogeneous Multicore System
GPUCPU CPU GPU GPU GPU
L2 L2 MEM MEM
Interconnection Network
3
On-chip Traffic Characteristics
CPU
GPU
Traffic Pattern Switching Mechanism
ErraticRandomLatency-sensitive
StreamingDedicatedThroughput-intensive
Packet Switching
Circuit Switching
NoCs must handle different traffic differently
ShanghaiTech
4ShanghaiTech
Src node
Intm. node1
Intm. node2
Intm. node3
Dest node
Src node
Intm. node1
Intm. node2
Intm. node3
Dest node
data link traversal
router pipeline
Network delay
setup
ack
Network delay
Setup delay
data
Packet-switched Circuit-switched
link traversal
router pipeline
Packet Switching vs. Circuit SwitchingPerformance Perspective
5
Packet Switching vs. Circuit Switching
Packet-switched
Circuit-switchedCircuit-switched NoC: potentially energy efficient for certain traffic pattern
Allocation & Arbitration
Allocation & Arbitration
Allocation & Arbitration
Energy Perspective
ShanghaiTech
6
Packet Switching Flexible, Scalable Latency, Energy
Circuit Switching
Latency, Energy
Setup, Maintenance
Regular Erratic
Fixe
d
Frequency
Des
tinati
on
Rand
om Packet Switching
CircuitSwitching
Packet Switching
Packet Switching
Packet Switching or Circuit Switching
NoC with both packet and circuit switching?
ShanghaiTech
7
Multi-plane vs. Single-plane
CSPS
PS+CS
Multi-plane: Independent packet-switched (PS) and circuit-switched (CS) planes
Single-plane:Packet and circuit switching sharing the same communication fabric
Increasing hardware requirement Low resource utilization
How can Packet and Circuit Switching share the same fabric?
ShanghaiTech
8
SDM
A
B
C
D
4 bits
2 bits1 bits1 bits
Space-Division Multiplexing
A
B
C
D
A
B
C
D
(Space-divisionMultiplexing)
PS+CS
Physically divide a channel into sub-channels
• K. Lusala et al., IJRC 2012• S. Secchi et al., DSD 2008• A. K. Lusala, ReCoSoC 2011• M. Modarressi et al., DATE 2009SDM suffers from packet serialization problem
ShanghaiTech
9
A
B
C
D
0D
1C
2B
3B
4A
5A
6A
7A
time
A B C D
8 bits
TDM
Time-Division Multiplexing
A
B
C
D
(Time-divisionMultiplexing)
PS+CS
We propose TDM-based hybrid-switched NoC !
ShanghaiTech
10
Outline
• Introduction• Design TDM-based Hybrid-switching NoC• Optimizations for Hybrid Switching• Conclusion
ShanghaiTech
11
Output 1
BWRC VA SA ST
Packet-switched Pipeline
HPST
Circuit-switched Pipeline
Routing Logic
Crossbar
Input 1
Packet-switched
Circuit-switched
Slot Table
VC AllocatorSW Allocator
Output nInput n
Packet-switched
Circuit-switched
Slot Table
Hybrid-switched Router
ShanghaiTech
12
R0 R1 R2
R3R5 R4
Circuit-switched Path SetupR0 R1 R2 R3
t0t1t2t3t4t5t6t7
CS
CS
CS
CS
t0
• Set up the path before transmission• Setup messages are sent through the packet-switched network• Acknowledge the source upon successful setupKeep time-slot assignment in Slot Tables
ShanghaiTech
0000
in_1
0
0
0
0
in_2
s0
s1
s2
s3
0
0
1 out_4
1 out_4
in_1
0
0
0
0
in_2
s0
s1
s2
s3
0
0
1 out_4
1 out_4
in_1
0
0
0
0
in_2
s0
s1
s2
s3
0
0
0 out_4
0 out_4
in_1
0
0
0
0
in_2
s0
s1
s2
s3
setup 1(succeed)
in_1 → out_4slot_id = 2
duration = 2
setup 2(fail)
in_1 → out_3slot_id = 3
duration = 1
teardown 1
in_1 → out_4slot_id = 2
duration = 2
① ②
③ ④
v out v out v out v out
v out v outv out v out
Slot Table Configuration Walkthrough
13
ShanghaiTech
14
Slot Table Size
Smaller slot table• Less energy overhead• Smaller packet waiting time• Coarser-grain multiplexing
Larger slot table• More energy overhead• Longer packet waiting time• Finer-grain multiplexing
Initial (reset)
more request
more request
(reset)
Slot table
V.S.
Slot table size should be adjusted dynamically
active
inactive
ShanghaiTech
15
Circuit-Switched Path ExclusivenessSlot Table
s0s1s2s3s4s5s6s7
11011011
v out
out_3out_3(PS)
out_2out_2(PS)
out_1out_1
Crossbar
SW Allocator
• Crossbar must be configured before a circuit-switched flit’s arrival. Time slot is wasted if circuit-switched flit is not presented.
configuration signals
Exclusively occupied by
circuit-switched paths
ShanghaiTech
16ShanghaiTech
Time-slot Stealing
SW Allocator
Crossbar
v out
Dec
oderLine Address
valid
Slot Table
VC Allocator
configurationsignals
CS flit enable
From upstream routerEnable path reuse between packet- and circuit-switched data paths
17
Routing decision is made based on the utilization of slot tables in neighbor routers
Hybrid-switched Network• Path Setup– Endpoint Selection: Frequent communication pairs– Route Selection: Adaptive Routing
• Switching Decision– Referring to packet slack*
*J. Yin et al., ISLPED 2012
ShanghaiTech
18
CPU Core/GPU SM/L2 Cache/
MC
RR
Full System Evaluation Platform
G G
MEM
C L2 C L2
G G G G
M L2 C L2
MEM
MEM
MEM
C L2
G G G G
G M
C L2
G G
C M
C L2
G G
M G
C L2
• Benchmarks– CPU: ammp, applu, art, equake, gafort, mgrid, swim, wupwise– GPU: blackscholes, lps, lib, nn, hotspot, pathfinder, sto
ShanghaiTech
19
ammp applu art equake gafort mgrid swim wupwise AVG0.8
0.9
1
1.1
1.2Packet Hybrid+Steal
CPU
Spe
edup
Black Hotspot LIB LPS NN PF STO AVG0.8
0.9
1
1.1
1.21.26
Packet Hybrid+Steal
GPU
Spe
edup
Performance Evaluation↑ 0.3%
CPU
GPU
↑ 4.1%
GPU performance is improved
CPU performance impact is negligible
ShanghaiTech
20
Black Hotspot LIB LPS NN PF STO AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+Steal
Net
wor
k En
ergy
Network Energy Evaluation
6.3% saving
ShanghaiTech
21
Overall – Basic Hybrid-switched NoC
AVG0.8
0.9
1
1.1
1.2
PacketHybrid+Steal
CPU
Spe
edup
AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+Steal
GPU
Spe
edup
AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+Steal
Net
wor
k En
ergy
CPU Speedup GPU Speedup Network Energy
0.3% CPU performance
improvement
4.1%GPU performance
improvement
6.3%Network energy
reduction
Can we do better?
ShanghaiTech
22
Outline
• Introduction• Design TDM-based Hybrid-switching NoC• Optimizations for Hybrid Switching• Conclusion
ShanghaiTech
23
Opportunity: Low Path Utilization
Circuit-switched paths are under
utilized
• Large number of overlapped circuit-switched paths• Circuit-switched paths are not fully utilized • Waste of on-chip resource (slot-tables)
Overlapped paths
ShanghaiTech
24ShanghaiTech
Circuit-switched Path
Hitchhiker-sharing Sources
Optimization: Path Sharing
Circuit-switched Path
Vicinity-sharing Destinations
Hitchhiker-sharing
Vicinity-sharing
Enable path reuse among circuit-switched data paths
25
Black Hotspot LIB LPS NN PF STO AVG0.8
0.9
1
1.1
1.21.26 1.25
Packet Hybrid+Steal Hybrid+Steal+Share
GPU
Spe
edup
ammp applu art equake gafort mgrid swim wupwise AVG0.8
0.9
1
1.1
1.2Packet Hybrid+Steal Hybrid+Steal+Share
CPU
Spe
edup
Performance Evaluation↑ 0.3% ↑ 0.2%
CPU
GPU
↑ 4.1% ↑ 3.7%
ShanghaiTech
26
Black Hotspot LIB LPS NN PF STO AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+Steal Hybrid+Steal+Share
Net
wor
k En
ergy
Network Energy Evaluation
Can we do EVEN better?
6.3% saving
9.0% saving
ShanghaiTech
27
Percentage of flits that are circuit-switched
Opportunity: Lower Buffer Pressure
Packet-switched
Circuit-switched
GPU benchmark
Circuit-switched flits percent (%)
Blackscholes 55.7
Hotspot 29.1
Lib 34.4
Lps 55.0
Nn 38.9
Pathfinder 49.1
Sto 18.5Observation: Circuit switching diverts on-chip traffic, alleviating the buffer pressure on packet-switched data paths.
ShanghaiTech
28
Circuit switching some of the packets alleviates buffer pressure, facilitates more aggressive power gating.
Input 1
Packet-switched
Circuit-switched
Slot Table
Optimization: Aggressive Power-gating
Reduce dynamic and leakage power dissipation
active
inactive
ShanghaiTech
29
Black Hotspot LIB LPS NN PF STO AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+StealHybrid+Steal+Share Hybrid+Steal+Share+PG
GPU
Spe
edup
ammp applu art equake gafort mgrid swim wupwise AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+StealHybrid+Steal+Share Hybrid+Steal+Share+PG
CPU
Spe
edup
Performance Evaluation↑ 0.3% ↑ 0.2%
CPU
GPU
↑ 4.1% ↑ 3.7% ↑ 2.6%
↓ 1.6%
ShanghaiTech
30
Black Hotspot LIB LPS NN PF STO AVG0.700000000000001
0.800000000000001
0.900000000000001
1
1.1
1.2
Packet Hybrid+StealHybrid+Steal+Share Hybrid+Steal+Share+PG
Net
wor
k En
ergy
Network Energy Evaluation
Energy saving is significant
6.3% saving
9.0% saving
17.1% saving
ShanghaiTech
31
Overall
CPU Speedup GPU Speedup Network Energy
1.6% CPU performance
degradation
2.6%GPU performance
improvement
17.1%Network energy
reduction
AVG0.700000000000001
0.800000000000001
0.900000000000001
1
1.1
1.2
PacketHybrid+Steal+Share+PG
Net
wor
k En
ergy
AVG0.8
0.9
1
1.1
1.2
PacketHybrid+Steal+Share+PG
GPU
Spe
edup
AVG0.8
0.9
1
1.1
1.2
PacketHybrid+Steal+Share+PG
CPU
Spe
edup
ShanghaiTech
32
Conclusion
TDM-based Hybrid-switched Network TDM is an efficient way to enable on-chip resource sharing Hybrid-switched NoC handles different traffic differently
Performance Energy efficiency Scalability (in paper)
ShanghaiTech