![Page 1: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/1.jpg)
Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC
for Heterogeneous Multicore Systems
Jieming Yin*, Pingqiang Zhou+, Sachin S. Sapatnekar* and Antonia Zhai*
* University of Minnesota, Twin Cities, USA+ ShanghaiTech University, China
28th IEEE International Parallel & Distributed Processing Symposium
![Page 2: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/2.jpg)
2ShanghaiTech
Heterogeneous Multicore System
GPUCPU CPU GPU GPU GPU
L2 L2 MEM MEM
Interconnection Network
![Page 3: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/3.jpg)
3
On-chip Traffic Characteristics
CPU
GPU
Traffic Pattern Switching Mechanism
ErraticRandomLatency-sensitive
StreamingDedicatedThroughput-intensive
Packet Switching
Circuit Switching
NoCs must handle different traffic differently
ShanghaiTech
![Page 4: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/4.jpg)
4ShanghaiTech
Src node
Intm. node1
Intm. node2
Intm. node3
Dest node
Src node
Intm. node1
Intm. node2
Intm. node3
Dest node
data link traversal
router pipeline
Network delay
setup
ack
Network delay
Setup delay
data
Packet-switched Circuit-switched
link traversal
router pipeline
Packet Switching vs. Circuit SwitchingPerformance Perspective
![Page 5: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/5.jpg)
5
Packet Switching vs. Circuit Switching
Packet-switched
Circuit-switchedCircuit-switched NoC: potentially energy efficient for certain traffic pattern
Allocation & Arbitration
Allocation & Arbitration
Allocation & Arbitration
Energy Perspective
ShanghaiTech
![Page 6: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/6.jpg)
6
Packet Switching Flexible, Scalable Latency, Energy
Circuit Switching
Latency, Energy
Setup, Maintenance
Regular Erratic
Fixe
d
Frequency
Des
tinati
on
Rand
om Packet Switching
CircuitSwitching
Packet Switching
Packet Switching
Packet Switching or Circuit Switching
NoC with both packet and circuit switching?
ShanghaiTech
![Page 7: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/7.jpg)
7
Multi-plane vs. Single-plane
CSPS
PS+CS
Multi-plane: Independent packet-switched (PS) and circuit-switched (CS) planes
Single-plane:Packet and circuit switching sharing the same communication fabric
Increasing hardware requirement Low resource utilization
How can Packet and Circuit Switching share the same fabric?
ShanghaiTech
![Page 8: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/8.jpg)
8
SDM
A
B
C
D
4 bits
2 bits1 bits1 bits
Space-Division Multiplexing
A
B
C
D
A
B
C
D
(Space-divisionMultiplexing)
PS+CS
Physically divide a channel into sub-channels
• K. Lusala et al., IJRC 2012• S. Secchi et al., DSD 2008• A. K. Lusala, ReCoSoC 2011• M. Modarressi et al., DATE 2009SDM suffers from packet serialization problem
ShanghaiTech
![Page 9: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/9.jpg)
9
A
B
C
D
0D
1C
2B
3B
4A
5A
6A
7A
time
A B C D
8 bits
TDM
Time-Division Multiplexing
A
B
C
D
(Time-divisionMultiplexing)
PS+CS
We propose TDM-based hybrid-switched NoC !
ShanghaiTech
![Page 10: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/10.jpg)
10
Outline
• Introduction• Design TDM-based Hybrid-switching NoC• Optimizations for Hybrid Switching• Conclusion
ShanghaiTech
![Page 11: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/11.jpg)
11
Output 1
BWRC VA SA ST
Packet-switched Pipeline
HPST
Circuit-switched Pipeline
Routing Logic
Crossbar
Input 1
Packet-switched
Circuit-switched
Slot Table
VC AllocatorSW Allocator
Output nInput n
Packet-switched
Circuit-switched
Slot Table
Hybrid-switched Router
ShanghaiTech
![Page 12: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/12.jpg)
12
R0 R1 R2
R3R5 R4
Circuit-switched Path SetupR0 R1 R2 R3
t0t1t2t3t4t5t6t7
CS
CS
CS
CS
t0
• Set up the path before transmission• Setup messages are sent through the packet-switched network• Acknowledge the source upon successful setupKeep time-slot assignment in Slot Tables
ShanghaiTech
![Page 13: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/13.jpg)
0000
in_1
0
0
0
0
in_2
s0
s1
s2
s3
0
0
1 out_4
1 out_4
in_1
0
0
0
0
in_2
s0
s1
s2
s3
0
0
1 out_4
1 out_4
in_1
0
0
0
0
in_2
s0
s1
s2
s3
0
0
0 out_4
0 out_4
in_1
0
0
0
0
in_2
s0
s1
s2
s3
setup 1(succeed)
in_1 → out_4slot_id = 2
duration = 2
setup 2(fail)
in_1 → out_3slot_id = 3
duration = 1
teardown 1
in_1 → out_4slot_id = 2
duration = 2
① ②
③ ④
v out v out v out v out
v out v outv out v out
Slot Table Configuration Walkthrough
13
ShanghaiTech
![Page 14: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/14.jpg)
14
Slot Table Size
Smaller slot table• Less energy overhead• Smaller packet waiting time• Coarser-grain multiplexing
Larger slot table• More energy overhead• Longer packet waiting time• Finer-grain multiplexing
Initial (reset)
more request
more request
(reset)
Slot table
V.S.
Slot table size should be adjusted dynamically
active
inactive
ShanghaiTech
![Page 15: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/15.jpg)
15
Circuit-Switched Path ExclusivenessSlot Table
s0s1s2s3s4s5s6s7
11011011
v out
out_3out_3(PS)
out_2out_2(PS)
out_1out_1
Crossbar
SW Allocator
• Crossbar must be configured before a circuit-switched flit’s arrival. Time slot is wasted if circuit-switched flit is not presented.
configuration signals
Exclusively occupied by
circuit-switched paths
ShanghaiTech
![Page 16: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/16.jpg)
16ShanghaiTech
Time-slot Stealing
SW Allocator
Crossbar
v out
Dec
oderLine Address
valid
Slot Table
VC Allocator
configurationsignals
CS flit enable
From upstream routerEnable path reuse between packet- and circuit-switched data paths
![Page 17: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/17.jpg)
17
Routing decision is made based on the utilization of slot tables in neighbor routers
Hybrid-switched Network• Path Setup– Endpoint Selection: Frequent communication pairs– Route Selection: Adaptive Routing
• Switching Decision– Referring to packet slack*
*J. Yin et al., ISLPED 2012
ShanghaiTech
![Page 18: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/18.jpg)
18
CPU Core/GPU SM/L2 Cache/
MC
RR
Full System Evaluation Platform
G G
MEM
C L2 C L2
G G G G
M L2 C L2
MEM
MEM
MEM
C L2
G G G G
G M
C L2
G G
C M
C L2
G G
M G
C L2
• Benchmarks– CPU: ammp, applu, art, equake, gafort, mgrid, swim, wupwise– GPU: blackscholes, lps, lib, nn, hotspot, pathfinder, sto
ShanghaiTech
![Page 19: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/19.jpg)
19
ammp applu art equake gafort mgrid swim wupwise AVG0.8
0.9
1
1.1
1.2Packet Hybrid+Steal
CPU
Spe
edup
Black Hotspot LIB LPS NN PF STO AVG0.8
0.9
1
1.1
1.21.26
Packet Hybrid+Steal
GPU
Spe
edup
Performance Evaluation↑ 0.3%
CPU
GPU
↑ 4.1%
GPU performance is improved
CPU performance impact is negligible
ShanghaiTech
![Page 20: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/20.jpg)
20
Black Hotspot LIB LPS NN PF STO AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+Steal
Net
wor
k En
ergy
Network Energy Evaluation
6.3% saving
ShanghaiTech
![Page 21: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/21.jpg)
21
Overall – Basic Hybrid-switched NoC
AVG0.8
0.9
1
1.1
1.2
PacketHybrid+Steal
CPU
Spe
edup
AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+Steal
GPU
Spe
edup
AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+Steal
Net
wor
k En
ergy
CPU Speedup GPU Speedup Network Energy
0.3% CPU performance
improvement
4.1%GPU performance
improvement
6.3%Network energy
reduction
Can we do better?
ShanghaiTech
![Page 22: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/22.jpg)
22
Outline
• Introduction• Design TDM-based Hybrid-switching NoC• Optimizations for Hybrid Switching• Conclusion
ShanghaiTech
![Page 23: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/23.jpg)
23
Opportunity: Low Path Utilization
Circuit-switched paths are under
utilized
• Large number of overlapped circuit-switched paths• Circuit-switched paths are not fully utilized • Waste of on-chip resource (slot-tables)
Overlapped paths
ShanghaiTech
![Page 24: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/24.jpg)
24ShanghaiTech
Circuit-switched Path
Hitchhiker-sharing Sources
Optimization: Path Sharing
Circuit-switched Path
Vicinity-sharing Destinations
Hitchhiker-sharing
Vicinity-sharing
Enable path reuse among circuit-switched data paths
![Page 25: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/25.jpg)
25
Black Hotspot LIB LPS NN PF STO AVG0.8
0.9
1
1.1
1.21.26 1.25
Packet Hybrid+Steal Hybrid+Steal+Share
GPU
Spe
edup
ammp applu art equake gafort mgrid swim wupwise AVG0.8
0.9
1
1.1
1.2Packet Hybrid+Steal Hybrid+Steal+Share
CPU
Spe
edup
Performance Evaluation↑ 0.3% ↑ 0.2%
CPU
GPU
↑ 4.1% ↑ 3.7%
ShanghaiTech
![Page 26: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/26.jpg)
26
Black Hotspot LIB LPS NN PF STO AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+Steal Hybrid+Steal+Share
Net
wor
k En
ergy
Network Energy Evaluation
Can we do EVEN better?
6.3% saving
9.0% saving
ShanghaiTech
![Page 27: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/27.jpg)
27
Percentage of flits that are circuit-switched
Opportunity: Lower Buffer Pressure
Packet-switched
Circuit-switched
GPU benchmark
Circuit-switched flits percent (%)
Blackscholes 55.7
Hotspot 29.1
Lib 34.4
Lps 55.0
Nn 38.9
Pathfinder 49.1
Sto 18.5Observation: Circuit switching diverts on-chip traffic, alleviating the buffer pressure on packet-switched data paths.
ShanghaiTech
![Page 28: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/28.jpg)
28
Circuit switching some of the packets alleviates buffer pressure, facilitates more aggressive power gating.
Input 1
Packet-switched
Circuit-switched
Slot Table
Optimization: Aggressive Power-gating
Reduce dynamic and leakage power dissipation
active
inactive
ShanghaiTech
![Page 29: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/29.jpg)
29
Black Hotspot LIB LPS NN PF STO AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+StealHybrid+Steal+Share Hybrid+Steal+Share+PG
GPU
Spe
edup
ammp applu art equake gafort mgrid swim wupwise AVG0.8
0.9
1
1.1
1.2
Packet Hybrid+StealHybrid+Steal+Share Hybrid+Steal+Share+PG
CPU
Spe
edup
Performance Evaluation↑ 0.3% ↑ 0.2%
CPU
GPU
↑ 4.1% ↑ 3.7% ↑ 2.6%
↓ 1.6%
ShanghaiTech
![Page 30: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/30.jpg)
30
Black Hotspot LIB LPS NN PF STO AVG0.700000000000001
0.800000000000001
0.900000000000001
1
1.1
1.2
Packet Hybrid+StealHybrid+Steal+Share Hybrid+Steal+Share+PG
Net
wor
k En
ergy
Network Energy Evaluation
Energy saving is significant
6.3% saving
9.0% saving
17.1% saving
ShanghaiTech
![Page 31: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/31.jpg)
31
Overall
CPU Speedup GPU Speedup Network Energy
1.6% CPU performance
degradation
2.6%GPU performance
improvement
17.1%Network energy
reduction
AVG0.700000000000001
0.800000000000001
0.900000000000001
1
1.1
1.2
PacketHybrid+Steal+Share+PG
Net
wor
k En
ergy
AVG0.8
0.9
1
1.1
1.2
PacketHybrid+Steal+Share+PG
GPU
Spe
edup
AVG0.8
0.9
1
1.1
1.2
PacketHybrid+Steal+Share+PG
CPU
Spe
edup
ShanghaiTech
![Page 32: Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar](https://reader038.vdocument.in/reader038/viewer/2022110319/56649c755503460f949281d4/html5/thumbnails/32.jpg)
32
Conclusion
TDM-based Hybrid-switched Network TDM is an efficient way to enable on-chip resource sharing Hybrid-switched NoC handles different traffic differently
Performance Energy efficiency Scalability (in paper)
ShanghaiTech