scalable interconnects for reconfigurable spatial architectures · 2021. 6. 1. · 4 val...
TRANSCRIPT
![Page 1: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/1.jpg)
Scalable Interconnects forReconfigurable Spatial Architectures
Yaqi Zhang, Alexander Rucker, Matthew Vilim, Raghu Prabhakar,William Hwang, Kunle Olukotun
Electrical EngineeringStanford University
ISCA ’19: The 46th International Symposium on Computer Architecture, Phoenix, AZ
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 1/27
®
![Page 2: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/2.jpg)
Spatial Accelerators
• Energy efficient• High-throughput• Low-latency
Examples:• Plasticine (ISCA ‘17)• Compressed-sparse CNN accelerator (ISCA ‘17)• Stream-dataflow accelerator (ISCA ‘17)
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 2/27
![Page 3: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/3.jpg)
Accelerator Characteristics• High compute density• High on-chip memory bandwidth
• Distributed compute andmemory resources• Streaming interface between compute andmemory• Statically mapped and scheduled compute graph
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/27
M C M
C
M
C
M
M
C
MC
MC
MC MC
MC
MC
MC
MC
MC C
C
C
C
C
C
C
C
C
MDD D D D
C Compute M On-chipScratchpad D Off-chipDRAM
![Page 4: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/4.jpg)
Accelerator Characteristics• High compute density• High on-chip memory bandwidth• Distributed compute andmemory resources• Streaming interface between compute andmemory• Statically mapped and scheduled compute graph
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/27
M C M
C
M
C
M
M
C
MC
MC
MC MC
MC
MC
MC
MC
MC C
C
C
C
C
C
C
C
C
MDD D D D
C Compute M On-chipScratchpad D Off-chipDRAM
![Page 5: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/5.jpg)
Key ChallengesOn-chip networks play a critical role in:
• Energy efficiency (↓ data movement)• Flexibility• Scalability• Compute utilization
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27
![Page 6: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/6.jpg)
Key ChallengesOn-chip networks play a critical role in:
• Energy efficiency (↓ data movement)
• Flexibility• Scalability• Compute utilization
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27
![Page 7: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/7.jpg)
Key ChallengesOn-chip networks play a critical role in:
• Energy efficiency (↓ data movement)• Flexibility
• Scalability• Compute utilization
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27
![Page 8: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/8.jpg)
Key ChallengesOn-chip networks play a critical role in:
• Energy efficiency (↓ data movement)• Flexibility• Scalability
• Compute utilization
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27
![Page 9: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/9.jpg)
Key ChallengesOn-chip networks play a critical role in:
• Energy efficiency (↓ data movement)• Flexibility• Scalability• Compute utilization
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27
![Page 10: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/10.jpg)
Communication Patterns
Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet Latency
Spatial Accelerator Frequent Fine-grained Throughput
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27
M M M MC C C C
MemoryBus
Parallelism
Multi-Processor
![Page 11: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/11.jpg)
Communication Patterns
Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet Latency
Spatial Accelerator Frequent Fine-grained Throughput
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27
M M M MC C C C
MemoryBus
Parallelism
Multi-Processor
![Page 12: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/12.jpg)
Communication Patterns
Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained
Throughput
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27
C C C C
M M M M
C C C C
M M M M
Parallelism
Pipelining
Spatial Acceleratordummy
![Page 13: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/13.jpg)
Communication Patterns
Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained
Throughput
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27
C C C C
M M M M
C C C C
M M M M
Parallelism
Pipelining
Spatial Acceleratordummy
![Page 14: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/14.jpg)
Communication Patterns
Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained Throughput
Compute On-chipMemoryNetwork Network Compute
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27
![Page 15: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/15.jpg)
Outline
Motivation
Network Design Space
Compilation Flow
Evaluation
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 6/27
![Page 16: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/16.jpg)
Static Network
PB
S S
S
PB
S
S
PB
S
S
PB
S
PB
S
PB
S
PB
S
PB
S
PB PB PB PB
R Router
S Switch
PB Physical Block
Pros Cons
Guaranteed bandwidth Low link utilizationP&R failures
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 7/27
![Page 17: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/17.jpg)
Dynamic Network
PB
R R
R
PB
R
R
PB
R
R
PB
R
PB
R
PB
R
PB
R
PB
R
PB PB PB PB
R Router
S Switch
PB Physical Block
Pros Cons
Link sharing Limited bandwidthDeadlock
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 8/27
![Page 18: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/18.jpg)
Hybrid Network: Static and Dynamic
PB
R R
R
S
PB
R
R
S
PB
R
R
S
PB
R
S
PB
R
S
PB
R
S
PB
R
S
PB
R
S
PB
S
PB
S
PB
S
PB
S R Router
S Switch
PB Physical Block
Pros Cons
Link sharing More areaMore bandwidth More static powerGuaranteed P&R
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 9/27
![Page 19: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/19.jpg)
Outline
Motivation
Network Design Space
Compilation Flow
Evaluation
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 10/27
![Page 20: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/20.jpg)
SpatialA DSL for Reconfigurable Accelerators
1 // Tiled inner product2 val vecA,vecB = DRAM[T](N)3 Reduce(Reg[T])( N ){ i =>4 val tileA,tileB = SRAM[T](tilesize)5 tileA load vecA(i::i+tilesize)6 tileB load vecB(i::i+tilesize)7 Reduce(Reg[T])(tilesize) { j =>8 tileA(j) * tileB(j)9 } { _ + _ }
10 }
• Annotate data size N
• Calculate loopiterations
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 11/27
Spatial Accelerator Compiler Mapping Characterization Simulation
![Page 21: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/21.jpg)
Accelerator Compiler
• Allocate compute andmemory Virtual Blocks (VBs)• Infer activation counts for logical links
1 // Tiled inner product2 val vecA,vecB = DRAM[T](N)3 Reduce(Reg[T])( N ){ i =>4 val tileA,tileB = SRAM[T](tilesize)5 tileA load vecA(i::i+tilesize)6 tileB load vecB(i::i+tilesize)7 Reduce(Reg[T])(tilesize) { j =>8 tileA(j) * tileB(j)9 } { _ + _ }
10 }
⇒A
B C
D
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 12/27
Virtual Block→Logical Link→
Spatial Accelerator Compiler Mapping Characterization Simulation
![Page 22: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/22.jpg)
Mapping
• Partition VB graph to meet hardware constraints
• Place and route the VB graph onto the network• Allocate VCs for the dynamic network
A
B C
D
⇒A-1
A-2B
C
D
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 13/27
Spatial Accelerator Compiler Mapping Characterization Simulation
![Page 23: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/23.jpg)
Mapping
• Partition VB graph to meet hardware constraints• Place and route the VB graph onto the network• Allocate VCs for the dynamic network
A-1
A-2B
C
D
⇒PB
R R
R
S
PB
R
R
S
PB
R
R
S
PB
R
S
PB
R
S
PB
R
S
PB
R
S
PB
R
S
PB
S
PB
S
PB
S
PB
S R Router
S Switch
PB Physical Block
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 13/27
Spatial Accelerator Compiler Mapping Characterization Simulation
![Page 24: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/24.jpg)
Placement and Routing
• Start with random placement
• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
D
B A C
![Page 25: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/25.jpg)
Placement and Routing
• Start with random placement• Route all links, in order of activation count
• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
D
B A C
![Page 26: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/26.jpg)
Placement and Routing
• Start with random placement• Route all links, in order of activation count
• Build most efficient broadcast tree• Guarantee static network placement, if possible
• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
D
B A C
![Page 27: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/27.jpg)
Placement and Routing
• Start with random placement• Route all links, in order of activation count
• Build most efficient broadcast tree• Guarantee static network placement, if possible• Else, map the link to the dynamic network
• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
D
B A C
![Page 28: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/28.jpg)
Placement and Routing
• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost
• Dynamic network congestion• Average route length• Maximum route length
• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
![Page 29: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/29.jpg)
Placement and Routing
• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost
• Dynamic network congestion• Average route length• Maximum route length
• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
D
B A C
![Page 30: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/30.jpg)
Placement and Routing
• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost
• Dynamic network congestion• Average route length• Maximum route length
• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
C D
B A
![Page 31: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/31.jpg)
Placement and Routing
• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
![Page 32: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/32.jpg)
Placement and Routing
• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
![Page 33: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/33.jpg)
Area and Energy Characterization• Synthesize switch and router RTL at 28 nm, 1GHz• Power simulation with Primetime
• Decompose power into:• Inactive (per-cycle)• Active (per-bit)
0 20 40 60 80 100
Activation (%)
0.00
0.05
0.10
0.15
Powe
r(W
)
Inactive
Active
Switchstaticstatic+dynamic
0 20 40 60 80 100
Activation (%)
0.00
0.01
0.02
0.03
0.04
Powe
r(W
)
Inactive
Active
Router
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 15/27
Spatial Accelerator Compiler Mapping Characterization Simulation
![Page 34: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/34.jpg)
Area and Energy Characterization• Synthesize switch and router RTL at 28 nm, 1GHz• Power simulation with Primetime• Decompose power into:
• Inactive (per-cycle)• Active (per-bit)
0 20 40 60 80 100
Activation (%)
0.00
0.05
0.10
0.15
Powe
r(W
)
Inactive
Active
Switchstaticstatic+dynamic
0 20 40 60 80 100
Activation (%)
0.00
0.01
0.02
0.03
0.04
Powe
r(W
)
Inactive
Active
Router
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 15/27
![Page 35: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/35.jpg)
Simulation• Integrate simulator with DRAMSim and BookSim• Track transmitted data in switches and routers• Estimate per-app power with activity traces:
Enet =∑
allocated
PinactiveTsim + Eflit#flit
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 16/27
Spatial Accelerator Compiler Mapping Characterization Simulation
![Page 36: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/36.jpg)
Outline
Motivation
Network Design Space
Compilation Flow
Evaluation
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 17/27
![Page 37: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/37.jpg)
Area and Energy Characterization
0.0
0.1
0.2mm
2
1 2 3 4 50.00
0.02
0.04
0.06
2 4 8
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 18/27
![Page 38: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/38.jpg)
Area and Energy Characterization
0.0
0.1
0.2mm
2
1 2 3 4 50.00
0.02
0.04
0.06
2 4 8
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 18/27
L
L
Takeaway
Switches take less energy to transmit data than routersBandwidth scales more efficiently on the static network
![Page 39: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/39.jpg)
Benchmarks
Category Application
Linear Algebra
Dot ProductOuter ProductBlack ScholesGEMM
Database TPC-H Query 6Clustering k-Means Clustering
Inference
Lattice RegressionLSTM (RNN)GRU (RNN)LeNet (CNN)
TrainingGaussian Discriminant AnalysisLogistic RegressionStochastic Gradient Descent
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 19/27
![Page 40: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/40.jpg)
Benchmark Resource Usage
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 20/27
![Page 41: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/41.jpg)
Evaluated Design Space
• Different network configurations• Static: flow control, bandwidth• Dynamic: VC count, flit width• Hybrid
• Different applications• Different architectures
• Pipelined (high throughput)• Scheduled (low throughput)
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 21/27
![Page 42: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/42.jpg)
Evaluated Metrics• Performance (Perf)• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27
![Page 43: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/43.jpg)
Evaluated Metrics• Performance (Perf)• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)
Reported values are the geomean across all applications,normalized to the worst network configuration.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27
![Page 44: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/44.jpg)
Evaluated Metrics• Performance (Perf)L• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)
Area
Compute 51.0%
On-ChipMemory
32.0%
Network
17.0%
Power
Compute
26.0%
On-ChipMemory
34.8% Network15.6%
DRAM
23.6%
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27
![Page 45: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/45.jpg)
Hybrid Network VCs and Flit Width
Perf
Perf / Area
Perf / Watt
1.1
vc4vc21.3
1.1
Perf
Perf / Area
Perf / Watt
1.0
flit128flit256flit512
1.7
1.0
Dynamic network flit width and VC count can be decreasedwith no performance loss.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 23/27
R R
![Page 46: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/46.jpg)
Static vs. Dynamic vs. Hybrid
DotPro
duct
OuterP
roduct
BlackSch
oles
TPCHQ6
Lattice
GDA
GEMM
Kmea
ns
LogRegSGD
LSTMGRU
LeNet
0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
Per
form
ance
Dynamic Hybrid (2.25x) Static (3x)
The dynamic network performs poorly on compute-boundapplications due to insufficient bandwidth.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 24/27
![Page 47: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/47.jpg)
Static vs. Dynamic vs. Hybrid
DotPro
duct
OuterP
roduct
BlackSch
oles
TPCHQ6
Lattice
GDA
GEMM
Kmea
ns
LogRegSGD
LSTMGRU
LeNet
0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
Per
form
ance
Dynamic Hybrid (2.25x) Static (3x)
The dynamic network performs poorly on compute-boundapplications due to insufficient bandwidth.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 24/27
![Page 48: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/48.jpg)
Most Efficient Network Configurations
DotPro
duct
OuterP
roduct
BlackSch
oles
TPCHQ6
Lattice
GDA
GEMM
Kmea
ns
LogRegSGD
LSTMGRU
LeNet
0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
Dat
aM
ovem
ent
Hybrid (2.25x) Static (3x)
The hybrid network reduces data movement by using adynamic network as an escape path.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 25/27
![Page 49: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/49.jpg)
Most Efficient Network ConfigurationsPipelined Architecture
Perf
Perf / Area
Perf / Watt
7.0
HybridStatic6.9
2.3
A hybrid network improves energy efficiency by 1.8x withperformance similar to a static network.
Performance varies up to 7x between the best and worstnetwork configurations.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 26/27
L
![Page 50: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/50.jpg)
Most Efficient Network ConfigurationsPipelined Architecture
Perf
Perf / Area
Perf / Watt
7.0
HybridStatic6.9
2.3
A hybrid network improves energy efficiency by 1.8x withperformance similar to a static network.Performance varies up to 7x between the best and worstnetwork configurations.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 26/27
L
![Page 51: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/51.jpg)
Conclusion• Network performance correlates strongly with
bandwidth for spatial accelerators• Bandwidth scales more efficiently on a static network• A hybrid (large static, small dynamic) network:
• Eliminates place and route failure• Improves perf/watt
Thank You!
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 27/27
![Page 52: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/52.jpg)
Conclusion• Network performance correlates strongly with
bandwidth for spatial accelerators• Bandwidth scales more efficiently on a static network• A hybrid (large static, small dynamic) network:
• Eliminates place and route failure• Improves perf/watt
Thank You!
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 27/27
![Page 53: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/53.jpg)
Static Network: Flow Control
Src Dst
End-to-end Flow Control Per-hop Flow Control
Back PressureAck
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 1/9
![Page 54: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/54.jpg)
Static Network: Bandwidth
PB
S S
S
PB
S
S
PB
S
S
PB
S
PB PB PB PB R Router
S Switch
PB Physical Block
We vary the number of links between switches.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 2/9
![Page 55: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/55.jpg)
Dynamic Network
RouterFlit-width/
We vary the number of Virtual Channels (VCs) and flit width.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/9
![Page 56: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/56.jpg)
Static Network Bandwidth
Perf
1 / Area
1 / Power
2.0
x1x2x3
3.4
4.3
PB
R R
R
S
PB
R
R
S
PB
R
S
PB
R
S
PB
R
S
PB
R
S
PB
S
PB
S
PB
S
3x static network bandwidth
Bandwidth strongly impacts accelerator performance.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/9
R
![Page 57: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/57.jpg)
Static Network Flow ControlCredit-Based vs. Per-Hop
Perf
1 / Area
1 / Power
3.3
creditper-hop
1.2
2.1
Src Dst
End-to-end Flow Control
Per-hop Flow Control
Back Pressure
Ack
Credit-based flow control has 3x lower performance.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/9
R
![Page 58: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/58.jpg)
Accelerator Model• Pool of compute andmemory resource• Compute:
• SIMD pipeline, or• Vector processor with a small instruction window
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 6/9
SIMD Lanes
Input Buffers
Pipelined Scheduled
Stages
SIMD Lanes
Function Unit
Compute Physical Block
Memory Physical BlockScratchpad Bank
ComputePB
MemoryPB
ComputePB
ComputePB
MemoryPB
ComputePB
ComputePB
MemoryPB
ComputePB
DRAMPB
DRAMPB
DRAMPB
DRAMPB
DRAMPB
DRAMPB
![Page 59: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/59.jpg)
Statically Routed Dynamic Network
• Streaming protocol requires in-order transmission• Can’t use adaptive or oblivious routing• Can’t drop packets
• Routes are looked up in a table at runtime• Route to multiple outputs for efficient broadcast links
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 7/9
![Page 60: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/60.jpg)
Performance Scaling
0
20
40
Pipe
lined
Norm
Perfo
rman
ce
BlackScholes TPCHQ6 GEMM SGDD-x0S-x3S-x2S-x1
H-x3H-x2H-x1
32 64 128# PBs
0
5
10
15
Sche
duled
Norm
Perfo
rman
ce
32 64 128# PBs
32 64 128# PBs
32 64 128# PBs
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 8/9
![Page 61: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)](https://reader036.vdocument.in/reader036/viewer/2022071511/612ff4ee1ecc51586943c8e2/html5/thumbnails/61.jpg)
Key Design Challenges
Off-chipMemoryBandwidth
ComputeThroughput
On-chipMemoryBandwidth
On-chipNetworkBandwidth
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 9/9