complexity-effective memory access scheduling for many-core accelerator architectures
DESCRIPTION
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures. George L. Yuan, Ali Bakhoda and Tor M. Aamodt Electrical and Computer Engineering University of British Columbia December 14 th , 200 9 (MICRO 200 9 ). The Trend: DRAM Access Locality in Many-Core. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/1.jpg)
George L. Yuan, Ali Bakhoda and Tor M. Aamodt
Electrical and Computer EngineeringUniversity of British Columbia
December 14th, 2009 (MICRO 2009)
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
![Page 2: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/2.jpg)
2
The Trend: DRAM Access Locality in Many-Core
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
Inside the interconnect, interleaving of memory request streams reduces the DRAM access locality seen by the memory controller
8 16 32 64
Before Interconnect After Interconnect
Number of Cores
DR
AM
Acc
ess
Loca
lity
Good
Bad
Pre-interconnect access locality
Post-interconnect access locality
![Page 3: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/3.jpg)
Opened Row: A
DRAM
3
Today’s Solution: Out-of-Order Scheduling
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
Row B
Row A
Row A
Request Queue
Row B
Row A
Row A
Youngest
Oldest
Switching RowOpened Row: B
Queue size needs to increase as number of cores increase
Requires fully-associative logic Circuit issues:
o Cycle timeo Areao Power
OoO OK for Single Core, OK for Multi-Core, but for Many-Core..?
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 4: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/4.jpg)
4Complexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
No prior work for memory access scheduling for 10,000+ threads
Related Work Rixner, Dally, et al
o First-Ready First-Come First-Serve (FRFCFS) Patents by Intel, Nvidia, etc.. Mutlu & Moscibroda
o Stall-time Fair Memoryo Parallelism-Aware Batch Scheduling
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 5: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/5.jpg)
5Complexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Our Contributions Show request stream interleaving in interconnect First paper that considers problem of DRAM
scheduling for tens of thousands of threads Integration of DRAM scheduling in interconnect,
allowing for more complexity-effective design Achieves 91% of performance of out-of-order
scheduling with in-order scheduling for memory-limited applications
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 6: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/6.jpg)
6Complexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Outline Introduction Background on DRAM The Request Interleaving Problem Hold-Grant Interconnect Arbitration Experimental Results Conclusion
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 7: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/7.jpg)
7
Example of many-core accelerator? GPUs
High FLOP capacity for high resolution graphics Nvidia’s GTX285: 30 8-wide multiprocessors 10,000’s of concurrent threads Demand on memory system extremely high
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 8: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/8.jpg)
8
Background: DRAM
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
DRAM
Column Decoder
Memory Array
Row
Dec
oderM
emor
y C
ontr
olle
r
Row BufferRow Buffer
Row
Dec
oder
Column Decoder
Row Buffer
Column Decoder
Row Buffer
Row Access: Activate a row of DRAM bank and load into row buffer (slow)
Column Access:Read and write data in row buffer (fast)
Precharge: Write row buffer data back into row (slow)
![Page 9: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/9.jpg)
9
tRC = row cycle time
tRP = row precharge time
tRCD = row activate time
Bank Precharge Row A Activate Row B Pre...RB RBRARARARA Precharge Row B Act..tRP tRCD
tRC
Background: DRAM Row Access LocalityDefinition: Number of accesses to a row between row switches
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
“row switch”
Row access locality Achievable DRAM Bandwidth Performance
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 10: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/10.jpg)
10
The Request Interleaving Problem
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 11: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/11.jpg)
11
FR-FCFS vs FIFO
FRFCFS vs FIFO: Almost 2x Speedup
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
fwt lib mum neu nn ray red sp wp HM0
50
100
150
200FIFO FR-FCFS
IPC
![Page 12: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/12.jpg)
12
Alternative Solution: Banked FIFO for Bank-level Parallelism
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
FIFO for DRAM Bank 0
FIFO
Banked FIFO1
2
3 ~23% speedup over FIFO
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 13: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/13.jpg)
13
Our SolutionHold grant interconnection arbitration policies
“Hold Grant” (HG): Previously granted input has highest priority
“Row-Matching Hold Grant” (RMHG): Previously granted input has highest priority if requested row matches previously requested row
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 14: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/14.jpg)
NW Router E
S
14
Interconnect Arbitration Policy: Round-Robin
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
RowA
RowAMemory Controller 0
RowBRowBRowCRowX
RowY
RowA
RowA
RowB
RowB
RowC
RowY
RowX
Memory Controller 1
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 15: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/15.jpg)
15
Interconnect Arbitration Policy: HG
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
RowA
RowAMemory Controller 0
RowBRowBRowCRowX
RowY
RowA
RowA
RowB
RowB
RowC
RowY
RowX
Memory Controller 1
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
NW Router E
S
![Page 16: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/16.jpg)
16
Interconnect Arbitration Policy: RMHG
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
RowA
RowAMemory Controller 0
RowBRowBRowCRowX
RowY
RowA
RowA
RowB
RowB
RowC
RowY
RowX
Memory Controller 1
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
NW Router E
S
![Page 17: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/17.jpg)
17
Complexity ComparisonScheme Complexity
FRFCFS 3584 bits compared
BFIFO+HG (XBAR) 224 bits stored and compared
BFIFO+RMHG (XBAR) 608 bits stored, 320 bits compared
BFIFO+HMHG4 (XBAR) 320 bits stored, 320 bits compared
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
For 32 entry queues: 15x reduction in bit comparisons, reduction from 32-way associative to direct mapped
![Page 18: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/18.jpg)
Shader cores 28
Threads per shader core 1024
Maximum supported in-flight requests per shader core
64
Number of DRAM Controllers 8
DRAM controller scheduler FIFO, Banked FIFO, First-Ready First-Come First-Serve (FRFCFS)
GDDR3 memory timing tCL=9, tRP=13, tRC=34tRAS=21, tRCD=12, tRRD=8
Topologies swept Crossbar, Mesh, Ring
Queue sizes swept 8, 16, 32, 64
Number of virtual channels swept 1, 2, 4
18
Methodology: Microarchitecture Parameters
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 19: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/19.jpg)
19
GPGPU-Sim: A massively multithreaded architecture performance simulator (www.gpgpu-sim.org)
Supports NVIDIA’s Compute Unified Device Architecture (CUDA) framework
Simulates Parallel Thread Execution (PTX) instructions
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
Methodology: Simulator
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 20: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/20.jpg)
20
Results – IPC Normalized to FR-FCFS
Crossbar network, 28 shader cores, 8 DRAM controllers, 8-entry DRAM queues:BFIFO: 14% speedup over regular FIFOBFIFO+HG: 18% speedup over BFIFO, within 91% of FRFCFS
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
0%
20%
40%
60%
80%
100%
fwt lib mum neu nn ray red sp wp HM
FIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFS
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 21: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/21.jpg)
21
Row Streak Breakers
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
RowARowARowB RowCRowA
Requests From Core 1
Requests From Core 2
OldestYoungest
Memory Controller QueueDRAM
RowA
“Row Streak”Row Streak Breakers
Stranded Request
![Page 22: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/22.jpg)
B H B H B H B H B H B H B H B H B Hfwt lib mum neu nn ray red sp wp
0%
20%
40%
60%
80%
100%Same Core Different Core
Row
Stre
ak B
reak
er
Cla
ssifi
catio
n
22
Row Streak Breakers
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
“bad”
“good”
B = banked FIFO; H = banked FIFO + Hold Grant
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
Arithmetic mean average reduction: 73%Harmonic mean average reduction: 96%
![Page 23: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/23.jpg)
23Complexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Conclusion Show request stream interleaving in interconnect
o Effect gets worse as number of cores increase First paper that considers problem of DRAM
scheduling for tens of thousands of threadso No prior work on memory scheduling for many-core
Integration of DRAM scheduling in interconnect, allowing for more complexity-effective designo Should allow for faster clock speeds, power/area savings
Achieves 91% of performance of out-of-order scheduling with in-order scheduling for memory-limited applications
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 24: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/24.jpg)
24
Future Work
Improve upon our memory scheduler design Evaluate performance of graphic applications Design a hold-grant scheme that works in
conjunction with multiple virtual channel deadlock avoidance schemes for torus networks
Synthesize, layout, and use SPICE to determine actual power/area overheads, cycle time
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 25: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/25.jpg)
25
Thank you
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 26: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/26.jpg)
26
Methodology: Microarchitecture Parameters
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
![Page 27: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/27.jpg)
27
FR-FCFS vs FIFO
Need out-of-order scheduling inside DRAM controller to improve row access locality of requests to DRAM chips
FIFO vs FRFCFS: 46.8% Slowdown
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
fwt lib mum neu nn ray red sp wp HM0%
20%
40%
60%
80%
100%
XBAR MESH RING
Spee
dup
![Page 28: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/28.jpg)
28
Varying Topology
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Ring networks require multiple virtual channels for deadlock avoidance
Multiple virtual channels = path diversity
Path diversity => requests arrive out of order = interleaving
![Page 29: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/29.jpg)
VC0 VC1Router
29
Multiple Virtual Channels :
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Row B
Row A
Row A
Row X
Source
Destination
Congestion
Dynamic Virtual Channel Allocation
![Page 30: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/30.jpg)
VC0 VC1Router
30
Multiple Virtual Channels :
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Row B
Row A
Row A
Row X
Source
Destination
Congestion
Static Virtual Channel Allocation
![Page 31: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/31.jpg)
31
SVCA vs DVCAHarmonic mean IPC for different virtual channel configurations
SVCA speedup over DVCA by up to 18.5%
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
![Page 32: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/32.jpg)
32
BenchmarksAbr. BenchmarkFWT Fast Walsh TransformLIB LIBOR Monte CarloMUM MUMmerGPUNEU Neural Network Digit RecognitionNN Nearest NeighborRAY Ray TracingRED ReductionRAY Ray TracingWP Weather Prediction
George YuanSupervisor: Dr. Tor Aamodt
University of British Columbia
![Page 33: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/33.jpg)
33
Sensitivity Analysis
George YuanSupervisor: Dr. Tor Aamodt
University of British Columbia
Varying DRAM Controller Queue Size
Varying Topology
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
![Page 34: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/34.jpg)
34
More Results
Memory Latency:33.9% reduction for HG and35.3% reduction for HMHG4compared to BFIFO
DRAM Efficiency:15.1% improvement for HG and HMHG4 over BFIFO
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
![Page 35: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/35.jpg)
35
Row Access Locality Reduction After Interconnect
44% for Crossbar, 48% for Mesh, 52% for Ring
George YuanSupervisor: Dr. Tor Aamodt
University of British Columbia
![Page 36: Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures](https://reader035.vdocument.in/reader035/viewer/2022062500/56815b01550346895dc8b2f6/html5/thumbnails/36.jpg)
36
DRAM Parameters
George YuanSupervisor: Dr. Tor Aamodt
University of British Columbia