1 efficient data access in future memory hierarchies rajeev balasubramonian school of computing...
TRANSCRIPT
![Page 1: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/1.jpg)
1
Efficient Data Access in FutureMemory Hierarchies
Rajeev Balasubramonian
School of ComputingResearch Buffet, Fall 2010
![Page 2: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/2.jpg)
2
Acks
• Terrific students in the Utah Arch group
• Collaborators at HP, Intel, IBM
• Prof. Al Davis, who re-introduced us to memory systems
![Page 3: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/3.jpg)
3
Current Trends
• Continued device scaling
• Multi-core processors
• The power wall
• Pin limitations
• Problematic interconnects
• Need for high reliability
![Page 4: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/4.jpg)
4
Anatomy of Future High-Perf Processors
TheCore
Hi-Perf
TheCore
Lo-Perf
• Designs well understood• Combo of hi- and lo-perf• Risky Ph.D.!!
![Page 5: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/5.jpg)
5
Anatomy of Future High-Perf Processors
Core
• Large shared L3• Partitioned into many banks• Assume one bank per core
L2/L3CacheBank
![Page 6: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/6.jpg)
6
Anatomy of Future High-Perf Processors
C
• Many cores!• Large distributed L3 cache
$ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
![Page 7: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/7.jpg)
7
Anatomy of Future High-Perf Processors
C
• On-chip network• Includes routers and long wires• Used for cache coherence• Used for off-chip requests/responses
$ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
![Page 8: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/8.jpg)
8
Anatomy of Future High-Perf Processors
C
• Memory controller handles off-chip requests to memory
$ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
Memory Controller
DIMM DIMM DIMM
![Page 9: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/9.jpg)
9
CPU
Anatomy of Future High-Perf Processors
DIMM DIMM DIMM
CPU
DIMM DIMM DIMM
CPU
DIMM DIMM DIMM
• Multi-socket motherboard• Lots of cores, lots of memory, all connected
![Page 10: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/10.jpg)
10
CPU
Anatomy of Future High-Perf Processors
DIMM DIMM DIMM
CPU
DIMM DIMM DIMM
CPU
DIMM DIMM DIMM
• DRAM backed up by slower, higher capacity emerging non-volatile memories• Eventually backed up by disk… maybe
NonVolatilePCM
NonVolatilePCM
DISK DISK
![Page 11: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/11.jpg)
11
Life of a Cache Miss
CoreL1
Miss in L1on-chipnetwork
$
Look up L2/L3 bank
On- andoff-chipnetwork
MC
Wait in MC queue
DIMM
Access DRAM
NonVolatilePCM
Access PCM
DISK
Access disk
![Page 12: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/12.jpg)
12
Research Topics
CoreL1
Miss in L1on-chipnetwork
$
Look up L2/L3 bank
On- andoff-chipnetwork
MC
Wait in MC queue
DIMM
Access DRAM
NonVolatilePCM
Access PCM
DISK
Access disk
Not very hot topics!
![Page 13: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/13.jpg)
13
Research Topics
CoreL1
Miss in L1on-chipnetwork
$
Look up L2/L3 bank
On- andoff-chipnetwork
MC
Wait in MC queue
DIMM
Access DRAM
NonVolatilePCM
Access PCM
DISK
Access disk
2
45
6 1
3
![Page 14: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/14.jpg)
14
Problems with DRAM
• DRAM main memory contributes 1/3rd of total energy in datacenters
• Long latencies; high bandwidth needs
• Error resilience is very expensive
• DRAM is a commodity and chips have to be compliant with standards
Initial designs instituted in the 1990s Innovations are evolutionary Traditional focus on density
![Page 15: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/15.jpg)
15
Time for a Revolutionary Change?
• Energy is far more critical today
• Cost-per-bit perhaps not as relevant today
• Memory reliability is increasing in importance
• Multi core access streams have poor locality
• Queuing delays are starting to dominate
• Potential migration to new memory technologies and interconnects
![Page 16: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/16.jpg)
• Incandescent light bulb• Low purchase cost• High operating cost• Commodity
• Energy-efficient light bulb• Higher purchase cost• Much lower operating cost• Value-addition
16
It’s worth a small increase in capital costs to gain large reductions in operating costs
$3.00 13W
$0.30 60W
And not 10X, just 15-20%!
Key Idea
![Page 17: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/17.jpg)
17
DRAM Basics
…
Memory bus or channel
Rank
DRAMchip ordeviceBank
Array1/8th of therow buffer
One word ofdata output
DIMM
On-chip Memory
Controller
![Page 18: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/18.jpg)
18
RAS
CAS
Cache Line
DRAM Chip DRAM Chip DRAM Chip DRAM Chip
Row Buffer
One bank shown in each chip
DRAM Operation
![Page 19: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/19.jpg)
19
New Design Philosophy
• Eliminate overfetch; activate a single chip and a single small array much lower energy, slightly higher cost
• Provide higher parallelism
• Add features for error detection
[ Appears in ISCA’10 paper, Udipi et al. ]
![Page 20: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/20.jpg)
20
Single Subarray Access (SSA)
MEMORY CONTROLLER
8 8
ADDR/CMD BUS
64 Bytes
Bank
Subarray
Bitlines
Row buffer
Global Interconnect to I/O
ONE DRAM CHIP
DIMM
8 8 8 8 8 88DATA BUS
![Page 21: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/21.jpg)
21
Address
Cache Line
DRAM ChipSubarray
DRAM ChipSubarray
DRAM ChipSubarray
DRAM ChipSubarraySubarray Subarray Subarray Subarray
Sleep Mode(or other parallelaccesses)
Subarray Subarray Subarray SubarraySubarray Subarray Subarray Subarray
SSA Operation
![Page 22: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/22.jpg)
22
Consequences
• Why is this good? Minimal activation energy for a line More circuits can be placed in low-power sleep Can perform multiple operations in parallel
• Why is this bad? Higher area and cost (roughly 5 – 10%) Longer data transfer time Not compatible with today’s standards No opportunity for row buffer hits
![Page 23: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/23.jpg)
23
Narrow Vs. Wide Buses?
• What provides higher utilization? 1 wide bus or 8 narrow buses?
• Must worry about load imbalance and long data transfer time in latter
• Must worry about many bubbles in the former because of dependences
[ Ongoing work, Chatterjee et al. ]
Bank Access DT
Bank Access DT
One 64-bit wide bus
Back-to-backrequests to the
same bank
64 idle bits
Bank Access DT
Bank Access DT
8 idle bits
Eight 8-bit wide buses
![Page 24: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/24.jpg)
24
Methodology
• Tested with simulators (Simics) and multi-threaded benchmark suites
• 8-core simulations
• DRAM energy and latency from Micron datasheets
![Page 25: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/25.jpg)
25
blackscholes
canneal
fluidanimate
streamcluste
rx264 cg
0.00
0.50
1.00
1.50
2.00
2.50
Baseline Open Row
Baseline Close Row
SBA
SSA
Rela
tive
DRA
M E
nerg
y Co
nsum
ption
Moving to close page policy – 73% energy increase on average Compared to open page, 3X reduction with SBA, 6.4X with SSA
Results – Energy
![Page 26: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/26.jpg)
26
BASELINE (OPEN PAGE,
FR-FCFS)
BASELINE (CLOSED
ROW, FCFS)
SBA SSA0%
20%
40%
60%
80%
100%Termination Resistors
Global In-terconnect
Bitlines
Decoder + Wordline + Senseamps
Results – Energy – Breakdown
![Page 27: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/27.jpg)
27
blackscholes
bodytrack
cannealferre
t
fluidanimate
freqmine
streamcluste
rvips
x264str
eam cg is
Average0.00
100.00200.00300.00400.00500.00600.00700.00800.00
Baseline Open Page
Baseline Close Page
SBA
SSA
Cycl
es
• Serialization/Queuing delay balance in SSA - 30% decrease (6/12) or 40% increase (6/12)
Results – Performance
![Page 28: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/28.jpg)
28
BASELINE (OPEN PAGE,
FR-FCFS)
BASELINE (CLOSED ROW,
FCFS)
SBA SSA0%
20%
40%
60%
80%
100%Data Transfer
DRAM Core Access
Rank Switching delay (ODT)
Command/Addr Transfer
Queuing Delay
Results – Performance – Breakdown
![Page 29: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/29.jpg)
29
Error Resilience in DRAM
• Important to not only withstand a single error, but also entire chip failure – referred to as chipkill
• DRAM chips do not include error correction features -- error tolerance must be built on top
• Example: 8-bit ECC code for a 64-bit word; for chipkill correctness, each of the 72 bits must be read from a separate DRAM chip significant overfetch!
0 1 2 3 68 69 70 71….
72-bit word on every bus transfer
![Page 30: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/30.jpg)
30
Two-Tiered Solution
• Add a small (8-bit) checksum for every cache line
• Maintain one extra DRAM chip for parity across 8 DRAM chips
• When the checksum flags an error, use the parity to re-construct the corrupted cache line
• Writes will require updates to parity as well
0 1 7 Parity….
![Page 31: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/31.jpg)
31
Research Topics
CoreL1
Miss in L1on-chipnetwork
$
Look up L2/L3 bank
On- andoff-chipnetwork
MC
Wait in MC queue
DIMM
Access DRAM
NonVolatilePCM
Access PCM
DISK
Access disk
2
45
6 1
3
![Page 32: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/32.jpg)
32
Topic 2 – On-chip Networks
• Conventional wisdom: buses are not scalable; need routed packet-switched networks
• But, routed networks require bulky energy-hungry routers
• Results: Buses can be made scalable by having a hierarchy
of buses and Bloom filters to stifle broadcasts Low-swing buses can yield low energy, simpler
coherence, and scalable operation
[ Appears in HPCA’10 paper, Udipi et al. ]
![Page 33: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/33.jpg)
33
Topic 3: Data Placement in Caches, Memory
• In a large distributed NUCA cache, or in a large distributed NUMA memory, data placement must be controlled with heuristics that are aware of:
capacity pressure in the cache bank distance between CPU and cache bank and DIMM queuing delays at memory controller potential for row buffer conflicts at each DIMM
[ Appears in HPCA’09, Awasthi et al. and in PACT’10, Awasthi et al. (Best paper!) ]
![Page 34: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/34.jpg)
34
Topic 4: Silicon Photonic Interconnects
• Silicon photonics can provide abundant bandwidth and makes sense for off-chip communication
• Can help multi-cores overcome the bandwidth problem
• Problems: DRAM design that best interfaces with silicon photonics, protocols that allow scalable operation
[ On-going work, Udipi et al. ]
![Page 35: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/35.jpg)
35
Topic 5: Memory Controller Design
• Problem: abysmal row buffer hit rates, quality of service
• Solutions:
Co-locate hot cache lines in the same page
Predict row buffer usage and “prefetch” row closure
QoS policies that leverage multiple memory “knobs”
[ Appears in ASPLOS’10 paper, Sudan et al. and on-going work, Awasthi et al., Sudan et al. ]
![Page 36: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/36.jpg)
36
Topic 6: Non Volatile Memories
• Emerging memories (PCM): can provide higher densities at smaller feature sizes are based on resistance, not charge (hence non-volatile) can serve as replacements to DRAM/Flash/disk
• Problem: when a cell is programmed to a given resistance, the resistance tends to drift over time may require efficient refresh or error correction
[ On-going work, Awasthi et al. ]
![Page 37: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010](https://reader036.vdocument.in/reader036/viewer/2022062516/56649e4c5503460f94b42157/html5/thumbnails/37.jpg)
37
Title
• Bullet