![Page 1: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/1.jpg)
1
Towards Scalable and Energy-Efficient Memory System Architectures
Rajeev Balasubramonian, Al Davis,Ani Udipi, Kshitij Sudan,
Manu Awasthi, Nil Chatterjee, Seth Pugsley, Manju Shevgoor
School of ComputingUniversity of Utah
![Page 2: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/2.jpg)
2
Towards Scalable and Energy-Efficient Memory System Architectures
![Page 3: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/3.jpg)
3
Convergence of Technology Trends
Energy
Reliability
New MemoryTechnologies
BW, Capacity, and Locality for
Multi-Cores
Overhaul of main memory architecture!
![Page 4: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/4.jpg)
4
High Level Approach
• Explore changes to memory chip microarchitecture Must cause minimal disruption to density
• Explore changes to interfaces and standards Major change appears inevitable!
• Explore system and memory controller innovations Most attractive, but order-of-magnitude improvement unlikely
Design solutions that are technology-agnostic
![Page 5: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/5.jpg)
5
Projects
Memory Chip
• Reduce overfetch
• Support reliability
• Handle PCM drift
• Promote read/write parallelism
Memory Interface
• Interface with photonics
• Organize channel for high capacity
Memory Controller
• Maximize use of row buffer
• Schedule for low latency and energy
• Exploit mini-ranks
CPUMC
DIMM…
![Page 6: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/6.jpg)
6
Talk Outline
Mature work:• SSA architecture – Single Subarray Access (ISCA’10)• Support for reliability (ISCA’10)• Interface with photonics (ISCA’11)• Micro-pages – data placement for row buffer efficiency (ASPLOS’10)• Handling multiple memory controllers (PACT’10)• Managing resistance drift in PCM cells (NVMW’11)
Preliminary work:• Handling read/write parallelism• Enabling high capacity• Handling DMA scheduling• Exploiting rank subsetting for performance and thermals
![Page 7: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/7.jpg)
7
Minimizing Overfetch with SingleSubarray Access
Ani Udipi
CPUMC
DIMM…
Primary Impact
![Page 8: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/8.jpg)
Problem 1 - DRAM Chip Energy
• On every DRAM access, multiple arrays in multiple chips are activated
• Was useful when there was good locality in access streams– Open page policy
• Helped keep density high and reduce cost-per-bit• With multi-thread, multi-core and multi-socket systems,
there is much more randomness– “Mixing” of access streams when finally seen by the memory
controller
8
![Page 9: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/9.jpg)
Rethinking DRAM Organization
• Limited use for designs based on locality• As much as 8kbytes read in order to service a 64byte cache
line request• Termed “overfetch”
– Substantially increases energy consumption• Need a new architecture that
– Eliminates overfetch– Increases parallelism– Increases opportunity for power-down– Allows efficient reliability
9
![Page 10: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/10.jpg)
Proposed Solution – SSA Architecture
10
MEMORY CONTROLLER
8 8
ADDR/CMD BUS
64 Bytes
Bank
Subarray
Bitlines
Row buffer
Global Interconnect to I/O
ONE DRAM CHIP
DIMM
8 8 8 8 8 88DATA BUS
![Page 11: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/11.jpg)
SSA Basics
• Entire DRAM chip divided into small “subarrays”
• Width of each subarray is exactly one cache line
• Fetch entire cache line from a single subarray in a single DRAM chip – SSA
• Groups of subarrays combined into “banks” to keep peripheral circuit overheads low
• Close page policy and “posted-RAS”
• Data bus to processor essentially split into 8 narrow buses
11
![Page 12: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/12.jpg)
SSA Architecture Impact
• Energy reduction– Dynamic – fewer bitlines activated– Static – smaller activation footprint – more and longer spells of
inactivity – better power down
• Latency impact– Limited pins per cache line – serialization latency– Higher bank-level parallelism – shorter queuing delays
• Area increase– More peripheral circuitry and I/O at finer granularities – area
overhead (< 5%)
12
![Page 13: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/13.jpg)
Area Impact
• Smaller arrays – more peripheral overhead• More wiring overhead in the on-chip interconnect between
arrays and pin pads• We did a best-effort area impact calculation using a
modified version of CACTI 6.5– Analytical model, has its limitations
• More feedback in this specific regard would be awesome!• More info on exactly where in the hierarchy overfetch stops
would be great too
13
![Page 14: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/14.jpg)
14
Support for Chipkill Reliability
Ani Udipi
CPUMC
DIMM…
Primary Impact
![Page 15: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/15.jpg)
Problem 2 – DRAM Reliability
• Many server applications require chipkill-level reliability – failure of an entire DRAM chip
• One example of existing systems– Consider baseline 64-bit word plus 8-bit ECC – Each of these 72 bits must be read out of a different chip,
else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable!
– Reading 72 chips - significant overfetch!• Chipkill even more of a concern for SSA since entire cache
line comes from a single chip
15
![Page 16: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/16.jpg)
Proposed Solution
Approach similar to RAID-5
16
DIMM
L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C
L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C..
C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C
.
...
.
...
.
...
.
...
P7
DRAM DEVICE
L – Cache Line C – Local Checksum P – Global Parity
![Page 17: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/17.jpg)
Chipkill design
• Two-tier error protection• Tier - 1 protection – self-contained error detection
– 8-bit checksum/cache line – 1.625% storage overhead– Every cache line read is now slightly longer
• Tier -2 protection – global error correction– RAID-like striped parity across 8+1 chips– 12.5% storage overhead
• Error-free access (common case)– 1 chip reads– 2 chip writes – leads to some bank contention – 12% IPC degradation
• Erroneous access– 9 chip operation
17
![Page 18: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/18.jpg)
Questions
• What are the common failure modes in DRAM? PCM?• Do entire chips fail?• Do parts of chips fail?
– Which parts? Bitlines? Wordlines? Capacitors?– Entire arrays?– Entire banks?– I/O?
• Should all these failures be handled the same way?
18
![Page 19: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/19.jpg)
19
Designing Photonic Interfaces
Ani Udipi
CPUMC
DIMM…
Primary Impact
![Page 20: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/20.jpg)
Problem 3 – Memory interconnect
• Electrical interconnects are not scaling well– Where can photonics make an impact, both on energy and
performance?
• Various levels in the DRAM interconnect– Memory cell to sense-amp - addressed by SSA– Row buffer to I/O – currently electrical (on-chip)– I/O pins to processor – currently electrical (off-chip)
• Photonic interconnects– Large static power component – laser/ring tuning– Much lower dynamic component – relatively unaffected by distance
• Electrical interconnects– Relatively small static component– Large dynamic component
• Cannot overprovision photonic bandwidth, use only where necessary
20
![Page 21: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/21.jpg)
Consideration 1 – How much photonics on a die?
21
Elec
tric
al E
nerg
y
Phot
onic
Ene
rgy
![Page 22: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/22.jpg)
Consideration 2 - Increasing Capacity
• 3D stacking is imminent• There will definitely be several dies on the channel
– Each die has photonic components that are constantly burning static power
– Need to minimize this!• TSVs available within a stack; best of both worlds
– Large bandwidth– Low static energy– Need to exploit this!
22
![Page 23: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/23.jpg)
Proposed Design
23
Processor DIMMWaveguide
DRAM chips
Photonic Interface die +
Stack controller
Memory controller
![Page 24: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/24.jpg)
Proposed Design – Interface Die
• Exploit 3D die stacking to move all photonic components to a separate interface die, shared by several memory dies
– Use photonics where there is heavy utilization – shared bus between processor and interface die i.e. the off-chip interconnect
– Helps break pin barrier for efficient I/O, substantially improves socket-edge BW
– On-stack, where there is low utilization, use efficient low-swing interconnects and TSVs
24
![Page 25: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/25.jpg)
Advantages of the proposed system
• Reduction in energy consumption– Fewer photonic resources, without loss in performance– Rings, couplers, trimming
• Industry considerations– Does not affect design of commodity memory dies– Same memory die can be used with both photonic and
electrical systems– Same interface die can be used with different kinds of
memory dies – DRAM, PCM, STT-RAM, Memristors
25
![Page 26: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/26.jpg)
Problem 4 – Communication Protocol
• Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface
– Need to handle heterogeneous memory modules, each with its own maintenance requirements, further complicates scheduling
– Very little interoperability – affects both consumers (too many choices!) and vendors (stock-keeping and manufacturing)
– Heavy pressure on address/command bus – several commands to micro-manage every operation of the DRAM
– Several independent banks – need to maintain large amounts of state to schedule requests efficiently
– Simultaneous arbitration for multiple resources (address bus, data bank, data bus) to complete a single transaction
26
![Page 27: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/27.jpg)
Proposed Solution – Packet-based interface
• Release most of the tight control memory controller holds today• Move mundane tasks to the memory modules themselves (on
the interface die) - make them more autonomous– maintenance operation (refresh, scrub, etc.)– routine operations (DRAM precharge, NVM wear handling)– timing control (DRAM alone has almost 20 different timing
constraints to be respected)– coding and any other special requirements
• Only information the memory module needs is the address and read/write identification, time slots reserved apriori for data return
27
![Page 28: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/28.jpg)
Advantages
• Better interoperability, plug and play– As long as the interface die has the necessary information,
everything in interchangeable• Better support for heterogeneous systems
– Allows easier data movement between DRAM and NVM for example, on the same channel
• Reduces memory controller complexity• Allows innovation and value addition in the memory,
without being constrained by processor-side support• Reduces bit transport energy on the address/command bus
28
![Page 29: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/29.jpg)
29
Data Placement with Micro-PagesTo Boost Row Buffer Utility
Kshitij Sudan
CPUMC
DIMM…
Primary Impact
![Page 30: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/30.jpg)
DRAM Access Inefficiencies• Over fetch due to large row-buffers
• 8 KB read into row buffer for a 64 byte cache line• Row-buffer utilization for a single request < 1%
• Diminishing locality in multi-cores• Increasingly randomized memory access stream• Row-buffer hit rates bound to go down
• Open page policy and FR-FCFS request scheduling• Memory controller schedules requests to open row-buffers first
GoalImprove row-buffer hit-rates for Chip Multi-Processors
30
![Page 31: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/31.jpg)
Key ObservationGather all heavily accessed chunks of independent OS
pages and map them to the same DRAM rowCache Block Access Pattern Within OS Pages
For heavily accessed pages in a given time interval,accesses are usually to a few cache blocks
31
![Page 32: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/32.jpg)
Basic Idea
Hottest micro-pages
1 KB micro-pages
Coldest micro-pages
4 KB OS Pages
DRAM Memory
Reserved DRAM Region
32
![Page 33: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/33.jpg)
Hardware Implementation (HAM)
PhysicalAddress
X
New addr . Y
4 GB Main MemoryCPU Memory Request
4 MB ReservedDRAM region
Y
X Page A
Mapping Table
X Y
Old Address New Address
BaselineHardware Assisted Migration (HAM)
33
![Page 34: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/34.jpg)
Results5M cycle EPOCH, ROPS, HAM and ORACLE
Apart from average 9% performance gains, our schemes also save DRAM energy at the same time!
Percent change in
performance
34
![Page 35: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/35.jpg)
Conclusions• On average, for applications with room for improvement and
with our best performing scheme• Average performance ↑ 9% (max. 18%)• Average memory energy consumption ↓ 18% (max. 62%). • Average row-buffer utilization ↑ 38%
• Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses
35
![Page 36: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/36.jpg)
36
Data Placement AcrossMultiple Memory Controllers
Kshitij Sudan
CPUMC
DIMM…
Primary Impact
![Page 37: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/37.jpg)
DRAM NUMA Latency
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMMQPI
MC On-Chip Memory Controller
QPI Interconnect
Memory Channel
DIMM DRAM (DIMMs)
Socket Boundary
37
![Page 38: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/38.jpg)
Problem Summary
• Pin limitations → increasing queuing delay• Almost 8x increase in queuing delays from single core/one
thread to 16 cores/16 threads
• Multi-cores → increasing row-buffer interference• Increasingly randomized memory access stream
• Longer on- and off-chip wire delays → increasing NUMA factor• NUMA factor already at 1.5x today
GoalImprove application performance by reducing queuing
delays and NUMA latency 38
![Page 39: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/39.jpg)
Policies to Manage Data Placement Among MCs
• Adaptive First Touch• Assign new virtual pages to a DRAM (physical) page belonging to
MC(j) minimizing the a cost function
• Dynamic Page Migration• Programs change phases → Imbalance in MC load• Migrate pages between MCs at runtime
• Integrating Heterogenous Memory Technologies
cost j = α x loadj + β x rowhitsj + λ x distancej
costk = Λ * distancek + Γ * rowhitsk
cost j = α x loadj + β x rowhitsj + λ x distance + Ƭ x LatencyDimmClusterj + µ x Usagej
39
![Page 40: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/40.jpg)
Summary
• Multiple on-chip MCs will be common in future CMPs• Multiple cores sharing one MC, MCs controlling different
types of memories• Intelligent data mapping needed
• Adaptive First Touch policy (AFT)• Increases performance by 6.5% in homogeneous and by
1.6% in DRAM – PCM hierarchy.
• Dynamic page migration, improvement on AFT• Further improvement over AFT - 8.9% over baseline in
homogeneous, and by 4.4% in best performing DRAM-PCM hierarchy.
40
![Page 41: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/41.jpg)
41
Managing Resistance Drift inPCM Cells
Manu Awasthi
CPUMC
DIMM…
Primary Impact
![Page 42: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/42.jpg)
42
Quick Summary
• Multi level cells in PCM appear imminent• A number of proposals exist to handle hard errors
and lifetime issues of PCM devices• Resistance Drift is a less explored phenomenon
– Will become increasingly significant as number of levels/cell increases – primary cause of “soft errors”
– Naïve techniques based on DRAM-like refresh will be extremely costly for both latency and energy
– Need to explore holistic solutions to counter drift
![Page 43: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/43.jpg)
43
What is Resistance Drift?
Resistance
Time11 10 01 00
A
B
ERROR!!
T0
Tn
Crystalline Amorphous
![Page 44: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/44.jpg)
44
Resistance Drift DataCell Type Drift Time at Room
temperature (secs)
Median 11 cell 10499
Worst 11 Case cell 1015
Median 10 cell 1024
Worst Case 10 cell 5.94
Median 01 cell 108
Worst Case 01 cell 1.81
(11) (00)(10) (01)
![Page 45: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/45.jpg)
45
Resistance Drift - Issues
• Programmed resistance drifts according to power law equation -
• R0, α usually follow a Gaussian distribution• Time to drift (error) depends on
– Programmed resistance (R0), and – Drift Coefficient (α)– Is highly unpredictable!!
Rdrift(t) = R0 х (t)α
![Page 46: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/46.jpg)
46
Resistance Drift - How it happens
11 10 01 00
Median case cell• Typical R0
• Typical α
Worst case cell• High R0
• High α
Scrub rate will be dictated by the Worst Case R0 and Worst Case α
Naive refresh/scrub will be extremely costly!
Drift Drift
R0 R0Rt Rt
ERROR!!
Number of Cells
![Page 47: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/47.jpg)
47
Architectural Solutions - Headroom• Assumes support for Light Array
Reads for Drift Detection (LARDDs) & ECC-N
• Headroom-h scheme – scrub is triggered if N-h errors are detected
† Decreases probability of errors slipping through
– Increases frequency of full scrub and hence decreases life time
– Gradual Headroom scheme : Start with large LARDD frequency, increase frequency as errors increase
Read Line
Check for Errors
Errors < N-h
Scrub Line
True
False
After N cycles
![Page 48: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/48.jpg)
48
Reducing Overheads with Circuit Level Solution
• Invoking ECC on every LARDD increases energy consumption
• Parity – like error detection circuit is used to signal the need for a full fledged ECC error detect – Number of Drift Prone States in each line are counted
when the line is written into memory (single bit represents odd/even)
– At every LARDD, parity is verified• Reduces need for ECC read-compare at every LARDD
cycle
48(11) (00)(10) (01)
![Page 49: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/49.jpg)
49
More Solutions
• Precise Writes– More write iterations to program state closer to mean,
reduce chance of drift– Increases energy consumption , write time and decreases
lifetime!• Non Uniform Guardbanding
– Resistance is equally distributed between all n states– Expand resistance range for drift prone states at expense
of non-drift prone ones
![Page 50: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/50.jpg)
50
Results
LARDD Interval (seconds)
Erro
rs
![Page 51: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/51.jpg)
51
Conclusions
• Resistance drift will exacerbate with MLC scaling• Naïve solutions based on ECC support are costly for
PCM– Increased write energy, decreased lifetimes
• Holistic solutions need to be explored to counter drift at device, architectural and system levels– 39% reduction in energy, 4x less errors, 102x increase in
lifetime
![Page 52: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/52.jpg)
52
Handling Read/Write Parallelism
Nil Chatterjee
CPUMC
DIMM…
Primary Impact
![Page 53: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/53.jpg)
53
The Problem• Writes are not on the critical path for program execution, but
they can slow down reads through resource contention
• In future chipkill correct systems, each data write will necessitate an update of the ECC codes and the impact of writes will be more evident.
• In PCM, the problem is exacerbated by the significantly longer write times.
• Abstracting the writes away improves read latency by 48% in non-ECC DRAM systems.
![Page 54: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/54.jpg)
54
Impact of Writes on Reads• Write draining affects read latencies by
– Increasing the queuing delay – Reducing the read stream’s row-buffer locality
![Page 55: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/55.jpg)
55
Bank Contention from Writes• Reads are not scheduled in the middle of the WQ
drain because it would require multiple bus turnarounds incurring tWRT and tOST delays.
• Underutilization of the data bus bandwidth during WQ draining leading to performance loss.
• However, opportunities to schedule read accesses to idle banks might exist in this interval.
![Page 56: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/56.jpg)
56
Example
![Page 57: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/57.jpg)
57
Solution : Increasing R/W overlap• During a WQ drain cycle, schedule partial reads to
idle banks. – Following a column read command, the data is
fetched from the sense amplifiers into a small buffer (64byte) near the I/O pads.
– Data will be streamed out only after the WQ reaches the low watermark - no turnaround delays.
• Immediately following the WQ drain, a flurry of prefetched reads can occupy the data bus.
![Page 58: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/58.jpg)
58
Solution : Increasing R/W overlap.
![Page 59: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/59.jpg)
59
Impact• A small pool of partial read registers can help
increase the data bus utilization post writes.
• In PCM system, where writes are very expensive, partial reads can have higher impact.
• The JEDEC standard must be augmented to support a partial read command.
![Page 60: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/60.jpg)
60
Organizing Channels for High Capacity
Kshitij Sudan
CPUMC
DIMM…
Primary Impact
![Page 61: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/61.jpg)
Increasing DRAM Capacity by Re-Architecting Memory Channel
• Increase DRAM capacity, while minimizing power• Re-architect CPU-to-DRAM channel
• Study effects of bus width and protocol (serial vs. parallel)• CMPs might have changed the playfield!
61
![Page 62: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/62.jpg)
62
Increasing DRAM Capacity by Re-Architecting Memory Channel
• Organize modules as binary tree, and move some MC functionality to “Buffer Chip”
• Reduces module depth from O(n) to O(log n)
• Reduces worst case latency, and improves signal integrity
• Buffer chip manages low-level DRAM operations and channel arbitration
• Not limited by worst-case access latency like FB-DIMM
• NUMA like DRAM access – leverage data mapping
62
![Page 63: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/63.jpg)
63
Handling DMA Scheduling
Kshitij Sudan
CPUMC
DIMM…
Primary Impact
![Page 64: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/64.jpg)
Handling DMA Scheduling
• Reduce conflicts between CPU generated RAM requests, and DMA generated DRAM requests
64
![Page 65: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/65.jpg)
Handling DMA Scheduling
• Study interference from DMA requests on CPU generated DRAM requests• With on-chip MCs, unclear how DMA requests compete
with DRAM requests.
• Devise scheduling polices to minimize DMA and CPU access conflicts
• Infer how DMA and DRAM requests are arbitrated at the MC• No CPU manufacturer documentation available publicly!
65
![Page 66: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/66.jpg)
66
Variable Rank Subsetting
Seth Pugsley
CPUMC
DIMM…
Primary Impact
![Page 67: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/67.jpg)
Motivation for Rank Subsetting
• Rank Subsetting– Split up a rank+data channel into multiple, smaller
ranks+data channels• Prior motivations: reduce dynamic energy and
overfetch
67
![Page 68: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/68.jpg)
Rank Size Options
Standard 8 chip-wide rank1x64-bit data bus2 banks1x8KB row buffer64 byte cache line in 8 clock edgesAll transfers sequential
4 chip-wide narrow rank2x32-bit data buses4 banks2x4KB row buffers64 byte cache line in 16 clock edgesCan transfer 2 cache lines in parallel
1 chip-wide narrow rank8x8-bit data buses16 banks8x1KB row buffers64 byte cache line in 64 clock edgesCan transfer 8 cache lines in parallel
2 chip-wide narrow rank4x16-bit data buses8 banks4x2KB row buffers64 byte cache line in 32 clock edgesCan transfer 4 cache lines in parallel
68
![Page 69: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/69.jpg)
Impact on Queuing Delay
Core AccessDB
16 cyc
4 cyc
Core AccessDB
16 cyc
4 cyc
Core AccessDB
16 cyc
4 cyc
Behavior with a single bank: data bus utilization of 25%
69
![Page 70: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/70.jpg)
Impact on Queuing Delay
Core AccessDB
16 cyc
4 cyc
Core AccessDB
16 cyc
4 cyc
Core AccessDB
16 cyc
4 cyc
Behavior with a single bank: data bus utilization of 25%
Core 0 AccessDB
16 cyc
Core 1 AccessDB
16 cyc
Core 0 AccessDB
16 cyc
Core 1 AccessDB
16 cyc
Behavior with two banks: data bus utilization of 50%
Core 0 AccessDB
16 cyc
Core 1 AccessDB
16 cyc
70
![Page 71: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/71.jpg)
Advantages of Rank Subsetting
• More open rows– Each open row is narrower (still OK hit rates)
• Reduced Queuing Delay – More banks available and better data bus
utilization
71
![Page 72: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/72.jpg)
Performance for Static Rank Subsetting
72
![Page 73: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/73.jpg)
Variable Rank Subsetting
• Use a different size rank for each memory op– e.g., 1-wide transaction on data bus at same time as 2-wide and 4-wide
transactions– Scheduling can get pretty hairy– Many wasted data bus slots
D0D1D2D3D4D5D6D7
Time
= 1-wide = 2-wide = 4-wide = 8-wide = wasted
…
73
![Page 74: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/74.jpg)
More Sensible Variable Rank Subsetting
• Still can use a different size rank for each memory op• Limit rank size to only 2 options
– Software chooses mode for newly allocated pages– Scheduling is much easier than the previous example
D0D1D2D3D4D5D6D7
= 4-wide = 8-wide = wasted
…
Time 74
![Page 75: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/75.jpg)
75
Exploiting Rank Subsetting toAlleviate Thermal Constraints
Manju Shevgoor
CPUMC
DIMM…
Primary Impact
![Page 76: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/76.jpg)
76
The problem- DRAM is getting hot• DRAM Temperatures can rise up to 95° C • Refresh rate needs to double once DRAM crosses 85° C • Thermal emergencies due to elevated temperatures
adversely affect performance• Cooling Systems are expensive
Full DIMM heat spreader, Zhu et al., ITHERM’08Typical Cooling System, Liu et al., HPCA’11
![Page 77: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/77.jpg)
77
Current Thermal Throttling Techniques
CPU Throttling
Reduces overall activity
Thermal Shutdown
Stop all requests to over-heated
chips
Memory Bandwidth Throttling
Lower channel bandwidth to
reduce DRAM activity
• All DRAM chips are affected by these techniques irrespective of their temperature
• Even cool chips which could otherwise be operating at optimal throughput are also throttled
![Page 78: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/78.jpg)
78
Refresh Overhead
Elastic Refresh, Stucheli et al., MICRO’10
• As memory chips get denser, this problem only worsens• Integer workloads can have up to 13% IPC degradation because
of Refresh• Chips working at Extended Temperature Range will cause larger
IPC degradation
![Page 79: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/79.jpg)
79
Temperature Profile along a DIMM
• Proximity to the hot processor results in unequal temperature
• Position with respect to airflow also impacts the temperature
• Temperature difference between the hottest and coolest chips can be 10°C
Typical Temperature Profile Along the RDIMMSource: Zhu et al., ITHERM’08
Typical Cooling System, Liu et al., HPCA’11
![Page 80: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/80.jpg)
80
Baseline
• All chips are grouped into 1 Rank• Not all chips are ‘HOT’• Not all chips need to be throttled!
Buffer
Rank 1
DIMM
Baseline Rank Organization
![Page 81: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/81.jpg)
81
Proposed Solution
BufferRank 1(Coolest
Rank)
Rank 4Rank 3(Warmest
Rank)
Rank 2
DIMM
Proposed Rank Organization- Statically Split DIMM into multiple Ranks based on temperature
• Not all Ranks are equally hot, so Penalize only Hottest Ranks• Control Refresh Rate at Rank granularity
• Only the hottest chips are refreshed every 32ms the rest can be refreshed every 64ms
![Page 82: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/82.jpg)
82
Fine-Grained DRAM Throttling• Need a throttling mechanism which can be applied at a
finer granularity• Temperature Aware Cache Replacement
– Modify LRU to preferentially evict lines belonging to Cool Ranks– Will reduce activity only in Hot Ranks
Decrease activity ONLY in Hot-Ranks
R3 R1 R1R2 R1 R3R3 R4
R1R3 R4R2R1 R4R1 R2
R3 R2 R3R2 R4 R2R2 R3
MRU LRU
![Page 83: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/83.jpg)
83
Rank-wise Refresh
BufferRank 1(Coolest)
Rank 4Rank 3(Warmest)
Rank 2
DIMM
• Refresh only as fast as needed• Only Ranks operating at Extended Temperature Range
are refreshed every 32ms• Ranks operating at Normal Temperature Range are
refreshed every 64ms
Extended Temperature Range
Normal Temperature Range
![Page 84: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/84.jpg)
84
Summary
Split DIMM into Mini Ranks
Model Temperature of Chips
Throttle Activity of Hot Ranks
Increase Refresh Rate of Hot Rank Only
Penalize Hot Ranks ONLY!!
Keeps the Chips from Reaching
High Temp. Maintains Data Integrity of Chips
once they get Hot
![Page 85: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/85.jpg)
85
Summary
• Converging technology trends require an overhaul of main memory architectures
• Multi-pronged approach required for significant improvements: memory chip, controller, interface, OS
• Future memory chips must also optimize for energy and reliability, and not just latency and density
• Publications: http://www.cs.utah.edu/arch-research/
![Page 86: Towards Scalable and Energy-Efficient Memory System Architectures](https://reader035.vdocument.in/reader035/viewer/2022081513/56816519550346895dd79696/html5/thumbnails/86.jpg)
86
Acknowledgments
• Collaborators at HP Labs, IBM, Intel
• Funding from NSF, Intel, HP, University of Utah
• Thanks for hosting!