towards scalable and energy-efficient memory system architectures
Post on 23-Feb-2016
45 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Towards Scalable and Energy-Efficient Memory System Architectures
Rajeev Balasubramonian, Al Davis,Ani Udipi, Kshitij Sudan,
Manu Awasthi, Nil Chatterjee, Seth Pugsley, Manju Shevgoor
School of ComputingUniversity of Utah
2
Towards Scalable and Energy-Efficient Memory System Architectures
3
Convergence of Technology Trends
Energy
Reliability
New MemoryTechnologies
BW, Capacity, and Locality for
Multi-Cores
Overhaul of main memory architecture!
4
High Level Approach
• Explore changes to memory chip microarchitecture Must cause minimal disruption to density
• Explore changes to interfaces and standards Major change appears inevitable!
• Explore system and memory controller innovations Most attractive, but order-of-magnitude improvement unlikely
Design solutions that are technology-agnostic
5
Projects
Memory Chip
• Reduce overfetch
• Support reliability
• Handle PCM drift
• Promote read/write parallelism
Memory Interface
• Interface with photonics
• Organize channel for high capacity
Memory Controller
• Maximize use of row buffer
• Schedule for low latency and energy
• Exploit mini-ranks
CPUMC
DIMM…
6
Talk Outline
Mature work:• SSA architecture – Single Subarray Access (ISCA’10)• Support for reliability (ISCA’10)• Interface with photonics (ISCA’11)• Micro-pages – data placement for row buffer efficiency (ASPLOS’10)• Handling multiple memory controllers (PACT’10)• Managing resistance drift in PCM cells (NVMW’11)
Preliminary work:• Handling read/write parallelism• Enabling high capacity• Handling DMA scheduling• Exploiting rank subsetting for performance and thermals
7
Minimizing Overfetch with SingleSubarray Access
Ani Udipi
CPUMC
DIMM…
Primary Impact
Problem 1 - DRAM Chip Energy
• On every DRAM access, multiple arrays in multiple chips are activated
• Was useful when there was good locality in access streams– Open page policy
• Helped keep density high and reduce cost-per-bit• With multi-thread, multi-core and multi-socket systems,
there is much more randomness– “Mixing” of access streams when finally seen by the memory
controller
8
Rethinking DRAM Organization
• Limited use for designs based on locality• As much as 8kbytes read in order to service a 64byte cache
line request• Termed “overfetch”
– Substantially increases energy consumption• Need a new architecture that
– Eliminates overfetch– Increases parallelism– Increases opportunity for power-down– Allows efficient reliability
9
Proposed Solution – SSA Architecture
10
MEMORY CONTROLLER
8 8
ADDR/CMD BUS
64 Bytes
Bank
Subarray
Bitlines
Row buffer
Global Interconnect to I/O
ONE DRAM CHIP
DIMM
8 8 8 8 8 88DATA BUS
SSA Basics
• Entire DRAM chip divided into small “subarrays”
• Width of each subarray is exactly one cache line
• Fetch entire cache line from a single subarray in a single DRAM chip – SSA
• Groups of subarrays combined into “banks” to keep peripheral circuit overheads low
• Close page policy and “posted-RAS”
• Data bus to processor essentially split into 8 narrow buses
11
SSA Architecture Impact
• Energy reduction– Dynamic – fewer bitlines activated– Static – smaller activation footprint – more and longer spells of
inactivity – better power down
• Latency impact– Limited pins per cache line – serialization latency– Higher bank-level parallelism – shorter queuing delays
• Area increase– More peripheral circuitry and I/O at finer granularities – area
overhead (< 5%)
12
Area Impact
• Smaller arrays – more peripheral overhead• More wiring overhead in the on-chip interconnect between
arrays and pin pads• We did a best-effort area impact calculation using a
modified version of CACTI 6.5– Analytical model, has its limitations
• More feedback in this specific regard would be awesome!• More info on exactly where in the hierarchy overfetch stops
would be great too
13
14
Support for Chipkill Reliability
Ani Udipi
CPUMC
DIMM…
Primary Impact
Problem 2 – DRAM Reliability
• Many server applications require chipkill-level reliability – failure of an entire DRAM chip
• One example of existing systems– Consider baseline 64-bit word plus 8-bit ECC – Each of these 72 bits must be read out of a different chip,
else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable!
– Reading 72 chips - significant overfetch!• Chipkill even more of a concern for SSA since entire cache
line comes from a single chip
15
Proposed Solution
Approach similar to RAID-5
16
DIMM
L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C
L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C..
C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C
.
...
.
...
.
...
.
...
P7
DRAM DEVICE
L – Cache Line C – Local Checksum P – Global Parity
Chipkill design
• Two-tier error protection• Tier - 1 protection – self-contained error detection
– 8-bit checksum/cache line – 1.625% storage overhead– Every cache line read is now slightly longer
• Tier -2 protection – global error correction– RAID-like striped parity across 8+1 chips– 12.5% storage overhead
• Error-free access (common case)– 1 chip reads– 2 chip writes – leads to some bank contention – 12% IPC degradation
• Erroneous access– 9 chip operation
17
Questions
• What are the common failure modes in DRAM? PCM?• Do entire chips fail?• Do parts of chips fail?
– Which parts? Bitlines? Wordlines? Capacitors?– Entire arrays?– Entire banks?– I/O?
• Should all these failures be handled the same way?
18
19
Designing Photonic Interfaces
Ani Udipi
CPUMC
DIMM…
Primary Impact
Problem 3 – Memory interconnect
• Electrical interconnects are not scaling well– Where can photonics make an impact, both on energy and
performance?
• Various levels in the DRAM interconnect– Memory cell to sense-amp - addressed by SSA– Row buffer to I/O – currently electrical (on-chip)– I/O pins to processor – currently electrical (off-chip)
• Photonic interconnects– Large static power component – laser/ring tuning– Much lower dynamic component – relatively unaffected by distance
• Electrical interconnects– Relatively small static component– Large dynamic component
• Cannot overprovision photonic bandwidth, use only where necessary
20
Consideration 1 – How much photonics on a die?
21
Elec
tric
al E
nerg
y
Phot
onic
Ene
rgy
Consideration 2 - Increasing Capacity
• 3D stacking is imminent• There will definitely be several dies on the channel
– Each die has photonic components that are constantly burning static power
– Need to minimize this!• TSVs available within a stack; best of both worlds
– Large bandwidth– Low static energy– Need to exploit this!
22
Proposed Design
23
Processor DIMMWaveguide
DRAM chips
Photonic Interface die +
Stack controller
Memory controller
Proposed Design – Interface Die
• Exploit 3D die stacking to move all photonic components to a separate interface die, shared by several memory dies
– Use photonics where there is heavy utilization – shared bus between processor and interface die i.e. the off-chip interconnect
– Helps break pin barrier for efficient I/O, substantially improves socket-edge BW
– On-stack, where there is low utilization, use efficient low-swing interconnects and TSVs
24
Advantages of the proposed system
• Reduction in energy consumption– Fewer photonic resources, without loss in performance– Rings, couplers, trimming
• Industry considerations– Does not affect design of commodity memory dies– Same memory die can be used with both photonic and
electrical systems– Same interface die can be used with different kinds of
memory dies – DRAM, PCM, STT-RAM, Memristors
25
Problem 4 – Communication Protocol
• Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface
– Need to handle heterogeneous memory modules, each with its own maintenance requirements, further complicates scheduling
– Very little interoperability – affects both consumers (too many choices!) and vendors (stock-keeping and manufacturing)
– Heavy pressure on address/command bus – several commands to micro-manage every operation of the DRAM
– Several independent banks – need to maintain large amounts of state to schedule requests efficiently
– Simultaneous arbitration for multiple resources (address bus, data bank, data bus) to complete a single transaction
26
Proposed Solution – Packet-based interface
• Release most of the tight control memory controller holds today• Move mundane tasks to the memory modules themselves (on
the interface die) - make them more autonomous– maintenance operation (refresh, scrub, etc.)– routine operations (DRAM precharge, NVM wear handling)– timing control (DRAM alone has almost 20 different timing
constraints to be respected)– coding and any other special requirements
• Only information the memory module needs is the address and read/write identification, time slots reserved apriori for data return
27
Advantages
• Better interoperability, plug and play– As long as the interface die has the necessary information,
everything in interchangeable• Better support for heterogeneous systems
– Allows easier data movement between DRAM and NVM for example, on the same channel
• Reduces memory controller complexity• Allows innovation and value addition in the memory,
without being constrained by processor-side support• Reduces bit transport energy on the address/command bus
28
29
Data Placement with Micro-PagesTo Boost Row Buffer Utility
Kshitij Sudan
CPUMC
DIMM…
Primary Impact
DRAM Access Inefficiencies• Over fetch due to large row-buffers
• 8 KB read into row buffer for a 64 byte cache line• Row-buffer utilization for a single request < 1%
• Diminishing locality in multi-cores• Increasingly randomized memory access stream• Row-buffer hit rates bound to go down
• Open page policy and FR-FCFS request scheduling• Memory controller schedules requests to open row-buffers first
GoalImprove row-buffer hit-rates for Chip Multi-Processors
30
Key ObservationGather all heavily accessed chunks of independent OS
pages and map them to the same DRAM rowCache Block Access Pattern Within OS Pages
For heavily accessed pages in a given time interval,accesses are usually to a few cache blocks
31
Basic Idea
Hottest micro-pages
1 KB micro-pages
Coldest micro-pages
4 KB OS Pages
DRAM Memory
Reserved DRAM Region
32
Hardware Implementation (HAM)
PhysicalAddress
X
New addr . Y
4 GB Main MemoryCPU Memory Request
4 MB ReservedDRAM region
Y
X Page A
Mapping Table
X Y
Old Address New Address
BaselineHardware Assisted Migration (HAM)
33
Results5M cycle EPOCH, ROPS, HAM and ORACLE
Apart from average 9% performance gains, our schemes also save DRAM energy at the same time!
Percent change in
performance
34
Conclusions• On average, for applications with room for improvement and
with our best performing scheme• Average performance ↑ 9% (max. 18%)• Average memory energy consumption ↓ 18% (max. 62%). • Average row-buffer utilization ↑ 38%
• Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses
35
36
Data Placement AcrossMultiple Memory Controllers
Kshitij Sudan
CPUMC
DIMM…
Primary Impact
DRAM NUMA Latency
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMMQPI
MC On-Chip Memory Controller
QPI Interconnect
Memory Channel
DIMM DRAM (DIMMs)
Socket Boundary
37
Problem Summary
• Pin limitations → increasing queuing delay• Almost 8x increase in queuing delays from single core/one
thread to 16 cores/16 threads
• Multi-cores → increasing row-buffer interference• Increasingly randomized memory access stream
• Longer on- and off-chip wire delays → increasing NUMA factor• NUMA factor already at 1.5x today
GoalImprove application performance by reducing queuing
delays and NUMA latency 38
Policies to Manage Data Placement Among MCs
• Adaptive First Touch• Assign new virtual pages to a DRAM (physical) page belonging to
MC(j) minimizing the a cost function
• Dynamic Page Migration• Programs change phases → Imbalance in MC load• Migrate pages between MCs at runtime
• Integrating Heterogenous Memory Technologies
cost j = α x loadj + β x rowhitsj + λ x distancej
costk = Λ * distancek + Γ * rowhitsk
cost j = α x loadj + β x rowhitsj + λ x distance + Ƭ x LatencyDimmClusterj + µ x Usagej
39
Summary
• Multiple on-chip MCs will be common in future CMPs• Multiple cores sharing one MC, MCs controlling different
types of memories• Intelligent data mapping needed
• Adaptive First Touch policy (AFT)• Increases performance by 6.5% in homogeneous and by
1.6% in DRAM – PCM hierarchy.
• Dynamic page migration, improvement on AFT• Further improvement over AFT - 8.9% over baseline in
homogeneous, and by 4.4% in best performing DRAM-PCM hierarchy.
40
41
Managing Resistance Drift inPCM Cells
Manu Awasthi
CPUMC
DIMM…
Primary Impact
42
Quick Summary
• Multi level cells in PCM appear imminent• A number of proposals exist to handle hard errors
and lifetime issues of PCM devices• Resistance Drift is a less explored phenomenon
– Will become increasingly significant as number of levels/cell increases – primary cause of “soft errors”
– Naïve techniques based on DRAM-like refresh will be extremely costly for both latency and energy
– Need to explore holistic solutions to counter drift
43
What is Resistance Drift?
Resistance
Time11 10 01 00
A
B
ERROR!!
T0
Tn
Crystalline Amorphous
44
Resistance Drift DataCell Type Drift Time at Room
temperature (secs)
Median 11 cell 10499
Worst 11 Case cell 1015
Median 10 cell 1024
Worst Case 10 cell 5.94
Median 01 cell 108
Worst Case 01 cell 1.81
(11) (00)(10) (01)
45
Resistance Drift - Issues
• Programmed resistance drifts according to power law equation -
• R0, α usually follow a Gaussian distribution• Time to drift (error) depends on
– Programmed resistance (R0), and – Drift Coefficient (α)– Is highly unpredictable!!
Rdrift(t) = R0 х (t)α
46
Resistance Drift - How it happens
11 10 01 00
Median case cell• Typical R0
• Typical α
Worst case cell• High R0
• High α
Scrub rate will be dictated by the Worst Case R0 and Worst Case α
Naive refresh/scrub will be extremely costly!
Drift Drift
R0 R0Rt Rt
ERROR!!
Number of Cells
47
Architectural Solutions - Headroom• Assumes support for Light Array
Reads for Drift Detection (LARDDs) & ECC-N
• Headroom-h scheme – scrub is triggered if N-h errors are detected
† Decreases probability of errors slipping through
– Increases frequency of full scrub and hence decreases life time
– Gradual Headroom scheme : Start with large LARDD frequency, increase frequency as errors increase
Read Line
Check for Errors
Errors < N-h
Scrub Line
True
False
After N cycles
48
Reducing Overheads with Circuit Level Solution
• Invoking ECC on every LARDD increases energy consumption
• Parity – like error detection circuit is used to signal the need for a full fledged ECC error detect – Number of Drift Prone States in each line are counted
when the line is written into memory (single bit represents odd/even)
– At every LARDD, parity is verified• Reduces need for ECC read-compare at every LARDD
cycle
48(11) (00)(10) (01)
49
More Solutions
• Precise Writes– More write iterations to program state closer to mean,
reduce chance of drift– Increases energy consumption , write time and decreases
lifetime!• Non Uniform Guardbanding
– Resistance is equally distributed between all n states– Expand resistance range for drift prone states at expense
of non-drift prone ones
50
Results
LARDD Interval (seconds)
Erro
rs
51
Conclusions
• Resistance drift will exacerbate with MLC scaling• Naïve solutions based on ECC support are costly for
PCM– Increased write energy, decreased lifetimes
• Holistic solutions need to be explored to counter drift at device, architectural and system levels– 39% reduction in energy, 4x less errors, 102x increase in
lifetime
52
Handling Read/Write Parallelism
Nil Chatterjee
CPUMC
DIMM…
Primary Impact
53
The Problem• Writes are not on the critical path for program execution, but
they can slow down reads through resource contention
• In future chipkill correct systems, each data write will necessitate an update of the ECC codes and the impact of writes will be more evident.
• In PCM, the problem is exacerbated by the significantly longer write times.
• Abstracting the writes away improves read latency by 48% in non-ECC DRAM systems.
54
Impact of Writes on Reads• Write draining affects read latencies by
– Increasing the queuing delay – Reducing the read stream’s row-buffer locality
55
Bank Contention from Writes• Reads are not scheduled in the middle of the WQ
drain because it would require multiple bus turnarounds incurring tWRT and tOST delays.
• Underutilization of the data bus bandwidth during WQ draining leading to performance loss.
• However, opportunities to schedule read accesses to idle banks might exist in this interval.
56
Example
57
Solution : Increasing R/W overlap• During a WQ drain cycle, schedule partial reads to
idle banks. – Following a column read command, the data is
fetched from the sense amplifiers into a small buffer (64byte) near the I/O pads.
– Data will be streamed out only after the WQ reaches the low watermark - no turnaround delays.
• Immediately following the WQ drain, a flurry of prefetched reads can occupy the data bus.
58
Solution : Increasing R/W overlap.
59
Impact• A small pool of partial read registers can help
increase the data bus utilization post writes.
• In PCM system, where writes are very expensive, partial reads can have higher impact.
• The JEDEC standard must be augmented to support a partial read command.
60
Organizing Channels for High Capacity
Kshitij Sudan
CPUMC
DIMM…
Primary Impact
Increasing DRAM Capacity by Re-Architecting Memory Channel
• Increase DRAM capacity, while minimizing power• Re-architect CPU-to-DRAM channel
• Study effects of bus width and protocol (serial vs. parallel)• CMPs might have changed the playfield!
61
62
Increasing DRAM Capacity by Re-Architecting Memory Channel
• Organize modules as binary tree, and move some MC functionality to “Buffer Chip”
• Reduces module depth from O(n) to O(log n)
• Reduces worst case latency, and improves signal integrity
• Buffer chip manages low-level DRAM operations and channel arbitration
• Not limited by worst-case access latency like FB-DIMM
• NUMA like DRAM access – leverage data mapping
62
63
Handling DMA Scheduling
Kshitij Sudan
CPUMC
DIMM…
Primary Impact
Handling DMA Scheduling
• Reduce conflicts between CPU generated RAM requests, and DMA generated DRAM requests
64
Handling DMA Scheduling
• Study interference from DMA requests on CPU generated DRAM requests• With on-chip MCs, unclear how DMA requests compete
with DRAM requests.
• Devise scheduling polices to minimize DMA and CPU access conflicts
• Infer how DMA and DRAM requests are arbitrated at the MC• No CPU manufacturer documentation available publicly!
65
66
Variable Rank Subsetting
Seth Pugsley
CPUMC
DIMM…
Primary Impact
Motivation for Rank Subsetting
• Rank Subsetting– Split up a rank+data channel into multiple, smaller
ranks+data channels• Prior motivations: reduce dynamic energy and
overfetch
67
Rank Size Options
Standard 8 chip-wide rank1x64-bit data bus2 banks1x8KB row buffer64 byte cache line in 8 clock edgesAll transfers sequential
4 chip-wide narrow rank2x32-bit data buses4 banks2x4KB row buffers64 byte cache line in 16 clock edgesCan transfer 2 cache lines in parallel
1 chip-wide narrow rank8x8-bit data buses16 banks8x1KB row buffers64 byte cache line in 64 clock edgesCan transfer 8 cache lines in parallel
2 chip-wide narrow rank4x16-bit data buses8 banks4x2KB row buffers64 byte cache line in 32 clock edgesCan transfer 4 cache lines in parallel
68
Impact on Queuing Delay
Core AccessDB
16 cyc
4 cyc
Core AccessDB
16 cyc
4 cyc
Core AccessDB
16 cyc
4 cyc
Behavior with a single bank: data bus utilization of 25%
69
Impact on Queuing Delay
Core AccessDB
16 cyc
4 cyc
Core AccessDB
16 cyc
4 cyc
Core AccessDB
16 cyc
4 cyc
Behavior with a single bank: data bus utilization of 25%
Core 0 AccessDB
16 cyc
Core 1 AccessDB
16 cyc
Core 0 AccessDB
16 cyc
Core 1 AccessDB
16 cyc
Behavior with two banks: data bus utilization of 50%
Core 0 AccessDB
16 cyc
Core 1 AccessDB
16 cyc
70
Advantages of Rank Subsetting
• More open rows– Each open row is narrower (still OK hit rates)
• Reduced Queuing Delay – More banks available and better data bus
utilization
71
Performance for Static Rank Subsetting
72
Variable Rank Subsetting
• Use a different size rank for each memory op– e.g., 1-wide transaction on data bus at same time as 2-wide and 4-wide
transactions– Scheduling can get pretty hairy– Many wasted data bus slots
D0D1D2D3D4D5D6D7
Time
= 1-wide = 2-wide = 4-wide = 8-wide = wasted
…
73
More Sensible Variable Rank Subsetting
• Still can use a different size rank for each memory op• Limit rank size to only 2 options
– Software chooses mode for newly allocated pages– Scheduling is much easier than the previous example
D0D1D2D3D4D5D6D7
= 4-wide = 8-wide = wasted
…
Time 74
75
Exploiting Rank Subsetting toAlleviate Thermal Constraints
Manju Shevgoor
CPUMC
DIMM…
Primary Impact
76
The problem- DRAM is getting hot• DRAM Temperatures can rise up to 95° C • Refresh rate needs to double once DRAM crosses 85° C • Thermal emergencies due to elevated temperatures
adversely affect performance• Cooling Systems are expensive
Full DIMM heat spreader, Zhu et al., ITHERM’08Typical Cooling System, Liu et al., HPCA’11
77
Current Thermal Throttling Techniques
CPU Throttling
Reduces overall activity
Thermal Shutdown
Stop all requests to over-heated
chips
Memory Bandwidth Throttling
Lower channel bandwidth to
reduce DRAM activity
• All DRAM chips are affected by these techniques irrespective of their temperature
• Even cool chips which could otherwise be operating at optimal throughput are also throttled
78
Refresh Overhead
Elastic Refresh, Stucheli et al., MICRO’10
• As memory chips get denser, this problem only worsens• Integer workloads can have up to 13% IPC degradation because
of Refresh• Chips working at Extended Temperature Range will cause larger
IPC degradation
79
Temperature Profile along a DIMM
• Proximity to the hot processor results in unequal temperature
• Position with respect to airflow also impacts the temperature
• Temperature difference between the hottest and coolest chips can be 10°C
Typical Temperature Profile Along the RDIMMSource: Zhu et al., ITHERM’08
Typical Cooling System, Liu et al., HPCA’11
80
Baseline
• All chips are grouped into 1 Rank• Not all chips are ‘HOT’• Not all chips need to be throttled!
Buffer
Rank 1
DIMM
Baseline Rank Organization
81
Proposed Solution
BufferRank 1(Coolest
Rank)
Rank 4Rank 3(Warmest
Rank)
Rank 2
DIMM
Proposed Rank Organization- Statically Split DIMM into multiple Ranks based on temperature
• Not all Ranks are equally hot, so Penalize only Hottest Ranks• Control Refresh Rate at Rank granularity
• Only the hottest chips are refreshed every 32ms the rest can be refreshed every 64ms
82
Fine-Grained DRAM Throttling• Need a throttling mechanism which can be applied at a
finer granularity• Temperature Aware Cache Replacement
– Modify LRU to preferentially evict lines belonging to Cool Ranks– Will reduce activity only in Hot Ranks
Decrease activity ONLY in Hot-Ranks
R3 R1 R1R2 R1 R3R3 R4
R1R3 R4R2R1 R4R1 R2
R3 R2 R3R2 R4 R2R2 R3
MRU LRU
83
Rank-wise Refresh
BufferRank 1(Coolest)
Rank 4Rank 3(Warmest)
Rank 2
DIMM
• Refresh only as fast as needed• Only Ranks operating at Extended Temperature Range
are refreshed every 32ms• Ranks operating at Normal Temperature Range are
refreshed every 64ms
Extended Temperature Range
Normal Temperature Range
84
Summary
Split DIMM into Mini Ranks
Model Temperature of Chips
Throttle Activity of Hot Ranks
Increase Refresh Rate of Hot Rank Only
Penalize Hot Ranks ONLY!!
Keeps the Chips from Reaching
High Temp. Maintains Data Integrity of Chips
once they get Hot
85
Summary
• Converging technology trends require an overhaul of main memory architectures
• Multi-pronged approach required for significant improvements: memory chip, controller, interface, OS
• Future memory chips must also optimize for energy and reliability, and not just latency and density
• Publications: http://www.cs.utah.edu/arch-research/
86
Acknowledgments
• Collaborators at HP Labs, IBM, Intel
• Funding from NSF, Intel, HP, University of Utah
• Thanks for hosting!
top related