towards scalable and energy-efficient memory system architectures

Towards Scalable and Energy-Efficient Memory System Architectures

Rajeev Balasubramonian, Al Davis,Ani Udipi, Kshitij Sudan,

Manu Awasthi, Nil Chatterjee, Seth Pugsley, Manju Shevgoor

School of ComputingUniversity of Utah

Towards Scalable and Energy-Efficient Memory System Architectures

Convergence of Technology Trends

Energy

Reliability

New MemoryTechnologies

BW, Capacity, and Locality for

Multi-Cores

Overhaul of main memory architecture!

High Level Approach

• Explore changes to memory chip microarchitecture Must cause minimal disruption to density

• Explore changes to interfaces and standards Major change appears inevitable!

• Explore system and memory controller innovations Most attractive, but order-of-magnitude improvement unlikely

Design solutions that are technology-agnostic

Projects

Memory Chip

• Reduce overfetch

• Support reliability

• Handle PCM drift

• Promote read/write parallelism

Memory Interface

• Interface with photonics

• Organize channel for high capacity

Memory Controller

• Maximize use of row buffer

• Schedule for low latency and energy

• Exploit mini-ranks

DIMM…

Talk Outline

Mature work:• SSA architecture – Single Subarray Access (ISCA’10)• Support for reliability (ISCA’10)• Interface with photonics (ISCA’11)• Micro-pages – data placement for row buffer efficiency (ASPLOS’10)• Handling multiple memory controllers (PACT’10)• Managing resistance drift in PCM cells (NVMW’11)

Preliminary work:• Handling read/write parallelism• Enabling high capacity• Handling DMA scheduling• Exploiting rank subsetting for performance and thermals

Minimizing Overfetch with SingleSubarray Access

Ani Udipi

DIMM…

Primary Impact

Problem 1 - DRAM Chip Energy

• On every DRAM access, multiple arrays in multiple chips are activated

• Was useful when there was good locality in access streams– Open page policy

• Helped keep density high and reduce cost-per-bit• With multi-thread, multi-core and multi-socket systems,

there is much more randomness– “Mixing” of access streams when finally seen by the memory

controller

Rethinking DRAM Organization

• Limited use for designs based on locality• As much as 8kbytes read in order to service a 64byte cache

line request• Termed “overfetch”

– Substantially increases energy consumption• Need a new architecture that

– Eliminates overfetch– Increases parallelism– Increases opportunity for power-down– Allows efficient reliability

Proposed Solution – SSA Architecture

MEMORY CONTROLLER

ADDR/CMD BUS

64 Bytes

Subarray

Bitlines

Row buffer

Global Interconnect to I/O

ONE DRAM CHIP

8 8 8 8 8 88DATA BUS

SSA Basics

• Entire DRAM chip divided into small “subarrays”

• Width of each subarray is exactly one cache line

• Fetch entire cache line from a single subarray in a single DRAM chip – SSA

• Groups of subarrays combined into “banks” to keep peripheral circuit overheads low

• Close page policy and “posted-RAS”

• Data bus to processor essentially split into 8 narrow buses

SSA Architecture Impact

• Energy reduction– Dynamic – fewer bitlines activated– Static – smaller activation footprint – more and longer spells of

inactivity – better power down

• Latency impact– Limited pins per cache line – serialization latency– Higher bank-level parallelism – shorter queuing delays

• Area increase– More peripheral circuitry and I/O at finer granularities – area

overhead (< 5%)

Area Impact

• Smaller arrays – more peripheral overhead• More wiring overhead in the on-chip interconnect between

arrays and pin pads• We did a best-effort area impact calculation using a

modified version of CACTI 6.5– Analytical model, has its limitations

• More feedback in this specific regard would be awesome!• More info on exactly where in the hierarchy overfetch stops

would be great too

Support for Chipkill Reliability

Ani Udipi

DIMM…

Primary Impact

Problem 2 – DRAM Reliability

• Many server applications require chipkill-level reliability – failure of an entire DRAM chip

• One example of existing systems– Consider baseline 64-bit word plus 8-bit ECC – Each of these 72 bits must be read out of a different chip,

else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable!

– Reading 72 chips - significant overfetch!• Chipkill even more of a concern for SSA since entire cache

line comes from a single chip

Proposed Solution

Approach similar to RAID-5

L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C

L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C..

C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C

DRAM DEVICE

L – Cache Line C – Local Checksum P – Global Parity

Chipkill design

• Two-tier error protection• Tier - 1 protection – self-contained error detection

– 8-bit checksum/cache line – 1.625% storage overhead– Every cache line read is now slightly longer

• Tier -2 protection – global error correction– RAID-like striped parity across 8+1 chips– 12.5% storage overhead

• Error-free access (common case)– 1 chip reads– 2 chip writes – leads to some bank contention – 12% IPC degradation

• Erroneous access– 9 chip operation

Questions

• What are the common failure modes in DRAM? PCM?• Do entire chips fail?• Do parts of chips fail?

– Which parts? Bitlines? Wordlines? Capacitors?– Entire arrays?– Entire banks?– I/O?

• Should all these failures be handled the same way?

Designing Photonic Interfaces

Ani Udipi

DIMM…

Primary Impact

Problem 3 – Memory interconnect

• Electrical interconnects are not scaling well– Where can photonics make an impact, both on energy and

performance?

• Various levels in the DRAM interconnect– Memory cell to sense-amp - addressed by SSA– Row buffer to I/O – currently electrical (on-chip)– I/O pins to processor – currently electrical (off-chip)

• Photonic interconnects– Large static power component – laser/ring tuning– Much lower dynamic component – relatively unaffected by distance

• Electrical interconnects– Relatively small static component– Large dynamic component

• Cannot overprovision photonic bandwidth, use only where necessary

Consideration 1 – How much photonics on a die?

Consideration 2 - Increasing Capacity

• 3D stacking is imminent• There will definitely be several dies on the channel

– Each die has photonic components that are constantly burning static power

– Need to minimize this!• TSVs available within a stack; best of both worlds

– Large bandwidth– Low static energy– Need to exploit this!

Proposed Design

Processor DIMMWaveguide

DRAM chips

Photonic Interface die +

Stack controller

Memory controller

Proposed Design – Interface Die

• Exploit 3D die stacking to move all photonic components to a separate interface die, shared by several memory dies

– Use photonics where there is heavy utilization – shared bus between processor and interface die i.e. the off-chip interconnect

– Helps break pin barrier for efficient I/O, substantially improves socket-edge BW

– On-stack, where there is low utilization, use efficient low-swing interconnects and TSVs

Advantages of the proposed system

• Reduction in energy consumption– Fewer photonic resources, without loss in performance– Rings, couplers, trimming

• Industry considerations– Does not affect design of commodity memory dies– Same memory die can be used with both photonic and

electrical systems– Same interface die can be used with different kinds of

memory dies – DRAM, PCM, STT-RAM, Memristors

Problem 4 – Communication Protocol

• Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface

– Need to handle heterogeneous memory modules, each with its own maintenance requirements, further complicates scheduling

– Very little interoperability – affects both consumers (too many choices!) and vendors (stock-keeping and manufacturing)

– Heavy pressure on address/command bus – several commands to micro-manage every operation of the DRAM

– Several independent banks – need to maintain large amounts of state to schedule requests efficiently

– Simultaneous arbitration for multiple resources (address bus, data bank, data bus) to complete a single transaction

Proposed Solution – Packet-based interface

• Release most of the tight control memory controller holds today• Move mundane tasks to the memory modules themselves (on

the interface die) - make them more autonomous– maintenance operation (refresh, scrub, etc.)– routine operations (DRAM precharge, NVM wear handling)– timing control (DRAM alone has almost 20 different timing

constraints to be respected)– coding and any other special requirements

• Only information the memory module needs is the address and read/write identification, time slots reserved apriori for data return

Advantages

• Better interoperability, plug and play– As long as the interface die has the necessary information,

everything in interchangeable• Better support for heterogeneous systems

– Allows easier data movement between DRAM and NVM for example, on the same channel

• Reduces memory controller complexity• Allows innovation and value addition in the memory,

without being constrained by processor-side support• Reduces bit transport energy on the address/command bus

Data Placement with Micro-PagesTo Boost Row Buffer Utility

Kshitij Sudan

DIMM…

Primary Impact

DRAM Access Inefficiencies• Over fetch due to large row-buffers

• 8 KB read into row buffer for a 64 byte cache line• Row-buffer utilization for a single request < 1%

• Diminishing locality in multi-cores• Increasingly randomized memory access stream• Row-buffer hit rates bound to go down

• Open page policy and FR-FCFS request scheduling• Memory controller schedules requests to open row-buffers first

GoalImprove row-buffer hit-rates for Chip Multi-Processors

Key ObservationGather all heavily accessed chunks of independent OS

pages and map them to the same DRAM rowCache Block Access Pattern Within OS Pages

For heavily accessed pages in a given time interval,accesses are usually to a few cache blocks

Basic Idea

Hottest micro-pages

1 KB micro-pages

Coldest micro-pages

4 KB OS Pages

DRAM Memory

Reserved DRAM Region

Hardware Implementation (HAM)

PhysicalAddress

New addr . Y

4 GB Main MemoryCPU Memory Request

4 MB ReservedDRAM region

X Page A

Mapping Table

Old Address New Address

BaselineHardware Assisted Migration (HAM)

Results5M cycle EPOCH, ROPS, HAM and ORACLE

Apart from average 9% performance gains, our schemes also save DRAM energy at the same time!

Percent change in

performance

Conclusions• On average, for applications with room for improvement and

with our best performing scheme• Average performance ↑ 9% (max. 18%)• Average memory energy consumption ↓ 18% (max. 62%). • Average row-buffer utilization ↑ 38%

• Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses

Data Placement AcrossMultiple Memory Controllers

Kshitij Sudan

DIMM…

Primary Impact

DRAM NUMA Latency

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMMQPI

MC On-Chip Memory Controller

QPI Interconnect

Memory Channel

DIMM DRAM (DIMMs)

Socket Boundary

Problem Summary

• Pin limitations → increasing queuing delay• Almost 8x increase in queuing delays from single core/one

thread to 16 cores/16 threads

• Multi-cores → increasing row-buffer interference• Increasingly randomized memory access stream

• Longer on- and off-chip wire delays → increasing NUMA factor• NUMA factor already at 1.5x today

GoalImprove application performance by reducing queuing

delays and NUMA latency 38

Policies to Manage Data Placement Among MCs

• Adaptive First Touch• Assign new virtual pages to a DRAM (physical) page belonging to

MC(j) minimizing the a cost function

• Dynamic Page Migration• Programs change phases → Imbalance in MC load• Migrate pages between MCs at runtime

• Integrating Heterogenous Memory Technologies

cost j = α x loadj + β x rowhitsj + λ x distancej

costk = Λ * distancek + Γ * rowhitsk

cost j = α x loadj + β x rowhitsj + λ x distance + Ƭ x LatencyDimmClusterj + µ x Usagej

Summary

• Multiple on-chip MCs will be common in future CMPs• Multiple cores sharing one MC, MCs controlling different

types of memories• Intelligent data mapping needed

• Adaptive First Touch policy (AFT)• Increases performance by 6.5% in homogeneous and by

1.6% in DRAM – PCM hierarchy.

• Dynamic page migration, improvement on AFT• Further improvement over AFT - 8.9% over baseline in

homogeneous, and by 4.4% in best performing DRAM-PCM hierarchy.

Managing Resistance Drift inPCM Cells

Manu Awasthi

DIMM…

Primary Impact

Quick Summary

• Multi level cells in PCM appear imminent• A number of proposals exist to handle hard errors

and lifetime issues of PCM devices• Resistance Drift is a less explored phenomenon

– Will become increasingly significant as number of levels/cell increases – primary cause of “soft errors”

– Naïve techniques based on DRAM-like refresh will be extremely costly for both latency and energy

– Need to explore holistic solutions to counter drift

What is Resistance Drift?

Resistance

Time11 10 01 00

ERROR!!

Crystalline Amorphous

Resistance Drift DataCell Type Drift Time at Room

temperature (secs)

Median 11 cell 10499

Worst 11 Case cell 1015

Median 10 cell 1024

Worst Case 10 cell 5.94

Median 01 cell 108

Worst Case 01 cell 1.81

(11) (00)(10) (01)

Resistance Drift - Issues

• Programmed resistance drifts according to power law equation -

• R0, α usually follow a Gaussian distribution• Time to drift (error) depends on

– Programmed resistance (R0), and – Drift Coefficient (α)– Is highly unpredictable!!

Rdrift(t) = R0 х (t)α

Resistance Drift - How it happens

11 10 01 00

Median case cell• Typical R0

• Typical α

Worst case cell• High R0

• High α

Scrub rate will be dictated by the Worst Case R0 and Worst Case α

Naive refresh/scrub will be extremely costly!

Drift Drift

R0 R0Rt Rt

ERROR!!

Number of Cells

Architectural Solutions - Headroom• Assumes support for Light Array

Reads for Drift Detection (LARDDs) & ECC-N

• Headroom-h scheme – scrub is triggered if N-h errors are detected

† Decreases probability of errors slipping through

– Increases frequency of full scrub and hence decreases life time

– Gradual Headroom scheme : Start with large LARDD frequency, increase frequency as errors increase

Read Line

Check for Errors

Errors < N-h

Scrub Line

After N cycles

Reducing Overheads with Circuit Level Solution

• Invoking ECC on every LARDD increases energy consumption

• Parity – like error detection circuit is used to signal the need for a full fledged ECC error detect – Number of Drift Prone States in each line are counted

when the line is written into memory (single bit represents odd/even)

– At every LARDD, parity is verified• Reduces need for ECC read-compare at every LARDD

48(11) (00)(10) (01)

towards scalable and energy-efficient memory system architectures

memory chip microarchitecture

projects memory chip

dram access

memory controller innovations

ssa basicsentire dram

dram chip energyon

density high

reliability isca10 interface

Documents

scalable web architectures and infrastructure

cloud-based architectures for auto-scalable web geoportals

building scalable, flexible enterprise architectures...

scalable parallel flash firmware for many-core architectures

sitara arm industrial communication -...

designing scalable architectures with mysql proxy

scalable internet architectures · who am i? @postwait on...

highly scalable-architectures

best practices for building scalable visibility...

scalable architectures 101 -...

building scalable, flexible enterprise architectures...

building scalable web architectures

mysql reference architectures for massively scalable … ·...

scalable web architectures: common patterns and approaches

scalable graphics architectures: interface & texture ·...

velocity 2010: scalable internet architectures

ash: a substrate for scalable architectures

dissecting scalable database architectures

scalable parallel computing on clouds : efficient and...

scalable interconnects for reconfigurable spatial...