optimizing dram based main memories using intelligent data placement

Optimizing DRAM Based Main Memories Using Intelligent Data

Placement

Ph.D. Thesis ProposalKshitij Sudan

Thesis Statement

Improving DRAM access latency, power consumption, and capacity by

leveraging intelligent data placement.

3

Overview

CPUMC

DIMM…

Memory Interconnect

Narrow, buffered channels to increase

capacity

Proposed work

Memory ControllerMaximize DRAM row-buffer utility

Micro-pages: ASPLOS 2010

System Re-design

Increasing capacity within a fixed power budget

Tiered MemoryUnder Review

4

RE-ARCHITECTING MEMORY CHANNELS

Proposed Work

5

Challenges in Increasing DRAM Capacity

• Slow growth in CPU pin count limits number of memory channels

• Signal integrity limits capacity per channel– Use serial, point-to-point links

• Drawbacks of using serial, point-to-point links– Increased latency due to signal re-conditioning– Memory controller complexity limits resource use

6

Increasing DRAM Capacity by Re-Architecting Memory Channel

• Re-architect CPU-to-DRAM channel• Many skinny, serial channels vs. few, wide buses

• CMPs might have changed the playing field• Improved signal integrity due to re-conditioning

• New channel topology to reduce latency• Study effects of channel frequency

7

Re-Architecting Memory Channel

Organize modules as binary tree, and move some MC functionality to “Buffer Chip”

• Reduces module depth from O(n) to O(log n)

• Reduces worst case latency, improves signal integrity

• Buffer chip manages low-level DRAM operations and channel arbitration

• Not limited by worst-case latency like FB-DIMM

• NUMA like DRAM access – leverage data mapping

8

MICRO-PAGESPast Work

9

Increasing Row-Buffer Utility with Data Placement

• Over fetch due to large row-buffers• 8 KB read into row buffer for a 64 byte cache line• Row-buffer utilization for a single request < 1%

• Diminishing locality in multi-cores• Increasingly randomized memory access stream• Row-buffer hit rates bound to go down

• Open page policy and FR-FCFS request scheduling• Memory controller schedules requests to open row-buffers first

GoalImprove row-buffer hit-rates for CMPs

10

Key ObservationPost-L2 Cache Block Access Pattern Within OS Pages

For heavily accessed pages in a given time interval,accesses are usually to a few cache blocks

11

Basic Idea

Hottest micro-pages

1 KB micro-pages

Coldest micro-pages

4 KB OS Pages

DRAM Memory

Reserved DRAM Region

12

Hardware Implementation (HAM)

PhysicalAddress

X

New addr . Y

4 GB Main MemoryCPU Memory Request

4 MB ReservedDRAM region

Y

X Page A

Mapping Table

X Y

Old Address New Address

BaselineHardware Assisted Migration (HAM)

13

Conclusions• On average, for applications with room for improvement

and with our best performing scheme• Average performance ↑ 9% (max. 18%)• Average memory energy consumption ↓ 18% (max. 62%). • Average row-buffer utilization ↑ 38%

• Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses

14

TIERED MEMORYPast Work

15

Increase DRAM Capacity in Fixed Power Budget

• DRAM power budget increasing steadily with increases in capacity– Memory power budget in large systems already close

to 50% of total power budget• DRAM low-power modes hard to use in current

systems– Granularity at which low-power modes operate at (a

DRAM rank)– Data placement to increase bandwidth reduces

opportunities to place ranks in low-power modes

16

DRAM Power Mgmt. Challenges

• DRAM supports low-power modes, but not easy to exploit:– Granularity at which memory can be put in low-power

mode is large.– Random distribution of memory accesses across ranks

• Memory interleaving.• Little co-ordination between memory managers (library, OS,

and hypervisor).• As a result, no rank experiences sufficient idleness to

warrant being placed in a low-power modes.

Few systems can exploit DRAM low-power modes aggressively

17

Tiered Memory

• Access to 4KB OS pages show a step curve• Leverage this to place frequently accessed pages in active-mode DRAM ranks • Place “cold” pages in low-power mode ranks

18

Iso-Power Tiered Memory-I

• A DRAM rank in self-refresh mode consumes ~15% of the power of an idle rank in active mode.– 1 rank in active idle mode = 6 ranks in self-refresh.

• By maintaining most of the memory in a low-power mode, can build systems with a much larger memory capacity in same power budget.

19

Iso-Power Tiered Memory-II

• 2 tiers of DRAM with heterogeneous power and performance characteristics.– “Hot” tier DRAM always available, “cold” tier DRAM uses

self-refresh low-power mode when idle.• Place frequently accessed data in hot tier.

– Maintain performance– Fewer accesses to cold tier -> reduces power.

• Batch references to cold tier:– Amortize entry/exit overheads of low-power mode.– Stay in low-power mode for longer.

20

Intelligent Data Placement

• Counters keep track of hot pages with low overhead

• Every epoch, migrate hot pages in low-power ranks, to active ranks– Requires page-table updates, TLB flushes– Still low overhead - after first few epoch, little

change in hot page set

21

Servicing cold-tier requests in batches

• Buffer requests at the memory controller for cold-tier accesses• At most, delay any request by t_g – prevents starvation• t_g chosen to amortize overheads of low-power mode entry/exit• Requires minimal change to the memory controller

22

Attributions

• Re-architecting memory channel: Rajeev Balasubramonian, Al Davis, Niladrish Chatterjee, Manu Awasthi

• Micro-Pages: Rajeev Balasubramonian, Al Davis, Niladrish Chatterjee, Manu Awasthi

• Tiered Memory: Karthick Rajamani, Wei Huang, John Carter, Freeman Rawson

Thanks

Questions?

Backup Slides

25

Other Work• Dynamic Hardware-Assisted Software-Controlled Page Placement to

Manage Capacity Allocation and Sharing within Large Caches - Manu Awasthi, Kshitij Sudan, Rajeev Balasubramonian, John Carter, HPCA, February 2009.

• Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality of Service - Kshitij Sudan, Sadagopan Srinivasan, Rajeev Balasubramonian, Ravi Iyer, Under Review.

• A Novel System Architecture for Web-Scale Applications Using Lightweight CPUs and Virtualized I/O - Kshitij Sudan, Saisanthosh Balakrishnan, Sean Lie, Min Xu, Dhiraj Mallick, Rajeev Balasubramonian, Gary Lauterbach, Under Review.

• Data Locality Optimization of Pthread Applications for Non-Uniform Cache Architectures – Gagan S. Sachdev, Kshitij Sudan, Rajeev Balasubramonian, Mary Hall, Under Review. Contd.

26

• Efficient Scrub Mechanisms for Error-Prone Emerging Memories - Manu Awasthi, Manjunath Shevgoor, Kshitij Sudan, Bipin Rajendran, Rajeev Balasubramonian, Viji Srinivasan, To Appear at HPCA-18, Feb 2012.

• Hadoop Jobs Require One-Disk-per-Core, Myth or Fact? - Kshitij Sudan, Min Xu, Sean Lie, Saisanthosh Balakrishnan, Gary Lauterbach, XLDB-5 Lightning Talk, Oct. 2011.

• Handling PCM Resistance Drift with Device, Circuit, Architecture, and System Solutions - Manu Awasthi, Manjunath Shevgoor, Kshitij Sudan, Rajeev Balasubramonian, Bipin Rajendran, Viji Srinivasan, Non-Volatile Memory Workshop, March 2011.

• Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers - Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian, Al Davis, PACT, September 2010

• Improving Server Performance on Multi-Cores via Selective Off-loading of OS Functionality - David Nellans, Kshitij Sudan, Erik Brunvand Rajeev Balasubramonian, WIOSCA, June 2010.

• Hardware Prediction of OS Run-Length For Fine-Grained Resource Customization - David Nellans, Kshitij Sudan, Erik Brunvand, Rajeev Balasubramonian, ISPASS-2010, March 2010.

27

Iso-Power Memory Configurations Tiered Memory Size for u=0.9

Nh=2

Nh=3

Nh=4

15171921232527293133

5 6 7 8 9 10 11

Idle Power Ratio (Hot/Cold)

Num

ber o

f Tie

red

Ran

ks

Tiered Memory Size for u=0.5

Nh=2

Nh=3

Nh=4

15171921232527293133

5 6 7 8 9 10 11


Num

ber o

f Tie

red

Ran

ks

• 8 active ranks in baseline • ratio of idle active and self-refresh power,• fraction (u) of memory requests served by hot ranks,• service rate,• bandwidth.

Tiered Memory Size for u=0.7

Nh=2

Nh=3

Nh=4

15171921232527293133

5 6 7 8 9 10 11


Num

ber o

f Tie

red

Ran

ks

4h,12c:2X baseline

2h,22c:3X baseline

Analytical model determines iso-power configurations for a given access rate to the active-mode (“hot”) DRAM ranks

28

Iso-Power Memory Configurations

Analytical model determines iso-power configurations for a given access rate to the active-mode (“hot”) DRAM ranks

29

Tiered Memory: Iso-Power Memory Architecture to Address Memory Power Wall

• Build tiers out of DRAM ranks• Aggressively use low-power (LP) modes• Intelligent data placement to reduce

overheads of entry/exit from LP modes• Buffer requests to ranks in LP and service

them in batches to amortize entry/exit costs

optimizing dram based main memories using intelligent data placement

Documents

row buffer

dram accesses

dram rankdata placement

power consumption

average rowbuffer utilization

lowlevel dram operations

dram channelmany skinny

capacitymemory power