optimizing dram based main memories using intelligent data placement
DESCRIPTION
Optimizing DRAM Based Main Memories Using Intelligent Data Placement. Ph.D. Thesis Proposal Kshitij Sudan. Thesis Statement. Improving DRAM access latency, power consumption, and capacity by leveraging intelligent data placement. Overview. System Re-design - PowerPoint PPT PresentationTRANSCRIPT
Optimizing DRAM Based Main Memories Using Intelligent Data
Placement
Ph.D. Thesis ProposalKshitij Sudan
Thesis Statement
Improving DRAM access latency, power consumption, and capacity by
leveraging intelligent data placement.
3
Overview
CPUMC
DIMM…
Memory Interconnect
Narrow, buffered channels to increase
capacity
Proposed work
Memory ControllerMaximize DRAM row-buffer utility
Micro-pages: ASPLOS 2010
System Re-design
Increasing capacity within a fixed power budget
Tiered MemoryUnder Review
4
RE-ARCHITECTING MEMORY CHANNELS
Proposed Work
5
Challenges in Increasing DRAM Capacity
• Slow growth in CPU pin count limits number of memory channels
• Signal integrity limits capacity per channel– Use serial, point-to-point links
• Drawbacks of using serial, point-to-point links– Increased latency due to signal re-conditioning– Memory controller complexity limits resource use
6
Increasing DRAM Capacity by Re-Architecting Memory Channel
• Re-architect CPU-to-DRAM channel• Many skinny, serial channels vs. few, wide buses
• CMPs might have changed the playing field• Improved signal integrity due to re-conditioning
• New channel topology to reduce latency• Study effects of channel frequency
7
Re-Architecting Memory Channel
Organize modules as binary tree, and move some MC functionality to “Buffer Chip”
• Reduces module depth from O(n) to O(log n)
• Reduces worst case latency, improves signal integrity
• Buffer chip manages low-level DRAM operations and channel arbitration
• Not limited by worst-case latency like FB-DIMM
• NUMA like DRAM access – leverage data mapping
8
MICRO-PAGESPast Work
9
Increasing Row-Buffer Utility with Data Placement
• Over fetch due to large row-buffers• 8 KB read into row buffer for a 64 byte cache line• Row-buffer utilization for a single request < 1%
• Diminishing locality in multi-cores• Increasingly randomized memory access stream• Row-buffer hit rates bound to go down
• Open page policy and FR-FCFS request scheduling• Memory controller schedules requests to open row-buffers first
GoalImprove row-buffer hit-rates for CMPs
10
Key ObservationPost-L2 Cache Block Access Pattern Within OS Pages
For heavily accessed pages in a given time interval,accesses are usually to a few cache blocks
11
Basic Idea
Hottest micro-pages
1 KB micro-pages
Coldest micro-pages
4 KB OS Pages
DRAM Memory
Reserved DRAM Region
12
Hardware Implementation (HAM)
PhysicalAddress
X
New addr . Y
4 GB Main MemoryCPU Memory Request
4 MB ReservedDRAM region
Y
X Page A
Mapping Table
X Y
Old Address New Address
BaselineHardware Assisted Migration (HAM)
13
Conclusions• On average, for applications with room for improvement
and with our best performing scheme• Average performance ↑ 9% (max. 18%)• Average memory energy consumption ↓ 18% (max. 62%). • Average row-buffer utilization ↑ 38%
• Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses
14
TIERED MEMORYPast Work
15
Increase DRAM Capacity in Fixed Power Budget
• DRAM power budget increasing steadily with increases in capacity– Memory power budget in large systems already close
to 50% of total power budget• DRAM low-power modes hard to use in current
systems– Granularity at which low-power modes operate at (a
DRAM rank)– Data placement to increase bandwidth reduces
opportunities to place ranks in low-power modes
16
DRAM Power Mgmt. Challenges
• DRAM supports low-power modes, but not easy to exploit:– Granularity at which memory can be put in low-power
mode is large.– Random distribution of memory accesses across ranks
• Memory interleaving.• Little co-ordination between memory managers (library, OS,
and hypervisor).• As a result, no rank experiences sufficient idleness to
warrant being placed in a low-power modes.
Few systems can exploit DRAM low-power modes aggressively
17
Tiered Memory
• Access to 4KB OS pages show a step curve• Leverage this to place frequently accessed pages in active-mode DRAM ranks • Place “cold” pages in low-power mode ranks
18
Iso-Power Tiered Memory-I
• A DRAM rank in self-refresh mode consumes ~15% of the power of an idle rank in active mode.– 1 rank in active idle mode = 6 ranks in self-refresh.
• By maintaining most of the memory in a low-power mode, can build systems with a much larger memory capacity in same power budget.
19
Iso-Power Tiered Memory-II
• 2 tiers of DRAM with heterogeneous power and performance characteristics.– “Hot” tier DRAM always available, “cold” tier DRAM uses
self-refresh low-power mode when idle.• Place frequently accessed data in hot tier.
– Maintain performance– Fewer accesses to cold tier -> reduces power.
• Batch references to cold tier:– Amortize entry/exit overheads of low-power mode.– Stay in low-power mode for longer.
20
Intelligent Data Placement
• Counters keep track of hot pages with low overhead
• Every epoch, migrate hot pages in low-power ranks, to active ranks– Requires page-table updates, TLB flushes– Still low overhead - after first few epoch, little
change in hot page set
21
Servicing cold-tier requests in batches
• Buffer requests at the memory controller for cold-tier accesses• At most, delay any request by t_g – prevents starvation• t_g chosen to amortize overheads of low-power mode entry/exit• Requires minimal change to the memory controller
22
Attributions
• Re-architecting memory channel: Rajeev Balasubramonian, Al Davis, Niladrish Chatterjee, Manu Awasthi
• Micro-Pages: Rajeev Balasubramonian, Al Davis, Niladrish Chatterjee, Manu Awasthi
• Tiered Memory: Karthick Rajamani, Wei Huang, John Carter, Freeman Rawson
Thanks
Questions?
Backup Slides
25
Other Work• Dynamic Hardware-Assisted Software-Controlled Page Placement to
Manage Capacity Allocation and Sharing within Large Caches - Manu Awasthi, Kshitij Sudan, Rajeev Balasubramonian, John Carter, HPCA, February 2009.
• Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality of Service - Kshitij Sudan, Sadagopan Srinivasan, Rajeev Balasubramonian, Ravi Iyer, Under Review.
• A Novel System Architecture for Web-Scale Applications Using Lightweight CPUs and Virtualized I/O - Kshitij Sudan, Saisanthosh Balakrishnan, Sean Lie, Min Xu, Dhiraj Mallick, Rajeev Balasubramonian, Gary Lauterbach, Under Review.
• Data Locality Optimization of Pthread Applications for Non-Uniform Cache Architectures – Gagan S. Sachdev, Kshitij Sudan, Rajeev Balasubramonian, Mary Hall, Under Review. Contd.
26
• Efficient Scrub Mechanisms for Error-Prone Emerging Memories - Manu Awasthi, Manjunath Shevgoor, Kshitij Sudan, Bipin Rajendran, Rajeev Balasubramonian, Viji Srinivasan, To Appear at HPCA-18, Feb 2012.
• Hadoop Jobs Require One-Disk-per-Core, Myth or Fact? - Kshitij Sudan, Min Xu, Sean Lie, Saisanthosh Balakrishnan, Gary Lauterbach, XLDB-5 Lightning Talk, Oct. 2011.
• Handling PCM Resistance Drift with Device, Circuit, Architecture, and System Solutions - Manu Awasthi, Manjunath Shevgoor, Kshitij Sudan, Rajeev Balasubramonian, Bipin Rajendran, Viji Srinivasan, Non-Volatile Memory Workshop, March 2011.
• Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers - Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian, Al Davis, PACT, September 2010
• Improving Server Performance on Multi-Cores via Selective Off-loading of OS Functionality - David Nellans, Kshitij Sudan, Erik Brunvand Rajeev Balasubramonian, WIOSCA, June 2010.
• Hardware Prediction of OS Run-Length For Fine-Grained Resource Customization - David Nellans, Kshitij Sudan, Erik Brunvand, Rajeev Balasubramonian, ISPASS-2010, March 2010.
27
Iso-Power Memory Configurations Tiered Memory Size for u=0.9
Nh=2
Nh=3
Nh=4
15171921232527293133
5 6 7 8 9 10 11
Idle Power Ratio (Hot/Cold)
Num
ber o
f Tie
red
Ran
ks
Tiered Memory Size for u=0.5
Nh=2
Nh=3
Nh=4
15171921232527293133
5 6 7 8 9 10 11
Idle Power Ratio (Hot/Cold)
Num
ber o
f Tie
red
Ran
ks
• 8 active ranks in baseline • ratio of idle active and self-refresh power,• fraction (u) of memory requests served by hot ranks,• service rate,• bandwidth.
Tiered Memory Size for u=0.7
Nh=2
Nh=3
Nh=4
15171921232527293133
5 6 7 8 9 10 11
Idle Power Ratio (Hot/Cold)
Num
ber o
f Tie
red
Ran
ks
4h,12c:2X baseline
2h,22c:3X baseline
Analytical model determines iso-power configurations for a given access rate to the active-mode (“hot”) DRAM ranks
28
Iso-Power Memory Configurations
Analytical model determines iso-power configurations for a given access rate to the active-mode (“hot”) DRAM ranks
29
Tiered Memory: Iso-Power Memory Architecture to Address Memory Power Wall
• Build tiers out of DRAM ranks• Aggressively use low-power (LP) modes• Intelligent data placement to reduce
overheads of entry/exit from LP modes• Buffer requests to ranks in LP and service
them in batches to amortize entry/exit costs
30