memory system support for online data-intensive...
TRANSCRIPT
www.inf.ed.ac.uk July 16, 2015 Kyungpook National University
Memory System Support for Online Data-Intensive Services
Boris Grot School of Informatics
University of Edinburgh
The Big Data Explosion
2 Image: Erik Fitzpatrick
Memory: the New Efficiency Battleground
Server CPUs getting more efficient: – Wimpy cores à low energy/op
– Many cores/chip à fewer sockets [SOP]
DRAM: – Demand for capacity outpacing
technology scaling
– Growing contributor to datacenter Total Cost of Ownership (TCO)
3
core core core core
core core core core
core core core core
core core core core
Must innovate in the memory system
DRAM 101
Accessed at block granularity - Page activated in row buffer
• Energy-intensive operation - Blocks fetched from row buffer
• Row buffer hits are 3x lower energy than activations
DRAM organized in pages - Page consists of multiple
cache blocks • DRAM page ≠ OS page
Do servers leverage row buffer locality?
Row Buffer
page
DRAM memory
Information Stores for Big Data Pointer-intensive structures store bulk data objects
– Constant-time data object retrievals – Example structures: hash tables, tree structures
– Example objects: memory-mapped files, SW objects, DB rows
hash tables (e.g., web search, object caching) trees (e.g., databases, file systems)
0 1 2 3 4 5 6 7 8 9
10
A B
C
Retrieving a Bulk Object
6
Memory
A
B
C
0 1 2 3 4 5 6 7 8 9
10
Application
A B
C
Keys spread over the memory space
Bulk objects contiguously allocated
Accesses: fine-grained for key lookups & bulk for data objects
Server Memory Traffic is Bimodal
Bulk accesses account for 60-75% of all memory accesses – Bulk access: touches ≥ 50% of bytes within a 1KB region
7 Are bulk accesses leveraged by memory?
0%
25%
50%
75%
100%
Data Serving
Media Streaming
Online Analytics
Web Search
Web Serving
Mem
ory
Acc
esse
s
bulk fine-grained
Bulk Accesses Are Poorly Exploited
Row buffer locality poorly exploited – Requests from multiple cores interleave – Limited instruction window size restricts MLP
DRAM page activations chief contributor to energy (~60%)
8
0%
25%
50%
75%
100%
Data Serving
Media Streaming
Online Analytics
Web Search
Web Serving
Mem
ory
Acc
esse
s Row buffer hits
0%
25%
50%
75%
100%
Mem
ory
Ene
rgy
Activation energy
Need to improve row buffer locality
Prior Work
Memory Access Scheduling: prioritize row buffer hits – Effectiveness limited by instruction window size
Spatial prefetching and scheduled writebacks – High hardware cost
– Limited opportunity: only a fraction of the memory accesses covered
9 Need a comprehensive mechanism with low cost
Streaming Can Exploit Locality – Stream contents of the row buffer to last-level cache (LLC) – Subsequent accesses become LLC hits
10
Row Buffer
Last-Level Cache
0 1 2 3 4 5 6 7 8 9
10
Application
A B
C LLC hits
Memory Request Stream
Challenge: fine-grained accesses cause overfetch
BuMP: Bulk Memory Access Prediction and Streaming
Prediction: identify bulk accesses – For both memory reads and writes
Streaming: upon an access to bulk object – Read: Stream entire object into the LLC – Write: Stream entire object to memory
11
[MICRO’14]
BuMP: Memory Reads
Memory reads: triggered by LLC misses – Majority (57-75%) go to pages with coarse-grained data
Prediction: associate coarse-grained access regions with code operating on them
– Identify functions that operate on coarse-grained data • Use the program counter (PC) of the first access
Streaming: upon a memory reference – Check if PC belongs to a coarse-grained operation
– Trigger bulk fetch
12 Low cost as only few PCs trigger bulk accesses
BuMP: Memory Reads Prediction
13
Memory
Last-Level Cache
BuMP: Memory Reads Prediction
14
Memory
History Tracking Table
Last-Level Cache
BuMP: Memory Reads Prediction
15
Memory
History Tracking Table
Last-Level Cache
0 1 2 3 4 5 6 7 8 9
10
A B
C
A
B
C
A PC1
B PC1
M PC2
PC1 PC1 PC2
PC2
BuMP: Memory Reads Streaming
16
Memory
History Tracking Table
Last-Level Cache
A
B
C
PC2
PC2
Row Buffer
C C
PC1
0 1 2 3 4 5 6 7 8 9
10
A B
C
Exploit row buffer locality when profitable
BuMP: Memory Writes Memory writes: evictions of modified LLC blocks
– Significant share (21-38%) of DRAM traffic – Majority (62-86%) go to pages with coarse-grained data
Prediction: track modified LLC-resident coarse-grained data
– Extends tracking table with a modified bit
Streaming: upon writing back an LLC block to memory – Check if it belongs to a page with coarse-grained data – Trigger bulk writeback
17
Methodology Server applications [CloudSuite]
– Data serving, Online analytics, Web search, Web serving, Media streaming
Many-core server – 16-core CMP @ 2.5 GHz – 16 GB of DRAM
18
Performance evaluation – Simics: full-system
simulation – Flexus: cycle-accurate
models of CMP & DRAM
Energy consumption – Custom DRAM energy
models based on Micron
Evaluation Highlights
– BuMP reduces row activations by 2x over Base-Open – Small over-fetch rate of ~12%
– Improves performance by ~11% over Base-Open
19
0%
25%
50%
75%
100%
Base
Bu
MP
Base
Bu
MP
Base
Bu
MP
Base
Bu
MP
Base
Bu
MP
Data Serving
Media Streaming
Online Analytics
Web Search
Web Serving
Mem
ory
Ene
rgy
Row activations Row buffer hits & Interface
Improves memory energy per access by 23%
BuMP: Summary Servers access memory in two granularities
– Fine: pointer-intensive data structures – Coarse: bulk data objects
DRAM does not exploit coarse-grained accesses – Accesses to different objects are interleaved
BuMP improves server energy efficiency – Identifies bulk accesses & triggers bulk transfers – Improves memory energy per access by 23%
20
0 1 2 3 4 5 6 7 8 9
10
A B
C