memory system support for online data-intensive...

www.inf.ed.ac.uk July 16, 2015 Kyungpook National University

Memory System Support for Online Data-Intensive Services

Boris Grot School of Informatics

University of Edinburgh

The Big Data Explosion

2 Image: Erik Fitzpatrick

Memory: the New Efficiency Battleground

Server CPUs getting more efficient: – Wimpy cores à low energy/op

– Many cores/chip à fewer sockets [SOP]

DRAM: –  Demand for capacity outpacing

technology scaling

– Growing contributor to datacenter Total Cost of Ownership (TCO)

3

core core core core

core core core core

core core core core

core core core core

Must innovate in the memory system

DRAM 101

Accessed at block granularity -  Page activated in row buffer

•  Energy-intensive operation -  Blocks fetched from row buffer

•  Row buffer hits are 3x lower energy than activations

DRAM organized in pages -  Page consists of multiple

cache blocks •  DRAM page ≠ OS page

Do servers leverage row buffer locality?

Row Buffer

page

DRAM memory

Information Stores for Big Data Pointer-intensive structures store bulk data objects

–  Constant-time data object retrievals –  Example structures: hash tables, tree structures

–  Example objects: memory-mapped files, SW objects, DB rows

hash tables (e.g., web search, object caching) trees (e.g., databases, file systems)

0 1 2 3 4 5 6 7 8 9

10

A B

C

Retrieving a Bulk Object

6

Memory

A

B

C

0 1 2 3 4 5 6 7 8 9

10

Application

A B

C

Keys spread over the memory space

Bulk objects contiguously allocated

Accesses: fine-grained for key lookups & bulk for data objects

Server Memory Traffic is Bimodal

Bulk accesses account for 60-75% of all memory accesses –  Bulk access: touches ≥ 50% of bytes within a 1KB region

7 Are bulk accesses leveraged by memory?

0%

25%

50%

75%

100%

Data Serving

Media Streaming

Online Analytics

Web Search

Web Serving

Mem

ory

Acc

esse

s

bulk fine-grained

Bulk Accesses Are Poorly Exploited

Row buffer locality poorly exploited –  Requests from multiple cores interleave –  Limited instruction window size restricts MLP

DRAM page activations chief contributor to energy (~60%)

8

0%

25%

50%

75%

100%

Data Serving

Media Streaming

Online Analytics

Web Search

Web Serving

Mem

ory

Acc

esse

s Row buffer hits

0%

25%

50%

75%

100%

Mem

ory

Ene

rgy

Activation energy

Need to improve row buffer locality

Prior Work

Memory Access Scheduling: prioritize row buffer hits –  Effectiveness limited by instruction window size

Spatial prefetching and scheduled writebacks –  High hardware cost

–  Limited opportunity: only a fraction of the memory accesses covered

9 Need a comprehensive mechanism with low cost

Streaming Can Exploit Locality –  Stream contents of the row buffer to last-level cache (LLC) –  Subsequent accesses become LLC hits

10

Row Buffer

Last-Level Cache

0 1 2 3 4 5 6 7 8 9

10

Application

A B

C LLC hits

Memory Request Stream

Challenge: fine-grained accesses cause overfetch

BuMP: Bulk Memory Access Prediction and Streaming

Prediction: identify bulk accesses –  For both memory reads and writes

Streaming: upon an access to bulk object –  Read: Stream entire object into the LLC – Write: Stream entire object to memory

11

[MICRO’14]

BuMP: Memory Reads

Memory reads: triggered by LLC misses –  Majority (57-75%) go to pages with coarse-grained data

Prediction: associate coarse-grained access regions with code operating on them

–  Identify functions that operate on coarse-grained data •  Use the program counter (PC) of the first access

Streaming: upon a memory reference –  Check if PC belongs to a coarse-grained operation

–  Trigger bulk fetch

12 Low cost as only few PCs trigger bulk accesses

BuMP: Memory Reads Prediction

13

Memory

Last-Level Cache


14

Memory

History Tracking Table

Last-Level Cache


15

Memory


Last-Level Cache

0 1 2 3 4 5 6 7 8 9

10

A B

C

A

B

C

A PC1

B PC1

M PC2

PC1 PC1 PC2

PC2

BuMP: Memory Reads Streaming

16

Memory


Last-Level Cache

A

B

C

PC2

PC2

Row Buffer

C C

PC1

0 1 2 3 4 5 6 7 8 9

10

A B

C

Exploit row buffer locality when profitable

BuMP: Memory Writes Memory writes: evictions of modified LLC blocks

–  Significant share (21-38%) of DRAM traffic –  Majority (62-86%) go to pages with coarse-grained data

Prediction: track modified LLC-resident coarse-grained data

–  Extends tracking table with a modified bit

Streaming: upon writing back an LLC block to memory –  Check if it belongs to a page with coarse-grained data –  Trigger bulk writeback

17

Methodology Server applications [CloudSuite]

–  Data serving, Online analytics, Web search, Web serving, Media streaming

Many-core server –  16-core CMP @ 2.5 GHz –  16 GB of DRAM

18

Performance evaluation –  Simics: full-system

simulation –  Flexus: cycle-accurate

models of CMP & DRAM

Energy consumption –  Custom DRAM energy

models based on Micron

Evaluation Highlights

–  BuMP reduces row activations by 2x over Base-Open –  Small over-fetch rate of ~12%

–  Improves performance by ~11% over Base-Open

19

0%

25%

50%

75%

100%

Base

Bu

MP

Base

Bu

MP

Base

Bu

MP

Base

Bu

MP

Base

Bu

MP

Data Serving

Media Streaming

Online Analytics

Web Search

Web Serving

Mem

ory

Ene

rgy

Row activations Row buffer hits & Interface

Improves memory energy per access by 23%

BuMP: Summary Servers access memory in two granularities

–  Fine: pointer-intensive data structures –  Coarse: bulk data objects

DRAM does not exploit coarse-grained accesses –  Accesses to different objects are interleaved

BuMP improves server energy efficiency –  Identifies bulk accesses & triggers bulk transfers –  Improves memory energy per access by 23%

20

0 1 2 3 4 5 6 7 8 9

10

A B

C

memory system support for online data-intensive...

Documents