characterizing application memory error vulnerability to optimize datacenter cost via...

51
Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous- Reliability Memory Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, Onur Mutlu

Upload: letitia-adams

Post on 21-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via

Heterogeneous-Reliability MemoryYixin Luo, Sriram Govindan, Bikash Sharma,

Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, Onur Mutlu

Page 2: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Executive Summary• Problem: Reliable memory hardware increases cost•Our Goal: Reduce datacenter cost; meet availability target•Observation: Data-intensive applications’ data exhibit a

diverse spectrum of tolerance to memory errors‐ Across applications and within an application‐ We characterized 3 modern data-intensive applications

•Our Proposal: Heterogeneous-reliability memory (HRM)‐ Store error-tolerant data in less-reliable lower-cost memory‐ Store error-vulnerable data in more-reliable memory

•Major results:‐ Reduce server hardware cost by 4.7 %‐ Achieve single server availability target of 99.90 %

2

Page 3: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies

across applications and within an application‐Observation 2: Data can be recovered by software

•Heterogeneous-Reliability Memory (HRM)•Evaluation

3

Page 4: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies

across applications and within an application‐Observation 2: Data can be recovered by software

•Heterogeneous-Reliability Memory (HRM)•Evaluation

4

Page 5: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Server Memory Cost is High•Server hardware cost dominates datacenter Total Cost of Ownership (TCO) [Barroso ‘09]

•As server memory capacity grows, memory cost becomes the most important component of server hardware costs [Kozyrakis ‘10]

5

128GB Memory cost~$140(per 16GB)×8= ~$1120 *

2 CPUs cost~$500(per CPU)×2= ~$1000 *

* Numbers in the year of 2014

Page 6: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

6

Memory Reliability is Important

System/app crash

System/apphang or

slowdown

Silent data corruption or incorrect app output

Page 7: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 4 Gb0

2

4

6

8

10

12[DocMemory '00] Predicted as trend

DRAM chip capacity

Testi

ng c

ost/

Mem

cos

t (%

)

Memory testing cost can be a significant fraction of memory cost as memory capacity grows

Existing Error Mitigation Techniques (I)

•Quality assurance tests increase manufacturing cost

7

Page 8: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Stronger error protection techniques have higher cost

Incr

easi

ng st

reng

thExisting Error Mitigation Techniques (II)

•Error detection and correction increases system cost

8

Technique Detection Correction Added capacity

Added logic

NoECC N/A N/A 0.00% NoParity 1 bit N/A 1.56% Low

SEC-DED 2 bit 1 bit 12.5% LowChipkill 2 chip 1 chip 12.5% High

Parity 1 bit N/A 1.56%SEC-DED 2 bit 1 bit 12.5%SEC-DED 2 bit 1 bit 12.5% LowChipkill 2 chip 1 chip 12.5% High

Page 9: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Goal: Design a new cost-efficient memory system that flexibly matches memory reliability with application memory error tolerance

Shortcomings of Existing Approaches

•Uniformly improve memory reliability‐Observation 1: Memory error tolerance varies across applications and with an application

•Rely only on hardware-level techniques‐Observation 2: Once a memory error is detected, most corrupted data can be recovered by software

9

Page 10: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies

across applications and within an application‐Observation 2: Data can be recovered by software

•Heterogeneous-Reliability Memory (HRM)•Evaluation

10

Page 11: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

11Memory Error Outcomes

Characterization Goal

Memory Error

System/App Crash

Incorrect Response

Masked by Logic

Masked by Overwrite

Consumed by Application

Correct Result Incorrect Result

x = …010…x = …110…

Store

x = …000…

Load

if (x != 0) …

return x;or *x;

Quantify application memory error tolerancecorrupted

Page 12: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Characterization Methodology•3 modern data-intensive applications

•3 dominant memory regions‐Heap – dynamically allocated data‐ Stack – function parameters and local variables‐Private – private heap managed by user

• Injected a total of 23,718 memory errors using software debuggers (WinDbg and GDB)•Examined correctness for over 4 billion queries

12

Application WebSearch Memcached GraphLabMemory footprint 46 GB 35 GB 4 GB

Page 13: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies

across applications and within an application‐Observation 2: Data can be recovered by software

•Heterogeneous-Reliability Memory (HRM)•Evaluation

13

Page 14: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

14

WebSearch Memcached GraphLab0

2

4

6

8

10

12

14 System/Application Crash

Prob

abili

ty o

f Cra

sh (%

)

>10× difference

Observation 1: Memory Error Tolerance VariesAcross Applications

Showing results for single-bit soft errorsResults for other memory error types can be found in the paper with similar conclusion

Page 15: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

15

WebSearch Memcached GraphLab1E+1

1E+2

1E+3

1E+4

1E+5

1E+6

1E+7

1E+8 Incorrect Responses

# In

corr

ect/

Billi

on Q

uerie

s

>105× difference

Observation 1: Memory Error Tolerance VariesAcross Applications

Showing results for single-bit soft errorsResults for other memory error types can be found in the paper

Page 16: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

16

Private Heap Stack0

0.2

0.4

0.6

0.8

1

1.2 System/Application Crash

Prob

abili

ty o

f Cra

sh (%

)

>4× difference

Showing results for WebSearchResults for other workloads can be found in the paper

Observation 1: Memory Error Tolerance Variesand Within an ApplicationAcross Applications

Page 17: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

17

Private Heap Stack1E+0

1E+1

1E+2 Incorrect Responses

# In

corr

ect/

Billi

on Q

uerie

s

Showing results for WebSearchResults for other workloads can be found in the paper

15All averaged at a very low rate

Observation 1: Memory Error Tolerance Variesand Within an ApplicationAcross Applications

Page 18: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies

across applications and within an application‐Observation 2: Data can be recovered by software

•Heterogeneous-Reliability Memory (HRM)•Evaluation

18

Page 19: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

• Implicitly recoverable – application intrinsically has a clean copy of the data on disk• Explicitly recoverable – application can create a copy of the

data at a low cost (if it has very low write frequency)

19

Observation 2: Data Can be Recovered by SoftwareImplicitly and Explicitly

Private Heap Stack Overall

88%

59%

1%

82%63%

28%16%

56%

WebSearch Recoverability

Implicitly recoverable

Explicitly recoverable

Page 20: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies

across applications and within an application‐Observation 2: Data can be recovered by software

•Heterogeneous-Reliability Memory (HRM)•Evaluation

20

Page 21: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

21

App/Data A App/Data B App/Data C

Mem

ory

erro

r vul

nera

bilit

y

Vulnerable data

Tolerant data

Exploiting Memory Error Tolerance

Heterogeneous-Reliability Memory

Low-cost memoryReliable memory

Vulnerable data

Tolerant data

Vulnerable data

Tolerant data

• ECC protected• Well-tested chips

• NoECC or Parity• Less-tested chips

Page 22: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

22

Memory

Page A

Page B

Par+R: Parity Detection + Software Recovery

Implicit Recovery Explicit Recovery

Memory

Disk

Page A

Page A

Memory Error

Disk

Page A

Memory Error

Page B

Page B

Write

Write non-intensive

Write intensive

Intrinsic

copy

Copy Copy

Page 23: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

23

Heterogeneous-Reliability Memory

App 1 data A

App 1 data B

App 2 data A

App 2 data B

App 3 data A

App 3 data B

Step 2: Map application data to the HRM system enabled by SW/HW cooperative solutions

Step 1: Characterize and classify application memory error tolerance

Reliable memory

Parity memory + software recovery (Par+R)

Low-cost memory

UnreliableReliable

Vulnerable Tolerant

App 1 data A

App 2 data A

App 2 data B

App 3 data A

App 3 data B

App 1 data B

Page 24: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies

across applications and within an application‐Observation 2: Data can be recovered by software

•Heterogeneous-Reliability Memory (HRM)•Evaluation

24

Page 25: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

25

Evaluated Systems

ConfigurationMapping

Pros and ConsPrivate(36 GB)

Heap(9 GB)

Stack(60 MB)

Typical Server ECC ECC ECC Reliable but expensiveConsumer PC NoECC NoECC NoECC Low-cost but unreliableHRM Par+R NoECC NoECC Parity only

Baseline systems HRM systems

Less-Tested (L) NoECC NoECC NoECC Least expensive and reliableHRM/L ECC Par+R NoECC Low-cost and reliable HRM

Page 26: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Design Parameters

26

DRAM/server HW cost [Kozyrakis ‘10] 30%NoECC memory cost savings 11.1%Parity memory cost savings 9.7%

Less-tested memory cost savings 18%±12%Crash recovery time 10 mins

Par+R flush threshold 5 minsErrors/server/month [Schroeder ‘09] 2000

Target single server availability 99.90%

Page 27: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Evaluation Metrics•Cost‐Memory cost savings‐Server HW cost savings(both compared with Typical Server)

•Reliability‐Crashes/server/month‐Single server availability‐# incorrect/million queries

27

Page 28: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

28

Improving Server HW Cost Savings

Typical Server

Consumer PC

HRM Less-Tested HRM/L0123456789

3.3 2.9

8.1

4.7

Serv

er H

W c

ost s

avin

gs

(%)

Reducing the use of memory error mitigation techniques in part of memory space can save

noticeable amount of server HW cost

Page 29: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

29

HRM systems are flexible to adjust and can achieve availability target

9797.5

9898.5

9999.5100

Sing

le s

erve

r ava

ilabi

lity

(%)

Achieving Target AvailabilitySingle server availability

target: 99.90%

Page 30: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

30

Achieving Acceptable Correctness

1

10

100

1000

33

9

163

12

# in

corr

ect/

mill

ion

quer

ies

HRM systems can achieve acceptable correctness

Page 31: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

31

Evaluation ResultsTypical ServerConsumer PCHRMLess-Tested (L)HRM/L

Bigger area means better tradeoff

Outer is betterInner is worse

Page 32: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Other Results and Findings in the Paper

• Characterization of applications’ reactions to memory errors‐ Finding: Quick-to-crash vs. periodically incorrect behavior

• Characterization of most common types of memory errors including single-bit soft/hard errors, multi-bit hard errors‐ Finding: More severe errors mainly decrease correctness

• Characterization of how errors are masked‐ Finding: Some memory regions are safer than others

• Discussion about heterogeneous reliability design dimensions, techniques, and their benefits and tradeoffs

32

Page 33: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Conclusion•Our Goal: Reduce datacenter cost; meet availability target

• Characterized application-level memory error tolerance of 3 modern data-intensive workloads

• Proposed Heterogeneous-Reliability Memory (HRM)‐ Store error-tolerant data in less-reliable lower-cost memory‐ Store error-vulnerable data in more-reliable memory

• Evaluated example HRM systems‐ Reduce server hardware cost by 4.7 %‐ Achieve single-server availability target 99.90 %

33

Page 34: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via

Heterogeneous-Reliability MemoryYixin Luo, Sriram Govindan, Bikash Sharma,

Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, Onur Mutlu

Page 35: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

35

Why use a software debugger?•Speed‐Our workloads are relatively long running•WebSearch – 30 minutes•Memcached – 10 minutes•GraphLab – 10 minutes‐Our workloads have large memory footprint•WebSearch – 46 GB•Memcached – 35 GB•GraphLab – 4 GB

Page 36: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

36

What are the workload properties?•WebSearch‐ Repeat a real-world trace of 200,000 queries, with 400 qps‐ Correctness: Top 4 most relevant documents• Document id• Relevance and popularity

•Memcached‐ 30 GB of twitter dataset‐ Synthetic client workload, at 5,000 rps‐ 90% read requests and 10% write requests

•GraphLab‐ 11 million twitter users’ following relations, 1.3 GB dataset‐ TunkRank algorithm ‐ Correctness: 100 most influential users and their scores

Page 37: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

37

How many errors are injected to each application and each memory region?

•WebSearch – 20,576•Memcached – 983•GraphLab – 2,159• Errors injected to each memory region is proportional to

their sizesApplication Private Heap Stack TotalWebSearch 36 GB 9 GB 60 MB 46 GBMemcached N/A 35 GB 132 KB 35 GB

GraphLab N/A 4 GB 132 KB 4 GB

Page 38: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

38

Does HRM require HW changes

CPU

DIMMMem Ctrl 0Mem Ctrl 1*

Mem Ctrl 2*

DIMM

DIMM

Channel 0ECC DIMM ECC

Channel 1*

Channel 2*

* Memory controller/Channel without ECC support

DIMM

DIMM

Page 39: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

39

What is the injection/monitoring process?

(Re)Start App

Inject Errors (Soft/Hard)

Run Client Workload

App Crash?

Compare Result with Expected Output

NOYES

Repe

atStart

2

3

4

5

1

Page 40: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

40

Comparison with previous works?•Virtualized and flexible ECC [Yoon ‘10]‐ Requires changes to the MMU in the processor‐ Performance overhead ~10% over NoECC

•Our work: HRM‐ Minimal changes to memory controller to enable different ECC on

different channels‐ Low performance overhead‐ Enables the use of less-tested memory

Page 41: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

41

Other Results

Page 42: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

42

Variation within application

Page 43: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

43

Variation within application

Page 44: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

44

Other types of memory errors

Page 45: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

45

Other types of memory errors

Page 46: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

46

Explicit Recovery

Page 47: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

47

Quick to crash vs. periodic incorrect

Page 48: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

48

Safe ratio: masked by overwrite

Page 49: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

49

Potential to tolerate memory errors

Page 50: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

50

Design dimension

Page 51: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash

51

Design dimension