justin meza qiang wu sanjeev kumar onur mutlu revisiting memory errors in large-scale production...

107
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from the Field

Upload: cornelius-adams

Post on 22-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Justin MezaQiang WuSanjeev KumarOnur Mutlu

Revisiting Memory Errors inLarge-Scale Production Data CentersAnalysis and Modeling of New Trends from the Field

Page 2: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Overview

Study of DRAM reliability:

▪ on modern devices and workloads

▪ at a large scale in the field

Page 3: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Error/failure occurrence

Technologyscaling

Modeling errors Architecture &workload

Overview

Page 4: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Technologyscaling

Modeling errors Architecture &workload

Error/failure occurrence

Overview

Errors follow a power-law distribution and a large number of errors occur due to sockets/channels

Page 5: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Modeling errors Architecture &workload

Error/failure occurrence

Technologyscaling

Overview

We find that newer cell fabrication technologies have higher failure rates

Page 6: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Modeling errors

Error/failure occurrence

Technologyscaling

Architecture &workload

Overview

Chips per DIMM, transfer width, and workload type (not necessarily CPU/memory utilization) affect reliability

Page 7: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Error/failure occurrence

Technologyscaling

Architecture &workloadModeling errors

Overview

We have made publicly available a statistical model for assessing server memory reliability

Page 8: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Error/failure occurrence

Technologyscaling

Architecture &workloadModeling errors

Page offliningat scale

Overview

First large-scale study of page offlining; real-world limitations of technique

Page 9: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

▪ background and motivation▪ server memory organization▪ error collection/analysis methodology▪ memory reliability trends▪ summary

Outline

Page 10: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Background andmotivation

Page 11: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

▪ examined extensively in prior work▪ charged particles, wear-out

▪ variable retention time (next talk)▪error correcting codes

▪ used to detect and correct errors▪ require additional storage overheads

DRAM errors are common

Page 12: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Our goalStrengthen understandingof DRAM reliability by studying:

▪ new trends in DRAM errors▪ modern devices and workloads

▪at a large scale▪ billions of device-days, across 14 months

Page 13: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Our main contributions▪ identified new DRAM failure trends

▪developed a model for DRAM errors

▪evaluated page offlining at scale

Page 14: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Server memoryorganization

Page 15: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 16: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Socket

Page 17: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Memorychannels

Page 18: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

DIMMslots

Page 19: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

DIMM

Page 20: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Chip

Page 21: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Banks

Page 22: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Rows andcolumns

Page 23: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Cell

Page 24: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

User data

ECC metadataadditional 12.5% overhead

Page 25: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Reliability eventsFault▪ the underlying cause of an error

▪ DRAM cell unreliably stores charge

Error▪ the manifestation of a fault▪ permanent: every time▪ transient: only some of the time

Page 26: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Error collection/analysismethodology

Page 27: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

DRAM error measurement▪ measured every correctable error

▪ across Facebook's fleet▪ for 14 months▪ metadata associated with each error

▪parallelized Map-Reduce to process▪used R for further analysis

Page 28: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

System characteristics▪ 6 different system configurations

▪ Web, Hadoop, Ingest, Database, Cache, Media▪ diverse CPU/memory/storage requirements

▪ modern DRAM devices▪ DDR3 communication protocol

▪ (more aggressive clock frequencies)▪ diverse organizations (banks, ranks, ...)▪ previously unexamined characteristics

▪ density, # of chips, transfer width, workload

Page 29: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Memory reliabilitytrends

Page 30: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Technologyscaling

Modeling errors Architecture &workload

Error/failure occurrence

Page 31: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Technologyscaling

Modeling errors Architecture &workload

Error/failure occurrence

Page 32: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Server error rate

3%

0.03%

Page 33: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Memory error distribution

Page 34: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Memory error distribution

Decreasinghazardrate

Page 35: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Memory error distribution

Decreasinghazardrate

How are errors mapped to memory

organization?

Page 36: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Sockets/channels: many errors

Page 37: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Sockets/channels: many errorsNot

mentioned in prior chip-

level studies

Page 38: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Sockets/channels: many errorsNot

accounted for in prior chip-level studies

At what rate do components fail on

servers?

Page 39: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Bank/cell/spurious failures are common

Page 40: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

# of servers

Denial-of-service–

like behavior

# of errors

Page 41: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

# of servers

Denial-of-service–

like behavior

# of errors

What factors contribute to memory

failures at scale?

Page 42: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Analytical methodology▪ measure server characteristics

▪ not feasible to examine every server▪ examined all servers with errors (error group)

▪ sampled servers without errors (control group)

▪ bucket devices based on characteristics▪ measure relative failure rate

▪ of error group vs. control group▪ within each bucket

Page 43: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Modeling errors Architecture &workload

Error/failure occurrence

Technologyscaling

Page 44: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

inconclusive trendsPrior work foundwith respect to memory capacity

Page 45: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

inconclusive trendsPrior work foundwith respect to memory capacity

Examine characteristic more

closely related to cell fabrication technology

Page 46: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

UseDRAM chip densityto examine technology scaling

(closely related to fabrication technology)

Page 47: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 48: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Intuition:quadraticincrease incapacity

Page 49: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Modeling errors Architecture &workload

Error/failure occurrence

Technologyscaling

We find that newer cell fabrication technologies have higher failure rates

Page 50: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Modeling errors

Error/failure occurrence

Technologyscaling

Architecture &workload

Page 51: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

DIMM architecture▪ chips per DIMM, transfer width

▪ 8 to 48 chips▪ x4, x8 = 4 or 8 bits per cycle

▪ electrical implications

Page 52: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

DIMM architecture▪ chips per DIMM, transfer width

▪ 8 to 48 chips▪ x4, x8 = 4 or 8 bits per cycle

▪ electrical implications

Does DIMM organization affect memory reliability?

Page 53: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 54: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 55: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

No consistent trend across only chips per

DIMM

Page 56: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 57: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 58: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 59: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

More chips ➔higher failure rate

Page 60: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 61: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

More bits per cycle ➔ higher failure rate

Page 62: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Intuition: increased electrical loading

Page 63: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Workload dependence▪ prior studies: homogeneous workloads

▪ web search and scientific▪ warehouse-scale data centers:

▪ web, hadoop, ingest, database, cache, media

Page 64: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Workload dependence▪ prior studies: homogeneous workloads

▪ web search and scientific▪ warehouse-scale data centers:

▪ web, hadoop, ingest, database, cache, media

What affect to heterogeneous

workloads have on reliability?

Page 65: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 66: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

No consistent trend across CPU/memory

utilization

Page 67: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 68: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Varies byup to 6.5x

Page 69: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Modeling errors

Error/failure occurrence

Technologyscaling

Architecture &workload

Chips per DIMM, transfer width, and workload type (not necessarily CPU/memory utilization) affect reliability

Page 70: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Error/failure occurrence

Technologyscaling

Architecture &workloadModeling errors

Page 71: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

A model for server failure▪ use statistical regression model

▪ compare control group vs. error group▪ linear regression in R

▪ trained using data from analysis▪enable exploratory analysis

▪ high perf. vs. low power systems

Page 72: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Memory errormodel

Density

Chips

Age

Relativeserverfailurerate

...

Page 73: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Memory errormodel

Density

Chips

Age

Relativeserverfailurerate

...

Page 74: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Available online

http://www.ece.cmu.edu/~safari/tools/memerr/

Page 75: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Error/failure occurrence

Technologyscaling

Architecture &workloadModeling errors

We have made publicly available a statistical model for assessing server memory reliability

Page 76: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Error/failure occurrence

Technologyscaling

Architecture &workloadModeling errors

Page offliningat scale

Page 77: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Prior page offlining work▪ [Tang+,DSN'06] proposed technique

▪ "retire" faulty pages using OS

▪ do not allow software to allocate them▪[Hwang+,ASPLOS'12] simulated eval.

▪ error traces from Google and IBM▪ recommended retirement on first error

▪ large number of cell/spurious errors

Page 78: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Prior page off lining work▪ [Tang+,DSN'06] proposed technique

▪ "retire" faulty pages using OS

▪ do not allow software to allocate them▪[Hwang+,ASPLOS'12] simulated eval.

▪ error traces from Google and IBM▪ recommended retirement on first error

▪ large number of cell/spurious errors

How effective is page offlining in the wild?

Page 79: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Cluster of12,276 servers

Page 80: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

-67%

Cluster of12,276 servers

Page 81: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

-67%

Prior work:-86% to -94%

Cluster of12,276 servers

Page 82: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Cluster of12,276 servers-67%

Prior work:-86% to -94%

6% of page offlining attempts failed due to

OS

Page 83: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Error/failure occurrence

Technologyscaling

Architecture &workloadModeling errors

Page offliningat scale

First large-scale study of page offlining; real-world limitations of technique

Page 84: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Error/failure occurrence

Technologyscaling

Architecture &workloadModeling errors

Page offliningat scale

Page 85: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

▪ Vendors▪ Age▪ Processor cores▪ Correlation analysis▪ Memory model case study

More results in paper

Page 86: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Summary

Page 87: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

▪ Modern systems▪ Large scale

Page 88: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Error/failure occurrence

Technologyscaling

Modeling errors Architecture &workload

Summary

Page 89: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Technologyscaling

Modeling errors Architecture &workload

Error/failure occurrence

Summary

Errors follow a power-law distribution and a large number of errors occur due to sockets/channels

Page 90: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Modeling errors Architecture &workload

Error/failure occurrence

Technologyscaling

Summary

We find that newer cell fabrication technologies have higher failure rates

Page 91: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Modeling errors

Error/failure occurrence

Technologyscaling

Architecture &workload

Summary

Chips per DIMM, transfer width, and workload type (not necessarily CPU/memory utilization) affect reliability

Page 92: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Page offliningat scale

Error/failure occurrence

Technologyscaling

Architecture &workloadModeling errors

Summary

We have made publicly available a statistical model for assessing server memory reliability

Page 93: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Newreliabili

tytrends

Error/failure occurrence

Technologyscaling

Architecture &workloadModeling errors

Page offliningat scale

Summary

First large-scale study of page offlining; real-world limitations of technique

Page 94: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Justin MezaQiang WuSanjeev KumarOnur Mutlu

Revisiting Memory Errors inLarge-Scale Production Data CentersAnalysis and Modeling of New Trends from the Field

Page 95: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Backup slides

Page 96: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Decreasing hazard rate

Page 97: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from
Page 98: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Errors 54,326 0 2 10

Density 4Gb 1Gb 2Gb 2Gb

Page 99: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Density

Errors

Errors 54,326 0 2 10

Density 4Gb 1Gb 2Gb 2Gb

Page 100: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

BucketsErrors

Density

Errors 54,326 0 2 10

Density 4Gb 1Gb 2Gb 2Gb

Page 101: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Errors 54,326 0 2 10

Density 4Gb 1Gb 2Gb 2Gb

Errors

Density

Page 102: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Density

Errors

Errors 54,326 0 2 10

Density 4Gb 1Gb 2Gb 2Gb

Page 103: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Case study

Page 104: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Output

Inputs

Case study

Page 105: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Output

Inputs

Case study

Does CPUs or density have a higher

impact?

Page 106: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Exploratory analysis

Page 107: Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from

Exploratory analysis