® 1 shubu mukherjee, fact group cache scrubbing in microprocessors: myth or necessity? practical...

21
® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Cache Scrubbing in Microprocessors: Myth or Microprocessors: Myth or Necessity? Necessity? Practical Experience Report Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum, & Steven K. Reinhardt* Fault Aware Computing Technology (FACT) Group Massachusetts Microprocessor Design Center, Intel Corporation 10th IEEE International Symposium Pacific Rim Dependable Computing, French Polynesia, March 3-5, 2004 * Also, University of Michigan, Ann Arbor

Upload: millicent-bruce

Post on 16-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

1 Shubu Mukherjee, FACT Group

Cache Scrubbing in Microprocessors: Cache Scrubbing in Microprocessors: Myth or Necessity?Myth or Necessity?

Practical Experience ReportPractical Experience Report

Shubu Mukherjee

Joel Emer, Tryggve Fossum, & Steven K. Reinhardt*

Fault Aware Computing Technology (FACT) Group

Massachusetts Microprocessor Design Center, Intel Corporation

10th IEEE International Symposium Pacific Rim Dependable Computing, French Polynesia, March 3-5, 2004

* Also, University of Michigan, Ann Arbor

Page 2: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

2 Shubu Mukherjee, FACT Group

SummarySummary

SECDED ECC (single error correction, double error detection)SECDED ECC (single error correction, double error detection) commonly used in on-chip cachescommonly used in on-chip caches interleaving converts spatial multi-bit errors to multiple single bit interleaving converts spatial multi-bit errors to multiple single bit

errorserrors

ScrubbingScrubbing periodically read cache blocks and correct all single bit errorsperiodically read cache blocks and correct all single bit errors this prevents single bit errors from accumulating, thereby avoiding this prevents single bit errors from accumulating, thereby avoiding

temporal double bit errorstemporal double bit errors

Our conclusion: given detected error target of 10 year MTTF Our conclusion: given detected error target of 10 year MTTF Scrubbing necessary only for very large caches (e.g., 100s of Scrubbing necessary only for very large caches (e.g., 100s of

megabytes to gigabytes)megabytes to gigabytes)

Page 3: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

3 Shubu Mukherjee, FACT Group

Origin of Cosmic RaysOrigin of Cosmic Rays

Cosmic rays come from deep spaceCosmic rays come from deep space

Earth’s Surface

p

np

p

n

n

p

p

n

n

n

Page 4: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

4 Shubu Mukherjee, FACT Group

Impact of Neutron Strike on a Si DeviceImpact of Neutron Strike on a Si Device

Secondary source of upsets: alpha particles from packagingSecondary source of upsets: alpha particles from packaging

Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device

+- ++ +-- -

Transistor Device

source drain

neutron strike

Page 5: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

5 Shubu Mukherjee, FACT Group

Strike Changes State of a Single BitStrike Changes State of a Single Bit

01

Example SolutionExample Solution Error correction codes (ECC) for single bit correctionError correction codes (ECC) for single bit correction Overhead = 7 bits for 64 bits of dataOverhead = 7 bits for 64 bits of data

Page 6: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

6 Shubu Mukherjee, FACT Group

Strike Changes State of Two Adjacent BitsStrike Changes State of Two Adjacent BitsSpatial Double Bit ErrorSpatial Double Bit Error

Example solution Example solution SECDED ECC (single error correction, double error detection)SECDED ECC (single error correction, double error detection)

8 bits of code per 64 bits of data8 bits of code per 64 bits of data Interleaving for the more general case … Interleaving for the more general case …

0 11 0

Page 7: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

7 Shubu Mukherjee, FACT Group

Interleaving bitsInterleaving bits

Interleaving convertsInterleaving converts spatial multi-bit error spatial multi-bit error multiple single bit errors multiple single bit errors

bits

X X X

X = covered with single ECC code

+ + +

+ = covered with different ECC code

// /00 0

Page 8: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

8 Shubu Mukherjee, FACT Group

Two Separate Strikes on Different BitsTwo Separate Strikes on Different BitsTemporal Double Bit ErrorsTemporal Double Bit Errors

SECDED ECC (single error correction, double error detection)SECDED ECC (single error correction, double error detection) could detect error, but cannot correct the errorcould detect error, but cannot correct the error if errors accumulateif errors accumulate

– single bit correctable error becomes a double bit detectable errorsingle bit correctable error becomes a double bit detectable error

Cycle 100 Cycle 1,000,000

Page 9: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

9 Shubu Mukherjee, FACT Group

Solutions for Temporal Double Bit ErrorsSolutions for Temporal Double Bit Errors

Natural EffectsNatural Effects whenever a processor reads a cache block, we can correct the single bit errorwhenever a processor reads a cache block, we can correct the single bit error check for errors when cache blocks are replaced from the cachecheck for errors when cache blocks are replaced from the cache

More Powerful ECCMore Powerful ECC SECDED ECC requires 8 bits per 64 bitsSECDED ECC requires 8 bits per 64 bits

– 7 bits for single bit correction7 bits for single bit correction

– 88thth bit for double bit detection bit for double bit detection

– Overhead = 13%Overhead = 13%

ECC with two bit correction requires 12 bits per 64 bitsECC with two bit correction requires 12 bits per 64 bits– Overhead = 19%Overhead = 19%

ScrubbingScrubbing Periodically read memory and correct all single bit errorsPeriodically read memory and correct all single bit errors Disallows accumulation of temporal double bit errorsDisallows accumulation of temporal double bit errors Standard technique in main memories (DRAMs)Standard technique in main memories (DRAMs) Our calculations (later) will assume the worst case for soft errorsOur calculations (later) will assume the worst case for soft errors

– cache blocks don’t get scrubbed naturallycache blocks don’t get scrubbed naturally

Page 10: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

10 Shubu Mukherjee, FACT Group

Memory Hierarchy of a ProcessorMemory Hierarchy of a Processor

Do we need to scrub on-chip caches? Do we need to scrub on-chip caches? depends on the size of these cachesdepends on the size of these caches

L1 Cache

CPU

L2 Cache

Main Memory (gigabytes)

megabytes

kilobytes

Page 11: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

11 Shubu Mukherjee, FACT Group

Detected Unrecoverable Error (DUE)Detected Unrecoverable Error (DUE)

Interval-basedInterval-based MTTF = Mean Time to Failure MTTF = Mean Time to Failure E.g., goal = 10 years MTTF for application crash E.g., goal = 10 years MTTF for application crash

Bossen, IRPS 2002Bossen, IRPS 2002

Rate-basedRate-based FIT = Failure in Time = 1 failure in a billion hoursFIT = Failure in Time = 1 failure in a billion hours 10 year MTTF = 1010 year MTTF = 1099 / (24 * 365 * 10) FIT = 11,415 FITs / (24 * 365 * 10) FIT = 11,415 FITs

Total of 210 FIT

+

Cache: 62 FITIQ: 100 FITFU: 58 FIT

+

Hypothetical Example

Page 12: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

12 Shubu Mukherjee, FACT Group

MTTF calculations: probabilitiesMTTF calculations: probabilities

1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC Q = # quadwords in cache memoryQ = # quadwords in cache memory PPdd[n] = probability that a sequence of n strikes causes n – 1 single bit [n] = probability that a sequence of n strikes causes n – 1 single bit

errors, followed by a double bit error on the nerrors, followed by a double bit error on the n thth strike strike

PPdd[1] = 0[1] = 0

PPdd[2] = 1 / Q[2] = 1 / Q

First Strike, Probability = Q / QSecond Strike, Probability = 1 / QPd[2] = (Q/Q) * (1/Q) = 1/Q

Page 13: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

13 Shubu Mukherjee, FACT Group

MTTF calculations: probabilitiesMTTF calculations: probabilities

1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC Q = # quadwords in cache memoryQ = # quadwords in cache memory PPdd[n] = probability that a sequence of n strikes causes n – 1 single bit [n] = probability that a sequence of n strikes causes n – 1 single bit

errors, followed by a double bit error on the nerrors, followed by a double bit error on the n thth strike strike

PPdd[3] = [ (Q-1)/Q ] * [2/Q][3] = [ (Q-1)/Q ] * [2/Q]

First Strike, Probability = Q / Q Second Strike, Probability = (Q-1) / QThird Strike, Probability = 2/Q

Pd[3] = (Q/Q) * (Q-1/Q) * (2/Q)

Page 14: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

14 Shubu Mukherjee, FACT Group

MTTF calculations: probabilitiesMTTF calculations: probabilities

1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC Q = # quadwords in cache memoryQ = # quadwords in cache memory PPdd[n] = probability that a sequence of n strikes causes n – 1 single bit [n] = probability that a sequence of n strikes causes n – 1 single bit

errors, followed by a double bit error on the nerrors, followed by a double bit error on the n thth strike strike

PPdd[1] = 0[1] = 0

PPdd[2] = 1 / Q[2] = 1 / Q

PPdd[3] = [ (Q-1)/Q ] * [2/Q][3] = [ (Q-1)/Q ] * [2/Q]

PPdd[4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q][4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q] …… PPdd[n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ][n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ]

Page 15: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

15 Shubu Mukherjee, FACT Group

MTTF calculations: EquationMTTF calculations: Equation M = mean # of single bit errors to get a double bit errorM = mean # of single bit errors to get a double bit error

= Expected value of random variable with P= Expected value of random variable with Pdd[n] as the [n] as the

probability distribution functionprobability distribution function M can be easily generated using a computer programM can be easily generated using a computer program MTTF (double bit error) = M * MTTF (single bit error)MTTF (double bit error) = M * MTTF (single bit error)

For a 32 megabyte cache & FIT/bit = 0.001 [Normand 1996, Tosaka 1996]For a 32 megabyte cache & FIT/bit = 0.001 [Normand 1996, Tosaka 1996] MTTF (double bit error) = M * MTTF (single bit error)MTTF (double bit error) = M * MTTF (single bit error)

= 2567 * (1 / Cache FIT)= 2567 * (1 / Cache FIT)

= 2567 * (10= 2567 * (1099 / (0.001 * 2 / (0.001 * 22222 * 72 * 24 * 365)) * 72 * 24 * 365))

= 970 years= 970 years

Saleh, et al.’s, 1990 closed form equationSaleh, et al.’s, 1990 closed form equation MTTF (double bit error) = [ 1 / (72 * f)] * sqrt(MTTF (double bit error) = [ 1 / (72 * f)] * sqrt( / 2Q) / 2Q)

= 970 years, f = FIT/bit= 970 years, f = FIT/bit

Page 16: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

16 Shubu Mukherjee, FACT Group

Temporal Double BitTemporal Double BitMTTF variations with cache sizeMTTF variations with cache size

10

100

1000

10000

0.0

01

0.0

02

0.0

03

0.0

04

0.0

05

0.0

06

0.0

07

0.0

08

0.0

09

0.0

1

FIT/bit

MT

TF

in

years

4 MB

16 MB

64 MB

256 MB

FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996)FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) higher at higher altitudes (e.g., 3-5x at 1.5km in Denver)higher at higher altitudes (e.g., 3-5x at 1.5km in Denver)

Temporal double bit error has very small contribution to DUE rateTemporal double bit error has very small contribution to DUE rate compared to a goal of 10 years DUE MTTFcompared to a goal of 10 years DUE MTTF

Page 17: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

17 Shubu Mukherjee, FACT Group

MTTF with ScrubbingMTTF with Scrubbing

I = scrubbing interval, scrub at the end of each interval II = scrubbing interval, scrub at the end of each interval I N = # scrubbing intervals to reach MTTF N = # scrubbing intervals to reach MTTF

= Expected value of random variable with probability distribution= Expected value of random variable with probability distribution

function: (1-pf)function: (1-pf)NN * pf, where pf = probability of a temporal double bit * pf, where pf = probability of a temporal double bit

error at the end of an intervalerror at the end of an interval

Assuming 16 GB cache, FIT/bit = 0.001 (Normand 1996, Tosaka 1996), Assuming 16 GB cache, FIT/bit = 0.001 (Normand 1996, Tosaka 1996),

scrub once a year (I = 1 year)scrub once a year (I = 1 year) MTTF(double bit error) = N * IMTTF(double bit error) = N * I

= 2281 * 1 = 2281 years= 2281 * 1 = 2281 years Saleh, et al. 1990 closed form equationSaleh, et al. 1990 closed form equation

2 / [Q * I * (f * 72)2 / [Q * I * (f * 72)22] = 2341 years, f = FIT/bit] = 2341 years, f = FIT/bit

I I I

Page 18: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

18 Shubu Mukherjee, FACT Group

Impact of Scrubbing on Impact of Scrubbing on Temporal Double Bit MTTFTemporal Double Bit MTTF

FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996)FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) higher at higher altitudes (e.g., 3-5x at 1.5km in Denver)higher at higher altitudes (e.g., 3-5x at 1.5km in Denver)

For 16 gigabytes of cache, scrubbing can helpFor 16 gigabytes of cache, scrubbing can help compared to a DUE MTTF goal of 10 yearscompared to a DUE MTTF goal of 10 years

110

1001000

10000100000

1000000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

FIT/bit

MT

TF

in y

ears

Scrub once a day Scrub once a month

Scrub once a year With no Scrubbing

16 Gigabyte Cache

Page 19: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

19 Shubu Mukherjee, FACT Group

SummarySummary

SECDED ECC (single error correction, double error detection)SECDED ECC (single error correction, double error detection) commonly used in on-chip cachescommonly used in on-chip caches interleaving converts spatial multi-bit errors to multiple single bit interleaving converts spatial multi-bit errors to multiple single bit

errorserrors

ScrubbingScrubbing periodically read cache blocks and correct all single bit errorsperiodically read cache blocks and correct all single bit errors this prevents single bit errors from accumulating, thereby avoiding this prevents single bit errors from accumulating, thereby avoiding

temporal double bit errorstemporal double bit errors

Our conclusion: given detected error target of 10 year MTTF Our conclusion: given detected error target of 10 year MTTF Scrubbing necessary only for very large caches (e.g., 100s of Scrubbing necessary only for very large caches (e.g., 100s of

megabytes to gigabytes)megabytes to gigabytes)

Page 20: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

20 Shubu Mukherjee, FACT Group

BACKUPSBACKUPS

Page 21: ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

RR

®®

21 Shubu Mukherjee, FACT Group

Raw soft error rate: 0.001 – 0.010 FIT/bitRaw soft error rate: 0.001 – 0.010 FIT/bit

Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” circuits,” VLSI Symposium on VLSI Technology Digest of VLSI Symposium on VLSI Technology Digest of Technical PapersTechnical Papers, 1996. , 1996.

Normand, “Single Event Upset at Ground Level,” IEEE Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.1996.