cs 110 computer architecture lecture 25 · lecture 25: dependability and raid instructor: ......

58
CS 110 Computer Architecture Lecture 25: Dependability and RAID Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University 1 Slides based on UC Berkley's CS61C

Upload: others

Post on 17-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

CS110ComputerArchitecture

Lecture25:DependabilityandRAID

Instructor:SörenSchwertfeger

http://shtech.org/courses/ca/

School of Information Science and Technology SIST

ShanghaiTech University

1Slides based on UC Berkley's CS61C

Page 2: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

ReviewLastLecture• I/Ogivescomputerstheir5senses• I/Ospeedrangeis100-milliontoone• Pollingvs.Interrupts• DMAtoavoidwastingCPUtimeondatatransfers• Disksforpersistentstorage,replacedbyflash• Networks:computer-to-computerI/O– Protocolsuitesallownetworkingofheterogeneouscomponents.Abstraction!!!

2

Page 3: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

ProtocolFamilyConcept

Message Message

TH Message TH Message TH TH

Actual Actual

Physical

Message TH Message THActual ActualLogical

Logical

3

Eachlowerlevelofstack“encapsulates”informationfromlayerabovebyaddingheaderandtrailer.

Page 4: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

MostPopularProtocolforNetworkofNetworks

• TransmissionControlProtocol/InternetProtocol(TCP/IP)

• ThisprotocolfamilyisthebasisoftheInternet,aWAN(wideareanetwork)protocol• IPmakesbestefforttodeliver• Packetscanbelost,corrupted

• TCPguaranteesdelivery• TCP/IPsopopularitisusedevenwhencommunicatinglocally:evenacrosshomogeneousLAN(localareanetwork)

4

Page 5: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

Message

TCP/IPpacket,Ethernetpacket,protocols

• Applicationsendsmessage

TCP data

TCP HeaderIP Header

IP DataEH

Ethernet Hdr

Ethernet Hdr• TCPbreaksinto64KiBsegments,adds20Bheader

• IPadds20Bheader,sendstonetwork• IfEthernet,brokeninto1500Bpacketswithheaders,trailers

5

Page 6: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

GreatIdea#6:DependabilityviaRedundancy

• Redundancysothatafailingpiecedoesn’tmakethewholesystemfail

6

1+1=2 1+1=2 1+1=1

1+1=22of3agree

FAIL!

Increasingtransistordensity reducesthecostofredundancy

Page 7: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

GreatIdea#6:DependabilityviaRedundancy

• Appliestoeverythingfromdatacenterstomemory– Redundantdatacenterssothatcanlose1datacenterbutInternetservicestaysonline

– RedundantroutessocanlosenodesbutInternetdoesn’tfail– Redundantdiskssothatcanlose1diskbutnotlosedata(RedundantArraysofIndependentDisks/RAID)

– Redundantmemorybitsofsothatcanlose1bitbutnodata(ErrorCorrectingCode/ECCMemory)

7

Page 8: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

Dependability

• Fault:failureofacomponent– Mayormaynotleadtosystemfailure

ServiceaccomplishmentServicedelivered

asspecified

ServiceinterruptionDeviationfromspecifiedservice

FailureRestoration

8

Page 9: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

DependabilityviaRedundancy:Timevs.Space

• SpatialRedundancy– replicateddataorcheckinformationorhardwaretohandlehardandsoft(transient)failures

• TemporalRedundancy– redundancyintime(retry)tohandlesoft(transient)failures

9

Page 10: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

DependabilityMeasures

• Reliability: MeanTimeToFailure(MTTF)• Serviceinterruption: MeanTimeToRepair(MTTR)• Meantimebetweenfailures(MTBF)– MTBF=MTTF+MTTR

• Availability=MTTF/(MTTF+MTTR)• ImprovingAvailability– IncreaseMTTF: Morereliablehardware/software+FaultTolerance

– ReduceMTTR:improvedtoolsandprocessesfordiagnosisandrepair

10

Page 11: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

UnderstandingMTTF

11

ProbabilityofFailure

1

Time

Page 12: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

UnderstandingMTTF

12

ProbabilityofFailure

1

TimeMTTF

1/3 2/3

Page 13: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

AvailabilityMeasures

• Availability=MTTF/(MTTF+MTTR)as%– MTTF,MTBFusuallymeasuredinhours

• Sincehoperarelydown,shorthandis“numberof9sofavailabilityperyear”

• 1nine:90%=>36daysofrepair/year• 2nines:99%=>3.6daysofrepair/year• 3nines:99.9%=>526minutesofrepair/year• 4nines:99.99%=>53minutesofrepair/year• 5nines:99.999%=>5minutesofrepair/year

13

Page 14: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

ReliabilityMeasures

• Anotherisaveragenumberoffailuresperyear:AnnualizedFailureRate(AFR)– E.g.,1000diskswith100,000hourMTTF– 365days*24hours=8760hours– (1000disks*8760hrs/year)/100,000=87.6faileddisksperyearonaverage

– 87.6/1000=8.76%annualfailurerate• Google’s2007study*foundthatactualAFRsforindividualdrivesrangedfrom1.7%forfirstyeardrivestoover8.6%forthree-yearolddrives

14

*research.google.com/archive/disk_failures.pdf

Page 15: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

DependabilityDesignPrinciple

• DesignPrinciple:Nosinglepointsoffailure– “Chainisonlyasstrongasitsweakestlink”

• DependabilityCorollaryofAmdahl’sLaw– Doesn’tmatterhowdependableyoumakeoneportionofsystem

– Dependabilitylimitedbypartyoudonotimprove

15

Page 16: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

Error Detection/CorrectionCodes• Memorysystemsgenerateerrors(accidentallyflipped-bits)– DRAMs storeverylittlechargeperbit– “Soft”errorsoccuroccasionallywhencellsarestruckbyalphaparticlesorotherenvironmentalupsets

– “Hard”errorscanoccurwhenchipspermanentlyfail– Problemgetsworseasmemoriesgetdenserandlarger

• MemoriesprotectedagainstfailureswithEDC/ECC• Extrabitsareaddedtoeachdata-word– Usedtodetectand/orcorrectfaultsinthememorysystem– Eachdatawordvalue mappedto uniquecodeword– Afaultchangesvalidcodewordto invalidone,whichcanbedetected

16

Page 17: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

BlockCodePrinciples• Hammingdistance=differencein#ofbits• p =011011,q =001111,Ham.distance(p,q)=2• p=011011,q =110001,distance(p,q)=?

• Canthinkofextrabitsascreatingacodewiththedata

• Whatifminimumdistancebetweenmembersofcodeis2andgeta1-biterror? RichardHamming, 1915-98

TuringAwardWinner17

Page 18: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

Parity:SimpleError-DetectionCoding• Eachdatavalue,beforeitis

writtentomemoryis“tagged”withanextrabittoforcethestoredwordtohaveevenparity:

• Eachword,asitisreadfrommemoryis“checked”byfindingitsparity(includingtheparitybit).

b7b6b5b4b3b2b1b0

+

b7b6b5b4b3b2b1b0p

+c• MinimumHammingdistanceofparitycodeis2

• Anon-zeroparitycheckindicatesanerroroccurred:– 2errors(ondifferentbits) arenotdetected– noranyevennumberoferrors,justoddnumbersoferrorsaredetected

18

p

Page 19: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

ParityExample

• Data01010101• 4ones,evenparitynow• Writetomemory:010101010tokeepparityeven

• Data01010111• 5ones,oddparitynow• Writetomemory:010101111tomakeparityeven

• Readfrommemory010101010

• 4ones=>evenparity,sonoerror

• Readfrommemory110101010

• 5ones=>oddparity,soerror

• Whatiferrorinparitybit?

19

Page 20: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

SupposeWanttoCorrect1Error?

• RichardHammingcameupwithsimpletounderstandmappingtoallowErrorCorrectionatminimumdistanceof3– Singleerrorcorrection,doubleerrordetection

• Called“HammingECC”– Workedweekendsonrelaycomputerwithunreliablecardreader,frustratedwithmanualrestarting

– Gotinterestedinerrorcorrection;published1950– R.W.Hamming,“ErrorDetectingandCorrectingCodes,”TheBellSystemTechnicalJournal,Vol.XXVI,No2(April1950)pp 147-160.

20

Page 21: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

Detecting/CorrectingCodeConcept

• Detection:bitpatternfailscodewordcheck• Correction:maptonearestvalidcodeword

Spaceofpossiblebitpatterns(2N)

Sparsepopulationofcodewords(2M <<2N)- withidentifiablesignature

Errorchangesbitpatterntonon-code

21

Page 22: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingDistance:8codewords

22

Page 23: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingDistance2:DetectionDetectSingleBitErrors

23

• No1biterrorgoestoanothervalidcodeword• ½codewords arevalid

InvalidCodewords

Page 24: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingDistance3:CorrectionCorrectSingleBitErrors,DetectDoubleBitErrors

24

•No2biterrorgoestoanothervalidcodeword;1biterrornear• 1/4codewords arevalid

Nearest000

(one1)

Nearest111(one0)

Page 25: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

Administrivia• FinalExam– Tuesday,June21,2016,9:00-11:00– Location:H2109+110– THREEcheatsheets(MT1,MT2,post-MT2)

• Hand-written• Project3willstillcome– Short/easy– but:– Competition:

• Slowest33percentileandbelow:80%• Fastestprogram:100%• Linearscalinginbetween.

– Time:Whatdoyouprefer?1weekonly,ortillendofexamweek?

25

Page 26: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingErrorCorrectionCode

• Useofextraparitybitstoallowthepositionidentificationofasingleerror

1. Markallbitpositionsthatarepowersof2asparitybits(positions1,2,4,8,16, …)– Startnumberingbitsat1atleft(notat0onright)

2.Allotherbitpositionsare databits(positions3,5,6,7,9,10,11,12,13,14,15, …)

3.Eachdatabitiscoveredby2ormoreparitybits

26

Page 27: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECC4. Thepositionof paritybitdetermines sequence

of databitsthatitchecks• Bit1(00012):checksbits(1,3,5,7,9,11,...)– Bitswithleastsignificantbitofaddress=1

• Bit2(00102):checksbits(2,3,6,7,10,11,14,15,…)– Bitswith2nd leastsignificantbitofaddress=1

• Bit4(01002):checksbits(4-7,12-15,20-23,...)– Bitswith3rd leastsignificantbitofaddress=1

• Bit8(10002):checksbits(8-15,24-31,40-47,...)– Bitswith4th leastsignificantbitofaddress=1

27

Page 28: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

GraphicofHammingCode

• http://en.wikipedia.org/wiki/Hamming_code

28

Page 29: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECC5.Set paritybitstocreateevenparity foreachgroup

• Abyteofdata:10011010• Createthe codedword,leavingspacesfortheparitybits:

• __1_001_1010000000000111123456789012

• Calculatetheparitybits29

Page 30: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECC• Position1checksbits1,3,5,7,9,11 (bold):? _1 _0 01 _1 01 0.setposition1toa _:_ _1 _0 01 _1 01 0

• Position2checksbits2,3,6,7,10,11 (bold):0?1_001 _101 0.setposition2toa _:0_ 1 _001 _101 0

• Position4checksbits4,5,6,7,12 (bold):011?001 _1010.setposition4toa _:011 _ 001_1010

• Position8checksbits8,9,10,11,12:0111001?1010.setposition8toa_:0111001_ 1010

30

Page 31: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECC• Position1checksbits1,3,5,7,9,11:? _1 _0 01 _1 01 0.setposition1toa0:0 _1 _0 01 _1 01 0

• Position2checksbits2,3,6,7,10,11:0?1_001 _101 0.setposition2toa1:01 1 _001 _101 0

• Position4checksbits4,5,6,7,12:011?001 _1010.setposition4toa1:0111 001_1010

• Position8checksbits8,9,10,11,12:0111001?1010.setposition8toa0:01110010 1010

31

Page 32: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECC• Finalcodeword:011100101010• Dataword: 10011010

32

Page 33: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECCErrorCheck

• Supposereceive011100101110

0 1 1 1 0 0 1 0 1 1 1 0

33

Page 34: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECCErrorCheck

• Supposereceive011100101110

34

Page 35: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECCErrorCheck

• Supposereceive0111001011100 1 0 1 1 1 √11 01 11 X-Parity2inerror

1001 0 √01110 X-Parity8inerror

• Impliesposition8+2=10isinerror011100101110

35

Page 36: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECCErrorCorrect

• Fliptheincorrectbit…011100101010

36

Page 37: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECCErrorCorrect

• Supposereceive0111001010100 1 0 1 1 1 √11 01 01 √

1001 0 √01010 √

37

Page 38: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingErrorCorrectingCode• Overheadinvolvedinsingleerror-correctioncode• Letp be totalnumberofparitybitsand d numberofdatabitsin p +d bitword

• Ifp errorcorrectionbitsaretopointto errorbit(p +d cases)+indicatethatnoerrorexists(1case),weneed:

2p >=p +d +1,thusp >=log(p +d +1)forlarged,p approacheslog(d)

• 8bitsdata=> d =8,2p =p +8+1=>p =4• 16data=>5parity,32data=>6parity,64data=>7parity

38

Page 39: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingSingle-ErrorCorrection,Double-ErrorDetection(SEC/DED)

• Adding extraparitybitcoveringtheentireword providesdoubleerrordetectionaswellassingleerrorcorrection1 2 34 5678p1 p2 d1 p3 d2 d3 d4p4

• Hammingparitybits H (p1 p2 p3)arecomputed(evenparityasusual)plusthe evenparityovertheentireword, p4:H=0 p4=0,noerrorH≠0p4=1,correctablesingleerror(oddparityif1error=>p4=1)H≠0p4=0, doubleerroroccurred(evenparityif2errors=>p4=0)H=0 p4=1, singleerroroccurredinp4bit,notinrestofword

TypicalmoderncodesinDRAMmemorysystems:64-bitdatablocks(8bytes)with72-bitcodewords(9bytes).

39

Page 40: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingSingleErrorCorrection+DoubleErrorDetection

40

1biterror(one1)Nearest0000

1biterror(one0)Nearest1111

2biterror(two0s,two1s)

HalfwayBetweenBoth

HammingDistance=4

Page 41: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

WhatifMoreThan2-BitErrors?

• Networktransmissions,disks,distributedstoragecommonfailuremodeisburstsofbiterrors,notjustoneortwobiterrors– ContiguoussequenceofB bitsinwhichfirst,lastandanynumberofintermediatebitsareinerror

– Causedbyimpulsenoiseorbyfadinginwireless– Effectisgreaterathigherdatarates

41

Page 42: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

CyclicRedundancyCheck

42

Simpleexample:ParityCheckBlock

1001101001101100111100000010110111011100001111001111110000001100

00111011

Data

Check

00000000 0=Check!

1001101001101100111100000000000011011100001111001111110000001100

0011101100101101 Not0=Fail!

Page 43: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

CyclicRedundancyCheck• Paritycodesnotpowerfulenoughtodetectlongrunsoferrors(alsoknownasbursterrors)

• BetterAlternative:Reed-SolomonCodes– UsedwidelyinCDs,DVDs,MagneticDisks– RS(255,223)with8-bitsymbols:eachcodeword contains255codewordbytes(223bytesaredataand32bytesareparity)

– Forthiscode:n=255,k=223,s=8,2t=32,t=16– Decodercancorrectanyerrorsinupto16bytesanywhereinthecodeword

43

Page 44: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

CyclicRedundancyCheck11010011101100 000 <--- input right padded by 3 bits1011 <--- divisor01100011101100 000 <--- result 1011 <--- divisor00111011101100 000

101100010111101100 000

101100000001101100 000 <--- skip leading zeros

101100000000110100 000

101100000000011000 000

101100000000001110 000

101100000000000101 000

101 1-----------------00000000000000 100 <--- remainder

44

14databits 3checkbits 17bitstotal

3bitCRCusing thepolynomial x3 +x+1(divideby1011togetremainder)

Page 45: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

CyclicRedundancyCheck

• Forblockofk bits,transmittergeneratesann-k bitframechecksequence

• Transmitsn bitsexactlydivisiblebysomenumber• Receiverdividesframebythatnumber– Ifnoremainder,assumenoerror– Easytocalculatedivisionforsomebinarynumberswithshiftregister

• Disksdetectandcorrectblocksof512byteswithcalledReedSolomoncodes≈CRC

45

Page 46: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

(InMoreDepth:CodeTypes)• LinearCodes:

CodeisgeneratedbyGandinnull-space ofH• HammingCodes:DesigntheHmatrix

– d =3Þ Columnsnonzero,Distinct– d =4Þ Columnsnonzero,Distinct,Odd-weight

• Reed-solomon codes:– BasedonpolynomialsinGF(2k)(I.e.k-bitsymbols)– Dataascoefficients,codespaceasvaluesofpolynomial:– P(x)=a0+a1x1+…ak-1xk-1– Coded:P(0),P(1),P(2)….,P(n-1)– Canrecoverpolynomialaslongasgetany k ofn– Alternatively:aslongasnomorethann-k codedsymbols

erased,canrecoverdata.• Sidenote:MultiplicationbyconstantinGF(2k)canbe

representedbyk k matrix:a×x– Decomposeunknownvectorintok bits:x=x0+2x1+…+2k-1xk-1– Eachcolumnisresultofmultiplyingaby2i

This image This image

46

Page 47: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

HammingECConyourown• TestiftheseHamming-codewordsarecorrect.Ifoneisincorrect,indicatethecorrectcodeword.Also,indicatewhattheoriginaldatawas.

• 110101100011• 111110001100• 000010001010

47

Page 48: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

EvolutionoftheDiskDrive

48IBMRAMAC305,1956

IBM3390K,1986

AppleSCSI,1986

Page 49: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

CansmallerdisksbeusedtoclosegapinperformancebetweendisksandCPUs?

ArraysofSmallDisks

49

14”10”5.25”3.5”

3.5”

DiskArray:1diskdesign

Conventional:4diskdesigns

LowEnd HighEnd

Page 50: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

ReplaceSmallNumberofLargeDiskswithLargeNumberofSmallDisks!(1988Disks)

50

CapacityVolumePowerDataRateI/ORateMTTFCost

IBM3390K20GBytes97cu.ft.3KW15MB/s600I/Os/s250KHrs$250K

IBM3.5"0061320MBytes0.1cu.ft.11W1.5MB/s55I/Os/s50KHrs$2K

x7023GBytes11cu.ft.1KW120MB/s3900IOs/s???Hrs$150K

DiskArrayshavepotentialforlargedataandI/Orates,highMBpercu.ft.,highMBperKW,butwhataboutreliability?

9X3X

8X

6X

Page 51: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

RAID:RedundantArraysof(Inexpensive)Disks

• Filesare"striped"acrossmultipledisks• Redundancyyieldshighdataavailability– Availability:servicestillprovidedtouser,evenifsomecomponentsfailed

• Diskswillstillfail• Contentsreconstructedfromdataredundantlystoredinthearray=>Capacitypenaltytostoreredundantinfo=>Bandwidthpenaltytoupdateredundantinfo

51

Page 52: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

RedundantArraysofInexpensiveDisksRAID1:DiskMirroring/Shadowing

52

• Eachdiskisfullyduplicatedontoits“mirror”Veryhighavailabilitycanbeachieved

•Bandwidthsacrificeonwrite:Logicalwrite=twophysicalwritesReadsmaybeoptimized

•Mostexpensivesolution:100%capacityoverhead

recoverygroup

Page 53: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

RedundantArrayofInexpensiveDisksRAID3:ParityDisk

53

P

100100111100110110010011...

logicalrecord 10100011

11001101

10100011

11001101

Pcontainssumofotherdisksperstripemod2(“parity”)Ifdiskfails,subtractPfromsumofotherdiskstofindmissinginformation

Stripedphysicalrecords

Page 54: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

RedundantArraysofInexpensiveDisksRAID4:HighI/ORateParity

D0 D1 D2 D3 P

D4 D5 D6 PD7

D8 D9 PD10 D11

D12 PD13 D14 D15

PD16 D17 D18 D19

D20 D21 D22 D23 P...

.

.

.

.

.

.

.

.

.

.

.

.DiskColumns

IncreasingLogicalDiskAddress

Stripe

Insidesof5disks

Example:smallreadD0&D5, largewriteD12-D15

54

Page 55: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

InspirationforRAID5• RAID4workswellforsmallreads• Smallwrites(writetoonedisk):– Option1:readotherdatadisks,createnewsumandwritetoParityDisk

– Option2:sincePhasoldsum,compareolddatatonewdata,addthedifferencetoP

• SmallwritesarelimitedbyParityDisk:WritetoD0,D5bothalsowritetoPdisk

55

D0 D1 D2 D3 P

D4 D5 D6 PD7

Page 56: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

RAID5:HighI/ORateInterleavedParity

56

Independentwritespossiblebecauseofinterleavedparity

D0 D1 D2 D3 P

D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

P D16 D17 D18 D19

D20 D21 D22 D23 P...

.

.

.

.

.

.

.

.

.

.

.

.DiskColumns

IncreasingLogicalDiskAddresses

Example:writetoD0,D5usesdisks0,1,3,4

Page 57: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

ProblemsofDiskArrays: SmallWrites

D0 D1 D2 D3 PD0'

+

+

D0' D1 D2 D3 P'

newdata

olddata

oldparity

XOR

XOR

(1.Read) (2.Read)

(3.Write) (4.Write)

RAID-5:SmallWriteAlgorithm

1LogicalWrite=2PhysicalReads+2PhysicalWrites

57

Page 58: CS 110 Computer Architecture Lecture 25 · Lecture 25: Dependability and RAID Instructor: ... Slides based on UC Berkley's CS61C. Review Last Lecture • I/O gives computers their

And,inConclusion,…• GreatIdea:RedundancytoGetDependability– Spatial(extrahardware)andTemporal(retryiferror)

• Reliability:MTTF&AnnualizedFailureRate(AFR)• Availability:%uptime(MTTF-MTTR/MTTF)• Memory– Hammingdistance2:ParityforSingleErrorDetect– Hammingdistance3:SingleErrorCorrectionCode+encodebitpositionoferror

• Treatdiskslikememory,exceptyouknowwhenadiskhasfailed—erasuremakesparityanErrorCorrectingCode

• RAID-2,-3,-4,-5:Interleaveddataandparity

58