raid: high performance, reliable secondary storage p. m. chen, u. michigan e. k. lee, dec src g. a....

43
RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson, U.

Upload: terence-oliver

Post on 16-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE

P. M. Chen, U. MichiganE. K. Lee, DEC SRCG. A. Gibson, CMUR. H. Katz, U. C. BerkeleyD. A. Patterson, U. C. Berkeley

Page 2: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

• The seven RAID organizations• Why RAID-1, RAID-3 and RAID-5 are the most

interesting• The small write problem occurring with RAID-5

– Possible solutions• Review of actual implementations

Highlights

Page 3: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Original Motivation

• Replacing large and expensive mainframe hard drives (IBM 3310) by several cheaper Winchester disk drives

• Will work but introduce a data reliability problem:– Assume MTTF of a disk drive is 30,000 hours– MTTDL for a set of n drives is 30,000/n

• n = 10 means MTTDL of 3,000 hours

Page 4: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Today’s Motivation

• “Cheap” SCSI hard drives are now big enough for most applications

• We use RAID today for – Increasing disk throughput by allowing parallel access– Eliminating the need to make disk backups

• Disk drives are too big to be backed up in an efficient fashion

Page 5: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 0

• Spread data over multiple disk drives• Advantage

– Simple to implement– Fast

• Disadvantage– Very unreliable

• RAID 0 with n disks has MMTF equal to 1/n of MTTF of a single disk

Page 6: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 1

• Mirroring– Two copies of each disk block on

two separate drives• Advantages

– Simple to implement and fault-tolerant• Disadvantage

– Requires twice the disk capacity of normal file systems

Page 7: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 2

• Instead of duplicating the data blocks we use an error correction code

• Very bad idea because disk drives either work correctly or do not work at all– Only possible errors are omission errors– We need an omission correction code

• A parity bit is enough to correct a single omission

Page 8: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 2

Page 9: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 3

• Requires N+1 disk drives– N drives contain data

• 1/N of each data block on each drive• Block b[k] now partitioned into N fragments b[k,1],

b[k,2], ... b[k,N]– Parity drive contains exclusive or of these N

fragmentsp[k] = b[k,1] b[k,2] ... b[k,N]

Page 10: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 3

A stripe consists of a single block

Page 11: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 4

• Requires N+1 disk drives– N drives contain data (individual blocks)– parity drive contains exclusive or of the

N blocks in stripep[k] = b[k] b[k+1] ... b[k+N-1]

Page 12: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 4

A stripe now contains multiple blocks

Page 13: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 5

• Single parity drive of RAID-4 is involved in every write – Will limit parallelism

• RAID-5 distribute the parity blocks among the N+1 drives

Page 14: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 5

Page 15: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

The small write problem

• Specific to RAID 5• Happens when we want to update a single block

– Block belongs to a stripe– How can we compute the new value of the

parity block

...b[k+1] p[k]b[k+2]b[k]

Page 16: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

First solution

• Read values of N-1 other blocks in stripe• Recompute

p[k] = b[k] b[k+1] ... b[k+N-1]

• Solution requires– N-1 reads– 2 writes (new block and parity block)

Page 17: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Second solution

• Assume we want to update block b[m]• Read old values of b[m] and parity block p[k]• Compute

p[k] = new b[m] old b[m] old p[k]

• Solution requires– 2 reads (old values of block and parity block)– 2 writes (new block and parity block)

Page 18: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RAID 6

• Each stripe has two redundant blocks:– P + Q redundancy

• Advantage– Much higher reliability

• Disadvantage:– Costlier updates

Page 19: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

PERFORMANCE COMPARISON

• Focus on system throughput• Measure it against system cost expressed in

number of disk drives

Page 20: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Smallread

Smallwrite

Large read

Large write

RAID 0 1 1 1 1

RAID 1 1 ½ 1 ½

RAID 3 1/G 1/G (G-1)/G (G-1)/G

RAID 5 1 max(1/G, 1/4) 1 (G-1)/G

RAID 6 1 max(1/G, 1/6) 1 (G-2)/G

Throughputs per dollar

Page 21: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Discussion

• Performance per dollar of RAID 3 is always less or equal to that of a RAID 5 system

• For small writes,– RAID 3, 5 and 6 are equally cost -effective at

small group sizes– RAID 5 and 6 are better for large group sizes

Page 22: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

RELIABILITY

• Theoretical reliability is very high– Especially for RAID 6

• In practice,– System crashes can cause

parity inconsistencies– Uncorrectable bit errors can happen during

repair times (one in 1014 bits)– Correlated disk failures happen!

Page 23: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Impact of parity inconsistencies

• Happen when system crashes during an update– New data were written but parity block was not

updated• Has little impact on RAID 3 (bad block)• Significant impact on RAID 5• Bigger impact on RAID 6

– Same as simultaneous failures of both P& Q blocks

Page 24: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Discussion

• System crashes and unrecoverable bit errors have biggest effect on MTTDL

• P + Q redundant disks protect against correlated disk failures and unrecoverable bit errors– Still vulnerable to system crashes– Should use NVRAM for write buffers

Page 25: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

IMPLEMENTATION CONSIDERATIONS• Must prevent users from reading corrupted data

from a failed disk– Mark blocks located on the failed disk

invalid– Mark reconstructed blocks valid

• To avoid regenerating all parity blocks after a crash– Must keep track of parity consistency and

store it in stable storage

Page 26: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Discussion

• Maintaining consistent/inconsistent state information for all parity blocks is a problem for software RAID systems – Rarely have NVRAM

• If updates are local, keep track in stable storage of a small number of parity blocks that could be inconsistent

• Otherwise use group commits

Page 27: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

SMALL WRITES REVISITED (I)

• Asynchronous writes can help if future updates overwrites previous ones

• Caching recently read blocks can help if old data necessary to compute new parity are in cache

• Caching recently written parity can also help– Parity is computer over many logically

consecutive blocks

Page 28: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

SMALL WRITES REVISITED (II)

• Floating Parity– Make parity update cheaper, by putting parity

in a rotationally-nearby unallocated block– Requires directories for locations of nearby

unallocated blocks– Should be implemented at controller level

Page 29: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

SMALL WRITES REVISITED (III)

• Parity Logging :– Defers cost of parity update by logging XOR of

old data and new data– Replay log file later to update parity– Reduces update cost to two blocking writes

(if we have in the old data block in RAM)– It works because nearly all storage systems

have idle times.

Page 30: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Declustered Parity (I)

• Addresses issue of high read cost when recovering from a failure a failure

• Looking at example:– A failure of disk 2 generates additional read

requests to disks 0, 1 and 3 every time a read request is made for a block that was stored on disk 2

Page 31: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Declustered Parity (II)

Page 32: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Declustered Parity (III)

• With declustered parity:– Same disk belongs to different groups

• Looking at example:– Disk 2 is in groups (0,1, 2, 3), (4, 5, 2 , 3) and

so on– Additional read requests caused by a failure of

disk 2 are now spread among all remaining disks

Page 33: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Declustered Parity (IV)

• Extra workload caused by the failure of a disk is now shared by all remaining disks

• Sole Disadvantage:– A failure of any two disks will now result in

data loss– In a standard set of RAID array, the two failed

disks had to be in the same array

Page 34: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Exploiting On-Line Spare Disks

• Distributed Sparing:– No dedicated spare disk– Each disk has 1/(N+1) of its capacity reserved

• Parity Sparing:– Also spreads the spare space but uses it to sore

additional party blocks• Can split groups into half groups• More …

Page 35: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

Distributed Sparing

S0, S1 and S2 represent spare blocks

Page 36: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

CASE STUDIES

• TicketTAIP• AutoRAID

– See presentation

Page 37: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

TickerTAIP (I)

• Traditional RAID architectures have– A central RAID controller interfacing to the

host and processing all I/O requests– Disk drives organized in strings – One disk controller per disk string (mostly

SCSI)

Page 38: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

TickerTAIP (II)

• Capabilities of RAID controller are crucial to the performance of RAID– Can become memory-bound

– Presents a single point of failure

– Can become a bottleneck

• Having a spare controller is an expensive proposition

Page 39: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

TickerTAIP (III)

• Uses a cooperating set ofarray controller nodes

• Major benefits are:– Fault-tolerance– Scalability– Smooth incremental growth– Flexibility: can mix and match components

Page 40: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

TickerTAIP (IV)

Hostinterconnects

Controller nodes

Page 41: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

TickerTAIP ( V)

A TickerTAIP array consists of:• Worker nodes connected with one or more local

disks through a bus • Originator nodes interfacing with host computer

clients• A high-performance small area network:

– Mesh based switching network (Datamesh)– PCI backplanes for small networks

Page 42: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

TickerTAIP ( VI)

• Can combine or separate worker and originator nodes

• Parity calculations are done in decentralized fashion:– Bottleneck is memory bandwidth not CPU

speed– Cheaper than having faster paths to a

dedicated parity engine

Page 43: RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

CONCLUSION

• RAID original purpose was to take advantage of Winchester drives that were smaller and cheaper than conventional disk drives– Replace a single drive by an array of smaller

drives• Nobody does that anymore!• Main purpose of RAID is to build fault-tolerant file

systems that do not need backups