storage systems cse 598d, spring 2007 lecture 5: redundant arrays of inexpensive disks feb 8, 2007

Storage SystemsStorage SystemsCSE 598d, Spring 2007CSE 598d, Spring 2007

Lecture 5: Redundant Arrays of Lecture 5: Redundant Arrays of Inexpensive DisksInexpensive Disks

Feb 8, 2007Feb 8, 2007

What is a RAID?

Why RAID?

• Higher performance– Higher I/O rates for short operations by exploiting

parallelism in disk arms

– Higher bandwidth for larger operations by exploiting parallelism in transfers

• However, we need to address the linear decrease in MTTF by introducing redundancy

Data Striping

• Stripe: Unit at which data is distributed across the disks• Small stripes: Greater parallelism for each (small) request,

however higher overheads• Large stripes: Less parallelism for small requests, but

preferable for large requests

Redundancy• Mechanism: Parity, ECC, Reed-Solomon codes, Mirroring

• Distribution schemes:– Concentrate redundancy on a few disks– Distribute the parity similar to data

RAID Taxonomy/Levels• RAID 0 (Non-Redundant)• RAID 1 (Mirrored)• RAID 2 (ECC)• RAID 3 (Bit-interleaved Parity)• RAID 4 (Block-interleaved Parity)• RAID 5 (Block-interleaved distributed Parity)• RAID 6 (P+Q Redundancy)• RAID 0+1 (Mirrored arrays)

Non-Redundant (RAID 0)

• Simply stripe the data across the disks without worrying about redundancy

• Advantages:– Lowest cost– Best write performance, used in some

supercomputing environments

• Disadvantages:– Cannot tolerate any data loss

Mirrored (RAID 1)

• Each disk is mirrored• Whenever you write to a disk, also write to its

mirror• Read can go to any one of the mirrors (with

shorter queueing delay)– What if one (or both) copies of a block get corrupted?

• Uses twice the number of disks!• Often used in database applications where

availability and transaction rate are more important than storage efficiency (cost of storage system)

ECC (RAID 2)

• Use Hamming codes (go over example)– Parity for distinct non-overlapping sets of components

• Helps identify and fix the errors• Storage efficiency is logarithmic with the number of

disks

Bit-Interleaved Parity (RAID 3)

• Unlike memory, one can typically identify which disk has failed

• Simple parity can thus suffice to identify/fix single error occurrences

• Data is interleaved bit-wise over the disks, and a single parity disk is provisioned

• Each read request spans all data disks• Each write request spans all data disks and parity disk• Consequently, only 1 outstanding request at a time• Sometimes referred to as “synchronized disks”• Suitable for apps need high data rate but not high I/O rate

– E.g., some scientific apps with high spatial data locality

Block-Interleaved Parity (RAID 4)

• Data is interleaved in blocks of certain size (striping unit)• Small (< 1 stripe) requests

– Reads access only 1 disk– Writes need 4 disk accesses (write new data, read old data,

read old parity, write new parity) – read-modify-write procedure

• Large requests can enjoy good parallelism• Parity disk can become a bottleneck – load imbalance

Block Interleaved Distributed Parity (RAID 5)

• Distributes the parity across the data disks (no separate parity disk)

• Best small read, large read and large write performance of any redundant array

Distribution of data in RAID 5

• Ideally, you want to access each disk once before accessing any disk twice (when traversing blocks sequentially)

• Left symmetric parity placement

P+Q Redundancy (RAID 6)

• Parity requires single, self-identifying error• More disks => Higher probability of multiple simultaneous failures• Idea: Use Reed-Solomon Codes for redundancy• Given a set of “k” input symbols. RS adds “2t” redundant symbols to

give a total number of symbols “n=k+2t”– 2t is the number of self-identifying errors we want to protect against

• The resulting “n” sequence can:– Correct “t” errors– Correct “2t” erasures

• Error locations are known

• So if we want to tolerate 2 disk failures, we only need t=1, i.e. 2 redundant disks

• Performance similar as RAID-5, except small writes incur 6 accesses to update P and Q

Mirrored Arrays (RAID 0+1)

• Combination of “0” and “1”

• Partition the array into groups of “m” each, with each disk in a group reflecting/mirroring the contents of the corresponding disk in other groups. If “m=2”, becomes single mirror

Comparing the levels

Throughput per dollar relative to RAID-0

Assumptions:Perfect load balancing“Small”: Requests for 1 striping unit“Large”: Requests of 1 full stripe (1 unit from each disk in an error correcting group)G: No. of disks in an error correction group

RAID-5 provides good trade-offs, and its variants used morecommonly in practice

Reliability

• Say the mean-time-between-failures of a single disk is MTBF

• Mean-time-between-failures of a 2-disk array without any redundancy is – MTBF/2 (MTBF/N for a N disk array)

• Say we have 2 disks, where 1 is the mirror of another. What is the MTBF of this system?– It can be calculated based on the probability of 1 disk

failing, and the second disk also failing during the time it takes to repair the first disk

– (MTBF/2) * (MTBF/MTTR)

• MTTF of a RAID-5 array is given by – (MTBF*MTBF)/N*(G-1)*MTTR

Reliability (contd.)• With 100 disks each with

MTBF=200K hours, a MTTR=1 hr, and a group size of 16, MTBF of this RAID-5 system is 3000 years!

• However, higher levels of reliability are still needed!!!

Why?• System crashes and Parity inconsistencies

– Not just disks fail. System may crash in the middle of updating parity leading to inconsistencies later on

• Uncorrectable Bit Errors– There may be an error when obtaining the data

from a single disk (usually incorrect writes) that may not be correctable

• Disk failures are usually correlated– Natural calamities, power surges/failures,

common support hardware– Also, disk reliability characteristics (e.g.

inverted bathtub) may themselves be correlated

Consequences of data loss

Implementation Issues• Avoiding stale data

– When a disk failure is detected, mark its sectors to be “invalid”, and after the new disk is re-created mark its sectors to be “valid”

• Regenerating parity after crash– Mark parity sectors “Inconsistent” before servicing

any write– When a system comes up, regenerate all

“Inconsistent” parity sectors– Periodically mark “Inconsistent” parities to be

“Consistent” – you can do better management based on need

• Orthogonal RAID– Reduces disk

failure correlations

– Reduces string conflicts

Next class: RAID modeling

• Improving small write performance in RAID-5– Buffering and Caching

• Buffer writes in a NVRAM to coalesce writes, avoid redundant writes, get better sequentiality, and allow better disk scheduling

– Floating parity• Cluster parity into cylinders each with some spares.

When parity needs to be updated, new parity block can be written on the rotationally-closest unallocated block following old parity

• Needs a level of indirection to get to latest parity

– Parity Logging• Keep a log of differences that need to be made to parity

(in NVRAM and on a log disk). Later on update the new parity

• Declustered Parity– We not only

want to balance load in the normal case, but also when there are failures

Say Disk 2 fails in the two configurations.The latter will more evenly balance the load

• Online Spare Disks– To allow reconstruction to start immediately (no

MTBR) so that window of vulnerability is low– Distributed Sparing

• Rather than keep separate disks (idling), spread the spare capacity around.

– Parity Sparing• Use the space capacity to store additional parity. One

can view this as P+Q redundancy• Or small writes can update just one of the parities

based on head position, queue length, etc.

• Data Striping– Trade-off between seek/positioning times, data

sequentiality and transfer parallelism– Optimal size of data striping is

Sqrt(P.X.(L-1).Z/N) where• P is the avg. positioning time• X is the avg. transfer rate• L is the concurrency• Z is the request size• N is no. of disks.

– Common rule of thumb where not much is known about the workload for RAID-5 is

½ * avg. positioning time * transfer rate

storage systems cse 598d, spring 2007 lecture 5: redundant arrays of inexpensive disks feb 8, 2007

Documents

data slide

data disks

raid taxonomylevels

blockinterleaved parity

mirrored raid

single parity disk

ecc raid

separate parity disk