xyratex white paper raid chunk size 1-0

Upload: elkcloner

Post on 06-Apr-2018

239 views

Category:

Documents


6 download

TRANSCRIPT

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    1/24

    RAID Chunk Size

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    2/24

    Notices

    The inormation in this document is subject to change without notice.

    While every eort has been made to ensure that all inormation in this document is

    accurate, Xyratex accepts no liability or any errors that may arise.

    2008 Xyratex (the trading name o Xyratex Technology Limited). Registered Ofce:

    Langstone Road, Havant, Hampshire, PO9 1SA, England. Registered number 03134912.

    No part o this document may be transmitted or copied in any orm, or by any means,

    or any purpose, without the written permission o Xyratex.

    Xyratex is a trademark o Xyratex Technology Limited. All other brand and product

    names are registered marks o their respective proprietors.

    Acknowledgements

    Jerry Hoetger, Director, Product Management, Xyratex International Ltd.

    Nigel Hart, Director CPD Engineering, Xyratex International Ltd.

    Stuart Campbell, Lead Firmware Architect and author o the Xyratex RAID frmware,

    Cork, Ireland, Xyratex International Ltd.

    Xyratex CPD Engineering, Xyratex Lake Mary, FL and Xyratex Naperville, IL

    Issue 1.0 | January, 2008

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    3/241

    Contents

    Executive Summary .......................................................................................................................................3

    Abstract .................................................................................................................................................3

    Chunk Size .................................................................................................................................................3

    Striping and Chunks (or Strips) ....................................................................................................................3

    RAID 0: Striping ..........................................................................................................................................3

    RAID 1: Mirroring........................................................................................................................................3RAID 5 .................................................................................................................................................4

    RAID 10: a mix o RAID 0 & RAID 1 ............................................................................................................5

    Stripe .................................................................................................................................................6

    Adjusting or Perormance and Chunk Size ...............................................................................................6

    Writing and Reading Chunks and Stripes ...................................................................................................6

    Read-Modiy-Write or Writing Stripes ......................................................................................................7

    Coalesce .................................................................................................................................................9

    RAID Cache Options ......................................................................................................................................9

    Perormance Implications o Confguration Options ..............................................................................10

    Large Data Transer - Sequential Reading and Writing Access Pattern ................................................11

    Small Data Transer Random Reading and Writing Access Pattern ....................................................12

    ESG Defned Workloads and Associated Access Patterns. ......................................................................12

    Perormance o Arrays and Varying Chunk Sizes ....................................................................................13

    Test Results ...............................................................................................................................................14

    SAS Drive Tests ............................................................................................................................................14

    SATA Drive Tests ..........................................................................................................................................15

    Test Conclusions ..........................................................................................................................................15

    A Case or Smaller Chunk (or Strip) Size with Wide Arrays ....................................................................17

    Glossary ...............................................................................................................................................17

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    4/242

    Executive SummaryIn general, application workloads benet rom large chunk sizes. The most signicant reason or using large chunk

    sizes today in conguring RAID arrays or high throughput rates and perormance is the recent advancements

    in linear transer rates o hard disk drives. The most common o drive sizes, 3 SCSI and Serial ATA have

    scaled rom 36GB to up to 1TB in todays SATA drives. This phenomenal increase in capacity came about by

    increasing the amount o data transerred while the platter rotated under the head/actuator assembly. RAID

    systems can now transer much more data in the single rotation o a platter without the response penalty o aseek operation. Conguration o RAID arrays or high throughput rates and perormance need to consider the

    increased perormance o large chunk (or what some call strip sizes) RAID arrays.

    Other contributing RAID eatures support larger chunk sizes: Write-back cache capabilities, coalescing o stripe

    writes, segmentation o data transer requests into stripes, and the deployment o read-modiy-write operations,

    optimize RAID perormance with large chunk (and large stripe) sizes.

    Some applications use wide arrays with between 12 and 16 drives in an array, and thereore have extra large

    stripes. These arrays may benet rom using the smaller chunk size options available in most controllers.

    Benchmarks show that commands to RAID arrays with the smaller chunk size (and stripe size) exhibited slower

    command response times (latency) and lower throughput (IOPS) than the same commands with larger chunk

    size and larger stripe size. This eect was noted in almost all workloads and on several dierent RAID oerings,

    regardless o drive type (SAS or SATA). Increased linear transer rate (aerial density), RAID rmware write-

    back cache and the ability to coalesce write commands contribute to the benets o larger chunks and larger

    stripes.

    AbstractConsidering the recent advances in disk drive technology, conguring RAID systems or optimum perormance

    presents new challenges. Over the last ve years, drive manuacturers have been steadily increasing the capacity

    o disk drives by increasing the density and linear transer rates.

    Although seek times have not improved in relation to linear transer rates, to optimize perormance it is necessary

    to transer as much data as possible while the disk drive heads are in one position. We accomplish this by

    increasing the chunk size o the disk array. The calculations are based on the number o drives, RAID level o the

    array, and the type o use is intended or the array.

    For example, database servers would have a dierent chunk size requirement, then one or a general use server

    or mail server.

    Throughout this paper the core concepts o chuck size and ull stripe writes are examined, as well as benchmark

    test results.

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    5/243

    Chunk Size

    Striping and Chunks

    All RAID levels used in todays storage rely on the concept o striping data across multiple disk drives in an array.

    The data in a stripe is usually made up o multiple records or segments o inormation in accordance to a layout

    by an application running on the host. The particular record layout is o no signicance to the RAID processing.

    Host applications access the logical data records or segments with varying access patterns. Access patterns and

    conguration options dened as part o the RAID array, will impact the perormance o the array.

    RAID 0: Striping

    Figure 1: RAID 0 Array

    In a RAID 0 system, data is split up into blocks that will get written across all the drives in the array. By using

    multiple disks (at least 2) at the same time, RAID 0 oers superior I/O perormance. This perormance can be

    enhanced urther by using multiple controllers, ideally one controller per disk.

    Advantages

    RAID0offersgreatperformancetobothreadandwriteoperations.

    Thereisnooverheadcausedbyparitycontrols.

    Allstoragecapacitycanbeused,thereisnodiskoverhead.Thetechnologyiseasytoimplement.

    Disadvantages

    RAID0isnotfault-tolerant.Ifonediskfails,alldataintheRAID0arrayarelost.

    Itshouldnotbeusedonmission-criticalsystems.

    Ideal Use

    RAID 0 is ideal or non-critical storage o data that has to be read rom or written to a workstation; or example:

    a workstation using Adobe Photoshop or image processing.

    RAID 1: Mirroring

    Data is stored twice by writing to both the data disk (or sets o data disks) and a mirror disk (or sets o disks). I

    a disk ails, the controller uses either the data disk or the mirror disk or data recovery, and continues operation.

    You need at least two disks or a RAID 1 array.

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    6/244

    RAID level 1 is oten combined with RAID level 0 to improve perormance. This RAID level is sometimes reerred

    to as RAID level 10.

    Advantages

    RAID1offersexcellentreadspeedandawritespeed,thatiscomparabletothatofasingledisk.

    Incaseadiskfails,datadoesnothavetoberebuilt,itjusthastobecopiedtothereplacementdisk.

    RAID1isaverysimpletechnology.

    Disadvantages

    Themaindisadvantageisthattheeffectivestoragecapacityisonlyhalfofthetotaldiskcapacitybecauseall

    data gets written twice.

    SoftwareRAID1solutionsdonotalwaysallowahotswapofafaileddisk(meaningitcannotbereplaced

    while the server continues to operate). Ideally a hardware controller is used.

    Ideal use

    RAID 1 is ideal or mission critical storage, or instance with accounting systems. It is also suitable or small

    servers, in which only two disks will be used.

    RAID 5

    RAID 5 is the most common secure RAID level. It is similar to RAID 3 except that data is transerred to disks by

    independent read and write operations (not in parallel). The data chunks that are written are also larger. Instead

    o a dedicated parity disk, parity inormation is spread across all the drives. You need at least 3 disks or a RAID

    5 array.

    A RAID 5 array can withstand a single disk ailure without losing data or access to data. Although RAID 5 can

    be achieved in sotware, a hardware controller is recommended. Oten extra cache memory is used on these

    controllers to improve the write perormance.

    Figure 2: RAID 1 Array

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    7/245

    Figure 3: RAID 5 Array

    Figure 4: RAID 10 Array

    Advantages

    Read data transactions are very ast, while write data transaction are somewhat slower (due to the parity that

    has to be calculated).

    Disadvantages

    Diskfailureshaveaneffectonthroughput,althoughthisisstillacceptable.

    LikeRAID3,thisiscomplextechnology.

    Ideal use

    RAID 5 is a good all-round system that combines ecient storage with excellent security and decent perormance.

    It is ideal or le and application servers.

    RAID 10: A Mix o RAID 0 & RAID 1

    RAID 10 combines the advantages (and disadvantages) o RAID 0 and RAID 1 in a single system. It provides

    security by mirroring all data on a secondary set o disks (disk 3 and 4 in the Figure below) while using striping

    across each set o disks to speed up data transers.

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    8/246

    StripeTheaggregationofrecordsorsegmentsiscommonlycalledastripeandrepresentsacommonobjectinthe

    array that is maintained by the rmware in the controller, based on commands rom the host application. For

    each stripe read to or written (ull or partial) rom the RAID rmware, multiple I/O requests are made to each

    member drive in the array depending on the RAID type and number o drives in the array.

    For RAID 0 and 10, no read-modiy-write is perormed, so there is a single I/O request to a member drive or each

    host I/O, which can be reduced by coalescing. For RAID 5/50/6, a host read command translates to a single I/O

    to a single drive, not multiple I/Os to the same drives. For a updating (read-modiy-write), there are multiple I/Os

    to the data and parity drive(s). For a ull stripe write, there is a single I/O or each drive. These requests are made

    in parallel. This particular paralleling aspect o ull stripe writes in RAID striping reduces command response time

    (latency) and increases I/O commands per second (IOPS) or hosts writing to the array sequentially.

    Adjusting or Perormance and Chunk SizeSystem administrators have options when dening the RAID systems used or storing and retrieving data

    by applications. Each application has a unique pattern or storing and retrieving the data based on how the

    application works, and how users o the application make requests. This pattern o reading and writing data

    along with the system administrator specied options aect the perormance o the application.

    Chunk size is one o the conguration options, where the size o the stripe is apportioned to the individual disk

    drive. Some RAID oerings let the system administrator dene the size o the stripe across all o the member drives

    (and derive the size o the chunk written to individual drives based on the number o member drives). Other RAID

    oerings have the system administrator congure the chunk or strip size (and derive the larger stripe size based

    on the number o member drives in the array). In either case the chunk size has perormance implications.

    Writing and Reading Chunks and StripesRAID controller rmware reads and writes stripes to member drives in an array. Writes can be either ull stripe

    or partial stripe write operations (See Read-Modiy-Write Section or more inormation). System Administrators

    stipulate a stripe size (and thereore derive chunk size), or stipulate the chunk size (and thereore derive stripe

    size). Full stripe write operations stripe across all member drives.

    Partial stripe operations read and write only to the aected portions o the stripe (usually one data drive and one

    parity drive). Read operations read only the amount o data associated with the host command. The presence o

    Read-Ahead capabilities in a RAID controller may modiy the amount o data read beyond that which is required

    to satisy the host command, in preparation o subsequent read commands.

    For RAID levels with parity and parity chunks are maintained on writes. Dierent RAID rmware handles reads

    and writes somewhat dierently, depending on the vendor and the RAID rmware strategy on ull stripe reads

    and writes.

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    9/247

    Writes can be handled in slightly dierent ways, depending on the vendor RAID strategy. Common in early RAID

    deployments, vendor RAID strategies stipulated that with writes and updates, the complete stripe was rst read

    rom all member drives in an array, locked or update with appropriate data and parity, veried, then completely

    striped across all member drives in the array with updated data and parity, at which time the lock or that stripe

    would be released. Xyratex RAID deploys a read-modiy-write strategy where only the portion o the stripe

    written to and the portion o the parity chunk updated are transmitted to member disk drives as part o the write

    or update operation. This mitigates the perormance implications o alternative ull stripe writes.

    Read-Modiy-Write or Writing Stripes1

    Xyratex has implemented read-modiy-write or writing stripes. Only that portion o the stripe (and parity

    chunks) is written to the array or servicing writes. This mitigates the perormance implication o ull stripe writes

    or random workloads, and is one o the reasons systems administrators can use larger chunks and stripes or

    all workloads.2

    The write operation is responsible or generating and maintaining parity data. This unction is typically reerred

    to as a read-modiy-write operation. Consider a stripe composed o our chunks o data and one chunk o parity.

    Supposethehostwantstochangejustasmallamountofdatathattakesupthespaceononlyonechunkwithin

    the stripe. The RAID rmware cannot simply write that small portion o data and consider the request complete.

    It also must update the parity data, which is calculated by perorming XOR operations on every chunk within the

    stripe. So parity must be recalculated when one or more stripes change.

    1 DellPowerSolutions,UnderstandingRAID-5andI/OProcessors,PaulLuse,May2003

    2 Full stripe writes or sequential write workloads is aster than Read-Modiy-Write commands. I the application can ll up

    a stripe completely, overall ull stripe writes perorm aster.

    Figure 5: Step-by-Step: Read-Modiy-Write Operation to a Four-Disk RAID Array

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    10/248

    Figure 5 shows a typical read-modiy-write operation in which the data that the host is writing to disk is contained

    withinjustonestrip,inpositionD5.Theread-modify-writeoperationconsistsofthefollowingsteps:

    1. Read new data rom host: The host operating system requests that the RAID subsystem write a piece o

    data to location D5 on disk 2.

    2. Read old data rom target disk or new data: Reading only the data in the location that is about to be

    written to eliminates the need to read all the other disks. The number o steps involved in the read-modiy-

    write operation is the same regardless o the number o disks in the array.

    3. Read old parity rom target stripe or new data: A read operation retrieves the old parity. This unction

    is independent o the number o physical disks in the array.

    4. Calculate new parity by perorming an XOR operation on the data rom steps 1, 2, and 3: The

    XOR calculation o steps 2 and 3 provides the resultant parity o the stripe, minus the contribution o the data

    that is about to be overwritten. To determine new parity or the D5 stripe that contains the new data, an

    XOR calculation is perormed on the new data read rom the host in step 1 with the result o the XOR

    procedure perormed in steps 2 and 3.

    5. Handle coherency: This process is not detailed in Figure 5 because its implementation varies greatly rom

    vendor to vendor. Ensuring coherency involves monitoring the write operations that occur rom the start o

    step 6 to the end o step 7. For the disk array to be considered coherent, or clean, the subsystem must ensure

    that the parity data block is always current or the data on the stripe. Because it is not possible to guarantee

    that the new target data and the new parity will be written to separate disks at exactly the same instant, the

    RAID subsystem must identiy the stripe being processed as inconsistent, or dirty, in RAID vernacular.

    6. Write new data to target location: The new data was received rom the host in step 1; now the RAID

    mappings determine on which physical disk, and where on the disk, the data will be written. With Xyratex

    RAID only the portion o the chunk updated will be written, not the complete chunk.

    7. Write new parity: The new parity was calculated in step 4; now the RAID subsystem writes it to disk.

    8. Handle coherency: Once the RAID subsystem veries that steps 6 and 7 have been completed successully,

    and the data and parity are both on diskthe stripe is considered coherent.

    This optimized method is ully scalable. The number o read, write, and XOR operations is independent o the

    number o disks in the array. Because the parity disk is involved in every write operation (steps 6 and 7), parity

    is rotated to a dierent disk with each stripe. I all the parity were stored on the same disk all the time, that disk

    could become a perormance bottleneck.

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    11/249

    CoalesceCoalescing is the process o aggregating several command requests together to orm a single and larger

    sequential access on the member drives in an array. Coalesce represents a level o buering commands to the

    member drives o an array. In this way, multiple write commands and associated data can be aggregated into a

    stripe. A single large sequential access reduces the number o disk commands and data transers signicantly,

    thus greatly increasing the amount o host I/O that can be processed in any specic time interval.

    Drive level coalescing by the RAID controller signicantly contributes to the overall perormance o an array

    and is very dependent on the access patterns o the application, as well as the algorithms o the RAID oering.

    Sequential applications with high queue depths leverage the improved perormance o write coalescing more

    than applications with low queue depths and are more random. With a queue depth o one and in write-through

    mode, only one write command will be issued at a time. The host will wait to receive an acknowledgement

    that an initial write completes beore issuing subsequent write commands. The RAID rmware will not have

    an opportunity to coalesce, issuing an individual write operation or each write command received. The use o

    writeback cache settings will allow the controller to acknowledge the writeback to the host applications while

    the data associated rom the command to be coalesced into a common stripe.

    As the application (and the host) exhibit writing multiple outstanding write commands, coalescing can increasingly

    improve perormance o the array, regardless o write-through or writeback cache setting. Each data transer

    write request potentially requires a corresponding write operation o the stripe, with individual reads and writes

    to each member drive o the array (read-modiy-write), or two I/O operations per drive required or each write.

    Some write operations that span stripes might require three (not aligned with stripe boundaries). With each

    extra outstanding write operations per drive (queue depth), the RAID rmware has an increasing chance o

    grouping such commands together in a single sequential data transer.

    Thesizeofthelogicaldrivewillalsoimpacttheaffectofcoalescingrandomworkloads.Largerlogicaldrivesthat

    span the ull range o the disk platters o member drives in the array will require longer command response times

    forcoalescedrandomcommandsthansmallerlogicaldrivesthatarelimitedtojustseveralcylindersofeachset

    o drives disk platters. The smaller range o cylinders will limit the eect o the randomness o the commands,

    reporting an almost sequential like application perormance.

    RAID Cache OptionsMany RAID oerings today oer the option o battery backed cache and an associated writeback mode

    o operation where, or write commands, the host is acknowledged immediately ater the data transer is

    received in to the cache o the RAID system. In the rare event o a power ailure, a hardware ailure in the

    controller, switch, attached cables, or a ailure with the RAID rmware, the record stays in cache until it can be

    completely externalized as part o a recovery process. This mode o operation reduces the response time (or

    latency) associated with the command. This mode o operation also reduces the throughput o the controller

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    12/2410

    due to the mirroring o cache operations between the two controllers in a dual controller conguration. This

    mirroring operation is usually done through dedicated channels on the base plane o the controller.

    Another option ound in most o todays RAID systems allows the system administrator to turn o the RAID

    system cache or writes. Many large data transer applications can get higher throughput (MB/s) in this mode

    as records are externalized directly to the member drives in an array, without relying on the RAID cache. This is

    called write-through mode.

    Some RAID oerings allow the system administrator to partition the cache or writes protecting perormance

    o other applications sharing the RAID system cache and to turn o the cache mirroring, while in writeback

    mode. The RAID system needs to revert to write-through when cache capacity is unavailable to commands.

    Partitioning o cache denes the threshold or transitioning, as well as conserves cache or servicing commands

    to other arrays. Turning o cache mirroring provides the command response time benets o cached writes and

    ast response times, while sacricing ault-tolerance.

    Perormance Implications o Confguration OptionsMany actors impact the perormance o an array and thus the perormance o host based applications that

    access one or more arrays. Signicant actors include:

    1. Data transer size to and rom the host. This is usually set by the application.

    2. Access pattern to the array by the host or hosts (concurrency o multiple hosts, sequential, random, random-

    sequential, queue depth).

    3. RAID level.

    4. RAID cache or writes (Writeback Mode or Write-Through Mode, amount o cache allocated or writes).

    5. Chunk size and thereore Stripe size.

    6. Number o member drives in the array.

    7. Drive perormance.

    The system administrator usually has options or conguring RAID level, RAID cache operations, chunk size,

    number o member drives and drive type (aecting drive perormance). The application usually dictates data

    transer size and access pattern, although this can be tuned with some operating systems.

    Furthermore, vendor RAID strategies will provide diering options to choose rom or conguring arrays. Some

    RAID vendors let system administrators set chunk size, others stripe size. With either approach, the vendor

    oering will provide a list o available options ranging rom as small as 4K to up to 256K chunk sizes. Each vendor

    strategy includes a selection o RAID levels and has slightly diering strategies or implementing each. The

    number o drives can vary, but usually ranges rom 1 or simple RAID 0 to 16. RAID vendors support a variety

    o dierent drive types today ranging rom Serial Attached SCSI (SAS), Serial ATA (SATA), Fibre Channel (FC),

    and SCSI, each with their own perormance characteristics and capacities. Usually an array is limited to include

    a single drive type and capacity.

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    13/2411

    Given the available options to the system administrator, several general recommendations can be made relative

    to the required perormance o host commands to the array based on the access patterns and data transer size

    dictated by the host application.

    Large Data Transer - Sequential Reading and

    Writing Access PatternThis workload will have data transer sizes o at least 32K and more typically 1MB or larger. A special case can

    be made or workloads that read and write larger than 1MB to the array. Some vendor RAID strategies may be

    limited to 1MB or perorm at less than optimally when support greater host transer sizes. The access pattern is

    predominantly sequential, read, write or alternating between read and write or very long intervals o time.

    Especially important or sequential write workloads is the ability to ll up stripes with data to maximize

    perormance (the eect o ull stripe writes and stripe coalesce). Maintaining enough application data to ll up

    stripes with write sequential access patterns can happen in several ways:

    1. Largedatatransfersizesfromthehost.Someapplicationssuchasvideostreamingallowforcongurabledata

    transer size. Ideally the applications data transer size should be aligned with the stripe size.

    2. The application can issue multiple outstanding commands to build queues o commands that can be

    sequentially executed to ll up stripes by the RAID controller across the array.

    3. Writeback cache setting can ll up stripes rom applications that maintain low or non-existent queue depths

    o outstanding commands.

    When the application lls up the stripe during write sequences, large chunks result in maximum throughput

    capabilities.Largerdatatransfersizesandlargestripesfacilitatesegmentingofsmallrecordsintolargerdata

    transers and multiple data transers into larger stripes and even larger coalesced data transers to member disk

    drives in the array. I the application cannot ll up the stripe, the system administrator should consider smaller

    stripes either via ewer drives in the array or choosing smaller chunk size.

    Many large data transer applications perorm better with write cache disabled, or when the controller is in

    Write-through mode. Write-through mode optimizes the RAID perormance or high throughput rates (MB/s)

    by bypassing cache coherency operations and writing aggregated streams o data to member drives in an array.

    Writeback cache should only be considered or small data transer applications or lling up stripe writes.

    Command response time (latency) or a high number o commands per second (IOPS) are typically not important

    to these workloads. The data transers are large or very large, and will by nature take longer to service as they

    are striped across many drives in an array. Typical workloads in this category include disk-to-disk backup, Virtual

    tape library, archiving applications, and some network le system deployments.

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    14/2412

    Small Data Transer Random Reading and Writing

    Access PatternThis work load will typically have data transer sizes o less than 32K and more typically 4K or 8K in size. The

    access pattern is predominantly random reads or writes, or intermixed with updates. These workloads typically

    do not maintain much o a queue o outstanding commands to the array, usually less than 6. The depth o the

    queue is very application dependent and can very depending on the number o threads the application can

    generate, number o processors, the number o clients supported or other programming constructs used.

    In this case, Writeback cache operations result in minimum response time and maximum number o commands

    per second (IOPS) rom the array, regardless o the chunk size and corresponding stripe size. In Writeback mode,

    larger cache sizes will increase the likelihood o lling up stripes to leverage the eect o ull stripe writes.

    Throughput is typically not important to these workloads. Typical workloads in this category include database,

    transactional systems, web servers, and many le servers.

    ESG Defned Workloads and Associated AccessPatternsHere is a very detailed description o access patterns by workload used in benchmarking array perormance by

    ESG. The queue depth is xed at 16K, which may not be typical or most data base or transaction workloads.

    Figure 6: ESG Defned Workloads

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    15/2413

    3 Nigel Hart. Xyratex CDP Engineering Manager. ESG Chunk Eects.xls. dated 3 July 2007.

    Perormance o Arrays and Varying Chunk Sizes3

    Xyratex CDP Engineering used ESG specied workloads, specially modied versions o the latest Xyratex RAID

    oerings, and a recent DotHill RAID oering to test the implications o varying chunk sizes on xed and repeatable

    workloads (Model RS-1220-F4-5412E and Model RS-1220-F4-5402E with 1Gb o cache per controller rom

    Xyratex and the Model 2730 rom DotHill).

    Xyratex used Intels IOMeter to simulate the workload patterns as specied by ESG as typical or various

    applications. See the table below or a detailed description o each workload. The conguration consisted o one

    system using a Microsot Windows 2003 host with bre channel interace connected (at 4Gb/s) to the Xyratex

    RAIDsystemconguredwithasingleRAID5(5+1)array,andsingleLUNexternalizedtothehost.Seagate

    X15-5 SAS drives and Seagate Barracuda SATA-II drives were used. The RAID systems were congured with dual

    controllers and cache mirroring in eect, even though only one controller was active.

    ThetestswererunwithIOMeteragainstthiscommonsinglearray/singleLUNconguration,againsteachofthe

    ESG workload categories, and with the ollowing dierent combinations o platorm and chunk size.

    RAID Oering Chunk Size Drive Type

    Xyratex RS-1220-F4-5402E 16K SAS

    Xyratex RS-1220-F4-5402E 64K SAS

    Xyratex RS-1220-F4-5412E 16K SAS

    Xyratex RS-1220-F4-5412E 64K SAS

    DotHill 2730 16K SAS

    DotHill 2730 64K SAS

    Xyratex RS-1220-F4-5402E 16K SATA

    Xyratex RS-1220-F4-5402E 64K SATA

    Xyratex RS-1220-F4-5412E 16K SATAXyratex RS-1220-F4-5412E 64K SATA

    DotHill 2730 16K SATA

    DotHill 2730 64K SATA

    Both 16K and 64K chunk sizes where common to both DotHill and Xyratex, and thereore used or comparison

    purposes. Xyratex RAID systems oer chunk size options o 64K, 128K, and 256K, and derives the stripe size

    depending on the number o drives in the array multiplied times the selected chunk size.

    DotHill oers 16K, 32K, and 64K chunk sizes. Since these are lower then the chunk sizes available on Xyratex

    RAID systems, Xyratex created a special set o RAID rmware to enable identical conguration options or one-

    to-one comparisons.

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    16/2414

    For a 16K chunk size the derived stripe size is 96K. For a 32K chunk size the derived stripe size is 192K. I chunk

    size made a dierence, it would be refected in lower command response times (latency) and higher command

    response throughput (IOPS) in these tests.

    The RS-1220-F4-5412E is the current shipping version o RAID system rom Xyratex. The previous model, the RS-

    1220-F4-54 02E,usedanolderversionoftheLSI1064SAScontrollerandIntelRAIDASIC(IOP331).Thatmodel

    was included only or comparison purposes.

    Test ResultsBoth Xyratex and DotHill RAID with the smaller chunk size (16K) and stripe size (96K) exhibited slower command

    response times (latency) and lower throughput (IOPS) than the same RAID oering with larger chunk size (64K)

    and stripe size (192K). This eect was noted in almost all workloads and RAID oerings except one DotHill test

    case, regardless o drive type (SAS or SATA).

    SAS Drive TestsSAS 64K chunk to 16K 5402 5412 2730

    4KOLTP -11% -13% -5%

    File Server -19% -19% -14%

    Web Server -20% -20% -20%

    4 KB random Read -10% -10% -10%

    4 KB random Write -15% -9% 2%

    512 KB Seq Write -37% -37% -51%

    512 KB Seq Read -21% -21% -11%

    Media Server -26% -26% -35%

    Exchange Data -10% -10% -5%

    ExchangeLogs -41% -42% -33%

    File Server -19% -20% -15%

    Web Server -20% -20% -21%

    Windows Media Player -26% -26% -36%

    Exchange Data (EBD) -10% -10% -4%

    ExchangeLogs(LOG) -41% -42% -31%

    Relative percent calculations based the dierence in total IOPS between 64K and 16K chunk sizes or each

    workload. Negative numbers indicate that the 16K chunk size perormed with less IOPS than the 64K chunk size

    test case, in relative terms.

    Sequential workloads perormed signicantly less with smaller chunk sizes (and smaller stripe sizes) than random

    workloads. Write workloads perormed signicantly less with smaller chunk sizes (and smaller stripe sizes) than

    read workloads. This aect is attributed to the lack o 1st level segmentation within a stripe with random and

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    17/2415

    write workloads. Where as sequential workloads benet rom both 1st level segmentation within the stripe (with

    larger stripes) and 2nd level coalescing o multiple stripes at a queue depth o 16.

    SATA Drive TestsSATA 64K chunk to 16K 5402 5412 2730

    4KOLTP -10% -10% -7%

    File Server -21% -22% -18%

    Web Server -22% -21% -23%

    4 KB random read -12% -10% -11%

    4 KB random write -13% -11% -3%

    512 KB Seq Write -82% -58% 13%

    512 KB Seq Read -10% -10% -1%

    Media Server -15% -16% -32%

    Exchange data -10% -8% -19%

    ExchangeLogs -49% -76% -43%

    File Server -20% -22% -19%

    Web Server -22% -21% -23%

    Windows Media Player -15% -16% -33%

    Exchange data (EBD) -10% -8% -8%

    ExchangeLogs(LOG) -49% -75% -7%

    Relative percent calculations based the dierence in total IOPS between 64K and 16K chunk sizes or each

    workload. Negative numbers indicate that the 16K chunk size perormed with less IOPS than the 64K chunk size

    test case, in relative terms.

    Similar dierences between random and sequential workloads, and SAS drives are seen with SATA drives.

    In general, the relative dierences with sequential workloads are exacerbated with slower SATA drive

    perormance.

    Test ConclusionsSmaller chunk size does not improve latency or increase IOPS or any o the ESG workloads, at the specied queue

    depth (16) and or the specied RAID oerings, other than one test case with the DotHill RAID controller. The

    ESG dened workloads are mostly random and small data transer size workloads, where system administrators

    would typically assign small chunk sizes or stripe sizes to improve perormance and increase IOPS. For the

    sequential workloads (Media server, Exchange data, Exchange logs, and Media player), smaller chunk size did

    impact command response time (latency) and reduce IOPS.

    The aect o the very ast linear transer rates with todays disk drives translates to larger chunks or high RAID

    perormance. The use o a small chunk size, 16K, did not improve perormance when compared to a larger 64K

    chunk.

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    18/2416

    System administrators need to congure external RAID systems such that ull stripe reads and writes leverage

    these ast linear transer rates by handling as much data transer as possible without incurring the penalty o slow

    seek times. Todays hard disk drives have very large track capacities. The Barracuda ES amily o Serial ATA drives

    used in this study rom Seagate has an average track size o 32K.4

    The largest capacity Barracuda ES drive has 8

    tracks per cylinder, translating to 256K per data transer between seeks. The smallest Barracuda ES drive has 3

    tracks per cylinder, translating to 96K per data transer between seeks.

    Coalescing o data transers with the writing o data in to stripes also helps produce higher perormance and

    IOPS o the larger chunk size. Coalescing is the process o aggregating several command requests together to

    orm a single and larger sequential access on the member drives in an array. A single large sequential access

    reduces the number o disk commands and data transers signicantly, thus greatly increasing the amount o

    host I/O that can be processed in any specic time interval. RAID in this mode o operation, becomes constrained

    by speed o a chunk update alone. A high queue depth used in the sixteen test cases contributed to the aect

    o coalescing on the command response times (latency) and throughput (IOPS).

    Expect application workloads with lower queue depths (rom 4 to 16) to also benet rom larger chunk sizes

    and stripe sizes, but to a lesser extent than reported with a queue depth o 16. Applications with queue depth

    approaching 1 or 2 may benet rom smaller chunk sizes (and smaller stripes) as the total data transer operation

    is reduced in size with ewer commands being aggregated into larger single sequential accesses.

    4 Track size will vary, depending on the location o the cylinder. Cylinders towards the outside o the drive will have larger

    tracks because o the size o the cylinder, versus cylinders towards the inside o the drive, which are smaller. Seagate Barracuda

    ES Serial ATA Product Manual, Revision A, 6/26/06

    Figure 7: Number o I/O Operations/Sec by Application Workload or

    16K and 64K Chunk Sizes

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    19/2417

    A Case or Smaller Chunk Size with Wide ArraysFor perormance reasons systems administrators may choose to congure a large number o drives in a single

    array. This conguration maximizes the parallelism o RAID by breaking up a large single data transer into

    multiple and parallel disk commands with smaller data transers to each drive. Some vendor RAID oerings

    allow the system administrator to choose rom a range o chunk size. In some cases a chunk size less than the

    maximum available may make since to reduce the overall stripe size o the array.

    For example: a 12 drive RAID 5 array with 256K chunks will result in an array with a 3MB stripe. Applications

    with typically small data transer requests, especially when made in a random ashion, may perorm better with a

    smaller stripe. The small data transer size or randomness o the write requests may limit the aect o writeback

    cache, coalescing, read-modiy-write and segmentation at the stripe level. Perormance may improve by speciy

    a 64K chunk (providing a 768K stripe) or a 128K chunk (providing a 1.5MB stripe).

    Smaller chunks and smaller stripes should denitely be considered or workloads against wide arrays that require

    disabling writeback cache (write-through mode).

    Glossary

    RAID: (Redundant Array o Independent Disks) A disk subsystem that is used to increase perormance or provide

    ault tolerance or both. RAID uses two or more ordinary hard disks and a RAID disk controller. In the past, RAID

    has also been implemented via sotware only.

    In the late 1980s, the term stood or redundant array o inexpensive disks, being compared to large,

    expensive disks at the time. As hard disks became cheaper, the RAID Advisory Board changed inexpensive to

    independent.

    Small and Large: RAID subsystems come in all sizes rom desktop units to foor-standing models (see NAS and

    SAN). Stand-alone units may include large amounts o cache as well as redundant power supplies. Initially used

    with servers, desktop PCs are increasingly being retrotted by adding a RAID controller and extra IDE or SCSI

    disks. Newer motherboards oten have RAID controllers.

    Disk Striping: RAID improves perormance by disk striping, which interleaves bytes or groups o bytes across

    multiple drives, so more than one disk is reading and writing simultaneously.

    Mirroring and Parity: Fault tolerance is achieved by mirroring or parity. Mirroring is 100% duplication o the

    data on two drives (RAID 1). Parity is used to calculate the data in two drives and store the results on a third

    (RAID 3 or 5). Ater a ailed drive is replaced, the RAID controller automatically rebuilds the lost data rom the

    other two. RAID systems may have a spare drive (hot spare) ready and waiting to be the replacement or a drive

    that ails.

    The parity calculation is perormed in the ollowing manner: a bit rom drive 1 is XORd with a bit rom drive 2,

    and the result bit is stored on drive 3 (see OR or an explanation o XOR).

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    20/24

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    21/2419

    External/GUI Other similar

    names

    Description

    Drive Array Storage Pool,

    Storage Pool Group,

    Device Group, Sub-

    device Group

    A group o drives that all have the same array unction place on them. The array unctions

    o the controller are RAID 0 and RAID 5 with an appropriate strip or chunk size. The array

    is constructed over the minimum useable capacity o all drives (smallest capacity drive

    extrapolated to all remaining drives in the array). The array can be broken in to logical drives

    that are constructed rom 1 or many parts o the array.

    All logical drives allocated rom the same drive array will have the same RAID characteristics.

    Full StripeWrite

    The amount o data written by the host, beore parity, that will result

    LogicalDrive,

    HostLogical

    Drive

    Volume The logical construction o a direct access block device as dened by the ANSII SCSI

    specication. Each logical drive has a number o host logical blocks (512 bytes), resulting

    in the capacity o the drive. Blocks can be written to or read rom. The controller enorces

    precedence to the order o commands as received at the controller rom connected hosts.

    Logicaldriveswilleitherbeinanormalstateorafailedstatedependingonphysicalstateof

    the drive array. Task management is carried out at the logical drive level.

    LUN TheprotocolendpointofSCSIcommunication.TherecanbemultipleLUNspresentedfor

    a given logical drive, thus giving multiple paths to that logical drive. For any host port, a

    givenlogicaldrivecanbereportedonlyonceasaLUN.ThelogicaldrivetheLUNisassigned

    to may be a host logical drive including source or virtual logical drive (associated with a

    snapshot).

    Overwrite

    Data Area,

    ODA

    Delta area Protected storage area on an array that has been dedicated to be used or storing data rom

    the snapped logical drive that is overwritten (beore images). In itsel, the ODA is a logical

    drive, though not externalized.

    Snap Back The process o restoring a logical drive rom a selected snapshot. This process takes place

    internally in the RAID controller rmware and needs no support rom any backup utility or

    other third party sotware.

    Snapshot Snap, Snapshot

    Volume

    A method or producing a point-in-time virtual copy o a logical drive. The state o all data

    on the virtual logical drive remains rozen in time and can be retrieved at a later point in

    time. In the process o initiating the snapshot, no data is actually copied rom the snapped

    logical drive. As updates are made to the snapped logical drives, update blocks are copied

    to the ODA.

    Snapshot

    Identier

    Snapshot Number A unique number identiying the snap relationship to the source logical drive. Snapshot

    number is not necessarily unique to the controller, only unique to the host logical drive.

    Snapshots are sequentially numbered or an individual host logical drive starting at 0

    through N.

    Number o

    Snapshots

    Number o total snapshots on the same host logical drive.

    SourceLogical

    Drive

    SnapshottedLogical

    Drive

    A logical drive that has a snapshot initiated on it.

    Strip, Chunk The minimum unit o storage managed on a given drive as a contiguous storage unit

    (usually a number o drive blocks) by the controller. The RAID controller writes data to drives

    in a drive array in this size.

    Stripe Size The amount o host data plus any parity at which an array unction, such as a host reador write, rotates across all members o the array the optimum unit o access o the array.

    Chunk size * Stripe Width = Stripe Size.

    Stripe Width The number o members in the array.

    Virtual

    LogicalDrive,

    Snapshot

    LogicalDrive

    A special logical drive created rom the source logical drive and the data contained in the

    overwrite data area implementing by the snapshot unction.

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    22/2420

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    23/24

  • 8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

    24/24

    2008 Xyratex (The trading name of Xyratex Technology Limited). Registered inEngland & Wales. Company no: 03134912. Registered Office: Langstone Road,Havant, Hampshire PO9 1SA, England. The information given in this brochure isfor marketing purposes and is not intended to be a specification nor to providethe basis for a warranty. The products and their details are subject to change.For a detailed specification or if you need to meet a specific requirement pleasecontact Xyratex.

    UK HQ

    Langstone Road

    Havant

    Hampshire PO9 1SA

    United Kingdom

    T +44(0)23 9249 6000

    F +44(0)23 9249 2284

    www.xyratex.com

    ISO 14001: 2004 Cert No. EMS91560