xyratex white paper raid chunk size 1-0

8/3/2019 Xyratex White Paper RAID Chunk Size 1-0

1/24

RAID Chunk Size


2/24

Notices

The inormation in this document is subject to change without notice.

While every eort has been made to ensure that all inormation in this document is

accurate, Xyratex accepts no liability or any errors that may arise.

2008 Xyratex (the trading name o Xyratex Technology Limited). Registered Ofce:

Langstone Road, Havant, Hampshire, PO9 1SA, England. Registered number 03134912.

No part o this document may be transmitted or copied in any orm, or by any means,

or any purpose, without the written permission o Xyratex.

Xyratex is a trademark o Xyratex Technology Limited. All other brand and product

names are registered marks o their respective proprietors.

Acknowledgements

Jerry Hoetger, Director, Product Management, Xyratex International Ltd.

Nigel Hart, Director CPD Engineering, Xyratex International Ltd.

Stuart Campbell, Lead Firmware Architect and author o the Xyratex RAID frmware,

Cork, Ireland, Xyratex International Ltd.

Xyratex CPD Engineering, Xyratex Lake Mary, FL and Xyratex Naperville, IL

Issue 1.0 | January, 2008


3/241

Contents

Executive Summary .......................................................................................................................................3

Abstract .................................................................................................................................................3

Chunk Size .................................................................................................................................................3

Striping and Chunks (or Strips) ....................................................................................................................3

RAID 0: Striping ..........................................................................................................................................3

RAID 1: Mirroring........................................................................................................................................3RAID 5 .................................................................................................................................................4

RAID 10: a mix o RAID 0 & RAID 1 ............................................................................................................5

Stripe .................................................................................................................................................6

Adjusting or Perormance and Chunk Size ...............................................................................................6

Writing and Reading Chunks and Stripes ...................................................................................................6

Read-Modiy-Write or Writing Stripes ......................................................................................................7

Coalesce .................................................................................................................................................9

RAID Cache Options ......................................................................................................................................9

Perormance Implications o Confguration Options ..............................................................................10

Large Data Transer - Sequential Reading and Writing Access Pattern ................................................11

Small Data Transer Random Reading and Writing Access Pattern ....................................................12

ESG Defned Workloads and Associated Access Patterns. ......................................................................12

Perormance o Arrays and Varying Chunk Sizes ....................................................................................13

Test Results ...............................................................................................................................................14

SAS Drive Tests ............................................................................................................................................14

SATA Drive Tests ..........................................................................................................................................15

Test Conclusions ..........................................................................................................................................15

A Case or Smaller Chunk (or Strip) Size with Wide Arrays ....................................................................17

Glossary ...............................................................................................................................................17


4/242

Executive SummaryIn general, application workloads benet rom large chunk sizes. The most signicant reason or using large chunk

sizes today in conguring RAID arrays or high throughput rates and perormance is the recent advancements

in linear transer rates o hard disk drives. The most common o drive sizes, 3 SCSI and Serial ATA have

scaled rom 36GB to up to 1TB in todays SATA drives. This phenomenal increase in capacity came about by

increasing the amount o data transerred while the platter rotated under the head/actuator assembly. RAID

systems can now transer much more data in the single rotation o a platter without the response penalty o aseek operation. Conguration o RAID arrays or high throughput rates and perormance need to consider the

increased perormance o large chunk (or what some call strip sizes) RAID arrays.

Other contributing RAID eatures support larger chunk sizes: Write-back cache capabilities, coalescing o stripe

writes, segmentation o data transer requests into stripes, and the deployment o read-modiy-write operations,

optimize RAID perormance with large chunk (and large stripe) sizes.

Some applications use wide arrays with between 12 and 16 drives in an array, and thereore have extra large

stripes. These arrays may benet rom using the smaller chunk size options available in most controllers.

Benchmarks show that commands to RAID arrays with the smaller chunk size (and stripe size) exhibited slower

command response times (latency) and lower throughput (IOPS) than the same commands with larger chunk

size and larger stripe size. This eect was noted in almost all workloads and on several dierent RAID oerings,

regardless o drive type (SAS or SATA). Increased linear transer rate (aerial density), RAID rmware write-

back cache and the ability to coalesce write commands contribute to the benets o larger chunks and larger

stripes.

AbstractConsidering the recent advances in disk drive technology, conguring RAID systems or optimum perormance

presents new challenges. Over the last ve years, drive manuacturers have been steadily increasing the capacity

o disk drives by increasing the density and linear transer rates.

Although seek times have not improved in relation to linear transer rates, to optimize perormance it is necessary

to transer as much data as possible while the disk drive heads are in one position. We accomplish this by

increasing the chunk size o the disk array. The calculations are based on the number o drives, RAID level o the

array, and the type o use is intended or the array.

For example, database servers would have a dierent chunk size requirement, then one or a general use server

or mail server.

Throughout this paper the core concepts o chuck size and ull stripe writes are examined, as well as benchmark

test results.


5/243

Chunk Size

Striping and Chunks

All RAID levels used in todays storage rely on the concept o striping data across multiple disk drives in an array.

The data in a stripe is usually made up o multiple records or segments o inormation in accordance to a layout

by an application running on the host. The particular record layout is o no signicance to the RAID processing.

Host applications access the logical data records or segments with varying access patterns. Access patterns and

conguration options dened as part o the RAID array, will impact the perormance o the array.

RAID 0: Striping

Figure 1: RAID 0 Array

In a RAID 0 system, data is split up into blocks that will get written across all the drives in the array. By using

multiple disks (at least 2) at the same time, RAID 0 oers superior I/O perormance. This perormance can be

enhanced urther by using multiple controllers, ideally one controller per disk.

Advantages

RAID0offersgreatperformancetobothreadandwriteoperations.

Thereisnooverheadcausedbyparitycontrols.

Allstoragecapacitycanbeused,thereisnodiskoverhead.Thetechnologyiseasytoimplement.

Disadvantages

RAID0isnotfault-tolerant.Ifonediskfails,alldataintheRAID0arrayarelost.

Itshouldnotbeusedonmission-criticalsystems.

Ideal Use

RAID 0 is ideal or non-critical storage o data that has to be read rom or written to a workstation; or example:

a workstation using Adobe Photoshop or image processing.

RAID 1: Mirroring

Data is stored twice by writing to both the data disk (or sets o data disks) and a mirror disk (or sets o disks). I

a disk ails, the controller uses either the data disk or the mirror disk or data recovery, and continues operation.

You need at least two disks or a RAID 1 array.


6/244

RAID level 1 is oten combined with RAID level 0 to improve perormance. This RAID level is sometimes reerred

to as RAID level 10.

Advantages

RAID1offersexcellentreadspeedandawritespeed,thatiscomparabletothatofasingledisk.

Incaseadiskfails,datadoesnothavetoberebuilt,itjusthastobecopiedtothereplacementdisk.

RAID1isaverysimpletechnology.

Disadvantages

Themaindisadvantageisthattheeffectivestoragecapacityisonlyhalfofthetotaldiskcapacitybecauseall

data gets written twice.

SoftwareRAID1solutionsdonotalwaysallowahotswapofafaileddisk(meaningitcannotbereplaced

while the server continues to operate). Ideally a hardware controller is used.

Ideal use

RAID 1 is ideal or mission critical storage, or instance with accounting systems. It is also suitable or small

servers, in which only two disks will be used.

RAID 5

RAID 5 is the most common secure RAID level. It is similar to RAID 3 except that data is transerred to disks by

independent read and write operations (not in parallel). The data chunks that are written are also larger. Instead

o a dedicated parity disk, parity inormation is spread across all the drives. You need at least 3 disks or a RAID

5 array.

A RAID 5 array can withstand a single disk ailure without losing data or access to data. Although RAID 5 can

be achieved in sotware, a hardware controller is recommended. Oten extra cache memory is used on these

controllers to improve the write perormance.



7/245



Advantages

Read data transactions are very ast, while write data transaction are somewhat slower (due to the parity that

has to be calculated).

Disadvantages

Diskfailureshaveaneffectonthroughput,althoughthisisstillacceptable.

LikeRAID3,thisiscomplextechnology.

Ideal use

RAID 5 is a good all-round system that combines ecient storage with excellent security and decent perormance.

It is ideal or le and application servers.

RAID 10: A Mix o RAID 0 & RAID 1

RAID 10 combines the advantages (and disadvantages) o RAID 0 and RAID 1 in a single system. It provides

security by mirroring all data on a secondary set o disks (disk 3 and 4 in the Figure below) while using striping

across each set o disks to speed up data transers.


8/246

StripeTheaggregationofrecordsorsegmentsiscommonlycalledastripeandrepresentsacommonobjectinthe

array that is maintained by the rmware in the controller, based on commands rom the host application. For

each stripe read to or written (ull or partial) rom the RAID rmware, multiple I/O requests are made to each

member drive in the array depending on the RAID type and number o drives in the array.

For RAID 0 and 10, no read-modiy-write is perormed, so there is a single I/O request to a member drive or each

host I/O, which can be reduced by coalescing. For RAID 5/50/6, a host read command translates to a single I/O

to a single drive, not multiple I/Os to the same drives. For a updating (read-modiy-write), there are multiple I/Os

to the data and parity drive(s). For a ull stripe write, there is a single I/O or each drive. These requests are made

in parallel. This particular paralleling aspect o ull stripe writes in RAID striping reduces command response time

(latency) and increases I/O commands per second (IOPS) or hosts writing to the array sequentially.

Adjusting or Perormance and Chunk SizeSystem administrators have options when dening the RAID systems used or storing and retrieving data

by applications. Each application has a unique pattern or storing and retrieving the data based on how the

application works, and how users o the application make requests. This pattern o reading and writing data

along with the system administrator specied options aect the perormance o the application.

Chunk size is one o the conguration options, where the size o the stripe is apportioned to the individual disk

drive. Some RAID oerings let the system administrator dene the size o the stripe across all o the member drives

(and derive the size o the chunk written to individual drives based on the number o member drives). Other RAID

oerings have the system administrator congure the chunk or strip size (and derive the larger stripe size based

on the number o member drives in the array). In either case the chunk size has perormance implications.

Writing and Reading Chunks and StripesRAID controller rmware reads and writes stripes to member drives in an array. Writes can be either ull stripe

or partial stripe write operations (See Read-Modiy-Write Section or more inormation). System Administrators

stipulate a stripe size (and thereore derive chunk size), or stipulate the chunk size (and thereore derive stripe

size). Full stripe write operations stripe across all member drives.

Partial stripe operations read and write only to the aected portions o the stripe (usually one data drive and one

parity drive). Read operations read only the amount o data associated with the host command. The presence o

Read-Ahead capabilities in a RAID controller may modiy the amount o data read beyond that which is required

to satisy the host command, in preparation o subsequent read commands.

For RAID levels with parity and parity chunks are maintained on writes. Dierent RAID rmware handles reads

and writes somewhat dierently, depending on the vendor and the RAID rmware strategy on ull stripe reads

and writes.


9/247

Writes can be handled in slightly dierent ways, depending on the vendor RAID strategy. Common in early RAID

deployments, vendor RAID strategies stipulated that with writes and updates, the complete stripe was rst read

rom all member drives in an array, locked or update with appropriate data and parity, veried, then completely

striped across all member drives in the array with updated data and parity, at which time the lock or that stripe

would be released. Xyratex RAID deploys a read-modiy-write strategy where only the portion o the stripe

written to and the portion o the parity chunk updated are transmitted to member disk drives as part o the write

or update operation. This mitigates the perormance implications o alternative ull stripe writes.

Read-Modiy-Write or Writing Stripes1

Xyratex has implemented read-modiy-write or writing stripes. Only that portion o the stripe (and parity

chunks) is written to the array or servicing writes. This mitigates the perormance implication o ull stripe writes

or random workloads, and is one o the reasons systems administrators can use larger chunks and stripes or

all workloads.2

The write operation is responsible or generating and maintaining parity data. This unction is typically reerred

to as a read-modiy-write operation. Consider a stripe composed o our chunks o data and one chunk o parity.

Supposethehostwantstochangejustasmallamountofdatathattakesupthespaceononlyonechunkwithin

the stripe. The RAID rmware cannot simply write that small portion o data and consider the request complete.

It also must update the parity data, which is calculated by perorming XOR operations on every chunk within the

stripe. So parity must be recalculated when one or more stripes change.

1 DellPowerSolutions,UnderstandingRAID-5andI/OProcessors,PaulLuse,May2003

2 Full stripe writes or sequential write workloads is aster than Read-Modiy-Write commands. I the application can ll up

a stripe completely, overall ull stripe writes perorm aster.

Figure 5: Step-by-Step: Read-Modiy-Write Operation to a Four-Disk RAID Array


10/248

Figure 5 shows a typical read-modiy-write operation in which the data that the host is writing to disk is contained

withinjustonestrip,inpositionD5.Theread-modify-writeoperationconsistsofthefollowingsteps:

1. Read new data rom host: The host operating system requests that the RAID subsystem write a piece o

data to location D5 on disk 2.

2. Read old data rom target disk or new data: Reading only the data in the location that is about to be

written to eliminates the need to read all the other disks. The number o steps involved in the read-modiy-

write operation is the same regardless o the number o disks in the array.

3. Read old parity rom target stripe or new data: A read operation retrieves the old parity. This unction

is independent o the number o physical disks in the array.

4. Calculate new parity by perorming an XOR operation on the data rom steps 1, 2, and 3: The

XOR calculation o steps 2 and 3 provides the resultant parity o the stripe, minus the contribution o the data

that is about to be overwritten. To determine new parity or the D5 stripe that contains the new data, an

XOR calculation is perormed on the new data read rom the host in step 1 with the result o the XOR

procedure perormed in steps 2 and 3.

5. Handle coherency: This process is not detailed in Figure 5 because its implementation varies greatly rom

vendor to vendor. Ensuring coherency involves monitoring the write operations that occur rom the start o

step 6 to the end o step 7. For the disk array to be considered coherent, or clean, the subsystem must ensure

that the parity data block is always current or the data on the stripe. Because it is not possible to guarantee

that the new target data and the new parity will be written to separate disks at exactly the same instant, the

RAID subsystem must identiy the stripe being processed as inconsistent, or dirty, in RAID vernacular.

6. Write new data to target location: The new data was received rom the host in step 1; now the RAID

mappings determine on which physical disk, and where on the disk, the data will be written. With Xyratex

RAID only the portion o the chunk updated will be written, not the complete chunk.

7. Write new parity: The new parity was calculated in step 4; now the RAID subsystem writes it to disk.

8. Handle coherency: Once the RAID subsystem veries that steps 6 and 7 have been completed successully,

and the data and parity are both on diskthe stripe is considered coherent.

This optimized method is ully scalable. The number o read, write, and XOR operations is independent o the

number o disks in the array. Because the parity disk is involved in every write operation (steps 6 and 7), parity

is rotated to a dierent disk with each stripe. I all the parity were stored on the same disk all the time, that disk

could become a perormance bottleneck.


11/249

CoalesceCoalescing is the process o aggregating several command requests together to orm a single and larger

sequential access on the member drives in an array. Coalesce represents a level o buering commands to the

member drives o an array. In this way, multiple write commands and associated data can be aggregated into a

stripe. A single large sequential access reduces the number o disk commands and data transers signicantly,

thus greatly increasing the amount o host I/O that can be processed in any specic time interval.

Drive level coalescing by the RAID controller signicantly contributes to the overall perormance o an array

and is very dependent on the access patterns o the application, as well as the algorithms o the RAID oering.

Sequential applications with high queue depths leverage the improved perormance o write coalescing more

than applications with low queue depths and are more random. With a queue depth o one and in write-through

mode, only one write command will be issued at a time. The host will wait to receive an acknowledgement

that an initial write completes beore issuing subsequent write commands. The RAID rmware will not have

an opportunity to coalesce, issuing an individual write operation or each write command received. The use o

writeback cache settings will allow the controller to acknowledge the writeback to the host applications while

the data associated rom the command to be coalesced into a common stripe.

As the application (and the host) exhibit writing multiple outstanding write commands, coalescing can increasingly

improve perormance o the array, regardless o write-through or writeback cache setting. Each data transer

write request potentially requires a corresponding write operation o the stripe, with individual reads and writes

to each member drive o the array (read-modiy-write), or two I/O operations per drive required or each write.

Some write operations that span stripes might require three (not aligned with stripe boundaries). With each

extra outstanding write operations per drive (queue depth), the RAID rmware has an increasing chance o

grouping such commands together in a single sequential data transer.

Thesizeofthelogicaldrivewillalsoimpacttheaffectofcoalescingrandomworkloads.Largerlogicaldrivesthat

span the ull range o the disk platters o member drives in the array will require longer command response times

forcoalescedrandomcommandsthansmallerlogicaldrivesthatarelimitedtojustseveralcylindersofeachset

o drives disk platters. The smaller range o cylinders will limit the eect o the randomness o the commands,

reporting an almost sequential like application perormance.

RAID Cache OptionsMany RAID oerings today oer the option o battery backed cache and an associated writeback mode

o operation where, or write commands, the host is acknowledged immediately ater the data transer is

received in to the cache o the RAID system. In the rare event o a power ailure, a hardware ailure in the

controller, switch, attached cables, or a ailure with the RAID rmware, the record stays in cache until it can be

completely externalized as part o a recovery process. This mode o operation reduces the response time (or

latency) associated with the command. This mode o operation also reduces the throughput o the controller


12/2410

due to the mirroring o cache operations between the two controllers in a dual controller conguration. This

mirroring operation is usually done through dedicated channels on the base plane o the controller.

Another option ound in most o todays RAID systems allows the system administrator to turn o the RAID

system cache or writes. Many large data transer applications can get higher throughput (MB/s) in this mode

as records are externalized directly to the member drives in an array, without relying on the RAID cache. This is

called write-through mode.

Some RAID oerings allow the system administrator to partition the cache or writes protecting perormance

o other applications sharing the RAID system cache and to turn o the cache mirroring, while in writeback

mode. The RAID system needs to revert to write-through when cache capacity is unavailable to commands.

Partitioning o cache denes the threshold or transitioning, as well as conserves cache or servicing commands

to other arrays. Turning o cache mirroring provides the command response time benets o cached writes and

ast response times, while sacricing ault-tolerance.

Perormance Implications o Confguration OptionsMany actors impact the perormance o an array and thus the perormance o host based applications that

access one or more arrays. Signicant actors include:

1. Data transer size to and rom the host. This is usually set by the application.

2. Access pattern to the array by the host or hosts (concurrency o multiple hosts, sequential, random, random-

sequential, queue depth).

3. RAID level.

4. RAID cache or writes (Writeback Mode or Write-Through Mode, amount o cache allocated or writes).

5. Chunk size and thereore Stripe size.

6. Number o member drives in the array.

7. Drive perormance.

The system administrator usually has options or conguring RAID level, RAID cache operations, chunk size,

number o member drives and drive type (aecting drive perormance). The application usually dictates data

transer size and access pattern, although this can be tuned with some operating systems.

Furthermore, vendor RAID strategies will provide diering options to choose rom or conguring arrays. Some

RAID vendors let system administrators set chunk size, others stripe size. With either approach, the vendor

oering will provide a list o available options ranging rom as small as 4K to up to 256K chunk sizes. Each vendor

strategy includes a selection o RAID levels and has slightly diering strategies or implementing each. The

number o drives can vary, but usually ranges rom 1 or simple RAID 0 to 16. RAID vendors support a variety

o dierent drive types today ranging rom Serial Attached SCSI (SAS), Serial ATA (SATA), Fibre Channel (FC),

and SCSI, each with their own perormance characteristics and capacities. Usually an array is limited to include

a single drive type and capacity.


13/2411

Given the available options to the system administrator, several general recommendations can be made relative

to the required perormance o host commands to the array based on the access patterns and data transer size

dictated by the host application.

Large Data Transer - Sequential Reading and

Writing Access PatternThis workload will have data transer sizes o at least 32K and more typically 1MB or larger. A special case can

be made or workloads that read and write larger than 1MB to the array. Some vendor RAID strategies may be

limited to 1MB or perorm at less than optimally when support greater host transer sizes. The access pattern is

predominantly sequential, read, write or alternating between read and write or very long intervals o time.

Especially important or sequential write workloads is the ability to ll up stripes with data to maximize

perormance (the eect o ull stripe writes and stripe coalesce). Maintaining enough application data to ll up

stripes with write sequential access patterns can happen in several ways:

1. Largedatatransfersizesfromthehost.Someapplicationssuchasvideostreamingallowforcongurabledata

transer size. Ideally the applications data transer size should be aligned with the stripe size.

2. The application can issue multiple outstanding commands to build queues o commands that can be

sequentially executed to ll up stripes by the RAID controller across the array.

3. Writeback cache setting can ll up stripes rom applications that maintain low or non-existent queue depths

o outstanding commands.

When the application lls up the stripe during write sequences, large chunks result in maximum throughput

capabilities.Largerdatatransfersizesandlargestripesfacilitatesegmentingofsmallrecordsintolargerdata

transers and multiple data transers into larger stripes and even larger coalesced data transers to member disk

drives in the array. I the application cannot ll up the stripe, the system administrator should consider smaller

stripes either via ewer drives in the array or choosing smaller chunk size.

Many large data transer applications perorm better with write cache disabled, or when the controller is in

Write-through mode. Write-through mode optimizes the RAID perormance or high throughput rates (MB/s)

by bypassing cache coherency operations and writing aggregated streams o data to member drives in an array.

Writeback cache should only be considered or small data transer applications or lling up stripe writes.

Command response time (latency) or a high number o commands per second (IOPS) are typically not important

to these workloads. The data transers are large or very large, and will by nature take longer to service as they

are striped across many drives in an array. Typical workloads in this category include disk-to-disk backup, Virtual

tape library, archiving applications, and some network le system deployments.


14/2412

Small Data Transer Random Reading and Writing

Access PatternThis work load will typically have data transer sizes o less than 32K and more typically 4K or 8K in size. The

access pattern is predominantly random reads or writes, or intermixed with updates. These workloads typically

do not maintain much o a queue o outstanding commands to the array, usually less than 6. The depth o the

queue is very application dependent and can very depending on the number o threads the application can

generate, number o processors, the number o clients supported or other programming constructs used.

In this case, Writeback cache operations result in minimum response time and maximum number o commands

per second (IOPS) rom the array, regardless o the chunk size and corresponding stripe size. In Writeback mode,

larger cache sizes will increase the likelihood o lling up stripes to leverage the eect o ull stripe writes.

Throughput is typically not important to these workloads. Typical workloads in this category include database,

transactional systems, web servers, and many le servers.

ESG Defned Workloads and Associated AccessPatternsHere is a very detailed description o access patterns by workload used in benchmarking array perormance by

ESG. The queue depth is xed at 16K, which may not be typical or most data base or transaction workloads.

Figure 6: ESG Defned Workloads


15/2413

3 Nigel Hart. Xyratex CDP Engineering Manager. ESG Chunk Eects.xls. dated 3 July 2007.

Perormance o Arrays and Varying Chunk Sizes3

Xyratex CDP Engineering used ESG specied workloads, specially modied versions o the latest Xyratex RAID

oerings, and a recent DotHill RAID oering to test the implications o varying chunk sizes on xed and repeatable

workloads (Model RS-1220-F4-5412E and Model RS-1220-F4-5402E with 1Gb o cache per controller rom

Xyratex and the Model 2730 rom DotHill).

Xyratex used Intels IOMeter to simulate the workload patterns as specied by ESG as typical or various

applications. See the table below or a detailed description o each workload. The conguration consisted o one

system using a Microsot Windows 2003 host with bre channel interace connected (at 4Gb/s) to the Xyratex

RAIDsystemconguredwithasingleRAID5(5+1)array,andsingleLUNexternalizedtothehost.Seagate

X15-5 SAS drives and Seagate Barracuda SATA-II drives were used. The RAID systems were congured with dual

controllers and cache mirroring in eect, even though only one controller was active.

ThetestswererunwithIOMeteragainstthiscommonsinglearray/singleLUNconguration,againsteachofthe

ESG workload categories, and with the ollowing dierent combinations o platorm and chunk size.

RAID Oering Chunk Size Drive Type

Xyratex RS-1220-F4-5402E 16K SAS




DotHill 2730 16K SAS

DotHill 2730 64K SAS

Xyratex RS-1220-F4-5402E 16K SATA

Xyratex RS-1220-F4-5402E 64K SATA

Xyratex RS-1220-F4-5412E 16K SATAXyratex RS-1220-F4-5412E 64K SATA

DotHill 2730 16K SATA

DotHill 2730 64K SATA

Both 16K and 64K chunk sizes where common to both DotHill and Xyratex, and thereore used or comparison

purposes. Xyratex RAID systems oer chunk size options o 64K, 128K, and 256K, and derives the stripe size

depending on the number o drives in the array multiplied times the selected chunk size.

DotHill oers 16K, 32K, and 64K chunk sizes. Since these are lower then the chunk sizes available on Xyratex

RAID systems, Xyratex created a special set o RAID rmware to enable identical conguration options or one-

to-one comparisons.


16/2414

For a 16K chunk size the derived stripe size is 96K. For a 32K chunk size the derived stripe size is 192K. I chunk

size made a dierence, it would be refected in lower command response times (latency) and higher command

response throughput (IOPS) in these tests.

The RS-1220-F4-5412E is the current shipping version o RAID system rom Xyratex. The previous model, the RS-

1220-F4-54 02E,usedanolderversionoftheLSI1064SAScontrollerandIntelRAIDASIC(IOP331).Thatmodel

was included only or comparison purposes.

Test ResultsBoth Xyratex and DotHill RAID with the smaller chunk size (16K) and stripe size (96K) exhibited slower command

response times (latency) and lower throughput (IOPS) than the same RAID oering with larger chunk size (64K)

and stripe size (192K). This eect was noted in almost all workloads and RAID oerings except one DotHill test

case, regardless o drive type (SAS or SATA).

SAS Drive TestsSAS 64K chunk to 16K 5402 5412 2730

4KOLTP -11% -13% -5%

File Server -19% -19% -14%

Web Server -20% -20% -20%

4 KB random Read -10% -10% -10%

4 KB random Write -15% -9% 2%

512 KB Seq Write -37% -37% -51%

512 KB Seq Read -21% -21% -11%

Media Server -26% -26% -35%

Exchange Data -10% -10% -5%

ExchangeLogs -41% -42% -33%

File Server -19% -20% -15%

Web Server -20% -20% -21%

Windows Media Player -26% -26% -36%

Exchange Data (EBD) -10% -10% -4%

ExchangeLogs(LOG) -41% -42% -31%

Relative percent calculations based the dierence in total IOPS between 64K and 16K chunk sizes or each

workload. Negative numbers indicate that the 16K chunk size perormed with less IOPS than the 64K chunk size

test case, in relative terms.

Sequential workloads perormed signicantly less with smaller chunk sizes (and smaller stripe sizes) than random

workloads. Write workloads perormed signicantly less with smaller chunk sizes (and smaller stripe sizes) than

read workloads. This aect is attributed to the lack o 1st level segmentation within a stripe with random and


17/2415

write workloads. Where as sequential workloads benet rom both 1st level segmentation within the stripe (with

larger stripes) and 2nd level coalescing o multiple stripes at a queue depth o 16.

SATA Drive TestsSATA 64K chunk to 16K 5402 5412 2730

4KOLTP -10% -10% -7%

File Server -21% -22% -18%

Web Server -22% -21% -23%

4 KB random read -12% -10% -11%

4 KB random write -13% -11% -3%

512 KB Seq Write -82% -58% 13%

512 KB Seq Read -10% -10% -1%

Media Server -15% -16% -32%

Exchange data -10% -8% -19%

ExchangeLogs -49% -76% -43%

File Server -20% -22% -19%

Web Server -22% -21% -23%

Windows Media Player -15% -16% -33%

Exchange data (EBD) -10% -8% -8%

ExchangeLogs(LOG) -49% -75% -7%

Relative percent calculations based the dierence in total IOPS between 64K and 16K chunk sizes or each

workload. Negative numbers indicate that the 16K chunk size perormed with less IOPS than the 64K chunk size

test case, in relative terms.

Similar dierences between random and sequential workloads, and SAS drives are seen with SATA drives.

In general, the relative dierences with sequential workloads are exacerbated with slower SATA drive

perormance.

Test ConclusionsSmaller chunk size does not improve latency or increase IOPS or any o the ESG workloads, at the specied queue

depth (16) and or the specied RAID oerings, other than one test case with the DotHill RAID controller. The

ESG dened workloads are mostly random and small data transer size workloads, where system administrators

would typically assign small chunk sizes or stripe sizes to improve perormance and increase IOPS. For the

sequential workloads (Media server, Exchange data, Exchange logs, and Media player), smaller chunk size did

impact command response time (latency) and reduce IOPS.

The aect o the very ast linear transer rates with todays disk drives translates to larger chunks or high RAID

perormance. The use o a small chunk size, 16K, did not improve perormance when compared to a larger 64K

chunk.


18/2416

System administrators need to congure external RAID systems such that ull stripe reads and writes leverage

these ast linear transer rates by handling as much data transer as possible without incurring the penalty o slow

seek times. Todays hard disk drives have very large track capacities. The Barracuda ES amily o Serial ATA drives

used in this study rom Seagate has an average track size o 32K.4

The largest capacity Barracuda ES drive has 8

tracks per cylinder, translating to 256K per data transer between seeks. The smallest Barracuda ES drive has 3

tracks per cylinder, translating to 96K per data transer between seeks.

Coalescing o data transers with the writing o data in to stripes also helps produce higher perormance and

IOPS o the larger chunk size. Coalescing is the process o aggregating several command requests together to

orm a single and larger sequential access on the member drives in an array. A single large sequential access

reduces the number o disk commands and data transers signicantly, thus greatly increasing the amount o

host I/O that can be processed in any specic time interval. RAID in this mode o operation, becomes constrained

by speed o a chunk update alone. A high queue depth used in the sixteen test cases contributed to the aect

o coalescing on the command response times (latency) and throughput (IOPS).

Expect application workloads with lower queue depths (rom 4 to 16) to also benet rom larger chunk sizes

and stripe sizes, but to a lesser extent than reported with a queue depth o 16. Applications with queue depth

approaching 1 or 2 may benet rom smaller chunk sizes (and smaller stripes) as the total data transer operation

is reduced in size with ewer commands being aggregated into larger single sequential accesses.

4 Track size will vary, depending on the location o the cylinder. Cylinders towards the outside o the drive will have larger

tracks because o the size o the cylinder, versus cylinders towards the inside o the drive, which are smaller. Seagate Barracuda

ES Serial ATA Product Manual, Revision A, 6/26/06

Figure 7: Number o I/O Operations/Sec by Application Workload or

16K and 64K Chunk Sizes


19/2417

A Case or Smaller Chunk Size with Wide ArraysFor perormance reasons systems administrators may choose to congure a large number o drives in a single

array. This conguration maximizes the parallelism o RAID by breaking up a large single data transer into

multiple and parallel disk commands with smaller data transers to each drive. Some vendor RAID oerings

allow the system administrator to choose rom a range o chunk size. In some cases a chunk size less than the

maximum available may make since to reduce the overall stripe size o the array.

For example: a 12 drive RAID 5 array with 256K chunks will result in an array with a 3MB stripe. Applications

with typically small data transer requests, especially when made in a random ashion, may perorm better with a

smaller stripe. The small data transer size or randomness o the write requests may limit the aect o writeback

cache, coalescing, read-modiy-write and segmentation at the stripe level. Perormance may improve by speciy

a 64K chunk (providing a 768K stripe) or a 128K chunk (providing a 1.5MB stripe).

Smaller chunks and smaller stripes should denitely be considered or workloads against wide arrays that require

disabling writeback cache (write-through mode).

Glossary

RAID: (Redundant Array o Independent Disks) A disk subsystem that is used to increase perormance or provide

ault tolerance or both. RAID uses two or more ordinary hard disks and a RAID disk controller. In the past, RAID

has also been implemented via sotware only.

In the late 1980s, the term stood or redundant array o inexpensive disks, being compared to large,

expensive disks at the time. As hard disks became cheaper, the RAID Advisory Board changed inexpensive to

independent.

Small and Large: RAID subsystems come in all sizes rom desktop units to foor-standing models (see NAS and

SAN). Stand-alone units may include large amounts o cache as well as redundant power supplies. Initially used

with servers, desktop PCs are increasingly being retrotted by adding a RAID controller and extra IDE or SCSI

disks. Newer motherboards oten have RAID controllers.

Disk Striping: RAID improves perormance by disk striping, which interleaves bytes or groups o bytes across

multiple drives, so more than one disk is reading and writing simultaneously.

Mirroring and Parity: Fault tolerance is achieved by mirroring or parity. Mirroring is 100% duplication o the

data on two drives (RAID 1). Parity is used to calculate the data in two drives and store the results on a third

(RAID 3 or 5). Ater a ailed drive is replaced, the RAID controller automatically rebuilds the lost data rom the

other two. RAID systems may have a spare drive (hot spare) ready and waiting to be the replacement or a drive

that ails.

The parity calculation is perormed in the ollowing manner: a bit rom drive 1 is XORd with a bit rom drive 2,

and the result bit is stored on drive 3 (see OR or an explanation o XOR).


20/24


21/2419

External/GUI Other similar

names

Description

Drive Array Storage Pool,

Storage Pool Group,

Device Group, Sub-

device Group

A group o drives that all have the same array unction place on them. The array unctions

o the controller are RAID 0 and RAID 5 with an appropriate strip or chunk size. The array

is constructed over the minimum useable capacity o all drives (smallest capacity drive

extrapolated to all remaining drives in the array). The array can be broken in to logical drives

that are constructed rom 1 or many parts o the array.

All logical drives allocated rom the same drive array will have the same RAID characteristics.

Full StripeWrite

The amount o data written by the host, beore parity, that will result

LogicalDrive,

HostLogical

Drive

Volume The logical construction o a direct access block device as dened by the ANSII SCSI

specication. Each logical drive has a number o host logical blocks (512 bytes), resulting

in the capacity o the drive. Blocks can be written to or read rom. The controller enorces

precedence to the order o commands as received at the controller rom connected hosts.

Logicaldriveswilleitherbeinanormalstateorafailedstatedependingonphysicalstateof

the drive array. Task management is carried out at the logical drive level.

LUN TheprotocolendpointofSCSIcommunication.TherecanbemultipleLUNspresentedfor

a given logical drive, thus giving multiple paths to that logical drive. For any host port, a

givenlogicaldrivecanbereportedonlyonceasaLUN.ThelogicaldrivetheLUNisassigned

to may be a host logical drive including source or virtual logical drive (associated with a

snapshot).

Overwrite

Data Area,

ODA

Delta area Protected storage area on an array that has been dedicated to be used or storing data rom

the snapped logical drive that is overwritten (beore images). In itsel, the ODA is a logical

drive, though not externalized.

Snap Back The process o restoring a logical drive rom a selected snapshot. This process takes place

internally in the RAID controller rmware and needs no support rom any backup utility or

other third party sotware.

Snapshot Snap, Snapshot

Volume

A method or producing a point-in-time virtual copy o a logical drive. The state o all data

on the virtual logical drive remains rozen in time and can be retrieved at a later point in

time. In the process o initiating the snapshot, no data is actually copied rom the snapped

logical drive. As updates are made to the snapped logical drives, update blocks are copied

to the ODA.

Snapshot

Identier

Snapshot Number A unique number identiying the snap relationship to the source logical drive. Snapshot

number is not necessarily unique to the controller, only unique to the host logical drive.

Snapshots are sequentially numbered or an individual host logical drive starting at 0

through N.

Number o

Snapshots

Number o total snapshots on the same host logical drive.

SourceLogical

Drive

SnapshottedLogical

Drive

A logical drive that has a snapshot initiated on it.

Strip, Chunk The minimum unit o storage managed on a given drive as a contiguous storage unit

(usually a number o drive blocks) by the controller. The RAID controller writes data to drives

in a drive array in this size.

Stripe Size The amount o host data plus any parity at which an array unction, such as a host reador write, rotates across all members o the array the optimum unit o access o the array.

Chunk size * Stripe Width = Stripe Size.

Stripe Width The number o members in the array.

Virtual

LogicalDrive,

Snapshot

LogicalDrive

A special logical drive created rom the source logical drive and the data contained in the

overwrite data area implementing by the snapshot unction.


22/2420


23/24


24/24

2008 Xyratex (The trading name of Xyratex Technology Limited). Registered inEngland & Wales. Company no: 03134912. Registered Office: Langstone Road,Havant, Hampshire PO9 1SA, England. The information given in this brochure isfor marketing purposes and is not intended to be a specification nor to providethe basis for a warranty. The products and their details are subject to change.For a detailed specification or if you need to meet a specific requirement pleasecontact Xyratex.

UK HQ

Langstone Road

Havant

Hampshire PO9 1SA

United Kingdom

T +44(0)23 9249 6000

F +44(0)23 9249 2284

www.xyratex.com

ISO 14001: 2004 Cert No. EMS91560

xyratex white paper raid chunk size 1-0

Documents