xyratex white paper raid chunk size 1-0
TRANSCRIPT
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
1/24
RAID Chunk Size
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
2/24
Notices
The inormation in this document is subject to change without notice.
While every eort has been made to ensure that all inormation in this document is
accurate, Xyratex accepts no liability or any errors that may arise.
2008 Xyratex (the trading name o Xyratex Technology Limited). Registered Ofce:
Langstone Road, Havant, Hampshire, PO9 1SA, England. Registered number 03134912.
No part o this document may be transmitted or copied in any orm, or by any means,
or any purpose, without the written permission o Xyratex.
Xyratex is a trademark o Xyratex Technology Limited. All other brand and product
names are registered marks o their respective proprietors.
Acknowledgements
Jerry Hoetger, Director, Product Management, Xyratex International Ltd.
Nigel Hart, Director CPD Engineering, Xyratex International Ltd.
Stuart Campbell, Lead Firmware Architect and author o the Xyratex RAID frmware,
Cork, Ireland, Xyratex International Ltd.
Xyratex CPD Engineering, Xyratex Lake Mary, FL and Xyratex Naperville, IL
Issue 1.0 | January, 2008
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
3/241
Contents
Executive Summary .......................................................................................................................................3
Abstract .................................................................................................................................................3
Chunk Size .................................................................................................................................................3
Striping and Chunks (or Strips) ....................................................................................................................3
RAID 0: Striping ..........................................................................................................................................3
RAID 1: Mirroring........................................................................................................................................3RAID 5 .................................................................................................................................................4
RAID 10: a mix o RAID 0 & RAID 1 ............................................................................................................5
Stripe .................................................................................................................................................6
Adjusting or Perormance and Chunk Size ...............................................................................................6
Writing and Reading Chunks and Stripes ...................................................................................................6
Read-Modiy-Write or Writing Stripes ......................................................................................................7
Coalesce .................................................................................................................................................9
RAID Cache Options ......................................................................................................................................9
Perormance Implications o Confguration Options ..............................................................................10
Large Data Transer - Sequential Reading and Writing Access Pattern ................................................11
Small Data Transer Random Reading and Writing Access Pattern ....................................................12
ESG Defned Workloads and Associated Access Patterns. ......................................................................12
Perormance o Arrays and Varying Chunk Sizes ....................................................................................13
Test Results ...............................................................................................................................................14
SAS Drive Tests ............................................................................................................................................14
SATA Drive Tests ..........................................................................................................................................15
Test Conclusions ..........................................................................................................................................15
A Case or Smaller Chunk (or Strip) Size with Wide Arrays ....................................................................17
Glossary ...............................................................................................................................................17
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
4/242
Executive SummaryIn general, application workloads benet rom large chunk sizes. The most signicant reason or using large chunk
sizes today in conguring RAID arrays or high throughput rates and perormance is the recent advancements
in linear transer rates o hard disk drives. The most common o drive sizes, 3 SCSI and Serial ATA have
scaled rom 36GB to up to 1TB in todays SATA drives. This phenomenal increase in capacity came about by
increasing the amount o data transerred while the platter rotated under the head/actuator assembly. RAID
systems can now transer much more data in the single rotation o a platter without the response penalty o aseek operation. Conguration o RAID arrays or high throughput rates and perormance need to consider the
increased perormance o large chunk (or what some call strip sizes) RAID arrays.
Other contributing RAID eatures support larger chunk sizes: Write-back cache capabilities, coalescing o stripe
writes, segmentation o data transer requests into stripes, and the deployment o read-modiy-write operations,
optimize RAID perormance with large chunk (and large stripe) sizes.
Some applications use wide arrays with between 12 and 16 drives in an array, and thereore have extra large
stripes. These arrays may benet rom using the smaller chunk size options available in most controllers.
Benchmarks show that commands to RAID arrays with the smaller chunk size (and stripe size) exhibited slower
command response times (latency) and lower throughput (IOPS) than the same commands with larger chunk
size and larger stripe size. This eect was noted in almost all workloads and on several dierent RAID oerings,
regardless o drive type (SAS or SATA). Increased linear transer rate (aerial density), RAID rmware write-
back cache and the ability to coalesce write commands contribute to the benets o larger chunks and larger
stripes.
AbstractConsidering the recent advances in disk drive technology, conguring RAID systems or optimum perormance
presents new challenges. Over the last ve years, drive manuacturers have been steadily increasing the capacity
o disk drives by increasing the density and linear transer rates.
Although seek times have not improved in relation to linear transer rates, to optimize perormance it is necessary
to transer as much data as possible while the disk drive heads are in one position. We accomplish this by
increasing the chunk size o the disk array. The calculations are based on the number o drives, RAID level o the
array, and the type o use is intended or the array.
For example, database servers would have a dierent chunk size requirement, then one or a general use server
or mail server.
Throughout this paper the core concepts o chuck size and ull stripe writes are examined, as well as benchmark
test results.
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
5/243
Chunk Size
Striping and Chunks
All RAID levels used in todays storage rely on the concept o striping data across multiple disk drives in an array.
The data in a stripe is usually made up o multiple records or segments o inormation in accordance to a layout
by an application running on the host. The particular record layout is o no signicance to the RAID processing.
Host applications access the logical data records or segments with varying access patterns. Access patterns and
conguration options dened as part o the RAID array, will impact the perormance o the array.
RAID 0: Striping
Figure 1: RAID 0 Array
In a RAID 0 system, data is split up into blocks that will get written across all the drives in the array. By using
multiple disks (at least 2) at the same time, RAID 0 oers superior I/O perormance. This perormance can be
enhanced urther by using multiple controllers, ideally one controller per disk.
Advantages
RAID0offersgreatperformancetobothreadandwriteoperations.
Thereisnooverheadcausedbyparitycontrols.
Allstoragecapacitycanbeused,thereisnodiskoverhead.Thetechnologyiseasytoimplement.
Disadvantages
RAID0isnotfault-tolerant.Ifonediskfails,alldataintheRAID0arrayarelost.
Itshouldnotbeusedonmission-criticalsystems.
Ideal Use
RAID 0 is ideal or non-critical storage o data that has to be read rom or written to a workstation; or example:
a workstation using Adobe Photoshop or image processing.
RAID 1: Mirroring
Data is stored twice by writing to both the data disk (or sets o data disks) and a mirror disk (or sets o disks). I
a disk ails, the controller uses either the data disk or the mirror disk or data recovery, and continues operation.
You need at least two disks or a RAID 1 array.
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
6/244
RAID level 1 is oten combined with RAID level 0 to improve perormance. This RAID level is sometimes reerred
to as RAID level 10.
Advantages
RAID1offersexcellentreadspeedandawritespeed,thatiscomparabletothatofasingledisk.
Incaseadiskfails,datadoesnothavetoberebuilt,itjusthastobecopiedtothereplacementdisk.
RAID1isaverysimpletechnology.
Disadvantages
Themaindisadvantageisthattheeffectivestoragecapacityisonlyhalfofthetotaldiskcapacitybecauseall
data gets written twice.
SoftwareRAID1solutionsdonotalwaysallowahotswapofafaileddisk(meaningitcannotbereplaced
while the server continues to operate). Ideally a hardware controller is used.
Ideal use
RAID 1 is ideal or mission critical storage, or instance with accounting systems. It is also suitable or small
servers, in which only two disks will be used.
RAID 5
RAID 5 is the most common secure RAID level. It is similar to RAID 3 except that data is transerred to disks by
independent read and write operations (not in parallel). The data chunks that are written are also larger. Instead
o a dedicated parity disk, parity inormation is spread across all the drives. You need at least 3 disks or a RAID
5 array.
A RAID 5 array can withstand a single disk ailure without losing data or access to data. Although RAID 5 can
be achieved in sotware, a hardware controller is recommended. Oten extra cache memory is used on these
controllers to improve the write perormance.
Figure 2: RAID 1 Array
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
7/245
Figure 3: RAID 5 Array
Figure 4: RAID 10 Array
Advantages
Read data transactions are very ast, while write data transaction are somewhat slower (due to the parity that
has to be calculated).
Disadvantages
Diskfailureshaveaneffectonthroughput,althoughthisisstillacceptable.
LikeRAID3,thisiscomplextechnology.
Ideal use
RAID 5 is a good all-round system that combines ecient storage with excellent security and decent perormance.
It is ideal or le and application servers.
RAID 10: A Mix o RAID 0 & RAID 1
RAID 10 combines the advantages (and disadvantages) o RAID 0 and RAID 1 in a single system. It provides
security by mirroring all data on a secondary set o disks (disk 3 and 4 in the Figure below) while using striping
across each set o disks to speed up data transers.
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
8/246
StripeTheaggregationofrecordsorsegmentsiscommonlycalledastripeandrepresentsacommonobjectinthe
array that is maintained by the rmware in the controller, based on commands rom the host application. For
each stripe read to or written (ull or partial) rom the RAID rmware, multiple I/O requests are made to each
member drive in the array depending on the RAID type and number o drives in the array.
For RAID 0 and 10, no read-modiy-write is perormed, so there is a single I/O request to a member drive or each
host I/O, which can be reduced by coalescing. For RAID 5/50/6, a host read command translates to a single I/O
to a single drive, not multiple I/Os to the same drives. For a updating (read-modiy-write), there are multiple I/Os
to the data and parity drive(s). For a ull stripe write, there is a single I/O or each drive. These requests are made
in parallel. This particular paralleling aspect o ull stripe writes in RAID striping reduces command response time
(latency) and increases I/O commands per second (IOPS) or hosts writing to the array sequentially.
Adjusting or Perormance and Chunk SizeSystem administrators have options when dening the RAID systems used or storing and retrieving data
by applications. Each application has a unique pattern or storing and retrieving the data based on how the
application works, and how users o the application make requests. This pattern o reading and writing data
along with the system administrator specied options aect the perormance o the application.
Chunk size is one o the conguration options, where the size o the stripe is apportioned to the individual disk
drive. Some RAID oerings let the system administrator dene the size o the stripe across all o the member drives
(and derive the size o the chunk written to individual drives based on the number o member drives). Other RAID
oerings have the system administrator congure the chunk or strip size (and derive the larger stripe size based
on the number o member drives in the array). In either case the chunk size has perormance implications.
Writing and Reading Chunks and StripesRAID controller rmware reads and writes stripes to member drives in an array. Writes can be either ull stripe
or partial stripe write operations (See Read-Modiy-Write Section or more inormation). System Administrators
stipulate a stripe size (and thereore derive chunk size), or stipulate the chunk size (and thereore derive stripe
size). Full stripe write operations stripe across all member drives.
Partial stripe operations read and write only to the aected portions o the stripe (usually one data drive and one
parity drive). Read operations read only the amount o data associated with the host command. The presence o
Read-Ahead capabilities in a RAID controller may modiy the amount o data read beyond that which is required
to satisy the host command, in preparation o subsequent read commands.
For RAID levels with parity and parity chunks are maintained on writes. Dierent RAID rmware handles reads
and writes somewhat dierently, depending on the vendor and the RAID rmware strategy on ull stripe reads
and writes.
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
9/247
Writes can be handled in slightly dierent ways, depending on the vendor RAID strategy. Common in early RAID
deployments, vendor RAID strategies stipulated that with writes and updates, the complete stripe was rst read
rom all member drives in an array, locked or update with appropriate data and parity, veried, then completely
striped across all member drives in the array with updated data and parity, at which time the lock or that stripe
would be released. Xyratex RAID deploys a read-modiy-write strategy where only the portion o the stripe
written to and the portion o the parity chunk updated are transmitted to member disk drives as part o the write
or update operation. This mitigates the perormance implications o alternative ull stripe writes.
Read-Modiy-Write or Writing Stripes1
Xyratex has implemented read-modiy-write or writing stripes. Only that portion o the stripe (and parity
chunks) is written to the array or servicing writes. This mitigates the perormance implication o ull stripe writes
or random workloads, and is one o the reasons systems administrators can use larger chunks and stripes or
all workloads.2
The write operation is responsible or generating and maintaining parity data. This unction is typically reerred
to as a read-modiy-write operation. Consider a stripe composed o our chunks o data and one chunk o parity.
Supposethehostwantstochangejustasmallamountofdatathattakesupthespaceononlyonechunkwithin
the stripe. The RAID rmware cannot simply write that small portion o data and consider the request complete.
It also must update the parity data, which is calculated by perorming XOR operations on every chunk within the
stripe. So parity must be recalculated when one or more stripes change.
1 DellPowerSolutions,UnderstandingRAID-5andI/OProcessors,PaulLuse,May2003
2 Full stripe writes or sequential write workloads is aster than Read-Modiy-Write commands. I the application can ll up
a stripe completely, overall ull stripe writes perorm aster.
Figure 5: Step-by-Step: Read-Modiy-Write Operation to a Four-Disk RAID Array
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
10/248
Figure 5 shows a typical read-modiy-write operation in which the data that the host is writing to disk is contained
withinjustonestrip,inpositionD5.Theread-modify-writeoperationconsistsofthefollowingsteps:
1. Read new data rom host: The host operating system requests that the RAID subsystem write a piece o
data to location D5 on disk 2.
2. Read old data rom target disk or new data: Reading only the data in the location that is about to be
written to eliminates the need to read all the other disks. The number o steps involved in the read-modiy-
write operation is the same regardless o the number o disks in the array.
3. Read old parity rom target stripe or new data: A read operation retrieves the old parity. This unction
is independent o the number o physical disks in the array.
4. Calculate new parity by perorming an XOR operation on the data rom steps 1, 2, and 3: The
XOR calculation o steps 2 and 3 provides the resultant parity o the stripe, minus the contribution o the data
that is about to be overwritten. To determine new parity or the D5 stripe that contains the new data, an
XOR calculation is perormed on the new data read rom the host in step 1 with the result o the XOR
procedure perormed in steps 2 and 3.
5. Handle coherency: This process is not detailed in Figure 5 because its implementation varies greatly rom
vendor to vendor. Ensuring coherency involves monitoring the write operations that occur rom the start o
step 6 to the end o step 7. For the disk array to be considered coherent, or clean, the subsystem must ensure
that the parity data block is always current or the data on the stripe. Because it is not possible to guarantee
that the new target data and the new parity will be written to separate disks at exactly the same instant, the
RAID subsystem must identiy the stripe being processed as inconsistent, or dirty, in RAID vernacular.
6. Write new data to target location: The new data was received rom the host in step 1; now the RAID
mappings determine on which physical disk, and where on the disk, the data will be written. With Xyratex
RAID only the portion o the chunk updated will be written, not the complete chunk.
7. Write new parity: The new parity was calculated in step 4; now the RAID subsystem writes it to disk.
8. Handle coherency: Once the RAID subsystem veries that steps 6 and 7 have been completed successully,
and the data and parity are both on diskthe stripe is considered coherent.
This optimized method is ully scalable. The number o read, write, and XOR operations is independent o the
number o disks in the array. Because the parity disk is involved in every write operation (steps 6 and 7), parity
is rotated to a dierent disk with each stripe. I all the parity were stored on the same disk all the time, that disk
could become a perormance bottleneck.
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
11/249
CoalesceCoalescing is the process o aggregating several command requests together to orm a single and larger
sequential access on the member drives in an array. Coalesce represents a level o buering commands to the
member drives o an array. In this way, multiple write commands and associated data can be aggregated into a
stripe. A single large sequential access reduces the number o disk commands and data transers signicantly,
thus greatly increasing the amount o host I/O that can be processed in any specic time interval.
Drive level coalescing by the RAID controller signicantly contributes to the overall perormance o an array
and is very dependent on the access patterns o the application, as well as the algorithms o the RAID oering.
Sequential applications with high queue depths leverage the improved perormance o write coalescing more
than applications with low queue depths and are more random. With a queue depth o one and in write-through
mode, only one write command will be issued at a time. The host will wait to receive an acknowledgement
that an initial write completes beore issuing subsequent write commands. The RAID rmware will not have
an opportunity to coalesce, issuing an individual write operation or each write command received. The use o
writeback cache settings will allow the controller to acknowledge the writeback to the host applications while
the data associated rom the command to be coalesced into a common stripe.
As the application (and the host) exhibit writing multiple outstanding write commands, coalescing can increasingly
improve perormance o the array, regardless o write-through or writeback cache setting. Each data transer
write request potentially requires a corresponding write operation o the stripe, with individual reads and writes
to each member drive o the array (read-modiy-write), or two I/O operations per drive required or each write.
Some write operations that span stripes might require three (not aligned with stripe boundaries). With each
extra outstanding write operations per drive (queue depth), the RAID rmware has an increasing chance o
grouping such commands together in a single sequential data transer.
Thesizeofthelogicaldrivewillalsoimpacttheaffectofcoalescingrandomworkloads.Largerlogicaldrivesthat
span the ull range o the disk platters o member drives in the array will require longer command response times
forcoalescedrandomcommandsthansmallerlogicaldrivesthatarelimitedtojustseveralcylindersofeachset
o drives disk platters. The smaller range o cylinders will limit the eect o the randomness o the commands,
reporting an almost sequential like application perormance.
RAID Cache OptionsMany RAID oerings today oer the option o battery backed cache and an associated writeback mode
o operation where, or write commands, the host is acknowledged immediately ater the data transer is
received in to the cache o the RAID system. In the rare event o a power ailure, a hardware ailure in the
controller, switch, attached cables, or a ailure with the RAID rmware, the record stays in cache until it can be
completely externalized as part o a recovery process. This mode o operation reduces the response time (or
latency) associated with the command. This mode o operation also reduces the throughput o the controller
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
12/2410
due to the mirroring o cache operations between the two controllers in a dual controller conguration. This
mirroring operation is usually done through dedicated channels on the base plane o the controller.
Another option ound in most o todays RAID systems allows the system administrator to turn o the RAID
system cache or writes. Many large data transer applications can get higher throughput (MB/s) in this mode
as records are externalized directly to the member drives in an array, without relying on the RAID cache. This is
called write-through mode.
Some RAID oerings allow the system administrator to partition the cache or writes protecting perormance
o other applications sharing the RAID system cache and to turn o the cache mirroring, while in writeback
mode. The RAID system needs to revert to write-through when cache capacity is unavailable to commands.
Partitioning o cache denes the threshold or transitioning, as well as conserves cache or servicing commands
to other arrays. Turning o cache mirroring provides the command response time benets o cached writes and
ast response times, while sacricing ault-tolerance.
Perormance Implications o Confguration OptionsMany actors impact the perormance o an array and thus the perormance o host based applications that
access one or more arrays. Signicant actors include:
1. Data transer size to and rom the host. This is usually set by the application.
2. Access pattern to the array by the host or hosts (concurrency o multiple hosts, sequential, random, random-
sequential, queue depth).
3. RAID level.
4. RAID cache or writes (Writeback Mode or Write-Through Mode, amount o cache allocated or writes).
5. Chunk size and thereore Stripe size.
6. Number o member drives in the array.
7. Drive perormance.
The system administrator usually has options or conguring RAID level, RAID cache operations, chunk size,
number o member drives and drive type (aecting drive perormance). The application usually dictates data
transer size and access pattern, although this can be tuned with some operating systems.
Furthermore, vendor RAID strategies will provide diering options to choose rom or conguring arrays. Some
RAID vendors let system administrators set chunk size, others stripe size. With either approach, the vendor
oering will provide a list o available options ranging rom as small as 4K to up to 256K chunk sizes. Each vendor
strategy includes a selection o RAID levels and has slightly diering strategies or implementing each. The
number o drives can vary, but usually ranges rom 1 or simple RAID 0 to 16. RAID vendors support a variety
o dierent drive types today ranging rom Serial Attached SCSI (SAS), Serial ATA (SATA), Fibre Channel (FC),
and SCSI, each with their own perormance characteristics and capacities. Usually an array is limited to include
a single drive type and capacity.
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
13/2411
Given the available options to the system administrator, several general recommendations can be made relative
to the required perormance o host commands to the array based on the access patterns and data transer size
dictated by the host application.
Large Data Transer - Sequential Reading and
Writing Access PatternThis workload will have data transer sizes o at least 32K and more typically 1MB or larger. A special case can
be made or workloads that read and write larger than 1MB to the array. Some vendor RAID strategies may be
limited to 1MB or perorm at less than optimally when support greater host transer sizes. The access pattern is
predominantly sequential, read, write or alternating between read and write or very long intervals o time.
Especially important or sequential write workloads is the ability to ll up stripes with data to maximize
perormance (the eect o ull stripe writes and stripe coalesce). Maintaining enough application data to ll up
stripes with write sequential access patterns can happen in several ways:
1. Largedatatransfersizesfromthehost.Someapplicationssuchasvideostreamingallowforcongurabledata
transer size. Ideally the applications data transer size should be aligned with the stripe size.
2. The application can issue multiple outstanding commands to build queues o commands that can be
sequentially executed to ll up stripes by the RAID controller across the array.
3. Writeback cache setting can ll up stripes rom applications that maintain low or non-existent queue depths
o outstanding commands.
When the application lls up the stripe during write sequences, large chunks result in maximum throughput
capabilities.Largerdatatransfersizesandlargestripesfacilitatesegmentingofsmallrecordsintolargerdata
transers and multiple data transers into larger stripes and even larger coalesced data transers to member disk
drives in the array. I the application cannot ll up the stripe, the system administrator should consider smaller
stripes either via ewer drives in the array or choosing smaller chunk size.
Many large data transer applications perorm better with write cache disabled, or when the controller is in
Write-through mode. Write-through mode optimizes the RAID perormance or high throughput rates (MB/s)
by bypassing cache coherency operations and writing aggregated streams o data to member drives in an array.
Writeback cache should only be considered or small data transer applications or lling up stripe writes.
Command response time (latency) or a high number o commands per second (IOPS) are typically not important
to these workloads. The data transers are large or very large, and will by nature take longer to service as they
are striped across many drives in an array. Typical workloads in this category include disk-to-disk backup, Virtual
tape library, archiving applications, and some network le system deployments.
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
14/2412
Small Data Transer Random Reading and Writing
Access PatternThis work load will typically have data transer sizes o less than 32K and more typically 4K or 8K in size. The
access pattern is predominantly random reads or writes, or intermixed with updates. These workloads typically
do not maintain much o a queue o outstanding commands to the array, usually less than 6. The depth o the
queue is very application dependent and can very depending on the number o threads the application can
generate, number o processors, the number o clients supported or other programming constructs used.
In this case, Writeback cache operations result in minimum response time and maximum number o commands
per second (IOPS) rom the array, regardless o the chunk size and corresponding stripe size. In Writeback mode,
larger cache sizes will increase the likelihood o lling up stripes to leverage the eect o ull stripe writes.
Throughput is typically not important to these workloads. Typical workloads in this category include database,
transactional systems, web servers, and many le servers.
ESG Defned Workloads and Associated AccessPatternsHere is a very detailed description o access patterns by workload used in benchmarking array perormance by
ESG. The queue depth is xed at 16K, which may not be typical or most data base or transaction workloads.
Figure 6: ESG Defned Workloads
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
15/2413
3 Nigel Hart. Xyratex CDP Engineering Manager. ESG Chunk Eects.xls. dated 3 July 2007.
Perormance o Arrays and Varying Chunk Sizes3
Xyratex CDP Engineering used ESG specied workloads, specially modied versions o the latest Xyratex RAID
oerings, and a recent DotHill RAID oering to test the implications o varying chunk sizes on xed and repeatable
workloads (Model RS-1220-F4-5412E and Model RS-1220-F4-5402E with 1Gb o cache per controller rom
Xyratex and the Model 2730 rom DotHill).
Xyratex used Intels IOMeter to simulate the workload patterns as specied by ESG as typical or various
applications. See the table below or a detailed description o each workload. The conguration consisted o one
system using a Microsot Windows 2003 host with bre channel interace connected (at 4Gb/s) to the Xyratex
RAIDsystemconguredwithasingleRAID5(5+1)array,andsingleLUNexternalizedtothehost.Seagate
X15-5 SAS drives and Seagate Barracuda SATA-II drives were used. The RAID systems were congured with dual
controllers and cache mirroring in eect, even though only one controller was active.
ThetestswererunwithIOMeteragainstthiscommonsinglearray/singleLUNconguration,againsteachofthe
ESG workload categories, and with the ollowing dierent combinations o platorm and chunk size.
RAID Oering Chunk Size Drive Type
Xyratex RS-1220-F4-5402E 16K SAS
Xyratex RS-1220-F4-5402E 64K SAS
Xyratex RS-1220-F4-5412E 16K SAS
Xyratex RS-1220-F4-5412E 64K SAS
DotHill 2730 16K SAS
DotHill 2730 64K SAS
Xyratex RS-1220-F4-5402E 16K SATA
Xyratex RS-1220-F4-5402E 64K SATA
Xyratex RS-1220-F4-5412E 16K SATAXyratex RS-1220-F4-5412E 64K SATA
DotHill 2730 16K SATA
DotHill 2730 64K SATA
Both 16K and 64K chunk sizes where common to both DotHill and Xyratex, and thereore used or comparison
purposes. Xyratex RAID systems oer chunk size options o 64K, 128K, and 256K, and derives the stripe size
depending on the number o drives in the array multiplied times the selected chunk size.
DotHill oers 16K, 32K, and 64K chunk sizes. Since these are lower then the chunk sizes available on Xyratex
RAID systems, Xyratex created a special set o RAID rmware to enable identical conguration options or one-
to-one comparisons.
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
16/2414
For a 16K chunk size the derived stripe size is 96K. For a 32K chunk size the derived stripe size is 192K. I chunk
size made a dierence, it would be refected in lower command response times (latency) and higher command
response throughput (IOPS) in these tests.
The RS-1220-F4-5412E is the current shipping version o RAID system rom Xyratex. The previous model, the RS-
1220-F4-54 02E,usedanolderversionoftheLSI1064SAScontrollerandIntelRAIDASIC(IOP331).Thatmodel
was included only or comparison purposes.
Test ResultsBoth Xyratex and DotHill RAID with the smaller chunk size (16K) and stripe size (96K) exhibited slower command
response times (latency) and lower throughput (IOPS) than the same RAID oering with larger chunk size (64K)
and stripe size (192K). This eect was noted in almost all workloads and RAID oerings except one DotHill test
case, regardless o drive type (SAS or SATA).
SAS Drive TestsSAS 64K chunk to 16K 5402 5412 2730
4KOLTP -11% -13% -5%
File Server -19% -19% -14%
Web Server -20% -20% -20%
4 KB random Read -10% -10% -10%
4 KB random Write -15% -9% 2%
512 KB Seq Write -37% -37% -51%
512 KB Seq Read -21% -21% -11%
Media Server -26% -26% -35%
Exchange Data -10% -10% -5%
ExchangeLogs -41% -42% -33%
File Server -19% -20% -15%
Web Server -20% -20% -21%
Windows Media Player -26% -26% -36%
Exchange Data (EBD) -10% -10% -4%
ExchangeLogs(LOG) -41% -42% -31%
Relative percent calculations based the dierence in total IOPS between 64K and 16K chunk sizes or each
workload. Negative numbers indicate that the 16K chunk size perormed with less IOPS than the 64K chunk size
test case, in relative terms.
Sequential workloads perormed signicantly less with smaller chunk sizes (and smaller stripe sizes) than random
workloads. Write workloads perormed signicantly less with smaller chunk sizes (and smaller stripe sizes) than
read workloads. This aect is attributed to the lack o 1st level segmentation within a stripe with random and
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
17/2415
write workloads. Where as sequential workloads benet rom both 1st level segmentation within the stripe (with
larger stripes) and 2nd level coalescing o multiple stripes at a queue depth o 16.
SATA Drive TestsSATA 64K chunk to 16K 5402 5412 2730
4KOLTP -10% -10% -7%
File Server -21% -22% -18%
Web Server -22% -21% -23%
4 KB random read -12% -10% -11%
4 KB random write -13% -11% -3%
512 KB Seq Write -82% -58% 13%
512 KB Seq Read -10% -10% -1%
Media Server -15% -16% -32%
Exchange data -10% -8% -19%
ExchangeLogs -49% -76% -43%
File Server -20% -22% -19%
Web Server -22% -21% -23%
Windows Media Player -15% -16% -33%
Exchange data (EBD) -10% -8% -8%
ExchangeLogs(LOG) -49% -75% -7%
Relative percent calculations based the dierence in total IOPS between 64K and 16K chunk sizes or each
workload. Negative numbers indicate that the 16K chunk size perormed with less IOPS than the 64K chunk size
test case, in relative terms.
Similar dierences between random and sequential workloads, and SAS drives are seen with SATA drives.
In general, the relative dierences with sequential workloads are exacerbated with slower SATA drive
perormance.
Test ConclusionsSmaller chunk size does not improve latency or increase IOPS or any o the ESG workloads, at the specied queue
depth (16) and or the specied RAID oerings, other than one test case with the DotHill RAID controller. The
ESG dened workloads are mostly random and small data transer size workloads, where system administrators
would typically assign small chunk sizes or stripe sizes to improve perormance and increase IOPS. For the
sequential workloads (Media server, Exchange data, Exchange logs, and Media player), smaller chunk size did
impact command response time (latency) and reduce IOPS.
The aect o the very ast linear transer rates with todays disk drives translates to larger chunks or high RAID
perormance. The use o a small chunk size, 16K, did not improve perormance when compared to a larger 64K
chunk.
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
18/2416
System administrators need to congure external RAID systems such that ull stripe reads and writes leverage
these ast linear transer rates by handling as much data transer as possible without incurring the penalty o slow
seek times. Todays hard disk drives have very large track capacities. The Barracuda ES amily o Serial ATA drives
used in this study rom Seagate has an average track size o 32K.4
The largest capacity Barracuda ES drive has 8
tracks per cylinder, translating to 256K per data transer between seeks. The smallest Barracuda ES drive has 3
tracks per cylinder, translating to 96K per data transer between seeks.
Coalescing o data transers with the writing o data in to stripes also helps produce higher perormance and
IOPS o the larger chunk size. Coalescing is the process o aggregating several command requests together to
orm a single and larger sequential access on the member drives in an array. A single large sequential access
reduces the number o disk commands and data transers signicantly, thus greatly increasing the amount o
host I/O that can be processed in any specic time interval. RAID in this mode o operation, becomes constrained
by speed o a chunk update alone. A high queue depth used in the sixteen test cases contributed to the aect
o coalescing on the command response times (latency) and throughput (IOPS).
Expect application workloads with lower queue depths (rom 4 to 16) to also benet rom larger chunk sizes
and stripe sizes, but to a lesser extent than reported with a queue depth o 16. Applications with queue depth
approaching 1 or 2 may benet rom smaller chunk sizes (and smaller stripes) as the total data transer operation
is reduced in size with ewer commands being aggregated into larger single sequential accesses.
4 Track size will vary, depending on the location o the cylinder. Cylinders towards the outside o the drive will have larger
tracks because o the size o the cylinder, versus cylinders towards the inside o the drive, which are smaller. Seagate Barracuda
ES Serial ATA Product Manual, Revision A, 6/26/06
Figure 7: Number o I/O Operations/Sec by Application Workload or
16K and 64K Chunk Sizes
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
19/2417
A Case or Smaller Chunk Size with Wide ArraysFor perormance reasons systems administrators may choose to congure a large number o drives in a single
array. This conguration maximizes the parallelism o RAID by breaking up a large single data transer into
multiple and parallel disk commands with smaller data transers to each drive. Some vendor RAID oerings
allow the system administrator to choose rom a range o chunk size. In some cases a chunk size less than the
maximum available may make since to reduce the overall stripe size o the array.
For example: a 12 drive RAID 5 array with 256K chunks will result in an array with a 3MB stripe. Applications
with typically small data transer requests, especially when made in a random ashion, may perorm better with a
smaller stripe. The small data transer size or randomness o the write requests may limit the aect o writeback
cache, coalescing, read-modiy-write and segmentation at the stripe level. Perormance may improve by speciy
a 64K chunk (providing a 768K stripe) or a 128K chunk (providing a 1.5MB stripe).
Smaller chunks and smaller stripes should denitely be considered or workloads against wide arrays that require
disabling writeback cache (write-through mode).
Glossary
RAID: (Redundant Array o Independent Disks) A disk subsystem that is used to increase perormance or provide
ault tolerance or both. RAID uses two or more ordinary hard disks and a RAID disk controller. In the past, RAID
has also been implemented via sotware only.
In the late 1980s, the term stood or redundant array o inexpensive disks, being compared to large,
expensive disks at the time. As hard disks became cheaper, the RAID Advisory Board changed inexpensive to
independent.
Small and Large: RAID subsystems come in all sizes rom desktop units to foor-standing models (see NAS and
SAN). Stand-alone units may include large amounts o cache as well as redundant power supplies. Initially used
with servers, desktop PCs are increasingly being retrotted by adding a RAID controller and extra IDE or SCSI
disks. Newer motherboards oten have RAID controllers.
Disk Striping: RAID improves perormance by disk striping, which interleaves bytes or groups o bytes across
multiple drives, so more than one disk is reading and writing simultaneously.
Mirroring and Parity: Fault tolerance is achieved by mirroring or parity. Mirroring is 100% duplication o the
data on two drives (RAID 1). Parity is used to calculate the data in two drives and store the results on a third
(RAID 3 or 5). Ater a ailed drive is replaced, the RAID controller automatically rebuilds the lost data rom the
other two. RAID systems may have a spare drive (hot spare) ready and waiting to be the replacement or a drive
that ails.
The parity calculation is perormed in the ollowing manner: a bit rom drive 1 is XORd with a bit rom drive 2,
and the result bit is stored on drive 3 (see OR or an explanation o XOR).
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
20/24
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
21/2419
External/GUI Other similar
names
Description
Drive Array Storage Pool,
Storage Pool Group,
Device Group, Sub-
device Group
A group o drives that all have the same array unction place on them. The array unctions
o the controller are RAID 0 and RAID 5 with an appropriate strip or chunk size. The array
is constructed over the minimum useable capacity o all drives (smallest capacity drive
extrapolated to all remaining drives in the array). The array can be broken in to logical drives
that are constructed rom 1 or many parts o the array.
All logical drives allocated rom the same drive array will have the same RAID characteristics.
Full StripeWrite
The amount o data written by the host, beore parity, that will result
LogicalDrive,
HostLogical
Drive
Volume The logical construction o a direct access block device as dened by the ANSII SCSI
specication. Each logical drive has a number o host logical blocks (512 bytes), resulting
in the capacity o the drive. Blocks can be written to or read rom. The controller enorces
precedence to the order o commands as received at the controller rom connected hosts.
Logicaldriveswilleitherbeinanormalstateorafailedstatedependingonphysicalstateof
the drive array. Task management is carried out at the logical drive level.
LUN TheprotocolendpointofSCSIcommunication.TherecanbemultipleLUNspresentedfor
a given logical drive, thus giving multiple paths to that logical drive. For any host port, a
givenlogicaldrivecanbereportedonlyonceasaLUN.ThelogicaldrivetheLUNisassigned
to may be a host logical drive including source or virtual logical drive (associated with a
snapshot).
Overwrite
Data Area,
ODA
Delta area Protected storage area on an array that has been dedicated to be used or storing data rom
the snapped logical drive that is overwritten (beore images). In itsel, the ODA is a logical
drive, though not externalized.
Snap Back The process o restoring a logical drive rom a selected snapshot. This process takes place
internally in the RAID controller rmware and needs no support rom any backup utility or
other third party sotware.
Snapshot Snap, Snapshot
Volume
A method or producing a point-in-time virtual copy o a logical drive. The state o all data
on the virtual logical drive remains rozen in time and can be retrieved at a later point in
time. In the process o initiating the snapshot, no data is actually copied rom the snapped
logical drive. As updates are made to the snapped logical drives, update blocks are copied
to the ODA.
Snapshot
Identier
Snapshot Number A unique number identiying the snap relationship to the source logical drive. Snapshot
number is not necessarily unique to the controller, only unique to the host logical drive.
Snapshots are sequentially numbered or an individual host logical drive starting at 0
through N.
Number o
Snapshots
Number o total snapshots on the same host logical drive.
SourceLogical
Drive
SnapshottedLogical
Drive
A logical drive that has a snapshot initiated on it.
Strip, Chunk The minimum unit o storage managed on a given drive as a contiguous storage unit
(usually a number o drive blocks) by the controller. The RAID controller writes data to drives
in a drive array in this size.
Stripe Size The amount o host data plus any parity at which an array unction, such as a host reador write, rotates across all members o the array the optimum unit o access o the array.
Chunk size * Stripe Width = Stripe Size.
Stripe Width The number o members in the array.
Virtual
LogicalDrive,
Snapshot
LogicalDrive
A special logical drive created rom the source logical drive and the data contained in the
overwrite data area implementing by the snapshot unction.
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
22/2420
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
23/24
-
8/3/2019 Xyratex White Paper RAID Chunk Size 1-0
24/24
2008 Xyratex (The trading name of Xyratex Technology Limited). Registered inEngland & Wales. Company no: 03134912. Registered Office: Langstone Road,Havant, Hampshire PO9 1SA, England. The information given in this brochure isfor marketing purposes and is not intended to be a specification nor to providethe basis for a warranty. The products and their details are subject to change.For a detailed specification or if you need to meet a specific requirement pleasecontact Xyratex.
UK HQ
Langstone Road
Havant
Hampshire PO9 1SA
United Kingdom
T +44(0)23 9249 6000
F +44(0)23 9249 2284
www.xyratex.com
ISO 14001: 2004 Cert No. EMS91560