multi-box raid with 3 party transfer and parity ... · infiniband n infiniband defines a high speed...
TRANSCRIPT
Multi-Box RAID with 3rd Party Transfer and Parity Calculation in Targets using RDMA
Erez ZilberYitzhak Birk
Technion
TPT-RAID
Agenda
n Introductionn Improving Communication Efficiencyn Relieving the Controller Bottleneckn Performance
Motivation
n Storage devices are becoming cheaper.
n However, highly-available single-box storage systems are still expensive.
n Even such systems are susceptible to failures that affect the entire box.
RAID Controller
Disks
Single-box storage system
Multi-Box RAID Systems
n A single, fault-tolerant controller connected to multiple storage boxes (targets).
n Any given parity group contains at most one disk drive from any given box.
n The controller and the disks reside in separate machines.¡ iSCSI may be used in order to
send SCSI commands and data.
Switch RAID controller
Hosts
Targets
Multi-box storage system
Multi-Box RAID Systems (cont.)
n Advantages:¡ There is no single point of storage-
box failure.¡ Highly available expensive storage
boxes are no longer needed.n Disadvantages:
¡ Transferring data over a network is not as efficient as using the DMA engine in a single-box RAID system.
¡ Merely using storage protocols (e.g. iSCSI) over conventional network infrastructure is not enough.
¡ Bottleneck in the controller ⇒ poor scalability.
Switch RAID controller
Hosts
Targets
Multi-box storage system
Agenda
n Introductionn Improving Communication Efficiencyn Relieving the Controller Bottleneckn Performance
InfiniBand
n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end).
n InfiniBand supports RDMA (Remote DMA)¡ High speed¡ Low latency¡ Very lean + no CPU involvement
iSCSI Extensions for RDMA (iSER)
n iSER is an IETF standardn Maps iSCSI over a network that provides
RDMA services.n Data is transferred directly into SCSI I/O
buffers without intermediate data copies.n Splits control and data:
¡ RDMA is used for data transfer.¡ Sending of control messages is left unchanged.¡ The same physical path may be used for both.
iSCSI Initiator iSCSI Target
Command Request
Read
SCSI Read Queue command
Send Data-inRDMA Write
...
SCSI Response
...
Command Completion
RDMA Write
RDMA Write
Send Data-in
Send Data-in
Status and Sense
iSER iSER
Send Control
Send Control
Control Notify
Control Notify
Put Data
Put Data
Put Data
iSCSI Initiator iSCSI Target
Command Request
Read
SCSI Read Queue command
Send Data-inData-in
...
SCSI Response
...
Command Completion
Data-in
Final Data-in
Send Data-in
Send Data-in
Status and Sense
iSCSI over iSER: Read Requests
iSCSI iSCSI over iSERTCP packets RDMA
iSCSI Initiator iSCSI Target
Command Request
Write
SCSI Write
Unsol Data-Out
Unsol Data-Out...
Queue command
Send R TRDMA Read
RDMA Read Send R T
...
SCSI Response
...
iSER iSER
Get Data
Get Data
...
Status and Sense
Command Completion
Send Control
Send Control
Send Control
Control Notify
Control Notify
Control Notify
Control Notify
Send Control
iSCSI Initiator iSCSI Target
Command Request
Write
SCSI Write
Unsol Data -Out
...
Queue command
Send R TReady to Transmit
Sol Data-Out
Ready to Transmit Send R T...
Final Sol Data -Out
SCSI Response
Send R T...
Status and Sense
Command Completion
Unsol Data -Out
iSCSI over iSER: Write Requests
iSCSI iSCSI over iSER
TCP packets RDMA
iSER + Multi-Box RAID
n iSCSI over iSER solves the problem of inefficient data transfer.
n The separation of control and data is really a protocol separation over the same path.
n The scalability problem remains:¡ All data passes through the controller.¡ When using RAID-4/5, the controller has to
perform parity calculations.
Agenda
n Introductionn Improving Communication Efficiencyn Relieving the Controller Bottleneckn Performance
Removing the Controller from the Data Path – 3rd Party Transfer
n 3rd Party Transfer: one iSCSI entity instructs a 2nd
iSCSI entity to read or write data to a 3rd iSCSI entity.
n Data is transferred directly between hosts and targets under controller command:¡ Lower zero-load latency, especially for large requests –
single hop instead of two hops.¡ The controller’s memory and InfiniBand link do not
become a bottleneck.n Out-of-band controllers already exist, but:
¡ RDMA makes out-of-band data transfers more transparent.
¡ We carry the idea into the RAID.
RDMA and Out-of-Band Controller
n 3rd Party Transfer is more transparent when combined with RDMA :¡ Transparent from a host point of view.¡ Almost transparent from a target point of view.
n Adding 3rd Party Transfer to iSCSI over iSER is essential for removing the controller from the data path.
Distributed Parity Calculation
n The controller is not in the data path ⇒ it cannot make parity calculations.
n Side benefit: relieves another possible controller bottleneck.
Distributed Parity Calculation – a Binary Tree
Data block 0 Data block 1 Data block 2 Data block 3 Data block N-2 Parity block
Temp result Temp result Temp result
Temp result
New parity block
XOR XOR XOR
XOR
…
...
Example: 3rd Party Transfer and Distributed Parity Calculation
n The host sends a command to the RAID controller.
Host
RAIDcontroller
0 1 2 3 4
CMD
CMDCMD
RDMA
RDMA
(parity calculation)
n The RAID controller sends commands to the targets.
n The targets perform RDMA operations to the host.
n The RAID controller sends commands to recalculate the parity block (only for WRITE requests).
n The targets calculate the new parity block.
TargetsCMD
Mirroring with 3rd Party Transfer
n READ commands: similar to RAID-5.
n WRITE commands may be executed in one of the following two ways:¡ All targets read the new data directly from the host.¡ A single target reads the new data directly from
the host and transfers it to other targets.
n Using 3rd Party Transfer for mirroring relieves the RAID controller bottleneck.
Required Protocol Changesn Host:
¡ Minor change: The host must accept InfiniBand connection requests.
n RAID controller and targets:¡ SCSI:
n Additional commands.n No SCSI hardware changes are required.
¡ iSCSI:n Small changes in login/logout process.n Extra field added to iSCSI Command PDU.
¡ iSER:n Added and modified iSER primitives.
Agenda
n Introductionn Improving Communication Efficiencyn Relieving the Controller Bottleneckn Performance
Test Setup
n Hardware:¡ Nodes (all types): Intel dual-XEON 3.2GHz¡ Memory disks¡ Mellanox MHEA28-1T (10Gb/s) InfiniBand HCA¡ Mellanox MTS2400 InfiniBand switch
n Software:¡ Linux SuSE 9.1 Professional (2.6.4-52 kernel)¡ Voltaire InfiniBand host stack¡ Voltaire iSER initiator and target
System Configurations
n Baseline system:¡ Host¡ In-band RAID controller¡ 5 targets
n TPT-RAID system:¡ Host¡ TPT RAID controller¡ 5 TPT targets
n Both systems use iSER (RDMA) over InfiniBand.
InfiniBand switch
Host
RAID controller
Targets
Scalability
n TPT-RAID (almost) doesn’t add work (relative to the Baseline):¡ No extra disk (media) operations.¡ No extra XOR operations.
n Added communication to the targets:¡ More commands¡ More data transfers
n The extra communication is divided among all targets.
Controller Scalability – RAID-5 (WRITE)
n Unlimited number of hostsn Unlimited number of targets
n InfiniBand BW is not a limiting factor (multiple hosts and targets).
11 (75%)32KB1MB
TPTBaseline
41 (78%)64KB8MB
21 (78%)32KB8MB
21 (72%)64KB1MB
Max. hosts Block sizeReq. size
Max. Thpt. with One Host – RAID-5 (WRITE)
Even when only a single host is used, the Baseline controller is the bottleneck!
Controller Scalability – Mirroring (WRITE)
n Unlimited number of hostsn Unlimited number of targets
91 (50%)256KB
TPTBaseline
2931 (50%)8MB
361 (50%)1MB
181 (50%)512KB
Max. hosts Req. size(Blk=32KB)
Max. Thpt. with One Host - Mirroring (WRITE)
n Even when a single host is used, the Baseline controller is a bottleneck.
n For TPT, the bottleneck is the host or the targets.
Summaryn Multi-box RAID: improved availability and low cost.
n Using a single controller retains simplicity.
n Single-box DMA engine is replaced by RDMA.
n Adding 3rd Party Transfer and distributed parity calculation allow scalability:¡ Can manage a larger system with more activity¡ For a given workload: larger max. thpt. ⇒ shorter waiting
times ⇒ lower latency
Cost reduction is taken another step while retaining performance and simplicity.