multi-box raid with 3 party transfer and parity ... · infiniband n infiniband defines a high speed...

28
Multi-Box RAID with 3 rd Party Transfer and Parity Calculation in Targets using RDMA Erez Zilber Yitzhak Birk Technion TPT-RAID

Upload: others

Post on 22-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Multi-Box RAID with 3rd Party Transfer and Parity Calculation in Targets using RDMA

Erez ZilberYitzhak Birk

Technion

TPT-RAID

Page 2: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Agenda

n Introductionn Improving Communication Efficiencyn Relieving the Controller Bottleneckn Performance

Page 3: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Motivation

n Storage devices are becoming cheaper.

n However, highly-available single-box storage systems are still expensive.

n Even such systems are susceptible to failures that affect the entire box.

RAID Controller

Disks

Single-box storage system

Page 4: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Multi-Box RAID Systems

n A single, fault-tolerant controller connected to multiple storage boxes (targets).

n Any given parity group contains at most one disk drive from any given box.

n The controller and the disks reside in separate machines.¡ iSCSI may be used in order to

send SCSI commands and data.

Switch RAID controller

Hosts

Targets

Multi-box storage system

Page 5: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Multi-Box RAID Systems (cont.)

n Advantages:¡ There is no single point of storage-

box failure.¡ Highly available expensive storage

boxes are no longer needed.n Disadvantages:

¡ Transferring data over a network is not as efficient as using the DMA engine in a single-box RAID system.

¡ Merely using storage protocols (e.g. iSCSI) over conventional network infrastructure is not enough.

¡ Bottleneck in the controller ⇒ poor scalability.

Switch RAID controller

Hosts

Targets

Multi-box storage system

Page 6: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Agenda

n Introductionn Improving Communication Efficiencyn Relieving the Controller Bottleneckn Performance

Page 7: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

InfiniBand

n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end).

n InfiniBand supports RDMA (Remote DMA)¡ High speed¡ Low latency¡ Very lean + no CPU involvement

Page 8: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

iSCSI Extensions for RDMA (iSER)

n iSER is an IETF standardn Maps iSCSI over a network that provides

RDMA services.n Data is transferred directly into SCSI I/O

buffers without intermediate data copies.n Splits control and data:

¡ RDMA is used for data transfer.¡ Sending of control messages is left unchanged.¡ The same physical path may be used for both.

Page 9: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

iSCSI Initiator iSCSI Target

Command Request

Read

SCSI Read Queue command

Send Data-inRDMA Write

...

SCSI Response

...

Command Completion

RDMA Write

RDMA Write

Send Data-in

Send Data-in

Status and Sense

iSER iSER

Send Control

Send Control

Control Notify

Control Notify

Put Data

Put Data

Put Data

iSCSI Initiator iSCSI Target

Command Request

Read

SCSI Read Queue command

Send Data-inData-in

...

SCSI Response

...

Command Completion

Data-in

Final Data-in

Send Data-in

Send Data-in

Status and Sense

iSCSI over iSER: Read Requests

iSCSI iSCSI over iSERTCP packets RDMA

Page 10: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

iSCSI Initiator iSCSI Target

Command Request

Write

SCSI Write

Unsol Data-Out

Unsol Data-Out...

Queue command

Send R TRDMA Read

RDMA Read Send R T

...

SCSI Response

...

iSER iSER

Get Data

Get Data

...

Status and Sense

Command Completion

Send Control

Send Control

Send Control

Control Notify

Control Notify

Control Notify

Control Notify

Send Control

iSCSI Initiator iSCSI Target

Command Request

Write

SCSI Write

Unsol Data -Out

...

Queue command

Send R TReady to Transmit

Sol Data-Out

Ready to Transmit Send R T...

Final Sol Data -Out

SCSI Response

Send R T...

Status and Sense

Command Completion

Unsol Data -Out

iSCSI over iSER: Write Requests

iSCSI iSCSI over iSER

TCP packets RDMA

Page 11: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

iSER + Multi-Box RAID

n iSCSI over iSER solves the problem of inefficient data transfer.

n The separation of control and data is really a protocol separation over the same path.

n The scalability problem remains:¡ All data passes through the controller.¡ When using RAID-4/5, the controller has to

perform parity calculations.

Page 12: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Agenda

n Introductionn Improving Communication Efficiencyn Relieving the Controller Bottleneckn Performance

Page 13: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Removing the Controller from the Data Path – 3rd Party Transfer

n 3rd Party Transfer: one iSCSI entity instructs a 2nd

iSCSI entity to read or write data to a 3rd iSCSI entity.

n Data is transferred directly between hosts and targets under controller command:¡ Lower zero-load latency, especially for large requests –

single hop instead of two hops.¡ The controller’s memory and InfiniBand link do not

become a bottleneck.n Out-of-band controllers already exist, but:

¡ RDMA makes out-of-band data transfers more transparent.

¡ We carry the idea into the RAID.

Page 14: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

RDMA and Out-of-Band Controller

n 3rd Party Transfer is more transparent when combined with RDMA :¡ Transparent from a host point of view.¡ Almost transparent from a target point of view.

n Adding 3rd Party Transfer to iSCSI over iSER is essential for removing the controller from the data path.

Page 15: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Distributed Parity Calculation

n The controller is not in the data path ⇒ it cannot make parity calculations.

n Side benefit: relieves another possible controller bottleneck.

Page 16: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Distributed Parity Calculation – a Binary Tree

Data block 0 Data block 1 Data block 2 Data block 3 Data block N-2 Parity block

Temp result Temp result Temp result

Temp result

New parity block

XOR XOR XOR

XOR

...

Page 17: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Example: 3rd Party Transfer and Distributed Parity Calculation

n The host sends a command to the RAID controller.

Host

RAIDcontroller

0 1 2 3 4

CMD

CMDCMD

RDMA

RDMA

(parity calculation)

n The RAID controller sends commands to the targets.

n The targets perform RDMA operations to the host.

n The RAID controller sends commands to recalculate the parity block (only for WRITE requests).

n The targets calculate the new parity block.

TargetsCMD

Page 18: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Mirroring with 3rd Party Transfer

n READ commands: similar to RAID-5.

n WRITE commands may be executed in one of the following two ways:¡ All targets read the new data directly from the host.¡ A single target reads the new data directly from

the host and transfers it to other targets.

n Using 3rd Party Transfer for mirroring relieves the RAID controller bottleneck.

Page 19: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Required Protocol Changesn Host:

¡ Minor change: The host must accept InfiniBand connection requests.

n RAID controller and targets:¡ SCSI:

n Additional commands.n No SCSI hardware changes are required.

¡ iSCSI:n Small changes in login/logout process.n Extra field added to iSCSI Command PDU.

¡ iSER:n Added and modified iSER primitives.

Page 20: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Agenda

n Introductionn Improving Communication Efficiencyn Relieving the Controller Bottleneckn Performance

Page 21: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Test Setup

n Hardware:¡ Nodes (all types): Intel dual-XEON 3.2GHz¡ Memory disks¡ Mellanox MHEA28-1T (10Gb/s) InfiniBand HCA¡ Mellanox MTS2400 InfiniBand switch

n Software:¡ Linux SuSE 9.1 Professional (2.6.4-52 kernel)¡ Voltaire InfiniBand host stack¡ Voltaire iSER initiator and target

Page 22: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

System Configurations

n Baseline system:¡ Host¡ In-band RAID controller¡ 5 targets

n TPT-RAID system:¡ Host¡ TPT RAID controller¡ 5 TPT targets

n Both systems use iSER (RDMA) over InfiniBand.

InfiniBand switch

Host

RAID controller

Targets

Page 23: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Scalability

n TPT-RAID (almost) doesn’t add work (relative to the Baseline):¡ No extra disk (media) operations.¡ No extra XOR operations.

n Added communication to the targets:¡ More commands¡ More data transfers

n The extra communication is divided among all targets.

Page 24: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Controller Scalability – RAID-5 (WRITE)

n Unlimited number of hostsn Unlimited number of targets

n InfiniBand BW is not a limiting factor (multiple hosts and targets).

11 (75%)32KB1MB

TPTBaseline

41 (78%)64KB8MB

21 (78%)32KB8MB

21 (72%)64KB1MB

Max. hosts Block sizeReq. size

Page 25: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Max. Thpt. with One Host – RAID-5 (WRITE)

Even when only a single host is used, the Baseline controller is the bottleneck!

Page 26: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Controller Scalability – Mirroring (WRITE)

n Unlimited number of hostsn Unlimited number of targets

91 (50%)256KB

TPTBaseline

2931 (50%)8MB

361 (50%)1MB

181 (50%)512KB

Max. hosts Req. size(Blk=32KB)

Page 27: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Max. Thpt. with One Host - Mirroring (WRITE)

n Even when a single host is used, the Baseline controller is a bottleneck.

n For TPT, the bottleneck is the host or the targets.

Page 28: Multi-Box RAID with 3 Party Transfer and Parity ... · InfiniBand n InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end)

Summaryn Multi-box RAID: improved availability and low cost.

n Using a single controller retains simplicity.

n Single-box DMA engine is replaced by RDMA.

n Adding 3rd Party Transfer and distributed parity calculation allow scalability:¡ Can manage a larger system with more activity¡ For a given workload: larger max. thpt. ⇒ shorter waiting

times ⇒ lower latency

Cost reduction is taken another step while retaining performance and simplicity.