towards a common communication infrastructure for clusters and grids darius buntinas argonne...

18
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory

Upload: rosalind-candice-owens

Post on 29-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

Towards a Common Communication Infrastructure for Clusters and Grids

Darius Buntinas

Argonne National Laboratory

2

Overview

Cluster Computing vs. Distributed Grids InfiniBand

– IB for WAN IP and Ethernet

– Improving performance Other LAN/WAN Options Summary

3

Cluster Computing vs. Distributed Grids

Typical clusters

– Homogenous architecture

– Dedicated environments Compatibility is not a concern

– Clusters can use high-speed LAN networks• E.g., VIA, Quadrics, Myrinet, InfiniBand

– And specific hardware accelerators• E.g., Protocol offload, RDMA

4

Cluster Computing vs. Distributed Grids cont’ed

Distributed environments

– Heterogeneous architecture

– Communication over WAN

– Multiple administrative domains Compatibility is critical

– Most WAN stacks are IP/Ethernet

– Popular grid communication protocols• TCP/IP/Ethernet• UDP/IP/Ethernet

But what about performance?

– TCP/IP/Ethernet latency: 10s of µs

– InfiniBand latency: 1s of µs How do you maintain high intra-cluster performance while enabling inter-

cluster communication?

5

Solutions

Use one network for LAN and another for WAN

– You need to manage two networks

– Your communication library needs to be multi-network capable• May have impact on performance or resource utilization

Maybe a better solution: A common network subsystem

– One network for both LAN and WAN

– Two popular network families• InfiniBand• Ethernet

6

InfiniBand

Initially introduced as a LAN

– Now expanding onto WAN

Issues with using IB on the WAN

– IB copper cables have limited lengths

– IB uses end-to-end credit-based flow control

7

Cable Lengths

IB copper cabling

– Signal integrity decreases with length and data rate

– IB 4x-QDR (32Gbps) max cable length is < 1m Solution: optical cabling for IB E.g., Intel Connects Cables

– Optical cables

– Electrical-to-optical converters at ends• ~50 ps conversion delay

– Plug into existing copper-based adapters

8

End-to-End Flow Control

IB uses end-to-end credit-based flow control

– One credit corresponds to one buffer unit at receiver

– Sender can send one unit of data per credit

– Long one-way latencies impact achievable throughput• WAN latencies are on the order of ms

Solution: Hop-by-hop flow control

– E.g., Obsidian Networks Longbow switches

– Switches have internal buffering

– Link-level flow control is performed between node and switch

9

Effect of Delay on Bandwidth

Distance (km)

Delay (µs)

1 5

2 10

20 100

200 1000

2000 10000

Source: S. Narravula, et. al., Performance of HPC Middleware over InfiniBand WAN , Ohio State Technical Report, 2007. OSU-CISRC-12/07-TR77

10

IP and Ethernet

Traditionally

– IP/Ethernet is used for WAN

– and for low-cost alternative on LAN

– Software-based TCP/IP stack implementation• Software overhead limits performance

Performance limitations

– Small 1500-byte maximum transfer unit (MTU)

– TCP/IP software stack overhead

11

Increasing Maximum Transfer Unit

Ethernet standard specifies 1500-byte MTU

– Each packet requires hardware and software processing

– Is considerable at gigabit speeds

MTU can be increased

– 9K Jumbo frames

– Reduce per-byte processing overhead

Not compatible on WAN

12

Large Segment Offload Engine on NIC

a.k.a. Virtual MTU Introduced by Intel and Broadcom Allow TCP/IP software stack to use 9K or 16K MTUs

– Reducing software overhead Fragmentation performed by NIC Standard 1500-byte MTU on the wire

– Compatible with upstream switches and routers

13

Offload Protocol Processing to NIC

Handling packets at gigabit speeds requires considerable processing

– Even with large MTU

– Uses CPU time that would otherwise be used by application Protocol Offload Engines (POE)

– Perform communication processing on NIC

– Myrinet, Quadrics, IB TCP Offload Engines (TOE) is a specific kind of POE

– Chelsio, NetEffect

14

TOE vs Non-TOE: Latency

Source: P. Balaji, W. Feng and D. K. Panda, Bridging the Ethernet-Ethernot Performance Gap. IEEE Micro Journal Special Issue on High-Performance Interconnects, pp. 24-40, May/June Volume, Issue 3, 2006.

15

TOE vs Non-TOE: Bandwidth and CPU Utilization

16

TOE vs Non-TOE: Bandwidth and CPU Utilization (9K MTU)

17

Other LAN/WAN Options

iWARP protocol offload– Runs over IP– Has functionality similar to TCP– Adds RDMA

Myricom– Myri-10G adapter– Uses 10G Ethernet physical layer– POE– Can handle both TCP/IP and MX

Mellanox– ConnectX adapter– Has multiple ports that can be configured for IB or Ethernet– POE– Can handle both TCP/IP and IB

Convergence in software stack: OpenFabrics– Supports IB and Ethernet adapters– Provides a common API to upper layer

18

Summary

Clusters can take advantage of high-performance LAN NICs

– E.g., InfiniBand Grids need interoperability

– TCP/IP is ubiquitous Performance gap Bridging the gap

– IB over the WAN

– POE for Ethernet Alternatives

– iWarp, Myricom’s Myri-10G, Mellanox ConnectX