integrating new capabilities into netpipe dave turner, adam oline, xuehua chen, and troy benjegerdes...

Integrating New Capabilities into NetPIPE

Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes

Scalable Computing Laboratory of Ames Laboratory

This work was funded by the MICS office of the US Department of Energy

TCPworkstations

PCs

Cray T3ESGI systems

PVM

TCGMSGruns on

ARMCI or MPI

MPI-21-sided

MPI_Put or MPI_Get

SHMEM1-sided

puts and gets

NetPIPE

2-sidedprotocols

1-sidedprotocols

nativesoftwarelayers

MPI MPICH LAM/MPIMPI/Pro MP_Lite

GMMyrinet cards

InfinibandMellanox VAPI

LAPI

SHMEM& GPSHMEM

ARMCI

IBM SP

Clusters

Network Protocol Independent Performance Evaluator

ARMCITCP, GM, VIA,Quadrics, LAPI

internalsystems

memcpy

+ Basic send/recv with options to guarantee pre-posting or use MPI_ANY_SOURCE. + Option to measure performance without cache effects. + One-sided communications using either Get or Put, with or without fence calls. + Measure performance or do an integrity test.

http://www.scl.ameslab.gov/Projects/NetPIPE/

The NetPIPE utility

NetPIPE does a series of ping-pong tests between two nodes.

Message sizes are chosen at regular intervals, and with slight perturbations, to fully test the communication system for idiosyncrasies.

Latencies reported represent half the ping-pong time for messages smaller than 64 Bytes.

Measuring the overhead of message-passing protocols.

Help in tuning the optimization parameters of message-passing libraries.

Optimizing driver and OS parameters (socket buffer sizes, etc.).

Identifying dropouts in networking hardware and drivers.

Some typical uses

What is not measured NetPIPE cannot measure the load on the CPU yet.

The effects from the different methods for maintaining message progress.

Scalability with system size.

Recent additions to NetPIPE Can do an integrity test instead of measuring performance.

Streaming mode measures performance in 1 direction only.

Must reset sockets to avoid effects from a collapsing window size.

A bi-directional ping-pong mode has been added (-2).

One-sided Get and Put calls can be measured (MPI or SHMEM).

Can choose whether to use an intervening MPI_Fence call to synchronize.

Messages can be bounced between the same buffers (default mode), or they can be started from a different area of memory each time.

There are lots of cache effects in SMP message-passing.

InfiniBand can show similar effects since memory must be registered with the card.

Process 1Process 0

0 12 3

Current projects Overlapping pair-wise ping-pong tests.

Must consider synchronization if not using bi-directional communications.

Investigate other methods for testing the global network.

Evaluate the full range from simultaneous nearest neighbor communications to all-to-all.

Ethernet Switch

n0 n1 n2 n3

n0 n1 n2 n3 Line speed vs end-point limited

Performance on Mellanox InfiniBand cards

A new NetPIPE module allows us to measure the raw performance across InfiniBand hardware (RDMA and Send/Recv).

Burst mode preposts all receives to duplicate the Mellanox test.

The no-cache performance is much lower when the memory has to be registered with the card.

An MP_Lite InfiniBand module will be incorporated into LAM/MPI. 0

1000

2000

3000

4000

5000

6000

7000

100 10,000 1,000,000M essage size in Bytes

Th

rou

gh

pu

t in

Mb

ps

M V AP ICHw/o c ac he effec ts

IB V AP IB urs t m ode

M V AP ICH 7.5 us

IB V AP I S e nd/Re cv

MVAPICH 0.9.1

10 Gigabit Ethernet

Intel 10 Gigabit Ethernet cards

133 MHz PCI-X bus

Single mode fiber

Intel ixgb driver

Can only achieve 2 Gbps now.

Latency is 75 us.

Streaming mode delivers up to 3 Gbps.

Much more development work is needed.

0

500

1000

1500

2000

2500

3000

3500

100 10 ,000 1 ,000 ,000Message s ize in Bytes

Th

rou

gh

pu

t in

Mb

ps

10 G igEs tream ing m ode

10 G igE 75 us

Channel-bonding Gigabit Ethernet for better communications between nodes

Channel-bonding uses 2 or more Gigabit Ethernet cards per PC to increase the communication rate between nodes in a cluster.

GigE cards cost ~$40 each.

24-port switches cost ~$1400.

$100 / computer

This is much more cost effective for PC clusters than using more expensive networking hardware, and may deliver similar performance.

Cache CPU

Memory

PCI NIC

Networkswitch

CacheCPU

Memory

Cache CPU

Memory

CacheCPU

Memory

Channel bonding in a cluster

NIC

PCI NICNIC

PCINICNIC

PCINICNIC

Performance for channel-bonded Gigabit Ethernet

Channel-bonding multiple GigE cards using MP_Lite and Linux kernel bonding

GigE can deliver 900 Mbps with latencies of 25-62 us for PCs with 64-bit / 66 MHz PCI slots.

Channel-bonding 2 GigE cards / PC using MP_Lite doubles the performance for large messages.

Adding a 3rd card does not help much.

Channel-bonding 2 GigE cards / PC using Linux kernel level bonding actually results in poorer performance.

The same tricks that make channel-bonding successful in MP_Lite should make Linux kernel bonding working even better.

Any message-passing system could then make use of channel-bonding on Linux systems.

0

500

1000

1500

2000

2500

100 10,000 1,000,000

Message size in Bytes

Th

rou

gh

pu

t in

Mb

ps

MP_Lite2 GigE

MP_Lite3 GigE

Linux2 GigE

1 GigE card

Channel-bonding in MP_Lite

Applicationon node 0

a b

MP_Lite

User space Kernel space

Largesocketbuffers

b

a

TCP/IP stackdev_q_xmit

device driver

devicequeue

DMA

GigEcard

TCP/IP stackdev_q_xmit

devicequeue

DMA GigEcard

Flow control may stop a given stream at several places.

With MP_Lite channel-bonding, each stream is independent of the others.

Linux kernel channel-bonding

Applicationon node 0

User space Kernel space

Largesocketbuffer

TCP/IPstack

dqx

device driver

devicequeue

DMA

GigEcard

devicequeue

DMA GigEcard

A full device queue will stop the flow at bonding.c to both device queues.

Flow control on the destination node may stop the flow out of the socket buffer.

In both of these cases, problems with one stream can affect both streams.

bonding.c

dqx

dqx

Comparison of high-speed interconnects

InfiniBand can deliver 4500 - 6500 Mbps at a 7.5 us latency.

Atoll delivers 1890 Mbps with a 4.7 us latency.

SCI delivers 1840 Mbps with only a 4.2 us latency.

Myrinet performance reaches 1820 Mbps with an 8 us latency.

Channel-bonded GigE offers 1800 Mbps for very large messages.

Gigabit Ethernet delivers 900 Mbps with a 25-62 us latency.

10 GigE only delivers 2 Gbps with a 75 us latency.

0

1000

2000

3000

4000

5000

6000

7000

100 10 ,000 1 ,000 ,000Message s ize in Bytes

Th

rou

gh

pu

t in

Mb

ps

In fin iBan dw/o cache e ffects

SCI 4 .2 u s

G ig E 62 u s

In fin iBan d RDM A7 .5 us la tency

2xG ig E 62 u s

Ato ll 4 .7 u s

M yr in e t 8 u s

Conclusions• NetPIPE provides a consistent set of analytical tools in the same flexible

framework to many message-passing and native communication layers.• New modules have been developed.

– 1-sided MPI and SHMEM– GM, InfiniBand using the Mellanox VAPI, ARMCI, LAPI– Internal tests like memcpy

• New modes have been incorporated into NetPIPE.– Streaming and bi-directional modes.– Testing without cache effects.– The ability to test integrity instead of performance.

Current projects

• Developing new modules.– ATOLL

– IBM Blue Gene/L

– I/O performance

• Need to be able to measure CPU load during communications.

• Expanding NetPIPE to do multiple pair-wise communications.– Can measure the backplane performance on switches.

– Compare the line speed to end-point limited performance.

• Working toward measuring more of the global properties of a network.– The network topology will need to be considered.

Contact information

Dave Turner - [email protected]

http://www.scl.ameslab.gov/Projects/MP_Lite/

http://www.scl.ameslab.gov/Projects/NetPIPE/

0

100

200

300

400

500

600

700

1 100 10,000 1,000,000Message size in Bytes

Th

rou

gh

pu

t in

Mb

ps

ARMCI

MP_Literaw TCP

LAM/MPI

One-sided Puts between two Linux PCs

MP_Lite is SIGIO based, so MPI_Put() and MPI_Get() finish without a fence.

LAM/MPI has no message progress, so a fence is required.

ARMCI uses a polling method, and therefore does not require a fence.

An MPI-2 implementation of MPICH is under development.

An MPI-2 implementation of MPI/Pro is under development.

Netgear GA620 fiber GigE 32/64-bit 33/66 MHz AceNIC driver

The MP_Lite message-passing library

• A light-weight MPI implementation• Highly efficient for the architectures supported• Designed to be very user-friendly• Ideal for performing message-passing research

http://www.scl.ameslab.gov/Projects/MP_Lite/

MPI Applicationsrestricted to a subset

of the MPI commands

MP_Lite

VIAOS-bypass

TCPworkstations

PCs

SMPshared-memory

segment

SHMEMone-sidedfunctions

MPIto retain portabilityfor MP_Lite syntax

Mixed systemdistributed

SMPs

MP_Lite syntax

Giganet hardware

M-VIA Ethernet

Cray T3E

SGI Origins

InfiniBandMellanox VAPI

0

500

1000

1500

2000

2500

3000

1 100 10,000 1,000,000

Message size in Bytes

Th

rou

gh

pu

t in

Mb

ps

old Cray MPI

MP_Literaw SHMEM

new Cray MPI

A NetPIPE example: Performance on a Cray T3E

Raw SHMEM delivers: 2600 Mbps 2-3 us latency

Cray MPI originally delivered: 1300 Mbps 20 us latency

MP_Lite delivers: 2600 Mbps 9-10 us latency

New Cray MPI delivers: 2400 Mbps 20 us latency

The top of the spikes are where the message size is divisible by 8 Bytes.

integrating new capabilities into netpipe dave turner, adam oline, xuehua chen, and troy benjegerdes...

Documents

measure performance

cache performance

measuring performance

netpipe utility netpipe

limited performance

raw performance

message sizes

bidirectional pingpong