ethernet and tcp optimizations

Post on 12-May-2015

1.984 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

With a trivial bit of tuning, you can extract fairly amazing small message latencies out of TCP. This ain't your father's Ethernet (or TCP).

TRANSCRIPT

Cisco Confidential 1© 2012 Cisco and/or its affiliates. All rights reserved.

Ethernet: Hidden Secrets Jeff Squyres

First: some backgroundinformation…

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3

Jeff’s work: Parallel computing at Cisco

Using lots and lots and lots of servers simultaneouslyto solve one computational problem

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4

Supercomputing applications

Racks of36 1U

servers

Tend to send lots and lots and lots of small messagesacross the network to stay in sync with each other

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5

Network message traversal

Underlying network

Send amessage

Receive themessage

A B

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6

Network message traversal

Underlying network

Send amessage

Receive themessage

Today’s fastest networks:1-3μs (!)

A B

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7

Today’s fastest networks

• Typically not Ethernet networks

• Usually have supercomputer-specific networksExample: highly tuned for short message latency

• …but that is changing

Ethernet Ethernot

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8

Cisco’s ultra low latency Ethernet

• Userspace NIC (“USNIC”)Expose Cisco NIC hardware directly to Linux userspace

Bypass the OS

Bypass the TCP stack

• Send raw Ethernet frames directly from user applicationsMuch, much faster than traditional TCP-based networking

Especially for latency of short messages

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9

Kernel

Cisco VIC hardware

TCP / IP stack

Cisco VIC driver

Normal TCP software architecture

UserspaceUserspace sockets library

MPI library

Application

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10

Kernel

Userspace verbs library

Cisco VIC hardware

Cisco USNIC software

MPI library

Userspace

Verbs IB core

Cisco USNIC driver

Bootstrappingand setup

Send and receivefast path

Application

With all that background…

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12

Doing some performance testing last week…

Two servers

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13

Doing some performance testing last week…

Two servers

Each with a 2 x 10Gb NIC

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14

Doing some performance testing last week…

Two servers

Each with a 2 x 10Gb NICConnected back-to-back

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15

“Ping pong” latency test

Send a messagefrom here

Receive the messagehere

Ping!

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16

“Ping pong” latency test

Get the messageback

Send the messageback

Pong!

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 17

“Ping pong” latency test

Because each ping and pong are soooo short,do this ping-pong exchange N times

Ping! / Pong!

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 18

“Ping pong” latency test

Total time for N ping-pongs

N

Time for one ping-pong

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 19

“Ping pong” latency test

Total time for N ping-pongs

N

Time for one ping-pong

2

Time for one ping

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 20

Time for one ping

Half-round trip (HRT)ping pong latency

=

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 21

Results: using 1x10G Ethernet port

1 byte~60μs

8MB~150ms

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 22

Results: using 2x10G Ethernet ports

1 byte~60μs

8MB~150ms

8MB~8.3ms

1 byte~30μs (!)

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 23

Results: using 2x10G Ethernet ports

1 byte~60μs

8MB~150ms

8MB~8.3ms

1 byte~30μs (!)

WHOA!

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 24

Results: just the small messages

The facts:From 1-1024 bytes: flat latency

Using 1 interface: ~60μsUsing 2 interfaces: ~30μs

~60μs

~30μs

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 25

Results: just the small messages

The facts:From 1-1024 bytes: flat latency

Using 1 interface: ~60μsUsing 2 interfaces: ~30μs

~60μs

~30μsWHY?

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 26

Must look at how TCP works…

1. Ethernet frame arrives

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 27

Must look at how TCP works…

1. Ethernet frame arrives

2. NIC sends interruptto OS Ethernet driver

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 28

Must look at how TCP works…

1. Ethernet frame arrives

2. NIC sends interruptto OS Ethernet driver

3. OS Ethernet drivercopies the packet to RAM

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 29

Must look at how TCP works…

1. Ethernet frame arrives

2. NIC sends interruptto OS Ethernet driver

3. OS Ethernet drivercopies the packet to RAM

4. OS TCP stack handspacket off to (whatever)

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 30

The Costco Rule

It’s always better in bulk

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 31

Why copy one packet at a time?

Let’s optimizethis part

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 32

Two (commonly used) optimizations

1. Copy a bunch ofpackets across PCI

at one time

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 33

Two (commonly used) optimizations

2. Only raise oneinterrupt for all of

those packet copies

1. Copy a bunch ofpackets across PCI

at one time

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 34

Two (commonly used) optimizations

2. Only raise oneinterrupt for all of

those packet copies

1. Copy a bunch ofpackets across PCI

at one time

A.k.a. “Interrupt Coalescing”

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 35

Interrupt coalescing

1. Ethernet frame arrives

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 36

Interrupt coalescing

1. Ethernet frame arrives

2. Has N time passedsince we sent an

interrupt to the OS?

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37

Interrupt coalescing

1. Ethernet frame arrives

2. Has N time passedsince we sent an

interrupt to the OS?

No: queue up the frame✖

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 38

Interrupt coalescing

1. Ethernet frame arrives

2. Has N time passedsince we sent an

interrupt to the OS?

No: queue up the frame✖✔ Yes: Send all queued frames and interrupt

Ok… So what?

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 40

The key: NIC interrupt coalescing timers

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 41

Timeline of a ping pong

NIC A

NIC B

1. A sends ping frame

2. B receives ping frame

Periodic interruptcoalescing timeout

125μs

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 42

Timeline of a ping pong

NIC A

NIC B

3. Coalesce timer expires; B sends interrupt4. B sends pong frame

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 43

Timeline of a ping pong

NIC A

NIC B

5. Coalesce timer expires; A sends interrupt6. A sends ping frame7. Rinse, repeat

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 44

Timeline of a ping pong

NIC A

NIC B

4 ping-pongs in ~8x timer duration

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 45

Timeline of a ping pong

NIC A

NIC B

In general, coalescing interrupts is a very Very Good Thing

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 46

Timeline of a ping pong

NIC A

NIC B

But it definitely hurts low-latency traffic

How do we reduce those artificial delays?

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 48

Two Ethernet ports with out-of-sync timers

NIC A

NIC B

NIC A

NIC B

Por

t 0

Por

t 1

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 49

Get more round trips in same amount of time

NIC A

NIC B

NIC A

NIC B

Por

t 0

Por

t 1

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 50

Get more round trips in same amount of time

NIC A

NIC B

NIC A

NIC B

Por

t 0

Por

t 1

In reality, sender and receiver timers on each port are wholly unrelated; they don’t line up

nicely like I used in these examples.

Meaning: in general, you actually usually get better overlap

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 51

Results: just the small messages

~60μs

~30μs

In this case, we got such good asymmetry, that the 2 port case is ~2x as fast (i.e., roughly twice as many interrupts in the same amount of time)

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 52

Lies, damn lies, and statistics

Remember:these are AVERAGE

latencies!

Individual ping-pong timesare the same as the

1 port case (from the network)

…but you get higher throughputbecause we’re reducing the

gaps between each ping-pong

Now let’s trysomething else…

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 54

Set the coalesce timer at 0

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 55

New ping-pongs much faster!

1 port~10.5μs

2 ports~10.6μs

1 port~7.2ms

2 ports~5.5ms

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 56

What are the tradeoffs?

Pros• (Much) faster TCP latency

…without changing app!

• Faster speeds seem to scale up to large messages, too

• Great for low-latency, sparse comms apps

• Best for NICs that are dedicated to MPI comms

Cons• May not scale well for

case of MPI process running on every core

• Lots and lots of interrupts going to socket:0.core:0

• May need to run (N-1) MPI processes…?

May also want to avoid socket:0.core:0, or move IRQ affinity

Your mileage may vary

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 58

But it’s interesting, nonetheless!

• Some experimentation might be worth trying with real world HPC apps:

• Allow TCP to wholly utilize core 0 (i.e., run MPI processes only on cores 1-15)

• Set the coalesce timer to something more than 0μs, but less than 125μs – there’s a whole spectrum with which to play

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 59

My overall points:

• Many in HPC have Ethernot networks …but as HPC continues to commoditize itself, lots of HPC users have Ethernet-based environments

• Today’s Ethernet switches and NICs are actually quite a bit faster and more advanced than what we old-time-HPCers grew up with

• Even good ol’ TCP is amazingly fast and optimized today

• You may be able to tune your NIC and/or fabric to extract pretty darn good MPI TCP performance

The default settings on your Ethernet NIC / fabric are likely set for general TCP traffic – which effect very different performance characteristics than what HPC applications typically need

Thank you.

top related