message passing architecture in intra-cluster communicationbhuyan/cs213/2004/lecture14a.pdfcs213...

26
CS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan <xzhang, bhuyan>@cs.ucr.edu February 8, 2004 UC Riverside Slide 1

Upload: others

Post on 10-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

CS213

Message Passing Architecture in

Intra-Cluster Communication

Xiao Zhang

Lamxi Bhuyan

<xzhang, bhuyan>@cs.ucr.edu

February 8, 2004 UC Riverside Slide 1

Page 2: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

CS213

Outline

1 – Kernel-based Message Passing Architecture . . . . . . . . . . . . . 3

1.1 – How UDP send/receive work . . . . . . . . . . . . . . . . . . . 5

1.2 – UDP critical path analysis . . . . . . . . . . . . . . . . . . . . . 10

2 – User-Level Message Passing Architecture . . . . . . . . . . . . . . 13

2.1 – VIA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 – How M-VIA send/receive work . . . . . . . . . . . . . . . . . . 17

2.3 – M-VIA critical path analysis . . . . . . . . . . . . . . . . . . . 22

2.4 – M-VIA performance evaluation . . . . . . . . . . . . . . . . . . 24

3 – Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

February 8, 2004 UC Riverside Slide 2

Page 3: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

1 – Kernel-based Message Passing Architecture CS213

1 – Kernel-based Message Passing Architecture

IP

TCP/UDP

Socket

User Applications

Kernel

User

Network Interface Controller

Driver

Figure 1: Traditional kernel-based mes-

sage passing architecture for intra-cluster

communication.

• User application initiates communi-

cation.

• Socket is the interface to user appli-

cation.

• TCP and UDP provide end-to-end

channel for users.

• IP is for routing.

• Driver/NIC perform transmit and

receive.

February 8, 2004 UC Riverside Slide 3

Page 4: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

1 – Kernel-based Message Passing Architecture CS213

User Applications:

s = socket(SOCK DGRAM)

bind(s,localAddr)

· · ·

Allocate 2 msg buf

· · ·

connect(s,remoteAddr)

· · ·

Prepare msg[1] for sending

send(s, msg[1])

recv(s, &msg[0])

· · ·

close(s)

s = socket(SOCK DGRAM)

bind(s,localAddr)

· · ·

Allocate 1 msg buf

· · ·

connect(s,remoteAddr)

· · ·

· · ·

recv(s, &msg)

send(s, msg)

· · ·

close(s)

Echo client using UDP/IP socket Echo server using UDP/IP socket

February 8, 2004 UC Riverside Slide 4

Page 5: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

1 – Kernel-based Message Passing Architecture CS213

1.1 – How UDP send/receive work

rx_ring tx_ring

application application

socket socket

skbuf(xmit)

skbuf(recv)

Driver

IPTCP/UDPSocket

PCI bus

pointer

copy

CSRs DMA engine CRC generator/checker

DEC 21140 device structures

Kernel

User

HOST

DEC 21140 NIC

Figure 2: TCP/UDP/IP over DEC 21140 NIC (tulip)

February 8, 2004 UC Riverside Slide 5

Page 6: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

1 – Kernel-based Message Passing Architecture CS213

UDP send operations

Socket:

• get the BSD socket structure.

• build msghdr structure for the sending message.

• pass control from BSD socket to INET socket.

UDP:

• get and verify the address. For connected socket, a flag is set to avoid routing.

• For unconnected socket, call routing function to get the routing table entry.

• fill in the UDP header (except checksum).

IP:

• segment large message if necessary.

• allocate a socket buffer (skbuf) and a data buffer for the sending message. The data

buffer includes the message and udp/ip/ethernet headers. The ethernet header is

16-byte aligned. The data buffer is 128-byte aligned.

• fill in the IP header and calculate the IP header checksum.

February 8, 2004 UC Riverside Slide 6

Page 7: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

1 – Kernel-based Message Passing Architecture CS213

IP (continue):

• copy the message from user space to skbuf and compute the message checksum at

the same time.

• compute the UDP header checksum and TCP/UDP pseudo-header checksum.

• copy the UDP header to skbuf.

Driver(xmit):

• enqueue skbuf to tx ring.

• trigger the immediate transmit.

NIC:

• DMA fetch tx ring descriptor.

• DMA fetch data buffer.

• CRC checksum.

• put packet onto the wire.

After NIC complete transmission:

• send a interrupt signal to CPU.

• CPU calls the NIC interrupt handler to dequeue skbuf from tx ring and free the

buffer.

0 31

source ip address

destination ip address

0 Protocol segment length

February 8, 2004 UC Riverside Slide 7

Page 8: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

1 – Kernel-based Message Passing Architecture CS213

UDP receive operations

when NIC receives a packet:

• DMA the incoming packet to a skbuf pointed by rx ring tail descriptor.

• send an interrupt to CPU.

Driver(recv):

• determine the packet protocol id.

• push skbuf into a queue.

• add the NIC to the poll list, and raise a soft interrupt to trigger the bottom-half of

the receive process.

IP (in bottom half):

• check IP header (length, version, checksum).

• handle IP options if any.

• reassemble IP fragments if necessary.

February 8, 2004 UC Riverside Slide 8

Page 9: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

1 – Kernel-based Message Passing Architecture CS213

UDP (in bottom half):

• check UDP header (length, checksum).

• find the INET socket that matches the address/port.

• enqueue skbuf into the socket receive queue.

• wake up a sleeping process waiting for packets.

Socket:

• get the BSD socket structure.

• build msghdr structure for receiving message.

• pass control from BSD socket to INET socket.

UDP (top half):

• If no packet arrives and recv() is a blocking call, the current process will sleep.

• After being waked up, copy and checksum the message from skbuf to user space.

• copy the sender’s address to user space.

February 8, 2004 UC Riverside Slide 9

Page 10: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

1 – Kernel-based Message Passing Architecture CS213

1.2 – UDP critical path analysis

� � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � �

send

0 1 2 3 4 5 6 7 8 9 10

SystemCall

trigger transmit

DMA(xmit)

IPUDP

socket

driver

11 12 13 14 15 16 17 18 19 20

intrStartintr(xmit)

intrReturn

transmitOne copy involved

time (usec)

User

Kernel

13 14 1615 17 18 19 20 21 22 23

DMA(recv)

IPUDP

intrReturn

schedule

One copy involved

24 25 26 27 28 30 3129 32 33 34 time (usec)

Kernel

User

35

receive

critical time

non−critical time

t2 t3 t4 t5 t6 t7 t8

t9 t10 t11 t12 t13 t14 t15 t16

t19

t18

t17

t1

t8

another processrecv()

SystemCall

sys_recvfromudp_recvmsg

intr(recv)interrupt

wakeupsleep

intrStart

Figure 3: Time line of transferring 1-byte message using UDP/IP

February 8, 2004 UC Riverside Slide 10

Page 11: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

1 – Kernel-based Message Passing Architecture CS213

Categories Intervals usec percentage

User application t1 + t19 0.3 0.9

Protocol processing t3 + t4 + t5 + t6 + t11 + t13 +

t14 + t16 + t17

14.5 41.0

OS system call overhead t2 + t18 1.6 4.5

OS interrupt overhead t10 + t12 5 14.1

OS context switch overhead t15 2 5.6

DMA/transmit/receive t7 + t8 + t9 12 33.9

Total 35.4 100

Table 1: UDP critical time break down

Problem: LARGE OVERHEAD!

February 8, 2004 UC Riverside Slide 11

Page 12: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

1 – Kernel-based Message Passing Architecture CS213

Goal: reduce software overhead to approach hardware limits.

Quick solution:

User Applications

Network Interface Controller

Driver

User

Kernel

IP

TCP/UDP

Socket

User Applications

Network Interface ControllerDriver

User

Kernel

This idea is reasonable:

• Cluster network can be assumed to be reliable and secure. A thick procotol stack like

TCP/IP is not necessary.

• An increasing number of applications are more sensitive to communication latency.

But,

• How to allow direct access yet provide protection?

• How to design an efficient, yet versatile programming interface?

• How to manage resources, in particular memory?

• How to fair share network without a kernel path?

February 8, 2004 UC Riverside Slide 12

Page 13: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

2 – User-Level Message Passing Architecture

In user-level message passing architecture, the operating system kernel and its

centralized networking stack are removed from the critical communication path

in a protected fashion, providing user applications a user-level network

interface, through which users can directly access to their network interface.

The operating system is only involved in setting up the communication.

In this way, the communicating parties can avoid intermediate copies of data,

interrupts, and context switches in the critical path, thus greatly decreasing

communication latency and increasing network throughput.

February 8, 2004 UC Riverside Slide 13

Page 14: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

2.1 – VIA Overview

Send

Doo

rbel

l

SEN

D

VI

RE

CV

RecvD

oorbell

CO

MP

CQ

Register Memory

Kernel Mode

User Mode

VI Provider

VIConsumer

Open / Connect / Send / Receive /RDMARead / RDMAWrite

Send

Doo

rbel

l

SEN

D

VIR

EC

VR

ecvDoorbell

VI User Agent (User Level Library)

Sockets, MPI, Cluster, OtherOS communication Interface

User Applications

(Kernel Level Driver)VI Kernel Agent

VI−capable Network Interface Controller

Figure 4: The Virtual Interface Architecture Model

February 8, 2004 UC Riverside Slide 14

Page 15: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

VI

Desc N

SendQ

Desc 2

Desc 1

RecvQ

Desc 1

Desc 2

Desc N

Status Status

Network Interface Controller

Send

Doo

rBel

l Recv D

oorBell

VI Consumer

Packets to/from network

Figure 5: A Virtual Interface

February 8, 2004 UC Riverside Slide 15

Page 16: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

VipOpenNic(&nicHand)

· · ·

VipCreateVI(&viHand)

· · ·

Allocate 3 desc and 3 msg

Register 3 desc and 3 msg

desc[i].buf ← msg[i]

· · ·

VipostRecv(viHand, &desc[1])

descp ← &desc[2]

VipConnectRequest(viHand,lAddr,rAddr)

· · ·

Prepare msg[0] for sending

VipPostSend(viHand,desc[0])

VipSendWait(viHand,&descp2)

VipPostRecv(viHand,descp)

VipRecvWait(viHand,&descp)

· · ·

VipDisconnect(viHand)

Deregister desc and msg

VipDestroyVI(viHand)

VipOpenNic(&nicHand)

· · ·

VipCreateVI(&viHand)

· · ·

Allocate desc[2] and msg[2]

Register desc[2] and msg[2]

desc[i].buf ← msg[i]

· · ·

VipPostRecv(viHand,&desc[0])

descp ← &desc[1]

VipConnectWait(nicHand,lAddr,rAddr,&connHand)

VipConnectAccept(connHand,viHand)

· · ·

VipPostRecv(viHand,descp)

VipRecvWait(viHand,&descp)

VipPostSend(viHand,descp)

VipSendWait(viHand,&descp)

· · ·

VipDisconnect(viHand)

Deregister desc and msg

VipDestroyVI(viHand)

Echo client using VIA Echo server using VIA

February 8, 2004 UC Riverside Slide 16

Page 17: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

2.2 – How M-VIA send/receive work

recv

_que

ue

send_queue recv

_que

ue

send_queuecopycopy

VIVI user registered buffers

HOST

DEC 21140 NIC

Kernel

User

PCI bus

DEC 21140 device structures

CSRs DMA engine CRC generator/checker

rx_ring tx_ringrecv buffers

Figure 6: M-VIA over DEC 21140 NIC (tulip). M-VIA is a modular VIA imple-

mentation for Linux over Ethernet.

February 8, 2004 UC Riverside Slide 17

Page 18: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

M-VIA send operations

VipPostSend:

• enqueue a send descriptor to the sending queue.

Driver(xmit):

• prepares send header, waits for device stop, gets the next send decriptor entry, and

fill in the header.

• segments large packets into chunks.

• enqueue each chunk to tx ring.

• trigger the immediate transmit.

NIC:

• DMA fetch tx ring descriptor.

• DMA fetch data buffer.

• CRC checksum.

• put packet onto the wire.

February 8, 2004 UC Riverside Slide 18

Page 19: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

After NIC complete transmission:

• send a interrupt signal to CPU.

• CPU calls the NIC interrupt handler to dequeue head entries from tx ring, mark the

descriptor complete and wake up a block process.

VipSendWait:

• check the head descriptor of the sending queue for N times. If the head descriptor is

marked complete, remove it from the sending queue and return success.

• otherwise, sleep.

• after being waked up, check the head descriptor again. If the head descriptor is

marked complete, remove it from the sending queue, and return success. Otherwise,

return error.

February 8, 2004 UC Riverside Slide 19

Page 20: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

M-VIA receive operations

VipPostRecv:

• enqueue a recv descriptor to the receiving queue.

when NIC receivess a packet:

• DMA the incoming packet to a buffer pointed by rx ring tail descriptor.

• send a interrupt signal to CPU.

Driver(recv):

• get a corresponding VI structure from the received packet.

• check the sequence number of the received packet.

• copy the packet from rx ring to VI specified buffer.

• dequeue head entry from rx ring.

• mark the descritpor complete.

• wake up a block process.

February 8, 2004 UC Riverside Slide 20

Page 21: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

VipRecvWait:

• checking the head descriptor of the receiving queue for N times. If the head

descriptor is marked complete, remove it from the receiving queue and return success.

• otherwise, sleep.

• after being waked up, check the head descriptor again. If the head descriptor is

marked complete, remove it from the receiving queue, and return success. Otherwise,

return error.

February 8, 2004 UC Riverside Slide 21

Page 22: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

2.3 – M-VIA critical path analysis

FastTrapReturn

VipPostRecv

FastTrapVipkPostRecv

FastTrapReturn

0 2 31 14 15 17 18 19 2016 21 22 2313124 5 6 7 8 9 10 11

FastTrap

polling

Kernel

User

time (usec)

VipPostSend

t1

NetworkReceiver

Sender

interrupt

polling

Kernel

User

interrupt

t5 t6 t7

Transmit

Receive

VipRecvWait

intr(recv)

intrStart

t10

critical time

non−critical time

intr(xmit)

intrReturn

VipSendWait

DMA(recv)

DMA(xmit)

t3 t4 t8 t9

one copy involved

trigger transmitVipkERingtulipPostSend

t2

intrReturn

intrStart

Figure 7: Time line of transferring 1-byte message using M-VIA

February 8, 2004 UC Riverside Slide 22

Page 23: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

Interval Description Time

cycles usec

t1 user post send descriptor 31 0.08

t2 Fast trap 138 0.34

t3 device send (first chunk) 733 1.83

t4 + t5 DMA(xmit)/transmit 10.1 + 0.08s

t6 DMA (receive) ???

t7 interrupt start 1268 3.16

t8 interrupt recv processing 3.1 + 0.272b(s + 25)/32c

t9 interrupt return 627 1.57

t10 user receive (polling) 20 0.05

Table 2: M-VIA critical time breakdown

February 8, 2004 UC Riverside Slide 23

Page 24: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

2.4 – M-VIA performance evaluation

vipl.a

Modules:via_ka.ovia_ering.ovia_tulip.o

Library:

Header:vipl.h

M−VIA 1.2

PentiumCeleron 400MHz

256M SDRAM

DEC 21140 NIC

Applications

Node 1 Node 2

RedHat Linux 2.4.20−8

crossover cable

PentiumCeleron 400MHz

256M SDRAM

DEC 21140 NIC

Applications

RedHat Linux 2.4.20−8

vpingpongrpingpong

upingpongtpingpong

vpingpongrpingpong

upingpongtpingping

TCP/IP TCP/IPM−VIA M−VIA

Figure 8: Experimental Setting

February 8, 2004 UC Riverside Slide 24

Page 25: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

2 – User-Level Message Passing Architecture CS213

0

50

100

150

200

0 200 400 600 800 1000 1200 1400

Lat

ency

(us

)

Packet size (bytes)

Linux TCP/IP socketLinux UDP/IP socket

M-VIA (reliable)M-VIA (unreliable)

Figure 9: One-way lentency vs. message

size

• entirely remove socket,

TCP/UDP and IP layers.

• zero copy when sending.

Sending overhead is constant.

• use fast trap instead of normal

system call.

• use polling instead of interrupt

to avoid context switch.

February 8, 2004 UC Riverside Slide 25

Page 26: Message Passing Architecture in Intra-Cluster Communicationbhuyan/CS213/2004/LECTURE14a.pdfCS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan

3 – Conclusion CS213

3 – Conclusion

• Traditional kernel-based communication has large software overhead.

• User-level communication is a good solution for intra-cluster

communication.

February 8, 2004 UC Riverside Slide 26