message passing architecture in intra-cluster communicationbhuyan/cs213/2004/lecture14a.pdfcs213...
TRANSCRIPT
CS213
Message Passing Architecture in
Intra-Cluster Communication
Xiao Zhang
Lamxi Bhuyan
<xzhang, bhuyan>@cs.ucr.edu
February 8, 2004 UC Riverside Slide 1
CS213
Outline
1 – Kernel-based Message Passing Architecture . . . . . . . . . . . . . 3
1.1 – How UDP send/receive work . . . . . . . . . . . . . . . . . . . 5
1.2 – UDP critical path analysis . . . . . . . . . . . . . . . . . . . . . 10
2 – User-Level Message Passing Architecture . . . . . . . . . . . . . . 13
2.1 – VIA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 – How M-VIA send/receive work . . . . . . . . . . . . . . . . . . 17
2.3 – M-VIA critical path analysis . . . . . . . . . . . . . . . . . . . 22
2.4 – M-VIA performance evaluation . . . . . . . . . . . . . . . . . . 24
3 – Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
February 8, 2004 UC Riverside Slide 2
1 – Kernel-based Message Passing Architecture CS213
1 – Kernel-based Message Passing Architecture
IP
TCP/UDP
Socket
User Applications
Kernel
User
Network Interface Controller
Driver
Figure 1: Traditional kernel-based mes-
sage passing architecture for intra-cluster
communication.
• User application initiates communi-
cation.
• Socket is the interface to user appli-
cation.
• TCP and UDP provide end-to-end
channel for users.
• IP is for routing.
• Driver/NIC perform transmit and
receive.
February 8, 2004 UC Riverside Slide 3
1 – Kernel-based Message Passing Architecture CS213
User Applications:
s = socket(SOCK DGRAM)
bind(s,localAddr)
· · ·
Allocate 2 msg buf
· · ·
connect(s,remoteAddr)
· · ·
Prepare msg[1] for sending
send(s, msg[1])
recv(s, &msg[0])
· · ·
close(s)
s = socket(SOCK DGRAM)
bind(s,localAddr)
· · ·
Allocate 1 msg buf
· · ·
connect(s,remoteAddr)
· · ·
· · ·
recv(s, &msg)
send(s, msg)
· · ·
close(s)
Echo client using UDP/IP socket Echo server using UDP/IP socket
February 8, 2004 UC Riverside Slide 4
1 – Kernel-based Message Passing Architecture CS213
1.1 – How UDP send/receive work
rx_ring tx_ring
application application
socket socket
skbuf(xmit)
skbuf(recv)
Driver
IPTCP/UDPSocket
PCI bus
pointer
copy
CSRs DMA engine CRC generator/checker
DEC 21140 device structures
Kernel
User
HOST
DEC 21140 NIC
Figure 2: TCP/UDP/IP over DEC 21140 NIC (tulip)
February 8, 2004 UC Riverside Slide 5
1 – Kernel-based Message Passing Architecture CS213
UDP send operations
Socket:
• get the BSD socket structure.
• build msghdr structure for the sending message.
• pass control from BSD socket to INET socket.
UDP:
• get and verify the address. For connected socket, a flag is set to avoid routing.
• For unconnected socket, call routing function to get the routing table entry.
• fill in the UDP header (except checksum).
IP:
• segment large message if necessary.
• allocate a socket buffer (skbuf) and a data buffer for the sending message. The data
buffer includes the message and udp/ip/ethernet headers. The ethernet header is
16-byte aligned. The data buffer is 128-byte aligned.
• fill in the IP header and calculate the IP header checksum.
February 8, 2004 UC Riverside Slide 6
1 – Kernel-based Message Passing Architecture CS213
IP (continue):
• copy the message from user space to skbuf and compute the message checksum at
the same time.
• compute the UDP header checksum and TCP/UDP pseudo-header checksum.
• copy the UDP header to skbuf.
Driver(xmit):
• enqueue skbuf to tx ring.
• trigger the immediate transmit.
NIC:
• DMA fetch tx ring descriptor.
• DMA fetch data buffer.
• CRC checksum.
• put packet onto the wire.
After NIC complete transmission:
• send a interrupt signal to CPU.
• CPU calls the NIC interrupt handler to dequeue skbuf from tx ring and free the
buffer.
⇓
0 31
source ip address
destination ip address
0 Protocol segment length
February 8, 2004 UC Riverside Slide 7
1 – Kernel-based Message Passing Architecture CS213
UDP receive operations
when NIC receives a packet:
• DMA the incoming packet to a skbuf pointed by rx ring tail descriptor.
• send an interrupt to CPU.
Driver(recv):
• determine the packet protocol id.
• push skbuf into a queue.
• add the NIC to the poll list, and raise a soft interrupt to trigger the bottom-half of
the receive process.
IP (in bottom half):
• check IP header (length, version, checksum).
• handle IP options if any.
• reassemble IP fragments if necessary.
February 8, 2004 UC Riverside Slide 8
1 – Kernel-based Message Passing Architecture CS213
UDP (in bottom half):
• check UDP header (length, checksum).
• find the INET socket that matches the address/port.
• enqueue skbuf into the socket receive queue.
• wake up a sleeping process waiting for packets.
Socket:
• get the BSD socket structure.
• build msghdr structure for receiving message.
• pass control from BSD socket to INET socket.
UDP (top half):
• If no packet arrives and recv() is a blocking call, the current process will sleep.
• After being waked up, copy and checksum the message from skbuf to user space.
• copy the sender’s address to user space.
February 8, 2004 UC Riverside Slide 9
1 – Kernel-based Message Passing Architecture CS213
1.2 – UDP critical path analysis
� � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � �
send
0 1 2 3 4 5 6 7 8 9 10
SystemCall
trigger transmit
DMA(xmit)
IPUDP
socket
driver
11 12 13 14 15 16 17 18 19 20
intrStartintr(xmit)
intrReturn
transmitOne copy involved
time (usec)
User
Kernel
13 14 1615 17 18 19 20 21 22 23
DMA(recv)
IPUDP
intrReturn
schedule
One copy involved
24 25 26 27 28 30 3129 32 33 34 time (usec)
Kernel
User
35
receive
critical time
non−critical time
t2 t3 t4 t5 t6 t7 t8
t9 t10 t11 t12 t13 t14 t15 t16
t19
t18
t17
t1
t8
another processrecv()
SystemCall
sys_recvfromudp_recvmsg
intr(recv)interrupt
wakeupsleep
intrStart
Figure 3: Time line of transferring 1-byte message using UDP/IP
February 8, 2004 UC Riverside Slide 10
1 – Kernel-based Message Passing Architecture CS213
Categories Intervals usec percentage
User application t1 + t19 0.3 0.9
Protocol processing t3 + t4 + t5 + t6 + t11 + t13 +
t14 + t16 + t17
14.5 41.0
OS system call overhead t2 + t18 1.6 4.5
OS interrupt overhead t10 + t12 5 14.1
OS context switch overhead t15 2 5.6
DMA/transmit/receive t7 + t8 + t9 12 33.9
Total 35.4 100
Table 1: UDP critical time break down
Problem: LARGE OVERHEAD!
February 8, 2004 UC Riverside Slide 11
1 – Kernel-based Message Passing Architecture CS213
Goal: reduce software overhead to approach hardware limits.
Quick solution:
User Applications
Network Interface Controller
Driver
User
Kernel
IP
TCP/UDP
Socket
User Applications
Network Interface ControllerDriver
User
Kernel
This idea is reasonable:
• Cluster network can be assumed to be reliable and secure. A thick procotol stack like
TCP/IP is not necessary.
• An increasing number of applications are more sensitive to communication latency.
But,
• How to allow direct access yet provide protection?
• How to design an efficient, yet versatile programming interface?
• How to manage resources, in particular memory?
• How to fair share network without a kernel path?
February 8, 2004 UC Riverside Slide 12
2 – User-Level Message Passing Architecture CS213
2 – User-Level Message Passing Architecture
In user-level message passing architecture, the operating system kernel and its
centralized networking stack are removed from the critical communication path
in a protected fashion, providing user applications a user-level network
interface, through which users can directly access to their network interface.
The operating system is only involved in setting up the communication.
In this way, the communicating parties can avoid intermediate copies of data,
interrupts, and context switches in the critical path, thus greatly decreasing
communication latency and increasing network throughput.
February 8, 2004 UC Riverside Slide 13
2 – User-Level Message Passing Architecture CS213
2.1 – VIA Overview
Send
Doo
rbel
l
SEN
D
VI
RE
CV
RecvD
oorbell
CO
MP
CQ
Register Memory
Kernel Mode
User Mode
VI Provider
VIConsumer
Open / Connect / Send / Receive /RDMARead / RDMAWrite
Send
Doo
rbel
l
SEN
D
VIR
EC
VR
ecvDoorbell
VI User Agent (User Level Library)
Sockets, MPI, Cluster, OtherOS communication Interface
User Applications
(Kernel Level Driver)VI Kernel Agent
VI−capable Network Interface Controller
Figure 4: The Virtual Interface Architecture Model
February 8, 2004 UC Riverside Slide 14
2 – User-Level Message Passing Architecture CS213
VI
Desc N
SendQ
Desc 2
Desc 1
RecvQ
Desc 1
Desc 2
Desc N
Status Status
Network Interface Controller
Send
Doo
rBel
l Recv D
oorBell
VI Consumer
Packets to/from network
Figure 5: A Virtual Interface
February 8, 2004 UC Riverside Slide 15
2 – User-Level Message Passing Architecture CS213
VipOpenNic(&nicHand)
· · ·
VipCreateVI(&viHand)
· · ·
Allocate 3 desc and 3 msg
Register 3 desc and 3 msg
desc[i].buf ← msg[i]
· · ·
VipostRecv(viHand, &desc[1])
descp ← &desc[2]
VipConnectRequest(viHand,lAddr,rAddr)
· · ·
Prepare msg[0] for sending
VipPostSend(viHand,desc[0])
VipSendWait(viHand,&descp2)
VipPostRecv(viHand,descp)
VipRecvWait(viHand,&descp)
· · ·
VipDisconnect(viHand)
Deregister desc and msg
VipDestroyVI(viHand)
VipOpenNic(&nicHand)
· · ·
VipCreateVI(&viHand)
· · ·
Allocate desc[2] and msg[2]
Register desc[2] and msg[2]
desc[i].buf ← msg[i]
· · ·
VipPostRecv(viHand,&desc[0])
descp ← &desc[1]
VipConnectWait(nicHand,lAddr,rAddr,&connHand)
VipConnectAccept(connHand,viHand)
· · ·
VipPostRecv(viHand,descp)
VipRecvWait(viHand,&descp)
VipPostSend(viHand,descp)
VipSendWait(viHand,&descp)
· · ·
VipDisconnect(viHand)
Deregister desc and msg
VipDestroyVI(viHand)
Echo client using VIA Echo server using VIA
February 8, 2004 UC Riverside Slide 16
2 – User-Level Message Passing Architecture CS213
2.2 – How M-VIA send/receive work
recv
_que
ue
send_queue recv
_que
ue
send_queuecopycopy
VIVI user registered buffers
HOST
DEC 21140 NIC
Kernel
User
PCI bus
DEC 21140 device structures
CSRs DMA engine CRC generator/checker
rx_ring tx_ringrecv buffers
Figure 6: M-VIA over DEC 21140 NIC (tulip). M-VIA is a modular VIA imple-
mentation for Linux over Ethernet.
February 8, 2004 UC Riverside Slide 17
2 – User-Level Message Passing Architecture CS213
M-VIA send operations
VipPostSend:
• enqueue a send descriptor to the sending queue.
Driver(xmit):
• prepares send header, waits for device stop, gets the next send decriptor entry, and
fill in the header.
• segments large packets into chunks.
• enqueue each chunk to tx ring.
• trigger the immediate transmit.
NIC:
• DMA fetch tx ring descriptor.
• DMA fetch data buffer.
• CRC checksum.
• put packet onto the wire.
February 8, 2004 UC Riverside Slide 18
2 – User-Level Message Passing Architecture CS213
After NIC complete transmission:
• send a interrupt signal to CPU.
• CPU calls the NIC interrupt handler to dequeue head entries from tx ring, mark the
descriptor complete and wake up a block process.
VipSendWait:
• check the head descriptor of the sending queue for N times. If the head descriptor is
marked complete, remove it from the sending queue and return success.
• otherwise, sleep.
• after being waked up, check the head descriptor again. If the head descriptor is
marked complete, remove it from the sending queue, and return success. Otherwise,
return error.
February 8, 2004 UC Riverside Slide 19
2 – User-Level Message Passing Architecture CS213
M-VIA receive operations
VipPostRecv:
• enqueue a recv descriptor to the receiving queue.
when NIC receivess a packet:
• DMA the incoming packet to a buffer pointed by rx ring tail descriptor.
• send a interrupt signal to CPU.
Driver(recv):
• get a corresponding VI structure from the received packet.
• check the sequence number of the received packet.
• copy the packet from rx ring to VI specified buffer.
• dequeue head entry from rx ring.
• mark the descritpor complete.
• wake up a block process.
February 8, 2004 UC Riverside Slide 20
2 – User-Level Message Passing Architecture CS213
VipRecvWait:
• checking the head descriptor of the receiving queue for N times. If the head
descriptor is marked complete, remove it from the receiving queue and return success.
• otherwise, sleep.
• after being waked up, check the head descriptor again. If the head descriptor is
marked complete, remove it from the receiving queue, and return success. Otherwise,
return error.
February 8, 2004 UC Riverside Slide 21
2 – User-Level Message Passing Architecture CS213
2.3 – M-VIA critical path analysis
FastTrapReturn
VipPostRecv
FastTrapVipkPostRecv
FastTrapReturn
0 2 31 14 15 17 18 19 2016 21 22 2313124 5 6 7 8 9 10 11
FastTrap
polling
Kernel
User
time (usec)
VipPostSend
t1
NetworkReceiver
Sender
interrupt
polling
Kernel
User
interrupt
t5 t6 t7
Transmit
Receive
VipRecvWait
intr(recv)
intrStart
t10
critical time
non−critical time
intr(xmit)
intrReturn
VipSendWait
DMA(recv)
DMA(xmit)
t3 t4 t8 t9
one copy involved
trigger transmitVipkERingtulipPostSend
t2
intrReturn
intrStart
Figure 7: Time line of transferring 1-byte message using M-VIA
February 8, 2004 UC Riverside Slide 22
2 – User-Level Message Passing Architecture CS213
Interval Description Time
cycles usec
t1 user post send descriptor 31 0.08
t2 Fast trap 138 0.34
t3 device send (first chunk) 733 1.83
t4 + t5 DMA(xmit)/transmit 10.1 + 0.08s
t6 DMA (receive) ???
t7 interrupt start 1268 3.16
t8 interrupt recv processing 3.1 + 0.272b(s + 25)/32c
t9 interrupt return 627 1.57
t10 user receive (polling) 20 0.05
Table 2: M-VIA critical time breakdown
February 8, 2004 UC Riverside Slide 23
2 – User-Level Message Passing Architecture CS213
2.4 – M-VIA performance evaluation
vipl.a
Modules:via_ka.ovia_ering.ovia_tulip.o
Library:
Header:vipl.h
M−VIA 1.2
PentiumCeleron 400MHz
256M SDRAM
DEC 21140 NIC
Applications
Node 1 Node 2
RedHat Linux 2.4.20−8
crossover cable
PentiumCeleron 400MHz
256M SDRAM
DEC 21140 NIC
Applications
RedHat Linux 2.4.20−8
vpingpongrpingpong
upingpongtpingpong
vpingpongrpingpong
upingpongtpingping
TCP/IP TCP/IPM−VIA M−VIA
Figure 8: Experimental Setting
February 8, 2004 UC Riverside Slide 24
2 – User-Level Message Passing Architecture CS213
0
50
100
150
200
0 200 400 600 800 1000 1200 1400
Lat
ency
(us
)
Packet size (bytes)
Linux TCP/IP socketLinux UDP/IP socket
M-VIA (reliable)M-VIA (unreliable)
Figure 9: One-way lentency vs. message
size
• entirely remove socket,
TCP/UDP and IP layers.
• zero copy when sending.
Sending overhead is constant.
• use fast trap instead of normal
system call.
• use polling instead of interrupt
to avoid context switch.
February 8, 2004 UC Riverside Slide 25
3 – Conclusion CS213
3 – Conclusion
• Traditional kernel-based communication has large software overhead.
• User-level communication is a good solution for intra-cluster
communication.
February 8, 2004 UC Riverside Slide 26