cs-556: distributed systems manolis marazakis inter-process communication (iii)
DESCRIPTION
Fall Semester 2005CS-556: Distributed Systems Berkeley Sockets (II) Connection-oriented communication pattern using sockets.TRANSCRIPT
Fall Semester 2005 CS-556: Distributed Systems
Berkeley Sockets (I)
Socket primitives for TCP/IP.
Primitive Meaning
Socket Create a new communication endpointBind Attach a local address to a socket
Listen Announce willingness to accept connections
Accept Block caller until a connection request arrives
Connect Actively attempt to establish a connectionSend Send some data over the connectionReceive Receive some data over the connectionClose Release the connection
Fall Semester 2005 CS-556: Distributed Systems
Berkeley Sockets (II)
Connection-oriented communication pattern using sockets.
Fall Semester 2005 CS-556: Distributed Systems
Connected vs Connectionless (I)
IP best-effort, unreliable, connectionless Remembers nothing about a packet after it
has sent it Checksum computed on header onlyNo assumptions about the underlying physical medium Serial link, Ethernet, Token ring, X.25, ATM,
wireless CDPD, …UDP: (optional) checksum notion of port
Fall Semester 2005 CS-556: Distributed Systems
Connected vs Connectionless (II)
TCP reliable connection-oriented service Segments are sent in IP datagrams Checksum of data in each segment Sequence # of the 1st byte in the segment Acknowledge-and-retransmit mechanism
Each side maintains a receive window Range of sequence # that this side is prepared to receive
Any arriving data with sequence # outsiode the receive window is discarded
Queuing of data arriving out-of-order Window slides to the right, if the next expected sequence
# has arrived … and an ACK is sent back with the sequence # expected
next Send window:
Bytes sent but not yet acknowledged RTO timer (retransnmission timeout) Timeout does not always mean that the data was lost !!
Bytes that can be sent but have not yet been sent
Fall Semester 2005 CS-556: Distributed Systems
UDP Failure ModelOmission failures
timeouts duplicate messages lost messages Need to maintain history
Last reply sent to each client provided that a client can make only one request at a time
interprets each request as the ACK for the previous reply periodic ‘’purge’’ of history
No ACK for the last response received before client terminates
Fixed max. buffer size (8 KB) No message order guaranteeProcess crash failures
Fall Semester 2005 CS-556: Distributed Systems
TCP Failure ModelReliable message delivery
checksums, sequence numbers, timeouts no need for applications to deal with
retransmissions duplicates reordering
no need for historiesFlow control mechanism
large transfers without overwhelming the receiver… BUT not reliable sessions:
Connections may be severed or severely congested Processes cannot distinguish network from process failure Processes cannot tell if their recent messages were
received
Fall Semester 2005 CS-556: Distributed Systems
TCP is a stream protocolNo inherent notion of “message boundary” The amount of data in a packet is not directly
related to the amount of data delivered to TCP in the send() call
No reliable for the receiver to determine how the data was packetized Several packets may have arrived between recv()
calls The amount of data returned in any given read()
is unpredictable Fixed-length messages Variable-length messages
End-of-record marker Fixed-length header (including record length) + variable data
Fall Semester 2005 CS-556: Distributed Systems
TCP Failure Modes (I)“TCP guarantees delivery of the data it sends” True or False ?
Guarantee to whom ?
False … How can we handle outages & crashes ?
TCP
NIC
IPNICIP
NICIP
TCP
NIC
IP
Application (A) Application (B)User-space
kernel-space
Fall Semester 2005 CS-556: Distributed Systems
TCP Failure Modes (II)IP is a best-effort, unreliable protocol … so the TCP layer is the first place in the data path
where it makes senses to even talk about guarantees The sender’s TCP layer can make no guarantee about segments that arrive at the receiver’s TCP layer An arriving segment may be corrupted, or it may
contain duplicate data, or it may be out of order …The receiver’s TCP layer guarantees to the sender’s TCP layer that any segment that it ACKs & all data that came before it have been correctly received This does not mean that the data has been delivered
to the application … ot that it will ever be delivered !! For example, the receiving host may crash after the ACK but
before delivery …
Fall Semester 2005 CS-556: Distributed Systems
TCP Failure Modes (III)It also makes sense to talk about guarantees at application B (receiver) There can be no guarantee that all data sent by
application A will arrive However, all data that does arrive will be in order
and uncorrupted
Avoid the attitude that “TCP will take care of everything”TCP is an end-to-end protocol, providing a reliable transport mechanism between peers …
The “peers” are the TCP layers of the sender & the receiver !!
Fall Semester 2005 CS-556: Distributed Systems
TCP Failure Modes (IV)Explicit acknowledgements
What does the client do if the server does not ACK receipt ?? It may not be safe to simply resend a request …
Network outage Peer crashes Peer’s host crash
When a problem occurs at an endpoint, there is generallyno alternative path The problem persists until it is repaired
An intermediate router may send the originator an ICMP message indicatingthat the destination network or the host is unreachableOR: The sender eventually times-out & resends the segments not ACKed. This continues until the sender gives up & drops the connection (~9 minutes).Pending read ETIMEDOUTOtherwise, the next write fails SIGPIPE or EPIPE
Fall Semester 2005 CS-556: Distributed Systems
TCP Failure Modes (V)
Peer crash: Indistinguishable from the case of the peer calling
close() and then exit() The peer’s TCP layer issues a FIN segment
This does not necessarily imply that the peer has no more data to send, or even that it is not willing to receive more data …
Reception of the FIN may come at different execution states of the application If client is blocked, TCP has no way of notifying it
The next transmission generates a RST segment ECONNRESET If the RST is ignored & more data is transmitted SIGIPE
This may occur if the client performs >=2 consecutive write() calls without an intervening read() Notification takes place only after the 2nd write()
If client has a pending read(), it gets an immediate error indication (eg: read() returns EOF)
Fall Semester 2005 CS-556: Distributed Systems
TCP Failure Modes (VI)Peer’s host crash: The peer’s TCP cannot issue the FIN segment Until recovery, this case cannot be distinguished
from a network outage The peer’s TCP no longer responds, but the sender keeps
retransmitting … Until either the host recovers, or the sender gives up
the connection ETIMEDOUT If the host reboots before the sender gives up, a
retransmitted segment may arrive at the TCP layer … without it having knowledge of the connection RST If sender has a read() pending ECONNRESET Else, the next write() results in a SIGPIPE signal
Fall Semester 2005 CS-556: Distributed Systems
Behavior of PeersChecking for client termination Heartbeats, timeouts for read operations,
SO_KEEPALIVE option, …Checking for valid input Buffer overflow errors
Fall Semester 2005 CS-556: Distributed Systems
We rely on DNS …
Fall Semester 2005 CS-556: Distributed Systems
The Message-Passing Interface
Some of the most intuitive primitives of MPI.
Primitive Meaning
MPI_bsend Append outgoing message to a local send buffer
MPI_send Send a message and wait until copied to local or remote buffer
MPI_ssend Send a message and wait until receipt startsMPI_sendrecv Send a message and wait for replyMPI_isend Pass reference to outgoing message, and continue
MPI_issend Pass reference to outgoing message, and wait until receipt starts
MPI_recv Receive a message; block if there are noneMPI_irecv Check if there is an incoming message, but do not block
Fall Semester 2005 CS-556: Distributed Systems
Group CommunicationMulticasting: 1-to-many comm. pattern
Applications: replicated services (better fault tolerance) discovery of services replicated data (better performance) propagation of event notifications
Failure model: depends on implementation:
IP multicast (UDP datagrams): omission failures class-D Inet addresses: “1110” bit prefix TTL
reliable multicast ordered multicast
FIFO Causal Total
Fall Semester 2005 CS-556: Distributed Systems
Conventional Procedure Call
a) Parameter passing in a local procedure call: the stack before the call to read
b) The stack while the called procedure is active
Fall Semester 2005 CS-556: Distributed Systems
Software layersApplications and Services
RPC and RMI
request-reply protocolmarshalling and external data representation
UDP and TCP
mid
dlew
are
RPC is more than a (transport) protocol: a structuring mechanism for distributed systems
Fall Semester 2005 CS-556: Distributed Systems
Steps of a Remote Procedure Call1. Client procedure calls client stub in normal way2. Client stub builds message, calls local OS3. Client's OS sends message to remote OS4. Remote OS gives message to server stub5. Server stub unpacks parameters, calls server6. Server does work, returns result to the stub7. Server stub packs it in message, calls local OS8. Server's OS sends message to client's OS9. Client's OS gives message to client stub10.Stub unpacks result, returns to client
Fall Semester 2005 CS-556: Distributed Systems
Client and Server StubsPrinciple of RPC between a client & server program.
Fall Semester 2005 CS-556: Distributed Systems
Example (Sun RPC - ONC)long square(long) example client ren.eecis.udel.edu 11 result: 121
Need RPC specification file (square.x): defines procedure name, arguments & resultsRun rpcgen square.x: generates square.h, square_clnt.c, square_xdr.c, square_svc.csquare_clnt.c & square_svc.c: Stub routines for client & serversquare_xdr.c: XDR (External Data Representation) code - takes care of data type conversions
Fall Semester 2005 CS-556: Distributed Systems
RPC Specification File (square.x)
struct square_in {long arg1;
};
struct square_out {long res1;
};
program SQUARE_PROG { version SQUARE_VERS { square_out SQUAREPROC(square_in) = 1; // procedure # } = 1; // version #} = 0x321230000; // program #
IDL – Interface Definition Language
Fall Semester 2005 CS-556: Distributed Systems
Parameter Specification & Stub Generation
procedure Corresponding message
Fall Semester 2005 CS-556: Distributed Systems
Writing a Client & a Server
The steps in writing a client & a server in DCE RPC.
Fall Semester 2005 CS-556: Distributed Systems
Binding (SUN RPC)Port Mapper (rpcbind) listens at UDP port 111Server registers program ID & version
rpcinfo -p -> display all registered RPC serversWhen client issues clnt_create, the port mapper is contacted:
program-to-port number mapping arguments: (program ID, version, protocol) response: server’s port number
Fall Semester 2005 CS-556: Distributed Systems
Binding (DCE)
Fall Semester 2005 CS-556: Distributed Systems
Passing Value Parameters (I)
Fall Semester 2005 CS-556: Distributed Systems
Passing Value Parameters (II)
a. Original message on Pentium (little-endian)b. The message after receipt on SPARC (big-endian)c. The message after being inverted.
Fall Semester 2005 CS-556: Distributed Systems
Passing Value Parameters (III)How to pass pointers ? Meaningful only within a specific address
space !Arrays (of known length) & structures: Copy/restore semantics (bet. stubs) IN/OUT/INOUT markers
Optimization: may eliminate one copy operationPointer to an arbitrary data structure ? No general solution Work-around:
Pass back the pointer to its “source”
Fall Semester 2005 CS-556: Distributed Systems
External Data Representation (I)
Data structures: “flattened” on transmission rebuilt upon receptionPrimitive data types: byte order (big-endian: MSB comes first) ASCII vs UNICODE (2 bytes per
character) marshalling/unmarshalling
to/from agreed external format
Fall Semester 2005 CS-556: Distributed Systems
External Data Representation (II)
XDR (RFC 1832), CDR (CORBA), Java: data -> byte stream object references
HTTP/MIME: data -> ASCII text
IP address port time object ID interface ID
Fall Semester 2005 CS-556: Distributed Systems
CORBA CDR example:
The flattened form represents a Person struct with value: {‘Smith’, ‘London’, 1934}
0–34–78–1112–15
16–1920-2324–27
5"Smit""h___" 6"Lond""on__"1934
index in sequence of bytes 4 bytes
notes on representation
length of string
‘Smith’
length of string‘London’
unsigned long
Fall Semester 2005 CS-556: Distributed Systems
Properties of TCPConnected vs Connectionless ProtocolsTCP is a stream protocolPerformance of TCPAvoid re-inventing TCP !!TCP failure modesBehaviour of peersLAN vs WAN testingTools & Resources
Fall Semester 2005 CS-556: Distributed Systems
Basic socket calls
recvsend
socket
bind localhost
sockaddr_in()
listen
accept peer
sockaddr_in()
socket
connect
recvsend
peer
sockaddr_in()
SERVER CLIENT
Fall Semester 2005 CS-556: Distributed Systems
Performance of TCP (I)4.4BSD Implementation: UDP: ~800 LOC TCP: ~4,500 LOC
CPU processing: checksums, data copyingTCP ACKs: Receiver can piggyback the ACK Usually every second segment is ACKed .. May even delay ACKs (up to 0.5 sec)
Connection setup: 3 segments 1 ½ RTT: SYN, SYN+ACK, ACK
Connection tear-down: 4 segments FIN, ACK, FIN (server-to-client), ACK Except the last segment, these can be combined
with data-bearing segments
Fall Semester 2005 CS-556: Distributed Systems
Performance of TCP (II)Results from a benchmark involving transmission of 5,000 data blocks UDP datagram size=TCP write size=1,440 bytes
Ethernet frame: 1,500 bytes IP header: 20 bytes, TCP header: 20 bytes TCP options: 12 bytes
Average over 50 runsClient produces data blocks, transmits them, and then exitsServer may run on localhost (127.0.0.1) Same host as the client, but given as an address Other host
Fall Semester 2005 CS-556: Distributed Systems
Performance of TCP (III)Server TCP UDP
time MB/sec time MB/sec drops
Client 2.88 2.5 1.96 3.67 336
Localhost
0.95 7.53 1.97 3.64 272
Remote 7.18 1.002 5.82 1.23 440
Localhost (loop-back): MTU=16,384
Client (network I/f): MTU=1,500
Fall Semester 2005 CS-556: Distributed Systems
Performance of TCP (IV)Server TCP UDP
time MB/sec time MB/sec drops
Client 1.05 1.41 1.63 0.91 212
Remote 1.55 0.965 1.91 0.78 306
Results for write size=300 bytes
Fall Semester 2005 CS-556: Distributed Systems
Avoid re-inventing TCP !!Retransmissions ? RTO
Must be adjustable Exponential back-off
Flow control Sliding window
Congestion controlMatching replies to requests ? Sequence # for each requestEfficiency of the implementation ? TCP code base is highly optimized … and runs in kernel-space
Fall Semester 2005 CS-556: Distributed Systems
LAN vs WAN testingPerformance on the WAN may not be satisfactory, due to the extra latency … may have to reconsider the designIncorrect code is more likely to be triggered on the WAN … assumptions on volume/rate of
arriving data
Fall Semester 2005 CS-556: Distributed Systems
HTTP
GET //www.dcs.qmw.ac.uk/index.html HTTP/ 1.1
URL or pathnamemethod HTTP version headers message body
HTTP/1.1 200 OK resource data
HTTP version status code reason headers message body
•Resource := MIME-encoded data•Content negotiation•Authentication
Methods:•GET, HEAD, POST•PUT, DELETE, TRACE, OPTIONS
Fall Semester 2005 CS-556: Distributed Systems
Tools (I)ping IP header + ICMP echo request/reply
tcpdump Network analyzer – “sniffer”
traceroute Determine the network path by forcing each intermediate
router to send an ICMP error message to the originator Send a UDP datagram with TTL=1 - so that the 1st router in the
path will discard it ! Send a 2nd UDP datagram with TTL=2 – so that the 2nd router in
the path will discard it ! … At the last hop, TTL=1 & an attempt is made to deliver the
datagram (generating the ICMP error message “port unreachable”)
Fall Semester 2005 CS-556: Distributed Systems
Tools (II)ttcp Benchmarking tool, with –many- parameters
UDP or TCP transfers, buffers, size of read/write’slsof Determine which process has a “file descriptor” open
(file or socket) lsof –i TCP:6000 lsof –i @remotehost.xdomain.net
netstat Active sockets: netstat –af inet Interfaces: netstat –i Routing table: netstat -rn Protocol statistics: netstat –sp tcp
System call tracers: strace, truss, ktrace
Fall Semester 2005 CS-556: Distributed Systems
ResourcesBooks: Richard Stevens:
TCP/IP illustrated series Protocols, Implementation, T/TCP/HTTP/NNTP/Domain
Sockets UNIX Network Programming series
Networking APIs: Sockets, XTI Interprocess Communication
J.C. Snader: “Effective TCP/IP Programming”RFCs: http://www.rfc-editor.org