![Page 1: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/1.jpg)
Implementation of TCP/IP in Linux (kernel 2.2)
Rishi Sinha
![Page 2: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/2.jpg)
Goals Goals
To help you implement your customized stack by identifying key points of the code structure
To point out some tricks and optimizations that evolved after 4.3BSD and that are part of Linux TCP/IP code
![Page 3: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/3.jpg)
TCP/IP source code /usr/src/linux/net/
All relative pathnames in this document are relative to /usr/src/linux/
http://lxr.linux.no cross-references all the Linux kernel code You can install and run it locally; I
haven’t tried
![Page 4: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/4.jpg)
The various layers (yawn…)
IP
TCP/UDP
INET socket
BSD socket
Appletalk IPX
(Physical)
(Link)
![Page 5: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/5.jpg)
Address families supported include/linux/socket.h
UNIX Unix domain sockets INET TCP/IP AX25 Amateur radio IPX Novell IPX APPLETALK Appletalk X25 X.25
More; about 24 in all
![Page 6: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/6.jpg)
Setting things up – socket-side
How the INET address family registers itself with BSD
socket layer
![Page 7: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/7.jpg)
struct socket BSD socket short type – SOCK_DGRAM, SOCK_STREAM struct proto_ops *ops – TCP/UDP operations
for this socket; bind, close, read, write etc. struct inode *inode – the file inode
associated with this socket struct sock *sk – the INET socket
associated with this socket
![Page 8: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/8.jpg)
BSD socket
INET socket? Operations to use?(How to create socket?)
No connections
![Page 9: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/9.jpg)
struct sock INET socket struct socket *socket – associated BSD
socket struct sock *next, **pprev – socks are in
linked lists struct dst_entry *dst_cache – pointer to the
route cache entry used by this socket struct sk_buff_head *receive_queue – head of
the receive queue struct sk_buff_head *write_queue – head of
the send queue
![Page 10: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/10.jpg)
struct sock continued __u32 daddr – foreign IP address __32 rcv_saddr – bound local IP
address __u16 dport – destination port unsigned short num – local port struct proto *prot – contains
TCP/UDP specific operations (repetition with struct socket’s ops field)
![Page 11: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/11.jpg)
INET socket
Reaching transport layer?
BSD socket?
No connections
![Page 12: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/12.jpg)
protocols vector Array of struct net_proto, which has
name, say INET, UNIX, IPX, etc initialization function, say inet_proto_init
This protocols array is static in net/protocols.c
This file uses conditional compilation to include protocols as chosen in make config
![Page 13: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/13.jpg)
inet_proto_init protocols vector is traversed at
system init time, and each init function called
Each of these protocol init functions registers itself with BSD sockets by giving its name and socket create function
Where does the BSD socket layer store this information?
![Page 14: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/14.jpg)
net_families BSD socket layer stores info for
each registering protocol in this array
This is an array of struct net_proto_family, which is int family int (*create)(struct socket *sock, int
protocol)
![Page 15: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/15.jpg)
BSD socket layer now has
INET
inet_create()
IPX
ipx_create()
UNIX
unix_create()
![Page 16: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/16.jpg)
So in socket() call BSD socket layer looks for specified
address family, say INET BSD socket layer calls create function
for that family, say inet_create() inet_create() does switch (BSD_socket-
>type) case SOCK_DGRAM: fill BSD_socket-
>proto_ops with UDP operations case SOCK_STREAM: fill BSD_socket-
>proto_ops with TCP operations
![Page 17: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/17.jpg)
Socket layer is satisfiedBSD socket:
AF_INET, SOCK_STREAM
INET socket
TCP’s proto_ops
Write queueReceive queue
Lots of other TCP data
![Page 18: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/18.jpg)
Reaching sockets through file descriptors Per process file table > inode >
BSD socket etc. Not describing here
![Page 19: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/19.jpg)
Setting things up – device side
How network interfaces come up and attach themselves to
the stack
![Page 20: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/20.jpg)
No connections
Network interface card
What is my name (since I don’t have a /dev file)?
Give packets to whom?
![Page 21: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/21.jpg)
struct device No device file for network devices Why? Design choice, probably
because network devices “push” data
Each interface is represented by a struct device
All struct devices are chained and the chain head is called dev_base
![Page 22: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/22.jpg)
struct device continued char *name – say eth0 unsigned long base_addr – I/O
address unsigned int irq – IRQ number struct device *next int (*init)(struct device *dev) int (*hard_start_xmit)(struct sk_buff
*skb, struct device *dev) – transmission function
![Page 23: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/23.jpg)
dev_base drivers/net/Space.c cleverly threads
struct devices for all possible interfaces into a list starting at dev_base (static data structure declaration, no code execution yet)
List includes limited number of devices of each type, i.e. eth0 to eth7 and no more possible
![Page 24: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/24.jpg)
ethif_probe() For each of these 8 struct devices,
names are eth0 to eth7 and init funtion is ethif_probe()
During system init time the list of struct devices is traversed, and the init function called for each
So ethif_probe() called for eth0; calls probe_list()
![Page 25: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/25.jpg)
probe_list() probe_list() goes through a list of all
ethernet devices the system has drivers for
The probe function for each driver is called, and if success, assign proper function pointers
from the driver code to this struct device (ethx)
if failure, no more eth devices exist, remove this struct device from the list and return
![Page 26: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/26.jpg)
After all devices in Space.c traversed through
lo0eth0, 3Com card
eth1, HP card
functions from 3com driver
functions from HP driver
Give packets to whom?
dev_base
![Page 27: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/27.jpg)
Modularized driver Much simpler, because the driver’s
probe is executed at module load time
If it finds a device, it appends a struct device to the end of the dev_base list
![Page 28: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/28.jpg)
backlog queue Very very distinct from socket
listen backlog queue! Systemwide queue that interfaces
immediately drop packets onto Device driver writers simply call
netif_rx(), which does the actual queueing
![Page 29: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/29.jpg)
Link layer is satisfied
lo0eth0, 3Com card
eth1, HP card
functions from 3com driver
functions from HP driver
dev_base
backlog queue
![Page 30: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/30.jpg)
Setting things up – between link and network layers
How packets reach the correct protocol stack
![Page 31: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/31.jpg)
No connections
backlog queue
IP? ARP? IPX? BOOTP?
Who takes packets off the backlog queue?Who gets these packets?
![Page 32: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/32.jpg)
net_bh() Bottom-half handler for network interrupt
interrupt Executes when network interrupt is not
masked So the fast handler (actual ISR), is driver
code that calls netif_rx() to queue the packet onto backlog queue, and marks net_bh() for execution
net_bh() takes packets off backlog and passes to the protocol specified in ethernet header
![Page 33: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/33.jpg)
ptype_base ptype_base is the head of a list of
possible packet types the link layer may receive (IP, ARP, IPX, BOOTP, etc.) that the system can handle
How is it built? For every protocol in the protocols
vector, when its init function is called (inet_proto_init), it calls functions like ip_init(), tcp_init() and arp_init()
![Page 34: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/34.jpg)
dev_add_pack completes the picture Those subprotocols interested in
registering a packet type (IP, ARP), get their init functions (ip_init(), arp_init()) to call dev_add_pack(), specifying a handler function
This adds the packet type to ptype_base So net_bh( ) hands off packets to the
right protocol stack
![Page 35: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/35.jpg)
Setting things up – between network and transport layers
How packets reach the correct transport protocol
![Page 36: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/36.jpg)
inet_protos An array of transport layer
protocols in INET Built at the time of inet_proto_init() By calling inet_add_protocol() for
every transport protocol Registers handlers for transport
protocols
![Page 37: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/37.jpg)
Packet movement through stack
Transmission and reception, queues, interrupts
![Page 38: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/38.jpg)
struct sk_buff Each packet that arrives on the wire is
encased in a buffer called sk_buff An sk_buff is just the data with a lot of
additional information about the packet There is a one-to-one relationship
between packets and sk_buffs, i.e. one packet, one buffer
sk_buffs can be allocated in multiples of 16 bytes
![Page 39: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/39.jpg)
struct sk_buff continued INET sock queues are queues of
sk_buffs Data coming from the socket calls
are copied into sk_buffs Data arriving from the network is
copied into sk_buffs sk_buff picture with fields
![Page 40: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/40.jpg)
struct sk_buff continued
![Page 41: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/41.jpg)
Queues backlog queue INET sock queues
TCP has a number of queues for out-of-order, connection backlog, error packets (?)
![Page 42: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/42.jpg)
Packet reception Packet received by hardware Receive interrupt generated Driver handler copies data from hardware
into fresh sk_buff Calls netif_rx() to queue on backlog Schedules net_bh() with
mark_bh(NET_BH) net_bh() executes the next time the
scheduler is run or a system call returns or a slow interrupt handler returns
![Page 43: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/43.jpg)
Packet reception continued net_bh() tries to send any pending
packets, then dequeues packets from the backlog and passes them to correct handler, say ip_rcv()
ip_rcv() may call ip_local_deliver() or ip_forward()
ip_local_deliver() results in call to tcp_v4_rcv() through the inet_protos list
tcp_v4_rcv() queues data at the correct socket’s queue
![Page 44: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/44.jpg)
Packet reception continued When the socket’s owner reads,
tcp_recvmsg() is invoked through BSD socket’s proto_ops
If instead the socket’s owner had blocked on a read, that process will be woken using wake_up (wait queue)
![Page 45: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/45.jpg)
Packet transmission Quite different for TCP and UDP in terms
of copying of user data to kernel space TCP does its own checksumming, while
IP does checksumming for UDP. Why? Next section.
net_bh() again takes care of flushing out packets that have piled up at the device’s queue
![Page 46: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/46.jpg)
Tricks and optimizations
TCP/IP enhancements, most due to Van Jacobson, arrived
after 4.3BSD
![Page 47: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/47.jpg)
Checksum and copy
![Page 48: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/48.jpg)
Checksum and copy continued Linux goes over every byte of data
only once (if the packet does not get fragmented)
Uses checksum_and_copy() TCP data from socket gets filled
into MSS-sized segments by TCP, so checksum-copying happens here
![Page 49: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/49.jpg)
Checksum and copy continued
INET Socket(struct sock)
write_queue
User Buffer (ubuff)
sk_buff structure
partially used sk_buff
newly allocated sk_buff
![Page 50: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/50.jpg)
Checksum and copy continued UDP, on the other hand, does not stuff
anything into MSS-sized buffers, so there is no need to copy data from user space at UDP layer
UDP passes data and a callback function to IP
IP copies this data into an sk_buff, using the callback function, which is a checksum_and_copy function
Large ping replies from a Linux host srrive in reverse order of frgaments! Why?
![Page 51: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/51.jpg)
This fragment leaves first, the partial checksum for its data calculated and remembered
This fragment leaves second, its checksum added to the partial checksum
This fragment leaves last, so that final checksum can be written into the UDP header
UDP datagram
UDP header
Why UDP fragmentation happens in reverse order
![Page 52: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/52.jpg)
Fixed size buffer, sk_buff mbufs were potentially very clumsy “There is exactly one, contiguous,
packet per pbuf (none of that mbuf chain stupidity).” Van Jacobson
Allocation of fixed size buffers at the transport layer implies knowledge of network and link layer header sizes
Linux is not shy of such indiscretions
![Page 53: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/53.jpg)
Incremental checksum updates At every hop, TTL changes (is
decremented) But IP checksum covers the header, and
therefore the TTL also So it needs to be calculated at every hop Linux does this in one step RFCs 1071, 1141, 1624 discusses both
copy_and_checksum and this incremental checksum update
![Page 54: Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha](https://reader036.vdocument.in/reader036/viewer/2022062320/56649ce35503460f949af354/html5/thumbnails/54.jpg)
Cached hardware headers Routes cache hardware headers
for quick construction of outgoing packets.