a transport protocol for dedicated end-to-end …role in achieving high throughput on dedicated...

A TRANSPORT PROTOCOLFOR

DEDICATED END-TO-END CIRCUITS

A Thesis

Presented to

the faculty of the School of Engineering and Applied Science

University of Virginia

In Partial Fulfillment

of the requirements for the Degree

Master of Science

Computer Engineering

by

Anant P. Mudambi

January 2006

APPROVAL SHEET

This thesis is submitted in partial fulfillment of the requirements for the degree of

Master of Science

Computer Engineering

Anant P. Mudambi

This thesis has been read and approved by the examining committee:

Malathi Veeraraghavan (Advisor)

Marty A. Humphrey (Chair)

Stephen G. Wilson

Accepted for the School of Engineering and Applied Science:

Dean, School of Engineering and Applied Science

January 2006

Abstract

E-science projects involving geographically distributed data sources, computing resources and sci-

entists, have special networking requirements such as a steady throughput and deterministic behav-

ior. The connectionless Internet model is not well-suited to meet such requirements. Connection-

oriented networks that offer guaranteed-rate, dedicated circuits have been proposed to meet the

high-end networking needs of distributed scientific research. In this work we describe the design

and implementation of a transport protocol for such dedicated circuits.

We present an initial user-space, UDP-based implementation called Fixed Rate Transport Proto-

col (FRTP). The constraints imposed by a user-space implementation led us to implement a lower-

overhead kernel-space solution that we call Circuit-TCP (C-TCP). The key feature of C-TCP is to

maintain a fixed sending rate, closely matched to the circuit rate, with the aim of achieving high

circuit utilization. We implemented C-TCP by modifying the Linux TCP/IP stack. Experimental

results on a wide-area circuit-switched testbed show that C-TCP is able to quickly utilize circuit

bandwidth and sustain a high data transfer rate.

iii

Acknowledgments

I would like to thank Prof. Malathi Veeraraghavan, for her advice and for keeping me on the right

track. I thank the members of the CHEETAH research group, Xuan, Xiangfei, Zhanxiang and

Xiuduan, for all their help.

Anil and Kavita, thank you for keeping me motivated. Finally, the biggest thank you to my

parents, for their incredible support and love.

iv

Contents

1 INTRODUCTION 1

2 BACKGROUND 3

2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 TCP Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 UDP-based Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Novel Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 End-host Factors that Affect Data Transfer Performance . . . . . . . . . . . . . . . 6

2.2.1 Memory and I/O bus usage . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1.1 Zero-copy Networking . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 Protocol Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 Disk Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.4 Process scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 CHEETAH Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Components of CHEETAH . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Features of a CHEETAH Network . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 The CHEETAH Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.4 End-host Software Support for CHEETAH . . . . . . . . . . . . . . . . . 14

3 UDP-BASED TRANSPORT PROTOCOL 16

3.1 SABUL Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 SABUL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

v

Contents vi

3.2 Modifications to SABUL : FRTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Problems with the FRTP Implementation . . . . . . . . . . . . . . . . . . 22

3.2.2 Possible Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 TCP-BASED SOLUTION 27

4.1 Transmission Control Protocol - A Primer . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.2 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.3 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.4 Self Clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Reasons for Selecting TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Circuit-TCP Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 Connection Establishment . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.2 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.3 Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.4 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.5 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 C-TCP Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4.1 Web100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.1 Utility of Disabling Slow Start . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5.2 Sustained Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5.2.1 Reno-TCP Performance . . . . . . . . . . . . . . . . . . . . . . 46

4.5.2.2 BIC-TCP Performance . . . . . . . . . . . . . . . . . . . . . . 46

4.5.2.3 C-TCP Performance . . . . . . . . . . . . . . . . . . . . . . . . 47

5 CONTROL-PLANE FUNCTIONS 49

5.1 Selecting the Circuit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Contents vii

5.2 Setting up the Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 CONCLUSIONS 56

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1.1 Transport Protocol Design for Dedicated Circuits . . . . . . . . . . . . . . 56

6.1.2 Transport Protocol Implementation . . . . . . . . . . . . . . . . . . . . . 57

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

A DISK WRITE RATE ESTIMATION 59

A.1 How Linux Handles Disk Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.2 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Bibliography 66

List of Figures

2.1 Memory I/O bus usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 CHEETAH experimental testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Architecture of CHEETAH end-host software . . . . . . . . . . . . . . . . . . . . 15

3.1 Architecture of a generic UDP-based protocol . . . . . . . . . . . . . . . . . . . . 17

3.2 Need for receiver flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 TCP self clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Congestion control in the control plane . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Structure of the Web100 stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Maximum buffer space required for a C-TCP burst . . . . . . . . . . . . . . . . . 41

4.5 Testbed configuration for C-TCP tests . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6 TCP and C-TCP comparison for different transfer sizes . . . . . . . . . . . . . . . 43

4.7 Start-up behavior of TCP and C-TCP . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.8 Throughput and RTT using Reno-TCP . . . . . . . . . . . . . . . . . . . . . . . . 46

4.9 Throughput and RTT using BIC-TCP . . . . . . . . . . . . . . . . . . . . . . . . 47

4.10 Throughput and RTT using C-TCP . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1 Throughput variability of disk-to-disk transfers . . . . . . . . . . . . . . . . . . . 50

5.2 Trade-off between circuit utilization and delay . . . . . . . . . . . . . . . . . . . . 51

viii

List of Tables

5.1 xdd benchmark results on zelda4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Disk write rate (Mbps) for individual runs using 32 KB request sizes . . . . . . . . 52

A.1 End host configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.2 Disk write rate results using xdd . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

ix

List of Abbreviations

ACK Acknowledgement

AIMD Additive Increase Multiplicative Decrease

API Application Programming Interface

AQM Active Queue Management

BDP Bandwidth Delay Product

BIC-TCP Binary Increase Congestion control TCP

CHEETAH Circuit-switched High-speed End-to-End Transport ArcHitecture

COW Copy On Write

C-TCP Circuit TCP

cwnd congestion window

DMA Direct Memory Access

DNS Domain Name Service

DRAGON Dynamic Resource Allocation via GMPLS Optical Networks

FAST Fast AQM Scalable TCP

FRTP Fixed Rate Transport Protocol

GbE Giga bit Ethernet

Gbps Giga bits per second

GB Giga Byte

GMPLS Generalized Multiprotocol Label Switching

x

List of Abbreviations xi

HS-TCP HighSpeed TCP

I/O Input/Output

IP Internet Protocol

KB Kilo Byte

LAN Local Area Network

LMP Link Management Protocol

Mbps Mega bits per second

MB Mega Byte

MSPP Multi-Service Provisioning Platform

MTU Maximum Transmission Unit

NAK Negative ACK

NETBLT Network Blast Transfer

NIC Network Interface Card

OC Optical Carrier

OCS Optical Connectivity Service

OS Operating System

OSPF Open Shortest Path First

RBUDP Reliable Blast UDP

RED Random Early Detect

RSVP-TE Resource Reservation Protocol

RTO Retransmission Time-out

RTT Round Trip Time

rwnd receiver advertised window

SABUL Simple Available Bandwidth Utilization Library

SACK Selective ACK

SONET Synchronous Optical Network

ssthresh slow start threshold

TCP Transmission Control Protocol

List of Abbreviations xii

TDM Time Division Multiplexing

TSI Terascale Supernova Initiative

UDP User Datagram Protocol

UDT UDP-based Data Transfer protocol

XCP eXplicit Control Protocol

Chapter 1

INTRODUCTION

Many fields of research require significant computing resources to conduct simulations and/or to

analyze large amounts of data. Large data sets collected by remote instruments may need to be

processed. The SETI@home project [2], which uses data collected by the National Astronomy

and Ionospheric Center’s radio telescope in Arecibo, Peru, is one such example. The telescope

generates about 35 GB of data per day that is stored in removable tapes and physically transported

to the server in Berkeley, California. In some cases, computations generate massive amounts of

output that has to be distributed to scientists who are physically at a distance from the computation

resource. For instance, the Terascale Supernova Initiative (TSI) project involves simulations run on

supercomputers at the Oak Ridge National Laboratory (ORNL), the results of which are used by

physicists at remote sites like the North Carolina State University (NCSU).

Networks connecting the data generation point, the computation resource and the scientists’

workplace make collabarative e-science much more practical. The large amounts of data involved

and, in some cases (e.g., real-time visualization), stringent delay/jitter requirements make it nec-

essary to use networks with large bandwidths and deterministic behavior. E-science applications

require high, constantly available bandwidth for their data transfer needs. It is difficult to provide

such rate-guaranteed services in packet-switched, connectionless networks, such as the present-day

Internet. This is because of the possibility of a large number of simultaneous flows competing for

the available network capacity. Therefore, the use of connection-oriented, dedicated circuits has

been proposed as a solution. Many research groups are implementing testbeds and the supporting

1

Chapter 1. INTRODUCTION 2

software to show the feasibility of such a solution.

The problem addressed in this thesis is the design of a transport protocol for dedicated circuits.

Many of the assumptions on which traditional transport protocols for packet-switched networks

are based need to be examined. For instance, the possibility of losses due to network buffer over-

flows makes congestion control an important function on connectionless networks. On connection-

oriented networks, because network resources are reserved for each data transfer, the end points of

the transfer have more control over whether or not network buffers will overflow. By maintaining

a data transfer rate that is matched to the reserved circuit’s rate the need for congestion control

can be eliminated. On the other hand, a transport layer function such as flow control is needed on

both connectionless and connection-oriented networks because it addresses a problem that network

resource reservation does not solve.

Our approach is to design the transport protocol under the assumption that resources are re-

served for a data transfer’s exclusive use. The transport protocol should not have any “features”

that leave the reserved circuit unutilized. We implemented the transport protocol and tested it on a

wide-area, connection-oriented network testbed. This protocol is called Circuit-TCP (C-TCP).

The rest of this thesis is organized as follows. Chapter 2 provides background information on

previous work in this area as well as issues that affect the design and performance of our transport

protocol. In Chapter 3, we describe the Fixed Rate Transport Protocol (FRTP) that was implemented

in the user space over UDP. The shortcomings of a user space impementation are pointed out.

Chapter 4 describes the design and implementation of C-TCP, our kernel space transport protocol

based on TCP. Experimental results over a testbed are used to compare C-TCP with TCP over

dedicated circuits. In Chapter 5 the control plane issues of determining the circuit rate and then

setting up the circuit are considered. The conclusions of this work are presented in Chapter 6.

Chapter 2

BACKGROUND

In this chapter we first look at other work that has been done in the development of transport pro-

tocols for high-performance networks. Next we point out some of the factors that play a significant

role in achieving high throughput on dedicated circuits. Many of these are end-host issues that we

discovered while implementing our transport protocol. This work has been conducted as a part

of the Circuit-switched High-speed End-to-End Transport ArcHitecture (CHEETAH) project. An

overview of CHEETAH is presented at the end of this chapter.

2.1 Related Work

There has been significant activity in developing transport protocols suitable for high-bandwidth

and/or high-delay networks. Even though very little of it is focussed explicitly towards dedicated

circuits there is enough of an overlap in the problems to justify a closer examination. High-

performance protocols can be classified as TCP enhancements, UDP-based and novel protocols.

Ease of deployment and familiarity with the sockets API to the TCP and UDP protocol stacks are

reasons for the popularity of TCP and UDP-based solutions.

2.1.1 TCP Enhancements

TCP is the most widely used reliable transport protocol on connectionless, packet-switched net-

works. We describe basic TCP operation in Chapter 4. It is designed to work under a wide range

3

Chapter 2. BACKGROUND 4

of conditions and this makes a few of its design decisions non-optimal for high-speed networks.

In recent years a number of protocol extensions to TCP have been proposed and implemented to

address this issue. Selective acknowledgements (SACKs) [27,16] have been proposed to deal more

efficiently with multiple losses in a round trip time (RTT) [13]. TCP uses cumulative acknowl-

edgements (ACKs) which means a data byte is not ACKed unless all data earlier in the sequence

space has been received successfully. SACKs inform the sender about out-of-sequence data already

received and help prevent unnecessary retransmissions. Two protocol extensions— timestamps op-

tion and window scaling— were proposed in [22]. The timestamps option field in a data packet’s

TCP header is filled in by a sender and echoed back in the corresponding ACK. It serves two pur-

poses. First, the timestamp can be used to estimate the round trip time more accurately and more

often. This gives the sender a better value for retransmission timeout (RTO) computation. Second,

the timestamp in a received packet can be used to prevent sequence number wraparound. The TCP

header has a 16-bit field for the window size, which limits the window size to 64 KB. This is insuf-

ficient for high-bandwidth, high-delay networks. The window scaling option allows a scaling factor

to be chosen during connection establishment. Subsequent window advertisements are right shifted

by the selected scaling factor. Scaling factors of upto 14 are allowed, thus by using this option a

window size of upto 1 GB can be advertised.

Standard TCP (also called Reno TCP) has been found wanting in high-bandwidth, high-delay

environments, mainly due to its congestion control algorithm. TCP’s Additive Increase Multi-

plicative Decrease (AIMD) algorithm is considered too slow in utilizing available capacity and too

drastic in cutting back when network congestion is inferred. Modifications to the TCP conges-

tion control algorithm have led to the development of HighSpeed TCP [14], Scalable TCP [25],

FAST [23], and BIC-TCP [39], among others. Standard TCP requires unrealistically low loss rates

to achieve high throughputs. HighSpeed TCP is a proposed change to the TCP AIMD parameters

that allows a TCP connection to achieve high sending rates under more realistic loss conditions.

Scalable TCP also proposes modified AIMD parameters that speed up TCP’s recovery from loss.

FAST infers network congestion and adjusts its window size based on queueing delays rather than

loss. BIC-TCP (BIC stands for Binary Increase Congestion control) is a new congestion control


algorithm that scales well to high bandwidth (i.e., it can achieve a high throughput at reasonable

packet loss rates) and is TCP-friendly (i.e., when the loss rate is high its performance is the same

as standard TCP’s). In addition, unlike HighSpeed or Scalable TCP, BIC-TCP’s congestion control

is designed such that two flows with different RTTs share the available bandwidth in a reasonably

fair manner.

2.1.2 UDP-based Protocols

To overcome the shortcomings of TCP, many researchers have implemented protocols over UDP by

adding required functionality, such as reliability, in the user space. The most common model is to

use UDP for the data transfer and a separate TCP or UDP channel for control traffic. SABUL [18],

Tsunami, Hurricane [38], and RBUDP [20] use a TCP control channel and UDT [19] uses UDP

for both data and control channels. The advantage of these solutions is that their user-space imple-

mentation makes deployment easy. At the same time, there are some limitations that arise because

these protocols are implemented in the user-space. In Chapter 3, we describe SABUL. Our attempt

at modifying SABUL to implement a transport protocol for dedicated circuits and the shortcomings

of a user-space transport protocol implementation are also pointed out.

2.1.3 Novel Protocols

Some novel protocols designed exclusively for high-performance data transfer have also been pro-

posed. The eXplicit Control Protocol (XCP) [24] was proposed to solve TCP’s stability and effi-

ciency problems. By separating link utilization control from fairness control, XCP is able to make

more efficient use of network resources in a fair manner. XCP’s requirement of multi-bit congestion

signals from the network makes it harder to deploy since routers in the network need to be modified.

NETBLT [10] was proposed for high-speed bulk data transfer. It provides reliable data transfer by

sending blocks of data in a lock-step manner. This degrades bandwidth utilization while the sender

awaits an acknowledgement (ACK) for each block.


2.2 End-host Factors that Affect Data Transfer Performance

Setting up a dedicated circuit involves resource reservation in the network. Depending on the

network composition, the resources reserved could be wavelengths, ports on a switch or time slots.

Ideally, we would like to fully use the reserved resources for exactly the time required to complete

the transfer. During the implementation of our transport protocol, we found that there are many

factors that make it hard to achieve this ideal. In this section we list a few of these factors that

impact the performance of transport protocol implementations.

2.2.1 Memory and I/O bus usage

First, consider an application that uses the transport protocol to carry out a file transfer. At the

sending end, the application has to

1. Read data from the disk, e.g. by invoking a read system call.

2. Send the data out on the network, e.g. by invoking a send system call.

There are two types of overhead in carrying out these operations. The system calls involve the over-

head of saving the process registers on stack before the system call handler is invoked. Secondly,

the two steps shown above could involve multiple passes over the memory and I/O bus. This is

illustrated in Figure 2.1(a). The figure shows the bus operations involved in moving data from the

disk to user space buffers (step 1 above), and from the user space buffer to kernel network buffers

(part of step 2). To avoid having to access the disk each time, for multiple accesses to a chunk of

data, the operating system caches recently accessed disk data in memory. This cache is called the

page cache, and direct memory access (DMA) is used for transfers between the page cache and the

disk (operation I in Figure 2.1(a)). Two passes over the memory bus are needed to transfer the data

from the page cache to the user space buffer (operation II). To send data out to the network, it is

again copied from the user space buffer to kernel network buffers (operation III). We do not show

the transfer from the kernel network buffer to the NIC, which is the final step in getting data out

into the network. For data transfers using TCP sockets on Linux, the sendfile system call can be


PROCESSOR

MEMORY

PAGE

CACHE

I

II

III

NICHARD DISK

USER-SPACE

KERNEL-SPACE MEMORY

(a) Using read and send

PROCESSOR

MEMORY

PAGE

CACHE

HARD DISKNIC

KERNEL-SPACE

MEMORY

USER-SPACE

(b) Using sendfile

Figure 2.1: Memory I/O bus usage

used to cut down the number of passes over the memory bus to three. As shown in Figure 2.1(b),

sendfile copies data directly from the page cache to the kernel network buffers, thus avoiding the

copy to user space and back. In addition, sendfile needs to be invoked just once for a single file, so

the overhead of making a system call is paid only once per file.

2.2.1.1 Zero-copy Networking

Other methods for avoiding the copy from user-space memory to kernel-space memory have been

proposed. Such methods are known by the common term zero-copy networking. For a classification

of zero-copy schemes see [7]. The zero in zero-copy networking indicates that there is no memory-

to-memory copy involved in the transfer of data between a user space buffer and the network. So,

in Figure 2.1(a), a zero-copy scheme would eliminate memory-to-memory copies after operation

II. How the data got into the user- or kernel-space buffer in the first place, and whether that required

a copy is not considered. Zero-copy schemes can be supported if an application interacts directly

with the NIC without passing through the kernel, or if the buffers are shared between user and

kernel space, rather than being copied. For an application to directly read from and write to the NIC

buffer, protocol procesing has to be done on the NIC. At the sender, buffers can be shared between

the application and the kernel if the application can guarantee that a buffer that has not yet been

transmitted will not be overwritten. One way to ensure this would be if the system call invoked to


send some data returns only after all of that data has been successfully transmitted. Since a reliable

transport protocol can consider a buffer to have been successfully transmitted only when all of the

data in that buffer has successfully reached the intended receiver, the application may need to wait

a while before it can reuse a buffer. An interesting alternative is to mark the buffer as copy-on-write

(COW), so that the contents of the buffer are copied to a separate buffer if and when the application

tries to overwrite it. Implementation of send-side zero-copy schemes for different operating systems

are described in [28].

Now consider the steps at a receiver. A receiver performs the steps shown in Figure 2.1(a) in

reverse order (there is no sendfile equivalent for the receiver). One way to implement zero-copy

on the receiver is to change the page table of the application process when it issues a recv system

call. This is called page flipping in [28]. Page flipping works only if the NIC separates the packet

payload and header, if the packet payload is an exact multiple of the page size and if the buffer

provided by the application is aligned to page boundaries. Because of these requirements there has

been little effort to implement such a scheme.

Several factors that influence communication overhead are presented in [33]. The memory and

I/O bus usage for schemes with different kernel and interface hardware support are compared. For

instance, the author shows how, by using DMA, NIC buffering and checksum offload, the number

of passes over the bus can be reduced from six to one.

2.2.2 Protocol Overhead

Apart from the memory and I/O bus, the other main end host resource that could become a bottle-

neck is processor cycles. TCP/IP, being the most widely used protocol stack, has received attention

in this regard. In [9] the processing overhead of TCP/IP is estimated and the authors’ conclusion

is that with a proper implementation, TCP/IP can sustain high throughputs efficiently. More recent

work presented in [17] takes into consideration the OS and hardware support that a TCP implemen-

tation will require.

The overhead of a transport layer protocol can be divided into two categories: per-packet costs

and per-byte costs [9, 28, 6]. Per-packet costs include protocol processing (e.g., processing the


sequence numbers on each packet in TCP) and interrupt processing. Per-byte costs are incurred

when data is copied or during checksum calculation.

Per-packet overhead can be reduced by reducing the number of packets handled during the

transfer. For a given transfer size, the number of packets can be reduced by using larger packets.

The maximum transmission unit (MTU) of the network constrains the packet size that an end host

can use. For instance, Ethernet imposes a 1500-byte limit on the IP datagram size. The concept

of jumbograms was introduced by Alteon Networks to allow Ethernet frames of upto 9000 bytes,

and many gigabit Ethernet NICs now support larger frame sizes. Larger packet sizes can decrease

protocol processing overhead as well as the overhead of interrupt processing. NICs interrupt the

processor on frame transmission and reception. An interrupt is costly for the processor because

the state of the currently running process has to be saved and an interrupt handler invoked to deal

with the interrupt. As interface rates increase to 1 Gbps and higher, interrupt overhead can become

significant. Many high-speed NICs support interrupt coalescing so that the processor is interrupted

for a group of transmitted or received packets, rather than for each individual packet.

Schemes to reduce per-byte costs involved in copying data over the memory I/O bus were

described in Section 2.2.1. Checksum calculation can be combined with a copy operation and

carried out efficiently in software. For instance, the sender could calculate the checksum when data

is being copied from the user-space buffer to the kernel-space buffer. Another way to reduce the

processor’s checksum calculation burden is to offload it to the interface card.

2.2.3 Disk Access

All the factors considered so far affect data transfer throughput. In designing a transport protocol

for dedicated circuits, not only does a high throughput have to be maintained, the circuit utilization

should also be high. Thus end host factors that cause variability in the throughput also need to

be considered. For disk-to-disk data transfers, disk access can limit throughput as well as cause

variability. The file system used can have an effect on disk access performance. The time to

physically move the disk read/write head to the area on the hard disk where the desired data resides,

called seek time, is a major component of the disk access latency. File accesses tend to be sequential,


so a file system that tries to keep all parts of a file clustered together on the hard disk would perform

better than one in which a file is broken up into small pieces spread all over the hard disk.

At the sender, data needs to be read from the disk to memory. System calls to do this are

synchronous. When the system call returns successfully, the requested data is available in memory

for immediate use. Operating systems try to improve the efficiency of disk reads by reading in

more than the requested amount, so that some of the subsequent requests can be satisfied without

involving the disk hardware. At the data receiver, the system call to write to disk is asynchronous

by default. This means that when the system call returns it is not guaranteed that the data has been

written to disk; instead it could just be buffered in memory. Asynchronous writes are tailored to

make the common case of small, random writes efficient, since they allow the operating system

to schedule disk writes in an efficient manner. The operating system could reorder the writes to

minimize seeks. In Linux, for instance, data written to disk is actually copied to memory buffers

in the page cache and these buffers are marked dirty. Two kernel threads, bdflush and kupdate, are

responsible for flushing dirty buffers to disk. The bdflush kernel thread is activated when the number

of dirty buffers exceeds a threshold, and kupdate is activated whenever a buffer has remained dirty

too long. As a consequence of the kernel caching and delayed synchronization between memory

buffers and the disk, there can be significant variability in the conditions under which a disk write

system call operates.

2.2.4 Process scheduling

The final factor we consider is the effect of the process scheduler. All modern operating sys-

tems are multitasking. Processes run on the processor for short intervals of time and then either

relinquish the CPU voluntarily (e.g. if they block waiting for I/O) or are forcibly evicted by the

operating system if their time slot runs out. This gives users the impression that multiple processes

are running simultaneously. Multitasking, like packet-switched networking, tries to fairly divide up

a resource (processor cycles for multitasking; bandwidth for packet-switched networking) among

all contenders (multiple processes; multiple flows) for the resource. This behavior is at odds with

resource reservation in a connection-oriented network such as CHEETAH. If the degree of mul-


titasking at an end host is high then a data transfer application may not get the processor cycles

required to fully use the reserved circuit. Even if the required number of free cycles are available,

the process scheduler might not be able to schedule the data transfer application in the monotonic

fashion required to send and receive data at the fixed circuit rate.

2.3 CHEETAH Network

CHEETAH, which stands for Circuit-switched High-speed End-to-End Transport ArcHitecture, is a

network architecture that has been proposed [37] to provide high-speed, end-to-end connectivity on

a call-by-call basis. Since the transport protocol proposed in this thesis is to be used over a dedicated

circuit through a CHEETAH network, in this section we provide a description of CHEETAH.

2.3.1 Components of CHEETAH

Many applications in the scientific computing domain require high throughput transfers with deter-

ministic behavior. A circuit-switched path through the network can meet such requirements better

than a packet-switched path. CHEETAH aims to bring the benefits of a dedicated circuit to an end-

user. In order to allow wide implementation, CHEETAH has been designed to build on existing

network infrastructure instead of requiring radical changes. Ethernet and SONET (Synchronous

Optical Network) are the most widely used technologies in local area networks (LANs) and wide

area networks (WANs) respectively. To take advantage of this, a CHEETAH end-to-end path con-

sists of Ethernet links at the edges and Ethernet-over-SONET links in the core. Multi-Service

Provisioning Platforms (MSPPs) are hardware devices that make such end-to-end paths possible.

MSPPs are capable of mapping between the packet-switched Ethernet domain and the time divi-

sion multiplexed (TDM) SONET domain. MSPPs are an important component of the CHEETAH

architecture for three reasons.

1. The end hosts can use common Ethernet NICs and do not need, for instance, SONET line

cards.

2. Many enterprises already have MSPPs deployed to connect to their ISP’s backbone network.


3. Standard signaling protocols, as defined for Generalized Multiprotocol Label Switching

(GMPLS) networks, are (being) implemented in MSPPs. This is essential to support dynamic

call-by-call sharing in a CHEETAH network.

2.3.2 Features of a CHEETAH Network

One of the salient features of CHEETAH is that it is an add-on service to the existing packet-

switched service through the Internet. This means, firstly, that applications requiring CHEETAH

service can co-exist with applications for which a path through the packet-switched Internet is good

enough. Secondly, because network resources are finite, it is possible that an application’s request

for a dedicated circuit is rejected; in such cases, the Internet path provides an alternative so that

the application’s data transfer does not fail. To realize this feature, end hosts are equipped with an

additional NIC that is used exclusively for data transfer over a CHEETAH circuit.

To make the CHEETAH architecture scalable, the network resource reservation necessary to

set up an end-to-end circuit should be done in a distributed and dynamic manner. Standardized

signaling protocols that operate in a distributed manner, such as the hop-by-hop signaling in GM-

PLS protocols, are key to achieving scalability. CHEETAH uses RSVP-TE1 signaling in the control

plane. Dynamic circuit set up and tear down means that these operations are performed when (and

only when) required, as opposed to statically provisioning a circuit for a long period of time. Dy-

namic operation is essential for scalability because it allows the resources to be better utilized, thus

driving down costs. End-host applications that want to use a CHEETAH circuit are best-placed

to decide when the circuit should be set up or torn down. Therefore an end host connected to the

CHEETAH network runs signaling software that can be used by applications to attempt circuit set

up on a call-by-call basis.

With end-host signaling in place, applications that want to use a CHEETAH circuit can do so

in a dynamic manner. This leads to the question of whether, just because it can be done, a circuit

set up should be attempted for a given data transfer. In [37], analytical arguments are used to show

1Resource Reservation Protocol-Traffic Engineering. This is the signaling component of the GMPLS protocols, theother components being Link Management Protocol (LMP) and Open Shortest Path First-TE (OSPF-TE).


that, for data transfers above a threshold size, transfer delay can be reduced by using a CHEETAH

circuit rather than an Internet path. It is also worth noting that there are situations in which the

overhead of circuit set up makes it advantageous to use a path through the Internet, although for

wide-area bulk data transfer a dedicated circuit invariably trumps an Internet path.

2.3.3 The CHEETAH Testbed

To study the feasibility of the CHEETAH concept, an experimental testbed has been set up. This

testbed extends between North Carolina State University (NCSU), Raleigh, NC, and Oak Ridge Na-

tional Laboratory (ORNL), Oak Ridge, TN and passes through the MCNC point-of-presence (PoP)

in Research Triangle Park, NC and the Southern Crossroads/Southern LambdaRail (SOX/SLR) PoP

in Atlanta, GA. The testbed layout is shown in Figure 2.2. In this testbed, the Sycamore SN16000

Intelligent Optical Switch is used as the MSPP. In the figure we show end hosts connected directly

or through Ethernet switches to the gigabit Ethernet card on the SN16000. The cross connect card

is configured through the control card to set up a circuit. The SN16000 has an implementation of

the GMPLS signaling protocol that follows the standard and has been tested for interoperability.

��

��

�� "!��$#&%('

��) ��* ��+�,�-��#&.(/�/(/

!��10��( �

�,2�� 3(4&!��

�5�(�� (& ��(�� &� 6� ��

�7�8� � �+��

�9�,��!�5�:#�%�'

�;) ��8* ��1�7� #6.�/(/�/

!��10��

< � � � � �� 46� <

�5�(�� (& ��(�� &� 6� ��

�7�8� � �+��

�9�,��!�5�:#�%�'

�;) ��8* ��1�7� #6.�/(/�/

!��10��

=>�$�?�$4��

�5��A@�B � 0

�5�(�C@DB �� 0

!"�>0�� ?� � E 4��?�

Figure 2.2: CHEETAH experimental testbed


The testbed has been designed to support the networking needs of the TSI project mentioned

at the beginning of this chapter. We present results of experiments conducted over this testbed in

Chapter 4.

2.3.4 End-host Software Support for CHEETAH

To allow applications to start using CHEETAH circuits, software support is required to make the

end hosts CHEETAH enabled. The architecture of the end-host software is shown in Figure 2.3.

The relevant components of the CHEETAH end-host software are shown inside a dotted box to

signify that the application could either interact with each component individually or make higher-

level calls that hide the details of the components being invoked. To be able to use a CHEETAH

circuit between two end hosts, both should support CHEETAH.

The Optical Connectivity Service (OCS) client allows applications to query whether a re-

mote host is on the CHEETAH network. OCS uses the Internet’s Domain Name Service (DNS)

to provide additional information such as the IP address of the remote end’s secondary NIC. As

mentioned earlier, depending on the situation, either a CHEETAH circuit or a path through the In-

ternet may be better for a particular transfer. The routing decision module takes measurements of

relevant network parameters (e.g., available bandwidth and average loss rate) and uses these along

with the parameters of a particular transfer (e.g., the file size and requested circuit rate) to decide

whether or not a CHEETAH circuit set up should be attempted. To achieve CHEETAH’s goal of

distributed circuit set up, an RSVP-TE signaling module runs on each end host. The RSVP-TE

module exchanges control messages with the enterprise MSPP to set up and tear down circuits.

These control messages are routed through the primary NIC over the Internet. The final software

component is the transport protocol module. Depending on whether a circuit or an Internet path

is being used, the transport protocol used will be C-TCP or TCP. In this thesis the focus will be on

the design, implementation and evaluation of C-TCP.

To end this chapter we mention some of the other projects focused on connection-oriented

networking for e-science projects. UltraScience Net [36] is a Department of Energy sponsored

research testbed connecting Atlanta, Chicago, Seattle and Sunnyvale. This network uses a central-


��

��

�� !��" # %$ �� &'�!�� (� ��

# �*),+��-.�� !��

-0/1�%��%/ �+0/2 ��34��$�� !

+0/�� 54��/ 67�8 �

�.!��/ 67�8 �

39�:+.+

8 �" �!�/;��!

�.�47��%-<�!� =��/�> 39�:+.+

�0��?��

+0/�� 54��/ 67�8 �

�.!��(��/ 67@8 �

Figure 2.3: Architecture of CHEETAH end-host software

ized scheme for the control-plane functions. Another effort is the Dynamic Resource Allocation via

GMPLS Optical Networks (DRAGON) project [12]. DRAGON uses GMPLS protocols to support

dynamic bandwidth provisioning.

Chapter 3

UDP-BASED TRANSPORT PROTOCOL

In Chapter 2 we mentioned a few protocols that are based on UDP. There are good reasons for

taking this approach:

• UDP provides the minimal functionality of a transport protocol. It transfers datagrams be-

tween two processes but makes no guarantees about their delivery. UDP’s minimalism leaves

no scope for anything to be taken out of its implementation. Thus a new protocol built over

UDP has to add extra (and only the required) functionality. The significance of this is that

these additions can be done in the user space, without requiring changes to the operating

system code. This makes UDP-based solutions as easy to use and portable as an application

program.

• The sockets API to the UDP and TCP kernel code is widely deployed and used. This makes

implementation easier and faster.

The basic design of all UDP-based protocols is similar and is shown in Figure 3.1. Data packets

are transferred using UDP sockets. A separate TCP or UDP channel is used to carry control pack-

ets. Control packets serve to add features to the data transfer not provided by UDP’s best-effort

service. We used the Simple Available Bandwidth Utilization Library (SABUL), a UDP-based data

transfer application, to implement the Fixed Rate Transport Protocol (FRTP). In this chapter we first

present an overview of the SABUL protocol and implementation. Then we describe the changes

16

Chapter 3. UDP-BASED TRANSPORT PROTOCOL 17

��

��! �"$#%�'&�(*),+! - *.'&

/0+,"1+�(*)-+' ' 2.!&

Figure 3.1: Architecture of a generic UDP-based protocol

that we made to SABUL to implement FRTP. The advantages and shortcomings of this approach

are discussed.

3.1 SABUL Overview

SABUL is designed for bulk data transfers over high-bandwidth networks. SABUL’s architecture

is the same as that shown in Figure 3.1. TCP is used for control packet transmission from the data

receiver to the data sender. SABUL adds reliability, rate-based congestion control and flow control

to UDP’s basic data transfer service.

Providing end-to-end reliability—guaranteeing that all the data sent is received in the same or-

der and without duplicates—is a function of the transport layer. SABUL implements the following

error control scheme for reliable transfer. It adds a sequence number to each UDP data packet.

The receiver detects packet loss using the sequence numbers of the received packets. On inferring

loss, the receiver immediately sends a negative-acknowledgement (NAK) control packet to convey

this information to the sender. The sender then recovers from the error by retransmitting the lost

packet(s). The receiver maintains an ERR timer to periodically send NAKs if there are missing

packets. This is to provide protection against lost retransmissions. For file transfers, reading data


from the disk for each retransmission is very expensive in time. Therefore, the sender keeps the

transmitted data in memory until it is acknowledged. A SABUL receiver periodically sends an ac-

knowledgement (ACK) control packet, acknowledging all packets received in-order. On receiving

an ACK, the sender can free the buffer space occupied by data that is confirmed to have been re-

ceived. In addition the SABUL sender has a timer that is reset each time a control packet is received.

If this timer (called the EXP timer) expires because no control information has been received, the

sender assumes that all unacknowledged packets have been lost and retransmits them.

SABUL uses a rate-based congestion control scheme. The sender modifies the sending rate

depending on the degree of congestion in the network. The SABUL receiver sends a periodic syn-

chronization (SYN) control packet containing the number of data packets received in the previous

SYN period. The sender uses this information to estimate the amount of loss and hence the con-

gestion in the network. Depending on whether the loss is above or below a threshold, the sending

rate is reduced or increased, respectively. The sending rate is modified by changing the inter-packet

gap.

SABUL is a user space implementation which means a SABUL receiver cannot distinguish

between loss due to network congestion and loss due to its receive buffer (the kernel UDP buffer)

overflowing. The information in SYN packets represents both types of loss, and therefore, SABUL’s

rate-based congestion control also serves as a reactive flow control strategy. In addition, a fixed

window is used to limit the amount of unacknowledged data in the network.

3.1.1 SABUL Implementation

The SABUL implementation is described next. It is important to separate the SABUL transport

protocol from an application that uses it. In the description below we refer to an application using

SABUL as the sending application or receiving application. The sending application generates

the data that is to be transferred using SABUL, for example by reading it from a file on the hard

disk. The receiving application, likewise, consumes the data transferred using SABUL. SABUL

is implemented in C++. The sending application invokes a SABUL method to put data into the

protocol buffer. SABUL manages the protocol buffer and transmits or retransmits data packets


from it. Two threads are used. One handles the interface with the sending application, mainly the

filling of the protocol buffer. The other thread is responsible for sending out data packets. The

sequence numbers of packets that need to be retransmitted are recorded in a loss list. Pseudocode

for the sender side functionality is shown below:

INITIALIZATION:

Create TCP socket on well-known port number

Listen for a connection

Accept connection from client

Get the UDP port number on which the receiver is expecting data

Calculate the inter-packet gap required to maintain the desired sending rate

Fork a new thread to handle the data transmission

DATA TRANSMISSION:

WHILE data transfer is not over

WHILE protocol buffer is empty AND data transfer is not over

Wait for data from the sending application

ENDWHILE

Poll control channel for control packets

IF control packet received THEN

Process control packet /* See below */

ENDIF

IF loss list is not empty THEN

Remove first packet from the loss list

ELSE

Form a new packet

ENDIF

Send the data packet by invoking the send() system call on the UDP socket

Wait till it is time to send the next packet


ENDWHILE

CONTROL PACKET PROCESSING:

IF ACK packet THEN

Release buffer space held by the acknowledged packet(s)

Update loss list

Inform sending application of availability of buffer space

ELSE IF NAK packet THEN

Update loss list

ELSE IF SYN packet THEN

Adjust sending rate

ENDIF

Two threads are used at the receiver too. One thread (call it the network thread) is responsible

for receiving data packets, writing the data into the protocol buffer and sending control packets.

The other thread (main thread) handles the interface with the receiving application, transferring

data from the protocol buffer to the application buffer. SABUL uses an optimization when the

receiving application asks to read more data than the protocol buffer has. The main thread sets a

flag indicating such a situation. On seeing this flag, the network thread copies all available data

into the application buffer and resets the flag. As the rest of the data requested by the receiving

application arrives, it is copied directly into the application buffer saving a memory copy. The

receiver side pseudocode follows.

INITIALIZATION:

Create TCP and UDP sockets

Connect to the sender

Inform the sender of the UDP port number

Fork a new thread to receive data

RECEIVING DATA:


WHILE all the data has not been received

IF receiving application is waiting for data THEN

Copy all ACKed data from protocol buffer to application buffer

ENDIF

IF ACK timer expired THEN

Send ACK packet

ENDIF

IF ERR timer expired THEN

Send NAK packet with sequence numbers of missing packets

ENDIF

IF SYN timer expired THEN

Send SYN packet with number of packets received in previous SYN interval

ENDIF

Get the address into which to receive the next expected data packet

Receive a data packet on the UDP socket

IF missing packets THEN

Add missing packets’ sequence numbers to loss list

Send an immediate NAK packet

ENDIF

Update state variables like next expected sequence number, ACK sequence number

Update loss list

ENDWHILE

3.2 Modifications to SABUL : FRTP

Our initial idea for a transport protocol that can be used over dedicated circuits was that, since

bandwidth is reserved, the data should be just streamed across at the circuit rate. Transmitting at a

rate lower than the reserved circuit rate would leave bandwidth unutilized. Transmitting at a higher


rate would eventually lead to a buffer filling up and overflowing. Therefore we wanted a transport

protocol that could monotonically send data packets at a fixed rate. SABUL seemed like a perfect

match for doing this since it can maintain a fixed sending rate if its rate-based congestion control

was disabled. FRTP, our transport protocol for dedicated circuits, could be implemented just like

SABUL, except that the rate altering congestion control would be stripped out.

The first modification to SABUL code was to remove the rate-based congestion control that

modified the sending rate. Second, we added support for using separate NICs for the data and

control channels. This was in line with the CHEETAH concept of having two NICs on CHEETAH-

enabled hosts. SABUL (and hence, FRTP) has many parameters that can be tweaked to improve

its performance. The application, protocol and UDP buffer sizes can be changed. The values of

the different timers that SABUL uses are also available for adjustment. We ran experiments in a

laboratory setting [40] to determine the effect of some of these parameters on FRTP performance,

and possibly determine the optimal values. Although we failed to determine a set of optimal values

for the parameters, these experiments did reveal some of the flawed assumptions we were making.

3.2.1 Problems with the FRTP Implementation

We observed that even though FRTP was set up to send at a fixed rate, the throughput achieved

(amount of data transferred / transfer time) was lower than the sending rate. This difference grew as

the sending rate was increased. We found that the reasons for this discrepancy were two-fold. First,

the FRTP implementation was not able to maintain a monotonic sending rate. Second, even if the

sender was able to maintain a constant sending rate, the receiving application could not empty the

buffers at the same (or higher) rate. This led to receiver buffer overflow and retransmissions, which

reduced the throughput.

FRTP implements a fixed sending rate by maintaining a fixed inter-packet gap. For instance,

if 1500 byte packets are being used, a 1 Gbps sending rate can be maintained by ensuring that the

gap between successive transmitted packets is 12 µs (= 1500 bytes / 1 Gbps). Commodity operating

systems do not provide straightforward methods (if at all) to measure such small intervals of time

and certainly do not provide a method to reliably schedule a periodic action at such a fine granular-


ity. For instance, the timer tick granularity available to user-space processes in Linux is 10 ms. To

overcome this, FRTP uses busy waiting to bide away the time between packet transmissions. If the

next packet needs to be sent at time t, FRTP does the following:

WHILE ((current time) < t)

NOP

ENDWHILE

The rdtsc (read time stamp counter) instruction, provided by Pentium processors, is used to get

an accurate value for the current time. The rdtsc instruction reads the time stamp counter that is

incremented at every processor tick. NOP is a no operation instruction. The busy waiting solution is

wasteful since the NOPs use up processor cycles that could have been used to accomplish something

more useful. It also does nothing to make the periodic invocation of an event reliable. If the sending

process were the only one running on the processor then the busy waiting scheme works to reliably

perform a periodic action. If a different process is running on the processor at t, the FRTP sending

process will miss its deadline. In fact, since FRTP itself uses 2 threads at the sender, the thread

responsible for filling the protocol buffer could interfere with the data sending thread’s busy waiting

induced periodicity.

SABUL’s rate adjustment scheme has been removed from FRTP. Therefore FRTP does not have

even the reactive flow control of SABUL. This is acceptable if we can be sure that flow control is

not required. The FRTP receiver architecture for a transfer to disk can be represented as shown in

Figure 3.2. Using the notation introduced in Section 3.1, the network thread handles the transfer

marked I and the main thread and the receiving application handle II and III respectively. The

process scheduler has to put the threads on the processor for the transfers to take place. Transfer III

additionally depends on how long the write to disk takes. These factors introduce variability into

the receiving rate. Buffers can hide this variability so that even a constant sending rate does not

cause buffer overflow. For a sending rate S(t) held at a constant value S, a receiving rate R(t) and a

receive buffer of size B, for no loss to occur:

S.τ−Z

τ

0R(t)dt ≤ B ∀τ ∈ [0,T ] (3.1)


UDP bufferProtocol

bufferApplication

buffer

Disk

Kernel-space User-space

I II III

Figure 3.2: Need for receiver flow control

where [0,T ] is the transfer interval. The (false) assumption behind our initial belief that it is enough

to just stream the data at the reserved circuit rate was that equation (3.1) holds throughout the

transfer. From our experiments we realized that not only is R(t) varying, we do not even know a

closed form expression for it, making the choice of S and B to satisfy equation (3.1) difficult. A

pragmatic approach is to assign sensible values to S and B, so that (3.1) is satisfied most of the time.

When it is not satisfied, there are losses and the error control algorithm will recover from the loss.

This is what we were seeing in our laboratory experiments (but with S(t) also varying with time).

A flow control protocol would attempt to ensure that the above equation is satisfied all the time, by

varying S(t). Unfortunately this implementation of FRTP has no flow control.

3.2.2 Possible Solutions

Our attempts to solve the two problems we identified with the FRTP implementation— use of busy

waiting for ensuring a steady rate and lack of flow control— are described next. The ideal solution

for maintaining a fixed inter-packet gap would involve transmitting a packet, giving up the processor

and reclaiming it when it is time to send the next packet. Linux offers a system call to relinquish


the processor. To see why it is not possible to reclaim the processor at a deterministic future time

it is essential to understand how the Linux scheduler schedules processes to run. Two queues (for

our purposes only two of the queues are important) are maintained, one of processes that are ready

to run (the RUNNABLE queue) and the other of processes that are waiting for some condition that

will make them ready to run (the INTERRUPTIBLE queue). For instance, if a process executes

instructions to write to disk, it is put in the INTERRUPTIBLE queue. When the write to disk

completes and the hard drive interrupts the processor the process is put back in the RUNNABLE

queue. So what happens when, after transmitting a packet, the FRTP sending process gives up the

CPU? Usually, the system call used to relinquish the processor allows the process to specify a time

after which it is to be made runnable again. The process is put in the INTERRUPTIBLE queue and

when the operating system determines that the time for which the process had asked to sleep has

passed, it is put back in the RUNNABLE queue. The problem arises because the operating system

uses the timer interrupts (which have a 10 ms period in Linux) to check whether the sleep time has

passed. Therefore if a process asked to sleep for 1 second, it is guaranteed to become runnable

after a time between 1.0 and 1.01 seconds, but if it asks to sleep for 100 µs it will become runnable

after some time between 100 µs and 10100 µs. Note that if we give this process the highest priority

then its becoming runnable implies that it runs on the processor, so we ignore the scheduling delay

between a process becoming ready to run and actually running. Thus on Linux (and other operating

systems that don’t support real-time processes) it is not possible for a user space process to send

packets monotonically at a high rate.

An alternate approach would be to maintain the sending rate, not on a packet-by-packet basis,

but in a longer time frame. This can be done by ensuring that N packets are sent every T units

of time such that (N/T ) is the desired sending rate. This would cause a burst of N packets in the

network so we would like to keep T as small as possible. In the limit N becomes 1 and we get what

SABUL attempts to implement. The sending process should get a periodic impulse every T units

of time and in response send out the N packets. Linux offers user-space processes the ability to

receive such periodic impulses in the form of signals. A process can use the settimer() system call

to activate a timer. This timer causes a signal to be sent periodically to the process. We modified the


FRTP code to use periodic signals to maintain the sending rate. This reduced the CPU utilization at

the sender compared to the earlier busy waiting scheme. But the lack of real-time support on Linux

meant that even if the signals were being sent like clockwork the user-space process was not always

able to start sending the next burst of packets immediately. We observed that occasionally some

signals would be missed because an earlier one was still pending.

We now consider the problem of adding flow control to FRTP. Since flow control is supposed to

avoid receiver buffer overflow, the data receiver is best placed to provide the information based on

which the sender can control the flow of data. SABUL’s sending rate adjustment in response to lost

packets is a form of flow control that does not use explicit information from the receiver. SABUL’s

flow control scheme was not very effective since we observed substantial loss and retransmission.

To be able to send back buffer status information, the receiver has to have timely access to this in-

formation. Although, the FRTP receiver can accurately figure out how much free space is available

in the protocol and application buffers (see Figure 3.2), it does not have access to the current status

of the UDP buffer in kernel memory. The kernel does not make any effort to avoid UDP buffer

overflows. The filling and emptying of a user space buffer are fully in the control of a user space

process. So if a user space buffer is short on free space, the process can choose not to read in more

data. With the UDP buffer the kernel has no control over the filling of the buffer since packets arrive

asynchronously over the network. That is why flow control is necessary to prevent the UDP buffer

from overflowing. Therefore, any flow control scheme which requires explicit buffer status infor-

mation from the receiver would need support from the kernel. By choosing to implement FRTP in

the user space over UDP, we lose the opportunity to implement such a flow control scheme.

Chapter 4

TCP-BASED SOLUTION

In the previous chapter we pointed out the shortcomings of a UDP-based transport protocol that

were uncovered while implementing FRTP using SABUL. We realized that more support from

the operating system would be required to better match the behavior of the end hosts with that of

the network in which resources were reserved. This chapter describes our efforts to implement a

transport protocol for dedicated circuits that is more closely tied in with the operating system than

the user-space FRTP. Our protocol is based on the TCP implementation in Linux. To reiterate this

fact, we call this protocol Circuit-TCP (C-TCP).

In this chapter, first an overview of TCP is presented. Then we look at the advantages of using

TCP to implement a transport protocol for dedicated circuits. Next, we present the implementation

of C-TCP. C-TCP has been tested on the CHEETAH testbed. Results from these experiments and a

discussion of their significance concludes this chapter.

4.1 Transmission Control Protocol - A Primer

TCP is the transport protocol of the TCP/IP suite of protocols. It is a connection-oriented protocol

that provides reliability, distributed congestion control and end-to-end flow control. Note that the

meaning of TCP being a ‘connection-oriented’ protocol is different from the use of the phrase in

‘connection-oriented network’. In order to provide its end-to-end services, TCP maintains state

for each data stream. Thus, TCP creates a connection between two end points wishing to commu-

27

Chapter 4. TCP-BASED SOLUTION 28

nicate reliably (the end points can be processes on end hosts), maintains state information about

the connection and disconnects the two end points when they no longer need TCP’s service. In

a connection-oriented network, a connection refers to physical network resources that have been

reserved, and that taken together form an end-to-end path.

Applications wishing to use TCP’s service use the sockets interface that the TCP/IP stack in the

operating system provides. Two processes that want to use TCP to communicate create sockets and

then one of the processes connects its socket to the remote socket. A connection is established if

the connection request is accepted by the remote end. TCP uses a 3-way handshake to establish a

connection. Connection establishment also initializes all of the state information that TCP requires

to provide its service. This state is stored in the data structures associated with the sockets on each

end of a connection. We now present brief descriptions of four of TCP’s functions. For a more

detailed description please see [29], [8] and [1].

4.1.1 Error Control

Each unique data byte transferred by TCP is assigned a unique sequence number. During connection

establishment the two ends of a connection exchange starting sequence numbers. The TCP at the

receiving end maintains information about sequence numbers that have been successfully received,

the next expected sequence number and so on. The receiver can make use of the sequence numbers

of received data to infer data reordering with certainty, but not data loss. In fact, neither the TCP

at the sender nor the one at the receiver can reliably detect packet loss since a packet presumed lost

could just be delayed in the network. TCP uses acknowledgements (ACKs) of successfully received

data and a sender-based retransmission time-out (RTO) mechanism to infer data loss. The time-out

value is calculated carefully using estimates of RTT and RTT variance, to reduce the possibility of

falsely detecting loss or waiting too long to retransmit lost data. An optimization that was proposed

and has been widely implemented is the use of triple duplicate ACKs to infer loss early rather than

wait for the RTO to expire. A TCP receiver sends back a duplicate ACK whenever an out-of-order

packet arrives. For instance, suppose packets Pn, Pn+1, Pn+2, Pn+3 and Pn+4 contain data that is

contiguous in the sequence number space. If Pn+1 goes missing, then the receiving TCP sends back


duplicate ACKs acknowledging the sucessful receipt of Pn when Pn+2, Pn+3 and Pn+4 arrive. On

getting 3 duplicate ACKs, a TCP sender assumes that the data packet immediately following the

(multiply) ACKed data was lost. The sender retransmits this packet immediately. This is called fast

retransmit. As was pointed out in Chapter 2, many enhancements to TCP have been proposed and

implemented, such as the use of SACKs, that improve TCP’s loss recovery, among other things.

4.1.2 Flow Control

Flow control allows a receiving TCP to control the amount of data sent by a sending TCP. With

each ACK, the receiving TCP returns the amount of free space available in its receive buffer. This

value is called the receiver advertised window (rwnd). The sending TCP accomplishes flow control

by ensuring that the amount of unacknowledged data (the demand for receiver buffer space) does

not exceed rwnd (the supply of buffer space on the receiver).

4.1.3 Congestion Control

The original specification of TCP [29] did not have congestion control. TCP’s congestion control

algorithm was proposed in [21]. Just as flow control tries to match the supply and demand for the

receiver buffer space, congestion control matches the supply and demand for network resources

like bandwidth and switch/router buffer space. This is a much more complex problem because

TCP is designed to work on packet-switched networks in which multiple data flows share network

resources. TCP’s congestion control algorithm is a distributed solution in which each data flow

performs congestion control using only its own state information, with no inter-flow information

exchange.

TCP congestion control is composed of three parts.

1. Estimate the current available supply of the network resources and match the flow’s demand

to that value.

2. Detect when congestion occurs (i.e. demand exceeds supply).

3. On detecting congestion, take steps to reduce it.


TCP maintains a state variable, congestion window (cwnd), which is its estimate of how much

data can be sustained in the network. TCP ensures that the amount of unacknowledged data does

not exceed cwnd,1 and thus uses cwnd to vary a flow’s resource demand. Since a sending TCP

has no explicit, real-time information about the amount of resources available in the network, the

cwnd is altered in a controlled manner, in the hope of matching it to the available resources. The

cwnd is increased in two phases. The first phase, which is also the one in which TCP starts, is

called slow start. During slow start cwnd is incremented by one packet for each returning ACK that

acknowledges new data. Thus, if cwnd at time t was C(t), all of the unacknowledged data at t would

get acknowledged by time (t +RT T ) and C(t +RT T ) would be C(t)+C(t) = (2×C(t)). Slow start

is used whenever the value of cwnd is below a threshold value called slow start threshold (ssthresh).

When cwnd increases beyond ssthresh, TCP enters the congestion avoidance phase in which the rate

of cwnd increase is reduced. During congestion avoidance, each returning ACK increments cwnd

from C to (C + 1C ). An approximation used by many implementations is to increment C to (C +1)

at the end of an RTT (assuming the unit for cwnd is packets).

The second component of congestion control is congestion detection. TCP uses packet loss as

an indicator of network congestion. Thus, each time a sending TCP infers loss, either through RTO

or triple duplicate ACKs, it is assumed that the loss was because of network congestion. Other

congestion indicators have been proposed. For instance, in Chapter 2 we mentioned that FAST

uses queueing delay to detect network congestion. Some researchers have proposed that a more

proactive approach should be adopted, and congestion should be anticipated and prevented, rather

than reacted to. Such a proactive approach would require congestion information from the network

nodes. See [5] for a discussion of the Active Queue Management (AQM) mechanisms that routers

need to implement, and [15] for a description of the Random Early Detect (RED) AQM scheme.

In [30], the modifications that need to be made to TCP in order to take advantage of the congestion

information provided by routers using AQM is presented.

The third component of congestion control is taking action to reduce congestion once its been

detected. The fact that congestion occurred (and was detected) means that TCP’s estimate of the

1Recall that flow control requires the amount of unacknowledged data to be less than rwnd. TCP implementation’suse min(rwnd,cwnd) to bound the amount of unacknowledged data.


available network resource supply is too high. Thus, to deal with congestion, TCP reduces its

estimate by cutting down cwnd. On detecting loss, the sending TCP first reduces ssthresh to half of

the flight size, where flight size is the amount of data that has been sent but not yet acknowledged

(the amount in flight). The next step is to reduce cwnd. The amount of reduction varies depending

on whether the loss detection was by RTO or triple duplicate ACKs. If an RTO occurred then the

congestion in the network is probably severe, so TCP sets cwnd to 1 packet. The receipt of duplicate

ACKs means that packets are getting through to the receiver and hence congestion is not that severe.

Therefore, in this case cwnd is set to (ssthresh + 3) packets and incremented by 1 packet for each

additional duplicate ACK. This is called fast recovery.

The linear increase of cwnd by one packet per RTT, during congestion avoidance, and its

decrease by a factor of two during recovery from loss is called Additive Increase Multiplicative

Decrease (AIMD). TCP uses an AI factor of one (cwnd ← cwnd + 1) and an MD factor of two

(cwnd← cwnd×(1− 1

2

)).

4.1.4 Self Clocking

Although TCP does not explicitly perform rate control, the use of ACK packets leads to a handy

rate maintenance property called self clocking [21]. Consider the situation shown in Figure 4.1.

The node marked SENDER is sending data to the RECEIV ER that is three hops away.2 The links

LINK1,LINK2 and LINK3 are logically separated to show data flow in both directions. The width

of a link is indicative of its bandwidth, so LINK2 is the bottleneck in this network. The shaded

blocks are packets (data packets and ACKs), with packet size proportional to a block’s area. The

figure shows the time instant when the sender has transmitted a window’s worth of packets at the

rate of LINK1. Because all these packets have to pass through the bottleneck link, they reach the

receiver at LINK2’s rate. This is shown by the separation between packets on LINK3. The receiver

generates an ACK for each successfully received data packet. If we assume that the processing time

for each received data packet is the same, then the ACKs returned by the receiver have the same

spacing as the received data packets. This ACK spacing is preserved on the return path. Each ACK

2This figure is adapted from one in [21].


allows the sender to transmit new data packets. If a sender has cwnd worth of data outstanding in

the network, new data packets are transmitted only when ACKs arrive. Thus, the sending rate (in

data packets per unit time) is maintained at the rate of ACK arrival, which in turn is determined by

the bottleneck link rate. This property of returning ACKs ‘clocking’ out data packets is called self

clocking.

��

� ��

� ��

��

� ��

� ��

� ��

� ��

� ��

� ��

� ��

� ��

� �� ! !! !! !" "" "" "# ## ## #$$$%%%&&&'''(((

((((

)))))))

*******

+++++++

,,,,,,,

-------

. .. .. .. .. .. .. .

/ // // // // // // /

DATA DATADATA

ACKs ACKs ACKs

SE

ND

ER

NE

TW

OR

K N

OD

E

NE

TW

OR

K N

OD

E

RE

CE

IVE

R

LINK1 LINK2 LINK3

Figure 4.1: TCP self clocking

4.2 Reasons for Selecting TCP

In Chapter 3, two problems were identified in a user-space UDP-based implementation of FRTP.

1. Use of busy waiting to maintain a fixed inter-packet gap, and thus a fixed rate, does not work

very well. Even if it did work perfectly, it is wasteful of CPU cycles.

2. The difficulty of maintaining a fixed receiving rate makes flow control very attractive. A

proactive scheme, in which the receiver is able to prevent buffer overflow, requires kernel

support that a user space FRTP cannot get. By removing SABUL’s rate-based congestion

control, FRTP forgoes SABUL’s reactive flow control too. Thus, FRTP has null flow control.


In this section, two issues are addressed. First, whether TCP is better at tackling the two problems

listed above. Second, are there other issues unique to TCP that need to be considered.

The description of TCP’s slow start and AIMD schemes in Section 4.1.3 shows that TCP does

not maintain a fixed sending rate. TCP is designed with the assumption that the available bandwidth

in the network (called ‘supply’ in Section 4.1) is changing over time, as other data flows start or end,

and that its instantaneous value is not known. TCP’s congestion control algorithms attempt to match

a flow’s sending rate to the available network bandwidth, inspite of this incomplete knowledge. But,

such a sending rate altering algorithm is not needed on dedicated ciruits.

If we assume that TCP’s congestion control can be disabled, how well can TCP maintain a fixed

sending rate and at what granularity? The self clocking property provides a low-overhead way to

maintain a steady sending rate. In steady state, each returning ACK clocks out a data packet so a

steady sending rate can be maintained at a granularity of packets. Moreover, packet transmission is

initiated as a result of an interrupt (the NIC raises an interrupt when an ACK is received), and so is

much less likely to be disturbed by the behavior of the process scheduler. This is a major advantage

of shifting the responsibility of maintaining a steady rate to the kernel domain.

The variability in the receiving rate is because of the receiving application’s interaction with the

process scheduler and the disk. This problem is not solved by using a different transport protocol.

But, TCP’s flow control is designed to minimize the impact of such variability on data transfer

performance. TCP uses a window-based flow control scheme (see Section 4.1.2) that prevents

receive buffer overflow, unlike SABUL, which reacts to packet loss caused by buffer overflow.

TCP appears to adequately deal with the two problems identified in implementing FRTP. In

addition there are a few other reasons for choosing TCP which we point out next. Once it had been

established that flow control required kernel support, our choice was essentially made. We did not

have the expertise to implement a kernel-space protocol starting from scratch. So, our protocol had

to be implemented by modifying an existing, stable kernel-space transport protocol. TCP and UDP

are so widely used and well understood that, unless some other protocol is clearly more suitable, it

makes sense to modify TCP or UDP. Another reason for choosing to use TCP is that error control

comes for free. In the next section, the protocol design for C-TCP is presented and it should be clear


that for the majority of transport protocol functions, what TCP implements works—regardless of

whether the underlying network is connectionless or connection-oriented.

So is TCP the answer to all our problems? Well, no. Without any modifications TCP’s conges-

tion control algorithm is not suitable for use over a dedicated circuit. One of the main differences

between TCP and C-TCP is the congestion control algorithm used. We describe C-TCP in more

detail in the next two sections. A practical issue with doing any kernel-space modification is that

ease of use for the solution is much lower than a user space application, which can be downloaded,

built and installed, since the host has to be rebooted.

4.3 Circuit-TCP Design

In this section the design of C-TCP is described. Five functions of a transport protocol are con-

sidered, namely connection establishment, congestion control, multiplexing, flow control and error

control. For each of these functions, we consider whether it is required on a dedicated circuit and if

so, how to provide the function.

4.3.1 Connection Establishment

It is useful in the design of a transport protocol to think in terms of control and data planes. Control

plane functions support the data plane. For instance, TCP’s three-way handshake for connection

establishment is used to agree upon an initial sequence number to be used in the data transfer that

follows. C-TCP requires state to be maintained for each data flow using C-TCP. The connection-

establishment and release schemes are used unaltered from TCP.

4.3.2 Congestion Control

Network congestion occurs when the demand for resources (e.g., bandwidth, switch buffers) ex-

ceeds the available supply. Congestion control attempts to match the demand to the supply. On

connection-oriented networks this matching occurs in two phases. The first phase of congestion

control occurs in the control plane and makes the decision of whether sufficient resources are avail-


able for the data plane transfer to begin. It consists of the steps shown in Figure 4.2. The transport

protocol can react to congestion (i.e., circuit set up failure) in many ways. In this figure, the dotted

arrow shows one possibility—retrying circuit set up (possibly after waiting a random amount of

time). Another possibility would be to return an error. If no congestion is detected in this first

phase, the data transfer can start.

SUCCESS

FAILURE

Initiate request for a circuit by

invoking a signaling protocol client.

Await circuit set up status

Circuit set up

status

Initiate data transfer

Handle circuit set up

failure

Figure 4.2: Congestion control in the control plane

The second phase of congestion control occurs in the data plane and is responsible for main-

taining a steady data transfer rate, as closely matched to the reserved circuit rate as possible. This is

achieved by modifying TCP’s congestion control algorithms of slow start and AIMD. These modifi-

cations are described in the next section. Congestion control is the main function in which C-TCP’s

needs differ significantly from the services offered by TCP.

4.3.3 Multiplexing

Multiple processes within a single end host need to simultaneously use the transport layer’s commu-

nication services. To allow for this, the transport protocol has to distinguish between data destined

for or originating from different processes. This functionality is required when the simultaneously

running processes are using dedicated circuits too. When a process creates a TCP socket, a unique

port number is assigned to the socket. TCP uses the port number to multiplex and demultiplex.

This scheme works well for C-TCP too.


4.3.4 Flow Control

On a dedicated circuit, the sending rate should be matched to the reserved circuit bandwidth for op-

timal circuit utilization. Therefore null flow control would be the ideal choice. To send and receive

data at a fixed rate, resources, such as processor cycles and buffer space, need to be reserved on

the end hosts participating in the transfer. In practice, though, end hosts run general-purpose oper-

ating systems on which resource reservation is not possible without real-time support. Therefore,

flow control is required to match a sender’s demand for resources on the receiver with the available

supply. C-TCP requires flow control.

There are three well-known flow control methods: ON/OFF, rate-based and window-based [3].

The ON/OFF scheme is inefficient on a dedicated cirucit because the circuit lies unused while a

sender waits for an ON signal. In a rate-based scheme, the receiver sends signals to the sender

to control the sending rate. The receiver needs to have a good estimate of its receiving capability

during the whole transfer, which is hard to implement. On the other hand, it is much simpler to

monitor the status of the receive buffer during the data transfer and send back window size reports.

TCP’s window-based flow control (see Section 4.1.2) is a good match for C-TCP’s flow control

requirements.

4.3.5 Error Control

Error control adds reliability to the data transfer by ensuring that all the data reaches the receiver

in order and without duplicates. Sources of errors in today’s wired networks are typically buffer

overflows at the packet switches and at the receiving end host. Dedicated circuits imply that re-

sources are reserved for the data transfer and thus no losses should occur at network switches. The

window-based flow control scheme should prevent losses from occurring at the receiver. How-

ever, link transmission errors are still possible and buggy implementations (of, e.g., the network

switches) could cause packet loss, reordering or duplication. Therefore, mechanisms to ensure end-

to-end reliability are needed in C-TCP. As TCP’s error control mechanism is able to handle network

unreliability, we make no modifications to it for C-TCP.


Finally, we note that, given that for four of the five transport protocol functions identified at the

beginning of this section, C-TCP uses TCP’s solution unmodified, TCP’s packet format also does

not need to be changed. This is because the one function in which C-TCP differs from TCP, namely

congestion control, does not have an impact on the protocol header format. See [34] for the TCP

protocol format.

4.4 C-TCP Implementation

The implementation of a transport protocol for dedicated circuits can be split into two parts. One

handles the control plane functions and the other handles the data plane functions. The control

plane functionality can be implemented to be independent of whether the transport protocol is UDP-

based or TCP-based. To focus on TCP-specific aspects of the implementation here, we postpone

discussion of the control plane component to the following chapter.

From the last section it is clear that major portions of an existing TCP implementation can be

used for C-TCP. C-TCP has been implemented by modifying the TCP stack in the Linux 2.6.11

kernel. The data plane function of C-TCP described in this section is that of maintaining a steady

data transfer rate, closely matched to the reserved circuit rate. As mentioned in Section 2.3, end-

host applications have a choice to use TCP over an Internet path going through the primary NIC, or

C-TCP over a CHEETAH circuit through the secondary NIC. Thus, an additional requirement on

the C-TCP implementation is that it should co-exist with TCP. One way to achieve this would be

to have separate kernel stacks for TCP and C-TCP, and to provide an interface to C-TCP (like the

sockets interface to TCP). Because the majority of TCP code can be used by C-TCP as is, we chose

to implement C-TCP by having only a single stack and API. This meant that the protocol stack must

be able to distinguish between packets from applications using TCP and those using C-TCP, and

treat them accordingly. To allow a user-space application to set parameters in the kernel indicating

that it wants to use C-TCP, we use the Web100 patch to the Linux kernel. A brief outline of Web100

is presented next.


USER-SPACE

KERNEL-SPACE

/proc

Web100 Library

Web100 Tool

Application

Sockets

API

TCP

IP

NIC

Set

Instrumentation

Kernel

Figure 4.3: Structure of the Web100 stack

4.4.1 Web100

The Web100 (and the related Net100) project [26] aims to address the issue of information hiding

that the layered TCP/IP stack entails. For instance, the effect seen by an application for any kind

of network problem is a longer elapsed time, whereas users would benefit from getting a more

detailed view of the inner workings of TCP and lower layers. The Web100 team has released

an instrumented TCP stack. The Web100 stack provides an interface for user space programs to

access many of TCP’s internal state variables. This interface is provided through the Linux /proc

filesystem. The /proc filesystem is a virtual filesystem that resides entirely in the main memory

and certain kernel variables can be accessed, using regular filesystem commands, through ‘files’

under the /proc ‘directory’. The structure of the Web100 stack3 is shown in Figure 4.3. The kernel

instrumentation set (KIS) interacts with the network stack and interfaces to the user space through

the /proc filesystem. A Web100 library provides a standard set of calls to interact with the KIS

through /proc.

The same interface also allows a few fields, in the internal data structure that Linux maintains

for each TCP socket, to be set from the user space. We make use of this capability in C-TCP. As-

sociated with each TCP socket that is created is a struct tcp sock data structure. In the Web100

stack a pointer to a struct web100stats data structure is added to the tcp sock structure. The

3This figure is adapted from one in [26].


web100stats structure holds, for each TCP socket, all the Web100 relevant information. In the

TCP stack, at points where interesting TCP state variables change, the corresponding field in that

socket’s web100stats structure is updated. In the same way when a user-space process uses the

/proc filesystem to modify some TCP kernel parameter of a socket, the corresponding field in the

web100stats structure gets set. If the TCP stack has been modified to check for and use the value

of a field in the socket’s web100stats data structure, then a user space process can control the

behavior of the TCP stack in real-time, on a per-connection basis.

4.4.2 Implementation Details

The self clocking property of TCP (see Section 4.1.4) can help maintain a steady sending rate

provided each ACK clocks out the amount of data that it is acknowledging. This would ensure that

the amount of unacknowledged data in the network remains fixed. But, TCP uses slow start and

AIMD schemes that constantly modify the amount of outstanding data. C-TCP tries to maintain a

steady data transfer rate by disabling TCP’s slow start and AIMD.

Bandwidth reservation in the network means that the amount of outstanding data that the net-

work can sustain is fixed at the bandwidth delay product BDP (= circuit rate × RTT). Suppose a

sender transmits a packet at time t = 0 and the ACK for that packet comes back at t = RT T . The

BDP represents the amount of data that is unacknowledged just before t = RT T , if the sender has

been continuously transmitting at the circuit rate. Note that, since network nodes have buffers, a

sender could transmit more than BDP amount of data in an RTT interval of time; the data in excess

of the BDP would be stored in buffers. If the receiver’s buffer space is not a limiting factor, then

keeping less than BDP amount of data in the network is detrimental to circuit utilization. On the

other hand, pushing more and more data into the network, filling up the network buffers, experienc-

ing loss and then reducing the sending rate is also not desirable.

C-TCP maintains a fixed amount of unacknowledged data, greater than or equal to the BDP, in

the network. We added two control parameters to the Web100 stack, in order to achieve this.

1. useckt to select whether a TCP socket is to be used as one end of a CHEETAH circuit.


2. ncap (network capacity) to set the amount of unacknowledged data that is to be maintained

in the network.

The useckt parameter is required so that a single TCP stack can be used to provide standard TCP

service on Internet paths and C-TCP service when a reserved circuit is being used. In order to

disable the rate-altering congestion control schemes of TCP, we modified the TCP sender code to

ignore the congestion window cwnd, and instead maintain min(ncap, rwnd) amount of unacknowl-

edged data in the network throughout the transfer when useckt is set. Since flow control is required

the check with rwnd is retained.

Another change that we needed to make was in the TCP receiver code. Linux implements a

slow start like scheme to update rwnd. A receiver starts off advertising an rwnd that is small (two

packets) and increments it as more data packets are received, regardless of what the actual available

receive buffer is. This makes rwnd a bottleneck during the initial part of the transfer and defeats

the purpose of the changes made at the sender. Therefore, we modified the TCP receiver code to

advertise the maximum possible rwnd when the socket is being used over a CHEETAH circuit.

A consequence of using the fixed ncap, in place of cwnd, is that TCP’s slow start is disabled.

Thus, early on in a transfer, C-TCP utilizes the reserved bandwidth better, since for long delay

networks, slow start can reduce the transfer throughput significantly. However, slow start serves to

start up self clocking without sending a large burst of packets. In a CHEETAH network, an end

host’s 1 Gbps Ethernet NIC is connected to a dedicated Ethernet-over-SONET circuit. Disabling

slow start is not a problem if the circuit rate and the NIC rate are matched,4 since the sender can not

transmit a burst. When the circuit rate is set to be less than 1 Gbps, C-TCP will send an initial burst

of packets that could cause buffer overflows at the first downstream MSPP. Consider a NIC rate N,

a circuit rate C and an RTT of T . In Figure 4.4, the line with slope N represents the amount of data

coming into the first downstream MSPP from the sender, and the line with slope C represents the

amount of data going out on the circuit from this MSPP. Since the sender does not send more than

BDP (= C×T ) amount of data in an RTT, the maximum buffer space required at the MSPP (marked

4A circuit rate greater than the NIC rate is also alright, but setting up a circuit that the end host cannot keep fullyutilized does not make sense.


Time

Am

ount

of

data

T1

BDP

Slope = C

Slope =

N

B

T = RTT

Figure 4.4: Maximum buffer space required for a C-TCP burst

B in the figure) is given by

B = (BDP)−C×T 1 = C×T −C×(

C×TN

)= T ×

(C−C2

N

)

For a given T and N, the value of B is maximized when C = N2 . For instance, with N = 1 Gbps and

T = 200 ms, the maximum buffer space required at the MSPP is 6.25 MB. Our experimental results

show that MSPPs are capable of buffering this much data. In addition, the gigabit Ethernet cards in

MSPPs support data-link layer flow control in the form of PAUSE frames [11]. This too prevents

buffer overflow.

4.5 Experimental Results

The Linux implementation of C-TCP described in the previous section has been tested on the

CHEETAH experimental testbed. In this section these results are presented and discussed.

The portion of the testbed relevant for our experiments is shown in Figure 4.5. The path of

the reserved circuit is shown as a thicker line. The blocks marked zelda1 through zelda5 and

wukong are end hosts with a primary Internet-connected NIC and a secondary NIC (shown shaded).

The modified Linux 2.6.11 kernel was loaded on hosts zelda4 at ORNL, zelda3 in Atlanta and

wukong at MCNC. The hosts zelda3 and zelda4 are Dell PowerEdge 2850s, with dual 2.8 GHz

Xeon processors and 2 GB of memory. Wukong is a Dell PowerEdge 1850 with a 2.8 GHz Xeon


Cro

ssco

nnec

t car

d

OC192card

Con

trol

car

d

Sycamore SN16000

Cro

ssco

nnec

t car

d

OC192card

Con

trol

car

d

Sycamore SN16000

Internet

routerJuniper

Juniper routerzelda4

zelda5

zelda1

zelda2

zelda3

wukong

ORNL, Tn

SOX/SLR, Ga

MCNC/NCSU, NC

Figure 4.5: Testbed configuration for C-TCP tests

processor and 1 GB of memory. All three have an 800 MHz front side bus, 146 GB SCSI disks and

a PERC4 RAID controller. The Sycamore SN16000s, which serve as MSPPs, are also shown.

In the experiments we are trying to compare the behavior of standard TCP and C-TCP over

dedicated circuits. To make this comparison, we use the iperf application (version 1.7.0) [35]

for memory-to-memory transfers. The iperf code was modified to allow selection of TCP or C-

TCP from the command line. For disk-to-disk transfers, a simple file transfer application was

implemented. An important control-plane function of C-TCP is selecting the rate of the circuit

to set up. For memory-to-memory transfers the bottleneck will usually be the NIC rate, since the

memory operations have a throughput much higher than the NIC’s 1 Gbps rate. Proper selection of

the circuit rate is therefore more critical for disk-to-disk transfers. Results for disk-to-disk transfers

are presented in the next chapter, after we describe C-TCP’s control plane functions.

4.5.1 Utility of Disabling Slow Start

The first set of experiments show the utility of disabling slow start in C-TCP. A 1 Gbps circuit was

set up between zelda4 and wukong. We ran memory-to-memory data transfers for various transfer


0

100

200

300

400

500

600

700

800

900

10 100 1000 10000 100000

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

Ave

rage

thro

ughp

ut(M

bps)

Rel

ativ

e de

lay

Amount of data transferred (KB)

Relative delayTCP

C-TCP

Figure 4.6: TCP and C-TCP comparison for different transfer sizes

sizes and gathered throughput and delay values. In Figure 4.6, TCP and C-TCP transfer throughput

and their relative delay (delay using TCP/delay using C-TCP) are plotted against transfer size. For

data transfers of a few MB, slow start takes up a substantial portion of the total transfer time. The

relative delay plot shows the utility of disabling slow start for such transfers. The data transfer

finishes in less than half the time when C-TCP is used, compared to the time it takes for TCP.

For very small transfers the three-way handshake connection establishment overhead dominates the

transfer time. As is to be expected, the returns of disabling slow start diminish as transfer size

increases. The throughput achieved by TCP and C-TCP start to converge as the amount of data

transferred is increased.

To show the difference in start-up behavior between C-TCP and TCP we captured packet traces

for a 5 MB transfer, using tcpdump. We then used tcptrace to generate, from the trace files, the plot

shown in Figure 4.7. Sequence numbers of received data bytes (relative to the first received data

byte) are plotted as a function of the time when the byte was received (relative to the start of the data

transfer). For ease of comparison, the plots for C-TCP and TCP are shown superimposed. For this

particular transfer, when using C-TCP the data transfer finishes around 0.06 s after starting, whereas


0

2000000

4000000

6000000

8000000

10000000

12000000

0.000 0.010 0.020 0.030 0.040 0.050 0.060 0.070 0.080 0.090 0.100

Seq

uenc

e of

fset

Relative time (seconds)

C-TCP window

C-TCP data

TCP window

TCP data

go

gogo go

go

gogo go

Figure 4.7: Start-up behavior of TCP and C-TCP

only about a third of the total data has been received by that time using TCP. The plot also shows

the window advertised by the receiver. For instance, at time 0.06 s, the TCP receiver has received

all data bytes with relative sequence number less than 1.3 million. The TCP advertised window

at that time is 3.8 million, which means the receiver has buffer space to receive 2.5 MB of data.

The slow start like growth of the receiver advertised window in Linux TCP that was mentioned in

Section 4.4.2 can be seen clearly. Also visible is the modified rwnd behavior for C-TCP. Finally,

note that the gains from disabling slow start will be even more pronounced when the RTT is higher.

In this experiment, the RTT between the end hosts zelda4 and wukong, as measured using ping,

was 13.6 ms.

4.5.2 Sustained Data Transfer

C-TCP tries to maintain a fixed amount of outstanding data in the network (see Section 4.4.2). Our

next set of experiments show the utility of this scheme when there is a mismatch between the end

host’s NIC rate and the rate of the reserved circuit. TCP is not designed to take advantage of the

fact that, when resources in the network are dedicated for a single transfer, the available network


capacity is fixed and known upfront. In trying to guess what the available bandwidth is, TCP

overshoots the network capacity, experiences loss, slows down and starts over again. The results in

this section show that such a behavior leads to lower throughput and also causes the network buffer

occupancy to increase.

A 500 Mbps circuit was set up between zelda4 and wukong, so there is a mismatch between the

circuit rate and the 1 Gbps NIC rate. A circuit with a rate lower than the NIC rate might be set up,

for instance, because the control plane component of the transport protocol determines that there

are end host bottlenecks (like disk access rate) or if not enough network resources are available.

We used iperf to conduct sustained memory-to-memory transfers lasting 600 seconds each. TCP

causes queues in the network nodes to build up (and eventually to packet loss). To illustrate this,

we used ping to collect RTT values periodically during the data transfer. Since longer queues at

the network nodes add waiting time to ping packets, the RTT values give an estimate of the buffer

occupancy behavior. The baseline RTT (when the network buffer queues are not full) between

zelda4 and wukong is 13.6 ms.


0 20 40 60 80

100 120 140 160

0 100 200 300 400 500 600

RTT

(ms)

Time (seconds)

0

100

200

300

400

500

600

Thro

ughp

ut (M

bps)

Figure 4.8: Throughput and RTT using Reno-TCP

4.5.2.1 Reno-TCP Performance

Figure 4.8 shows the results using Reno-TCP.5 The upper plot shows the throughput (calculated

by iperf ) and the bottom plot shows the RTT over the 600 s data transfer. TCP’s need to fill up

the network buffers and experience loss, to get an estimate of the available capacity, explains the

throughput variability that can be seen in the figure. The throughput reductions also match the peaks

in measured RTT, showing that TCP experiences loss due to full network queues. The markedly

poor throughput performance between 430 and 550 s, though is not explained by the RTT behavior

during the same period. The average throughput achieved using Reno-TCP over 600 seconds was

389 Mbps.

4.5.2.2 BIC-TCP Performance

Next we look at the performance of BIC-TCP. As was mentioned in Section 2.1.1, BIC-TCP has

been designed to address some of the issues that TCP has over high bandwidth, high delay networks.

The throughput and RTT plots for BIC-TCP are shown in Figure 4.9. The first thing to notice, is

5Linux 2.6.11 kernel uses BIC-TCP’s congestion control algorithm by default. To use Reno in place of BIC, the sysctlvariable net.ipv4.tcp bic is set to 0.


0 20 40 60 80

100 120 140 160

0 100 200 300 400 500 600

RTT

(ms)

Time (seconds)

0

100

200

300

400

500

600

Thro

ughp

ut (M

bps)

Figure 4.9: Throughput and RTT using BIC-TCP

that the variability in achieved throughput is much lower compared to Reno-TCP. To explain this

we need to understand how the window adjustment used by BIC differs from Reno TCP’s AIMD.

BIC-TCP also uses packet loss to infer congestion, but unlike Reno-TCP, uses an MD factor of 8

(i.e., cwnd← cwnd×(1− 1

8

)). This explains the quicker recovery of BIC-TCP after a loss. During

the congestion avoidance phase, BIC-TCP also uses an additive increase scheme, but the AI factor

is a function of the current cwnd. As the cwnd gets closer to a maximum window size Wmax, the

AI factor grows progressively smaller (see [39] for the details). We believe, BIC-TCP’s congestion

avoidance scheme causes the network queues to stay full for a longer time, thus explaining the

increased RTT values seen in Figure 4.9. BIC-TCP is able to achieve an average throughput of 423

Mbps.

4.5.2.3 C-TCP Performance

Finally, in Figure 4.10, the results using C-TCP are shown. Because C-TCP tries to maintain a fixed

amount of outstanding data (equal to the BDP in this case) in the network at all times, it achieves

much more steady throughput. The RTT stays more or less at the baseline value throughout the


0 20 40 60 80

100 120 140 160

0 100 200 300 400 500 600

RTT

(ms)

Time (seconds)

0

100

200

300

400

500

600

Thro

ughp

ut (M

bps)

Figure 4.10: Throughput and RTT using C-TCP

transfer. Thus, C-TCP does not stress the switch buffers. The average throughput achieved by C-

TCP was 458 Mbps. As pointed out in [5], keeping the network node buffers empty most of the

time gives lower end-to-end delay and, paradoxically, better throughput. This is because empty

queues serve to buffer bursts of packets, whereas if a queue was more likely to be full, then bursts

could cause packet losses and reduction in the cwnd by the sending host. Thus, the performance

improvement of C-TCP is expected to be more pronounced in the presence of interfering traffic.

Chapter 5

CONTROL-PLANE FUNCTIONS

In this chapter we consider issues associated with the control plane functions of a transport protocol

for dedicated circuits. The division into control and data plane functions is particularly clear in

a transport protocol for connection-oriented networks. Reservation of the network resources that

are subsequently used in the data plane transfer is a control plane function. The control plane

component of the transport protocol can be implemented independent of the data plane component,

i.e., the issues considered in this chapter apply equally to the UDP-based FRTP (Chapter 3) and

the TCP-based C-TCP (Chapter 4). For ease of exposition, we use “C-TCP” instead of “a transport

protocol for dedicated circuits” in the rest of this chapter.

The control plane functions of the transport protocol are divided into two parts. The first is the

selection of a circuit rate and is covered in Section 5.1. Once the circuit rate is selected, the required

amount of network resources need to be reserved. This part is covered in Section 5.2.

5.1 Selecting the Circuit Rate

Proper selection of the rate of the circuit to set up is important to achieve a balance between two

data transfer performance criteria. If a high-rate circuit is set up, and the end hosts are not capable

of maintaining data flow at that rate, circuit utilization will suffer. On the other hand, being over-

cautious and requesting a low-rate ciruit would lead to longer transfer delays than necessary.

The circuit rate should be set to the maximum data transfer rate that can be sustained by the

49

Chapter 5. CONTROL-PLANE FUNCTIONS 50

300

350

400

450

500

550

600

650

700

1 2 3 4 5 6 7 8 9 10

Thro

ughp

ut(M

bps)

Experiment run number

TCPC-TCP

Figure 5.1: Throughput variability of disk-to-disk transfers

end hosts involved. The problem thus boils down to finding out this sustainable rate. Variability

in the data sending and receiving rate at the end hosts can arise due to factors such as multitasking

(see Section 2.2.4) and disk access rate variability (see Section 2.2.3). This makes the selection of

the optimum circuit rate a difficult problem. To illustrate this, a set of disk-to-disk transfers were

carried out using a simple file transfer program that we implemented. The achieved throughput

values for transferring a 1.6 GB file between zelda3 and zelda4 (see Figure 4.5), over a 1 Gbps

circuit, are shown in Figure 5.1. The variability of the throughput evident in the ten runs using TCP

and C-TCP shows the difficulty in assigning a single number to the performance of disk-to-disk

transfers. A pragmatic approach to this problem is the following.

1. To the extent possible, we require users of C-TCP to reduce the sources of variability, e.g.,

by not running other processes while the data transfer is in progress. We believe this is a

reasonable requirement in the scientific computing domain.

2. Use an empirical method to estimate the average transfer rate that the end hosts can support.

The disk write rate is the limiting factor, hence a disk I/O benchmark program can be used to

estimate the average rate at which data can be written to disk on the receiving host.


15

20

25

30

35

40

400 500 600 700 800 900 1000 65

70

75

80

85

90

95

Tran

sfer

del

ay (s

econ

ds)

Circ

uit u

tiliz

atio

n (%

)

Circuit rate (Mbps)

Transfer delayUtilization

Figure 5.2: Trade-off between circuit utilization and delay

If circuit utilization is not an important requirement, a user might choose to run other processes

simultaneously with the data transfer. The circuit rate, in this scheme, is only an empirically esti-

mated average, and this makes flow control all the more important.

To test the utility of this approach, we used the method described next. First, we obtain a

value for the circuit rate for which disk-to-disk transfers get acceptable delay and circuit utilization.

Then we use a disk I/O benchmark to determine whether that circuit rate can be estimated using

the benchmark. In Figure 5.2, the transfer delay and circuit utilization (throughput/circuit rate) are

plotted against circuit rate for file transfers between zelda3 and zelda4. A 1.6 GB file was transferred

in each case and circuits with rates ranging from 400 Mbps to 1 Gbps were used. Average values

and standard deviation for five runs are shown.

There are a few points worth noting in this plot. First, the shape of the two graphs broadly

confirms our intuition that a better transfer delay can be achieved at the expense of circuit utilization

and vice versa. The transfer delay can be halved with a circuit rate of 1 Gbps as compared to a 400

Mbps circuit, but at the cost of a 20% reduction in the bandwidth utilization (70% compared to

85%). Second, for circuit rates greater than the end host supportable data transfer rate, the circuit


Disk Write Rate (Average ± Std. Dev.) MbpsRequest Size KB Normal I/O Synchronous I/O Direct I/O

8 1956 ± 45.0 892 ± 63.8 679 ± 4.716 2238 ± 32.0 919 ± 12.2 990 ± 5.932 2333 ± 40.3 954 ± 54.2 893 ± 41.664 2417 ± 41.7 926 ± 30.4 910 ± 79.3128 2456 ± 54.6 904 ± 55.8 955 ± 84.5

Table 5.1: xdd benchmark results on zelda4

Run # 1 2 3 4 5 6 7 8 9 10Normal I/O 2315 2349 2341 2320 2273 2427 2324 2341 2298 2336

Synchronous I/O 1023 924 901 1010 945 922 933 1055 905 926Direct I/O 886 935 933 922 855 928 930 817 867 862

Table 5.2: Disk write rate (Mbps) for individual runs using 32 KB request sizes

rate is no longer the bottleneck, therefore the transfer delay should remain more or less constant. In

the figure, the plot for transfer delay appears to flatten out for circuit rates greater than 700 Mbps.

Thus, from this figure, an optimum circuit rate value of 700 Mbps can be inferred.

Now we need to determine whether a disk I/O benchmark can be used to estimate this opti-

mum circuit rate before setting up the circuit. We use the xdd (version 6.3) disk I/O benchmark

program [32], to gather disk write rate estimates for three types of disk write operations— normal,

synchronous and direct I/O. Normal I/O makes full use of the page cache, synchronous I/O ensures

that data in the page cache and on the disk is synchronized and direct I/O bypasses the page cache

entirely. For a more detailed explanation of how disk writes are handled in the Linux kernel see

Appendix A or [4]. The xdd program was used to write 1 GB files using different request sizes.

The file size divided by the request size determines the number of write operations (calls to write()

system calls) performed by xdd. In Table 5.1 the average disk write rate and its standard deviation

on the zelda4 host, for ten runs, using five request sizes is shown. In Table 5.2, the disk write rates

for individual runs, using a 32 KB request size, are shown. The operating system, file system and

amount of RAM on zelda4 were Linux 2.6.11, ext3 and 2 GB, respectively.

These results show that the disk write rate estimated by the benchmark is higher than the circuit

rate that was inferred from Figure 5.2. This is true for all three kinds of disk I/O. The disk write


rate achieved using normal I/O, e.g., is more than three times the 700 Mbps number we were

expecting. This difference is explained by the fact that when data is being received over a circuit

and then written to disk, the network processing also plays a part. With the disk I/O benchmark

experiments only the disk write path is being tested and the interaction between the disk I/O and

network processing is ignored. Such interactions might include, for instance, contention for the

memory bus (see Section 2.2.1), or scheduling of the disk write and the network handling kernel

code paths. Also notice the large values for the standard deviation of the disk write rates. Direct

I/O gives a stable value for the achieved rate when request sizes of 8 KB and 16 KB are used, but

the variability in write rate increases dramatically for larger request sizes. For more results using

the xdd benchmark and possible explanations, see Appendix A. Thus, finding the optimal circuit

rate is not as straightforward as estimating the disk write rate on the receiver. First, the estimate for

the disk write rate has a large variability. Second, there are factors that cannot be accounted for by

a benchmark that estimates disk I/O performance.

5.2 Setting up the Circuit

Network resources need to be reserved before data transfer can begin. The actual circuit set-up

is handled by the RSVP-TE client implementation. The current implementation of the RSVP-TE

client [41], is in user space. It provides an API through a library as well as an application program

called bwrequestor (for bandwidth requestor) that can be called from the command-line to set up a

circuit. An application that wants to use a dedicated circuit would make API calls to the RSVP-TE

library. Occasionally, though, it may not be possible to modify an application to make such calls.

The bwrequestor can be used to set up a circuit in such cases. C-TCP interacts with the signaling

module differently in these two cases.

For applications that wish to use C-TCP over a dedicated circuit, and can be modified to do

so, an API is provided that closely matches TCP’s sockets API. The C-TCP API modifies only the

connect() and accept() calls from the TCP API. The C-TCP calls are actually wrappers around the

TCP calls. Thus, a system call syscall() is wrapped into the C-TCP specific syscallCTCP() that is


implemented like a regular user space function:

syscallCTCP() {

/* C-TCP specific operations */

...

syscall();

/* C-TCP specific operations */

...

}

There are two reasons for choosing this approach. The RSVP-TE signaling library that is used by

C-TCP is implemented in user space, and therefore calls to these functions have to originate from a

user space program. Second, applications must still be able to make calls to syscall() directly.

The C-TCP API provides four calls that differ from the TCP sockets API. The connect() call’s

functionality is provided by CTCP sender connect() and CTCP receiver connect(), and the ac-

cept() call’s functionality is provided by CTCP sender accept() and CTCP receiver accept(). The

names of the API calls indicate whether they are invoked by the data sender or receiver. In any data

transfer application using TCP, there will be a separation between the end that invokes connect() and

the one that invokes listen()/accept(). Because of the similarity between initiating a TCP connection

using connect() and setting up a circuit, C-TCP initiates circuit set up from the end that invokes the

C-TCP equivalent of connect(). The end initiating circuit set up needs to know the circuit rate, and

the circuit rate is determined by the receiving end host’s capability, hence it needs to know which

end is the receiver. One way to do this would be by imposing strict constraints on who (sender or

receiver) can invoke which call (connect or accept). For instance, only the data sender could be

allowed to invoke the connect() call. Instead, since the application knows whether it is the sender

or receiver, we take the approach of providing different calls for the sender and receiver, leaving it

to the application programmer to make the appropriate call. Within the wrapper calls, two kinds of


C-TCP specific operations are performed. The first is circuit set up using the RSVP-TE library and

the second is setting up the C-TCP parameters (see Section 4.4.2) using the Web100 API.

There may be some applications which cannot be modified to use the C-TCP API. To allow

such applications to use CHEETAH circuits, we use the following scheme. The bwrequestor is

used to set up a circuit. Because the C-TCP parameters useckt and ncap are actually fields in the

data structure associated with a TCP socket, they cannot be set until the TCP socket is created. We

have implemented a C-TCP Work Around Daemon (CTCP-WAD, the name is borrowed from a

similar daemon implemented in the Web100 project). The bwrequestor, after successfully setting

up a circuit, informs the CTCP-WAD about the circuit rate and the source and destination IP ad-

dresses of the circuit. The CTCP-WAD then periodically checks for TCP sockets which match this

source/destination IP address pair and when such a TCP socket is found, sets the C-TCP parameters

for that socket.

Chapter 6

CONCLUSIONS

In this chapter we summarize the key lessons learned from our efforts. A few ideas for future work

to advance the present work are also presented.

6.1 Conclusions

6.1.1 Transport Protocol Design for Dedicated Circuits

• The transport protocol can be split into control and data plane components. The control plane

functions are selecting the rate of the circuit to set up for a particular data transfer and setting

up the circuit.

• The data plane component of the transport protocol should be designed such that a data

transfer makes the best possible use of the reserved network resources. Ideally, this can be

achieved by matching the data transfer rate and the reserved circuit rate.

• In addition to the high utilization goal, the transport protocol should maximize the data trans-

fer rate in order to reduce the end-to-end delay.

• In practice, it is hard to match the data transfer rate and the reserved circuit rate exactly,

throughout the data transfer. Therefore, the protocol should try to avoid situations that cause

the data transfer rate to deviate from the circuit rate.

56

Chapter 6. CONCLUSIONS 57

• Flow control is required because of the lack of resource reservation on the end hosts.

6.1.2 Transport Protocol Implementation

• We implemented Fixed Rate Transport Protocol (FRTP), a user-space, UDP-based protocol,

by modifying SABUL. The advantage of a user-space implementation is ease of deployment.

The shortcomings of FRTP were the need for busy waiting to maintain a fixed data transfer

rate, and the lack of any form of flow control.

• The need for more support from the kernel, especially for flow control, led to the implementa-

tion of Circuit-TCP (C-TCP). C-TCP has been implemented by modifying the TCP/IP stack

in Linux. It maintains a fixed amount of outstanding data in the network and uses TCP’s self

clocking property to maintain a fixed data transfer rate that is closely matched to the reserved

circuit rate.

• Results of experiments on the CHEETAH testbed show that C-TCP is better suited for use

over dedicated circuits compared to TCP. Data transfer time is reduced because C-TCP dis-

ables slow start. In addition, because C-TCP maintains a fixed amount of outstanding data

in the network, buffers in the network nodes do not fill up. Thus, C-TCP obtains a steady

throughput in sustained data transfers.

• There exists a mismatch between the network, in which resources can be reserved, and the

end hosts’ operating system that do not support resource reservation. This makes it difficult

to select a circuit rate such that both circuit utilization and data transfer delay are optimized.

6.2 Future Work

The control plane problem of determining an optimal circuit rate has to be addressed. It is difficult to

model all the factors that play a part in determining the optimal rate. Therefore, a possible approach

could be to use the throughput values of past data transfers to get better and better estimates of the

sustainable data transfer rate. Estimates would need to be maintained for each sending-receiving

Chapter 6. CONCLUSIONS 58

end host pair, since end-host factors have the biggest impact on the sustainable data transfer rate

once a circuit is established.

Another direction for future work would be to study the impact of multitasking on C-TCP’s

ability to maintain a steady data transfer rate. It might be necessary to use an operating system

with real-time support. For instance, QNX is a commercial real-time operating system (RTOS).

Enhancements to the Linux kernel to provide real-time support, such as RTLinux and Real Time

Application Interface (RTAI) [31], are also available. The advantages of using an RTOS must be

justified against the costs of requiring a specialized operating system.

Appendix A

DISK WRITE RATE ESTIMATION

High utilization of a dedicated circuit for file transfers requires that the sending end keep the circuit

full and the receiving end be able to receive data at the circuit rate. Since file transfers involve

writing data to disk, a likely bottleneck is the rate at which data can be written to the disk, thus

making it crucial to estimate the disk write rate accurately. To do this we used xdd, a disk I/O

benchmark program. Results for the zelda4 host were presented in Section 5.1 and we pointed out

the problems involved in using such results to get a reliable estimate for the disk write rate. In this

appendix we present some background on how disk writes are handled in Linux. Benchmark results

for end hosts other than zelda4 are presented to show the effect of other factors.

A.1 How Linux Handles Disk Writes

The storage space on hard disks is divided into 512-byte sectors [4]. A sector is the minimum

amount of data that can be transferred from or to a disk. Device drivers in the operating system

kernel request data transfer in the size of blocks. Blocks consist of a set of adjacent bytes on the

disk, with the constraints that the block size (in bytes) should be a power of two, a multiple of the

sector size and smaller than the memory page size. Since, typically the page size is 4096 bytes,

block sizes of 1024 and 4096 bytes are commonly used. Each block on disk requires a buffer in

memory that is used by the kernel to store the block’s contents. There exists a buffer head data

structure, corresponding to each buffer, that contains information such as the memory address of

59

Appendix A. DISK WRITE RATE ESTIMATION 60

the buffer, the major and minor numbers of the device on which the block resides,1 the logical block

number etc.

Device drivers access blocks on the disk. Since physically moving the disk head to the area on

the disk where a block resides is expensive, the device driver tries to cluster together accesses to

adjacent blocks. To help with this clustering, device drivers are implemented with a high-level and

a low-level component. The high-level component forms requests for disk blocks. These requests

for individual blocks can be merged with existing requests for adjacent blocks. The kernel tries

to delay fulfilling disk access requests in order to allow more requests to be merged. Each device

driver has a list of request descriptors for requests that are pending. The low-level component of

a device driver goes through this list of requests and programs the disk controller to transfer each

block in each request.

Although every disk access has to go through the high-level device driver, and thus has to be

for disk blocks, it is much more convenient to access files in terms of virtual memory pages. Data

transfer for block devices can be classified into block I/O and page I/O. Block I/O operations transfer

a single block of data from disk that can be stored in a buffer in memory. Operations that access

filesystem metadata, such as inodes, use block I/O. Page I/O operations transfer a page’s worth of

blocks from the disk. These are invoked when accessing files (either regular files or device files like

/dev/hda). Although the data transferred in a page I/O operation is contiguous in the file of which it

is a part, the blocks comprising that page need not be contiguous on disk. For instance, with a page

size of 4096 bytes and a block size of 1024 bytes, a single page I/O operation might require that the

high-level device driver form four separate requests (recall that each request is for block(s) of data

that are contiguous on the disk).

To reduce disk accesses, and thus improve system performance, Linux uses disk caches. In [4],

a disk cache is defined as “a software mechanism that allows the system to keep in RAM some data

that is normally stored on disk, so that further accesses to that data can be satisfied quickly without

accessing the disk”. Two main disk caches are used in Linux, buffer cache and page cache. The

1The major and minor number of a device are used to differentiate between devices. The major number identifies thedevice type, e.g., IDE disk, and the minor number identifies a particular instance of that device, e.g., the third partitionon the second IDE disk.


buffer cache improves the performance of block I/O operations. The page cache consists of pages

that are involved in page I/O operations.

Consider a simplified version of how a write() system call to a file might proceed:

1. The arguments passed to the system call are filp, a pointer to the file to write to, buf, a pointer

to the user space memory from which to write, count, the number of bytes to write and ppos,

the offset in the file at which to write.

2. To quickly find out whether a particular portion of a file is in the page cache, a hash table is

maintained.

3. For each page of data to be written to the file on disk, the hash table for that file is consulted.

If the page is not in the page cache a new page is allocated.

4. If the page’s buffers do not already have corresponding buffer heads, they are allocated.

5. Data is copied from the user space buf to the page in the page cache.

6. The buffers which have been written to are marked as being dirty.

7. The responsibility of actually writing dirty buffers to the disk falls to a couple of kernel

threads— bdflush and kupdate. The bdflush thread is woken up when the proportion of page

buffers that are dirty exceeds a certain value. Each time a buffer is marked dirty the proportion

of dirty buffers is checked and if necessary bdflush is woken up. The kupdate thread helps

to ensure that a buffer that has been modified does not hang around in the disk cache for too

long. It is invoked periodically and flushes to disk all buffers that have been dirty for longer

than a threshold value.

These steps are applicable for a normal write to disk. Thus, if the amount of data written by

a process is small enough to fit entirely in the page cache without invoking bdflush or kupdate

immediately, then the writes can appear to be very fast. This is the common case scenario and

the page cache serves to improve its performance. A slightly different type of disk write is the

synchronous write. A process can ensure that all modified blocks are written to disk by using


system calls such as fsync() or sync() to synchronize the contents of the disk cache and the disk.

In addition, Linux also supports direct I/O that is useful for applications that do application-level

caching (e.g. database applications). For such applications, additional caching in the page cache

is not useful and might, in fact, slow things down because of the overhead of maintaining the page

cache data structures. Direct I/O can be requested by specifying a flag when a file is open()ed. For

direct I/O transfers, the kernel allocates special buffers that are visible from the user and kernel

space, so that no copying is necessary.

A.2 Benchmark Results

In this section we present the results of running the xdd disk I/O benchmark. After obtaining the

source code, compiling and installing xdd, it can be invoked from the command-line in the follow-

ing manner:

xdd.linuxsmp -op write -deletefile -targets 1 <filename> -reqsize

<request size> -mbytes <file size> -output <output file>

xdd is compiled for a Symmetric Multiprocessing (SMP) Linux machine.

-op specifies the operation to perform, read or write.

-deletefile instructs that the file written should be deleted once the transfer is over.

-targets 1 <filename> specifies the number and name(s) of the file(s) to be written.

-reqsize <reqsize> is the size of each I/O operation in KB.

-mbytes <file size> specifies the total size of the file to be written (MB)

-output <output file> specifies that xdd’s output statistics should be written to a file.

Four machines were used for the tests, in addition to zelda4, results for which were presented in

Section 5.1. The names and relevant configuration details for the four machines are shown in Ta-

ble A.1. Two different kernel versions were used on zelda3 to see the effect of the kernel on disk

write performance.

The average disk write rate and standard deviation of ten runs (in Mbps), for the five configura-

tions are shown in Table A.2. We can make the following observations from these results:


Host name OS/kernel File system RAM (GB)zelda1 Linux/2.4.21 ext3 2zelda2 Linux/2.4.21 ext3 2zelda3 Linux/2.4.21 ext3 2zelda3 Linux/2.6.11 ext3 2mvstu6 Linux/2.4.21 ext3 1

Table A.1: End host configurations

Request size (KB) Normal I/O Synchronous I/O Direct I/Ozelda1

8 1048 ± 85.7 794 ± 34.9 624 ± 2.516 1034 ± 33.9 787 ± 58.7 882 ± 4.932 1125 ± 82.0 788 ± 54.3 817 ± 49.164 1079 ± 64.6 782 ± 27.7 884 ± 14.4

128 1158 ± 92.5 800 ± 58.8 879 ± 86.9zelda2

8 979 ± 57.3 759 ± 42.1 623 ± 1.616 1041 ± 41.5 759 ± 51.8 880 ± 5.532 1109 ± 77.9 745 ± 22.7 805 ± 14.364 1085 ± 76.4 747 ± 24.4 862 ± 42.8

128 1121 ± 94.5 767 ± 15.1 846 ± 71.9zelda3 (2.4.21)

8 1115 ± 64.4 875 ± 69.0 626 ± 2.616 1208 ± 46.0 925 ± 60.5 888 ± 3.432 1233 ± 57.1 911 ± 44.4 984 ± 86.564 1241 ± 50.3 918 ± 56.3 972 ± 91.0

128 1239 ± 87.7 893 ± 50.0 987 ± 17.9zelda3 (2.6.11)

8 2062 ± 186.4 1000 ± 68.3 670 ± 11.916 2341 ± 18.6 988 ± 90.5 965 ± 23.832 2397 ± 46.6 956 ± 86.5 999 ± 89.464 2451 ± 34.7 995 ± 77.6 978 ± 45.9

128 2508 ± 51.8 984 ± 89.2 970 ± 98.5mvstu6

8 815 ± 89.6 614 ± 35.2 633 ± 2.216 858 ± 76.4 649 ± 24.6 895 ± 7.932 880 ± 117.4 637 ± 38.1 1112 ± 74.564 883 ± 53.3 656 ± 36.8 1054 ± 83.8

128 884 ± 94.8 647 ± 41.1 1018 ± 11.3

Table A.2: Disk write rate results using xdd


• Disk write rate achieved using normal I/O is higher compared to that achieved by synchronous

I/O. This is because a synchronous I/O write has the extra overhead of writing all dirty buffers

to disk.

• Direct I/O transfers seem to be the most affected by the request size. The general trend is that

the rate achieved using direct I/O increases as the request size increases from 8 to 32 KB, and

then either levels off or decreases slightly for request sizes of 64 and 128 KB. When using

direct I/O, data is written to the disk in request size chunks. As the request size increases,

each disk access transfers a larger amount of data. The dip in disk write rate for request sizes

above 32 KB is probably because for larger requests the likelihood of all blocks in the request

being close together reduces. Thus, the disk head requires more seek time to complete the

transfer.

• Next, consider the standard deviation values. The standard deviations are high for most

combinations of disk I/O type and request size, usually greater than 40 Mbps. The exceptions

are the direct I/O transfers using small request sizes.

• For direct I/O transfers, the variability in achieved rate increases as the request size increases.

For request sizes of 8 and 16 KB, the standard deviation for direct I/O transfers is often

smaller than 5 Mbps. But for request sizes larger than 16 KB the standard deviation increases

dramatically. The increase in seeks for larger request sizes also explains the increased vari-

ability of direct I/O, since the seek time depends on the position of the data on the disk.

• To see which type of I/O transfer is most affected by the request size, we can calculate the

standard deviation of the average disk write rates obtained using different request sizes. For

instance, the standard deviation of the average rates for zelda1 are 52, 7 and 111 Mbps for

normal, synchronous and direct I/O, respectively. In general, synchronous I/O appears to be

the least affected by request size value and direct I/O the most affected.

• Comparison of results for different kernel versions on zelda3 shows that normal I/O perfor-

mance improves almost two-fold, for all request sizes, on the 2.6.11 kernel. The disk schedul-


ing algorithm, which decides the order in which requests are sent to the disk controller, has

been improved in the new kernel. This is most likely the reason for the improvement in

performance of the normal I/O transfers.

• There is a slight improvement in the performance of synchronized I/O, but no noticeable

change in direct I/O performance when using the newer kernel. Synchronized and, especially,

direct I/O performance is dependent more on the disk capability itself rather than the kernel

disk scheduling algorithm.

Bibliography

[1] ALLMAN, M., PAXSON, V., AND STEVENS, W. TCP Congestion Control. RFC 2581 (Apr.

1999).

[2] ANDERSON, D. P., COBB, J., KORPELA, E., LEBOFSKY, M., AND WERTHIMER, D.

SETI@home An Experiment in Public-Resource Computing. Communications of the ACM

(Nov. 2002).

[3] BERTSEKAS, D., AND GALLAGER, R. Data Networks (2nd ed.). Prentice-Hall, Inc., Upper

Saddle River, NJ, USA, 1992.

[4] BOVET, D. P., AND CESATI, M. Understanding the Linux Kernel, 2nd Edition. O’Reilly

Media, Inc., 2002.

[5] BRADEN, B., CLARK, D., CROWCROFT, J., DAVIE, B., DEERING, S., ESTRIN, D., FLOYD,

S., JACOBSON, V., MINSHALL, G., PARTRIDGE, C., PETERSON, L., RAMAKRISHNAN, K.,

SHENKER, S., WROCLAWSKI, J., AND ZHANG, L. Recommendations on Queue Manage-

ment and Congestion Avoidance in the Internet. RFC 2309 (Apr. 1998).

[6] CHASE, J. S., GALLATIN, A. J., AND YOCUM, K. G. End-System Optimizations for High-

Speed TCP. IEEE Communications Magazine (Apr. 2001).

[7] CHU, H. K. J. Zero-Copy TCP in Solaris. Proc. USENIX Annual Technical Conference, San

Diego, CA (Jan. 1996).

[8] CLARK, D. D. Window and Acknowledgement Strategy in TCP. RFC 813 (July 1982).

66

Bibliography 67

[9] CLARK, D. D., JACOBSON, V., ROMKEY, J., AND SALWEN, H. An Analysis of TCP Pro-

cessing Overhead. IEEE Communications Magazine (June 1989).

[10] CLARK, D. D., LAMBERT, M. L., AND ZHANG, L. NETBLT: A Bulk Data Transfer Proto-

col. RFC 998 (Mar. 1987).

[11] CSMA/CD Access Method. IEEE 802.3 standard, section 2, appendix 31B.

[12] DRAGON Dynamic Resource Allocation via GMPLS Optical Networks.

http://dragon.east.isi.edu.

[13] FALL, K., AND FLOYD, S. Simulation-based Comparisons of Tahoe, Reno, and SACK TCP.

ACM Computer Communication Review (July 1996).

[14] FLOYD, S. HighSpeed TCP for Large Congestion Windows. RFC 3649 (Dec. 2003).

[15] FLOYD, S., AND JACOBSON, V. Random Early Detection Gateways for Congestion Avoid-

ance. IEEE/ACM Transactions on Networking (Aug. 1993).

[16] FLOYD, S., MAHDAVI, J., MATHIS, M., AND PODOLSKY, M. An Extension to the Selective

Acknowledgement (SACK) Option for TCP. RFC 2883 (July 2000).

[17] FOONG, A. P., HUFF, T. R., HUM, H. H., PATWARDHAN, J. P., AND REGNIER, G. J. TCP

Performance Re-Visited. Proc. International Symposium on Performance Analysis of Systems

and Software, Austin, TX (Mar. 2003).

[18] GU, Y., AND GROSSMAN, R. L. SABUL: A Transport Protocol for Grid Computing. Journal

of Grid Computing (Dec. 2003).

[19] GU, Y., AND GROSSMAN, R. L. UDT: An Application Level Transport Protocol for Grid

Computing. 2nd International Workshop on Protocols for Long-Distance Networks PFLDNet

(Feb. 2004).

[20] HE, E., LEIGH, J., YU, O., AND DEFANTI, T. A. Reliable Blast UDP: Predictable High

Performance Bulk Data Transfer. Proc. IEEE Cluster Computing (Sept. 2003).

Bibliography 68

[21] JACOBSON, V. Congestion Avoidance and Control. ACM Computer Communication Review

(Aug. 1988).

[22] JACOBSON, V., BRADEN, R., AND BORMAN, D. TCP Extensions for High Performance.

RFC 1323 (May 1992).

[23] JIN, C., WEI, D. X., AND LOW, S. H. FAST TCP: Motivation, Architecture, Algorithms,

Performance. IEEE Infocom (Mar. 2004).

[24] KATABI, D., HANDLEY, M., AND ROHRS, C. Congestion Control for High Bandwidth-

Delay Product Networks. ACM Sigcomm (Aug. 2002).

[25] KELLY, T. Scalable TCP: Improving Performance in Highspeed Wide Area Networks. ACM

SIGCOMM Computer Communication Review (Apr. 2003).

[26] MATHIS, M., HEFFNER, J., AND REDDY, R. Web100: Extended TCP Instrumentation. ACM

Communications Review (July 2003).

[27] MATHIS, M., MAHDAVI, J., FLOYD, S., AND ROMANOW, A. TCP Selective Acknowledge-

ment Options. RFC 2018 (Apr. 1996).

[28] PIETIKAINEN, P. Hardware-assisted Networking Using Scheduled Transfer Protocol on

Linux. Diploma thesis, Dept. of Electrical Engineering, University of Oulu, Oulu, Finland

(Sept. 2001).

[29] POSTEL, J. Transmission Control Protocol. RFC 793 (Sept. 1981).

[30] RAMAKRISHNAN, K., FLOYD, S., AND BLACK, D. The Addition of Explicit Congestion

Notification (ECN) to IP. RFC 3168 (Sept. 2001).

[31] RTAI Real Time Application Interface. http://www.rtai.org.

[32] RUWART, T. M. Xdd User’s Guide. http://www.ioperformance.com/xddversions/xdd6.3.pdf,

Jan. 2005.

Bibliography 69

[33] STEENKISTE, P. A Systematic Approach to Host Interface Design for High-Speed Networks.

IEEE Computer (Mar. 1994).

[34] STEVENS, W. R. TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley Professional,

Boston, MA, USA, 1993.

[35] TIRUMALA, A., QIN, F., DUGAN, J., FERGUSON, J., AND GIBBS, K. Iperf Version 1.7.0.

http://dast.nlanr.net/Projects/Iperf/, Mar. 2003.

[36] DOE UltraScience Net : Experimental Ultra-Scale Network Testbed for Large-Scale Science.

http://www.csm.ornl.gov/ultranet/.

[37] VEERARAGHAVAN, M., ZHENG, X., LEE, H., GARDNER, M., AND FENG, W. CHEETAH:

Circuit-switched High-speed End-to-End Transport Architecture. Proc. of Opticomm (Oct.

2003).

[38] WU, Q., AND RAO, N. S. V. Protocol for High-Speed Data Transfer Over Dedicated Chan-

nels. 3rd International Workshop on Protocols for Long-Distance Networks PFLDNet (Feb.

2005).

[39] XU, L., HARFOUSH, K., AND RHEE, I. Binary Increase Congestion Control for Fast Long-

Distance Networks. IEEE Infocom (Mar. 2004).

[40] ZHENG, X., MUDAMBI, A. P., AND VEERARAGHAVAN, M. FRTP:Fixed Rate Transport

Protocol- A Modified Version of SABUL for End-to-end Circuits. First IEEE Workshop on

Provisioning And Transport for Hybrid Networks PATHNets, San Jose, CA (Sept. 2004).

[41] ZHU, X., ZHENG, X., VEERARAGHAVAN, M., LI, Z., SONG, Q., HABIB, I., AND RAO,

N. S. V. Implementation of a GMPLS-based Network with End Host Initiated Signaling.

submitted to IEEE ICC (June 2006).

a transport protocol for dedicated end-to-end …role in achieving high throughput on dedicated...

Documents