performance modeling and evaluation of network...
Post on 17-Mar-2020
11 Views
Preview:
TRANSCRIPT
Performance Modeling and Evaluation of
Network Processors
A Thesis
Submitted For the Degree of
Master of Science (Engineering)
in the Faculty of Engineering
by
Govind S.
Supercomputer Education and Research Centre
Indian Institute of Science
Bangalore – 560 012
DECEMBER 2006
To my mother and father
Acknowledgments
I express my sincere gratitude towards Prof. R. Govindarajan, my research super-
visor, for his invaluable support and guidance. I sincerely thank him for being very
patient and for being available despite his busy schedule. I am also extremely grate-
ful to Prof Joy Kuri for providing invaluable guidance and suggestions especially
in the networking part of this thesis. I am also thankful to Prof W. M. Zuberek,
Professor in University of Newfoundland Canada, for allowing me to use CNET. I
also thank the Chairman of SERC for providing the excellent laboratory facilities
and a wonderful atmosphere in the department.
I also thank:
Kaushik Rajan for the being involved in a large number of interesting discussions
ranging from network processors to controversial cricket and uncontroversial tennis
greats. Shyam for providing subsidized, and in most cases free, weekend lunch and
dinner. Gvsk for helping me to get familiarized with CNET and for tolerating our
eccentricities. Rajesh for allowing me to work on ”my” machine. Manikantan for
wonderful discussions ranging from cricket, tennis, politics and football. Soccer-
pulse community for sharing the wonderful soccer videos. The organizers of the
FIFA world cup, Champions League, and EURO for providing the ”Jogo Bonito”
moments.
I would like to thank Him, The almighty, for providing me with this rare op-
portunity. Last but not the least, I thank all my family members for providing the
moral support and for being patient through the course of this work.
Abstract
In recent years there has been an exponential growth in Internet traffic resulting in
increased network bandwidth requirements which, in turn, has led to stringent pro-
cessing requirements on network layer devices like routers. Present backbone routers
on OC 48 links (2.5Gbps) have to process four million minimum-sized packets per
second. Further, the functionality supported in the network devices is also on the
increase leading to programmable processors, such as Intel’s IXP, Motorola’s C5
and IBM’s NP. These processors support multiple processors and multiple threads
to exploit packet-level-parallelism inherent in network workloads.
This thesis studies the performance of network processors. We develop a Petri
Net model for a commercial network processors (Intel IXP 2400,2850) for three
different applications viz., IPv4 forwarding, Network Address Translation and IP
security protocols. A salient feature of the Petri net model is its ability to model
the application, architecture and their interaction in great detail. The model is
validated using the Intel proprietary tool (SDK 3.51 for IXP architecture) over a
range of configurations. Our Performance evaluation results indicate that
1. The IXP processor is able to support a throughput of 2.5 Gbps for all modeled
applications.
2. Packet buffer memory (DRAM) is the bottleneck resource in a network proces-
sor and even multithreading is ineffective beyond a total of 16 threads in case of
header processing applications and beyond 32 threads for payload processing
applications.
Abstract iii
Since DRAM is the bottleneck resource we explore the benefits of increasing
the DRAM banks and other software schemes like offloading the packet header to
SRAM.
The second part of the thesis studies the impact of parallel processing in network
processor on packet reordering and retransmission. Our results indicate that the con-
current processing of packets in a network processor and buffer allocation schemes
in TFIFO leads to a significant packet reordering, (61%), on a 10-hop network (with
packet sizes of 64 B) which in turn leads to a 76% retransmission under the TCP fast-
restransmission algorithm. We explore different transmit buffer allocation schemes
namely, contiguous, strided, local, and global for transmit buffer which reduces the
packet retransmission to 24%. Our performance results also indicate that limiting
the number of microengines can reduce the extent of packet reordering while pro-
viding the same throughput. We propose an alternative scheme, Packetsort, which
guarantees complete packet ordering while achieving a throughput of 2.5 Gbps. Fur-
ther, we observe that Packetsort outperforms, by up to 35%, the in-built schemes in
the IXP processor namely, Inter Thread Signaling (ITS) and Asynchronous Insert
and Synchronous Remove (AISR).
The final part of this thesis investigates the performance of the network processor
in a bursty traffic scenario. We model bursty traffic using a Pareto distribution.
We consider a parallel and pipelined buffering schemes and their impact on packet
drop under bursty traffic. Our results indicate that the pipelined buffering scheme
outperforms the parallel scheme.
Contents
Acknowledgments i
Abstract ii
1 Introduction 1
1.1 Network Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Performance Evaluation and Architecture Exploration. . . . . 31.2.2 Impact of Packet Reordering . . . . . . . . . . . . . . . . . . . 41.2.3 Performance under Bursty Traffic . . . . . . . . . . . . . . . . 6
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 8
2.1 Network Processors: An overview . . . . . . . . . . . . . . . . . . . . 82.1.1 IXP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Motorola C-5 Processor . . . . . . . . . . . . . . . . . . . . . 132.1.3 IBM PowerNP Network Processor . . . . . . . . . . . . . . . . 14
2.2 Network Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 IP Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.2 Network Address Translation . . . . . . . . . . . . . . . . . . 152.2.3 IP Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Petri Nets: An Introduction . . . . . . . . . . . . . . . . . . . . . . . 16
3 Performance Modeling and Evaluation 20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 A Single Microengine Petri Net Model . . . . . . . . . . . . . . . . . 21
3.2.1 Multiple Microengine Petri Net Model . . . . . . . . . . . . . 243.2.2 Memory Modeling . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Performance Evaluation of IXP . . . . . . . . . . . . . . . . . . . . . 263.3.1 Simulation Methodology . . . . . . . . . . . . . . . . . . . . . 273.3.2 Validation Results . . . . . . . . . . . . . . . . . . . . . . . . 283.3.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.4 Architecture Exploration . . . . . . . . . . . . . . . . . . . . . 353.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
CONTENTS v
4 Packet Reordering in Network Processors 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Packet Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Reordering in Network Processors . . . . . . . . . . . . . . . . 454.2.2 Transmit Buffer Induced Reordering . . . . . . . . . . . . . . 464.2.3 Packet Ordering Mechanisms in IXP . . . . . . . . . . . . . . 474.2.4 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Packet Reordering in IXP . . . . . . . . . . . . . . . . . . . . . . . . 504.3.1 Petri Net Model . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.3 Performance Results . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Reducing Packet Reordering . . . . . . . . . . . . . . . . . . . . . . . 544.4.1 Buffer Allocation Schemes . . . . . . . . . . . . . . . . . . . . 544.4.2 Tuning Architecture Parameters . . . . . . . . . . . . . . . . . 584.4.3 Packet Sort: An Alternative Scheme . . . . . . . . . . . . . . 60
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 Performance Analysis of Network Processor in Bursty Traffic 66
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Generation of Bursty Traffic . . . . . . . . . . . . . . . . . . . . . . . 685.3 Petri net Model of the Traffic Generator . . . . . . . . . . . . . . . . 705.4 Packet Buffering Schemes . . . . . . . . . . . . . . . . . . . . . . . . 715.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5.1 Impact of Packet Buffering . . . . . . . . . . . . . . . . . . . . 725.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Related Work 77
6.1 Network Processor Performance Evaluation. . . . . . . . . . . . . . . 776.2 Packet Reordering in Network Processors. . . . . . . . . . . . . . . . 79
7 Conclusions 82
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Bibliography 86
List of Figures
2.1 Internal IXP 2400 Architecture. . . . . . . . . . . . . . . . . . . . . . 92.2 Microengine - Memory Unit Interface in IXP 2400. . . . . . . . . . . 12
2.3 Packet flow in the IXP 2400. . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Petri Net Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Petri Net Model for a Single Microengine in IXP 2400 Running IPv4Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Petri Net model for Memory Access in DDR DRAM. . . . . . . . . . 25
3.3 Petri Net model for Memory Access in Rambus DRAM. . . . . . . . . 263.4 Transmit Rates from PN and SDK Simulations. . . . . . . . . . . . . 30
3.5 Microengine Utilization from PN and SDK Simulations. . . . . . . . . 31
3.6 DRAM utilization for Different Bank Probabilities. . . . . . . . . . . 323.7 Average Microengine Queue Length for Different Bank Probabilities. 34
3.8 Impact of Number of DRAM Banks. . . . . . . . . . . . . . . . . . . 36
3.9 Impact of Number of Hash Units. . . . . . . . . . . . . . . . . . . . . 373.10 Performance Enhancements from Storing Packet Header in SRAM for
IP4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.11 Performance Enhancements from Storing Packet Header in SRAM for
NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.12 Impact of Limiting Pending DRAM Accesses per Microengine . . . . 40
4.1 Packet Reordering in Network Processors. . . . . . . . . . . . . . . . 444.2 Transmit Buffer Reordering. . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Inter Thread Signaling in the IXP. . . . . . . . . . . . . . . . . . . . 48
4.4 Asynchronous Insert Synchronous Reset (AISR) in the IXP. . . . . . 494.5 Simulated Network Topology. . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Packet Reordering in NP. . . . . . . . . . . . . . . . . . . . . . . . . . 534.7 Different Transmit Buffer Allocation Schemes. . . . . . . . . . . . . . 54
4.8 Impact of Various Buffer Allocation Schemes (64B Packet Size) -CNET result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.9 Impact of Various Buffer Allocation Schemes (512B Packet Size) -CNET Result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.10 Impact of Number of Microengines (64B Packet Size) - CNET Result. 58
4.11 Impact of Number of Microengines (512B Packet Size) - CNET Result. 59
LIST OF FIGURES vii
4.12 Impact of Number of Threads (64B Packet Size) - CNET Result. . . . 604.13 Impact of Number of Threads (512B Packet Size) - CNET Result. . . 614.14 Packet Sort Implementation in the IXP. . . . . . . . . . . . . . . . . 61
5.1 Packet Arrival in NP. . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 Bursty Traffic Generation. . . . . . . . . . . . . . . . . . . . . . . . . 695.3 Petri Net Model of Traffic Generator. . . . . . . . . . . . . . . . . . . 705.4 Pipelined Buffering Scheme. . . . . . . . . . . . . . . . . . . . . . . . 715.5 Bursty Traffic Generated using 48 sources. . . . . . . . . . . . . . . . 73
List of Tables
3.1 Model parameters used in the Petri net model . . . . . . . . . . . . . 283.2 Time Average DRAM Queue Length and Stall Percentage. . . . . . . 32
4.1 Petri Net Model Validation . . . . . . . . . . . . . . . . . . . . . . . . 524.2 Impact of Buffer Allocation schemes on Throughput. . . . . . . . . . 574.3 Transmit Rates for Different Number of Threads. . . . . . . . . . . . 594.4 Comparison of Various Schemes to Overcome Reordering. . . . . . . . 62
5.1 Output Line Rates supported with Input Rate of 1.7 Gbps . . . . . . 745.2 Output Line Rates supported with Input Rate of 3.14 Gbps . . . . . 755.3 Output Line Rates supported with Input Rate of 6 Gbps . . . . . . . 755.4 Maximum Output Rates for Different Packet Buffering Schemes. . . . 75
Chapter 1
Introduction
1.1 Network Processors
In recent years there has been an exponential growth in Internet traffic leading to
increasing network bandwidth requirements. For example, present backbone routers
on OC 48 links (2.5Gbps) have to process four million minimum-sized packets per
second. Further, the applications are also changing with VOIP (Voice over Internet
Protocol) and P2P (Peer to Peer) applications gaining increasing popularity. Fur-
ther, with the advent of IPSec [21] there is a need to support encryption/decryption
at the network layer. These growing functionalities at the network layer and the
increasing bandwidth requirements has resulted in application specific devices to be
deployed at the network layer. Network processors [14, 15, 26, 13] are application
specific processors that are specialized to perform network layer functionalities like
IPv4 [2], NAT [34]. These processors perform key computational functions like
encryption/decryption in hardware. However, these processors, unlike the ASICs,
are programmable and hence are easily adaptable to changing network standards
and applications. This helps in reducing the Time to Market (TTM) compared to
ASICs. As a result number of network processors have proliferated the market re-
cently [14, 15, 26, 13].
2 Introduction
Commercial network processors [13] [14] are store-forward architectures that
buffer incoming packets in a buffer memory (usually the DRAM), process the pack-
ets, and forward it to the corresponding output port. Networking applications
exhibit packet level parallelism where the processing of different packets are in-
dependent. Hence network processors exploit this characteristic of applications by
employing multiple processors to process packets. Further, they employ hardware
level multithreading to mask latencies in accessing memory or other application
specific functional units. Hence network processors support multiple threads and
have hardware support for low-overhead context switching. For example, the Intel
IXP 2400 processor [14] [15] uses a total of 64 threads (8 threads per microengine)
to process packets. This enables modern routers to support OC-48 and higher line
rates.
The performance evaluation of network processors is complex due to the inter-
action among multithreading, multiple processors, complex memory structures with
varying access time, and application specific functional units. Earlier work on per-
formance evaluation of network processors use either standard queuing models [9],
other analytical models [1] or simulation-based approaches [29]. However, many
of these work assume that the packets are already buffered in the DRAM and do
not model the flow of packets in and out of DRAM. However, since DRAM is a
critical resource which can potentially affect the performance of NPs, due to the
high latency involved in an access, not modeling the flow of packets in and out of
DRAM affects the accuracy of these studies. We address the problem by developing
a detailed Petri net model for network processing applications running on NPs.
Network processors use multiple threads and microengines to exploit packet level
parallelism in network applications. While this can improve the throughput or
performance of NPs it can also adversely affect the packet order at the output of the
network processor. It has been shown in earlier works the adverse impact of packet
reordering on TCP throughput. However, earlier work on performance evaluation
1.2 Our Contribution 3
of network processors [9, 29] do not study this issue. Similarly earlier work on
packet reordering [5, 24, 19] study the impact of packet reordering/retransmission
on TCP throughput. However, these works do not consider the impact of the network
processor architecture on packet reordering/retransmission.
1.2 Our Contribution
1.2.1 Performance Evaluation and Architecture Exploration.
In the first part of this thesis we develop a Petri net model of the IXP 2400/2800
processor running three different network applications. The model captures the
packet flow in detail from the Receive FIFO to the Transmit FIFO. The main fea-
ture of the model, unlike other Petri net models [11] [33], is its ability to model
the processor, application and their interaction in sufficient detail. The Petri net
model is different for different applications. We consider header processing appli-
cations (HPA) such as IPv4 and NAT and payload processing applications (PPA)
such as the IPSec protocols. The IXP processor is able to achieve a throughput
of 2.96 Gbps for HPA and 3.6 Gbps in case of PPA applications. The Petri net
model thus developed is validated using the Intel simulator for the IXP family of
processors. Our performance results indicate that under Poisson packet arrival pro-
cess with minimum sized packets the DRAM memory used for packet buffering is
the bottleneck. Our study also shows that multithreading is effective only up to a
certain number of threads. Beyond this threshold packet buffer memory (DRAM) is
fully utilized and multithreading is not beneficial. Since the transmit rate is limited
by the packet buffer memory utilization, we investigate the following approaches to
reduce the memory utilization.
• In the IXP processor, although the DRAM is utilized 100% the SRAM is
utilized only upto 27%, hence we explore placing the packet header in SRAM
and packet payload in DRAM. This scheme improves the throughput by up
to 20%.
4 Introduction
• Increasing the number of DRAM banks from 4 to 8 improves the throughput
by upto 3.6 Gbps. However when the number of banks is 8, the hash unit, a
task specific unit used for performing hardware lookup, becomes the bottle-
neck. Increasing the number of hash units from 1 to 2 gives an improvement
in the throughput by upto 60% as compared to the base case. We further ob-
serve that even with fewer microengines (4 MEs) and two hash units a similar
performance can be sustained.
• When the number of outstanding memory requests in the IXP processor ex-
ceeds a threshold then all microengines with memory requests at the head of
their command FIFO are stalled. Instead if the number of pending memory
requests from each microengine is limited then an improvement in transmit
rate by up to 4.1 Gbps can be achieved.
1.2.2 Impact of Packet Reordering
In the second part of the thesis we analyze the impact of the network processor
architecture on packet reordering. Our study indicates that in addition to the con-
current processing in the network processor, the allocation in the transmit buffer
also adversely impacts packet ordering. Our results indicate that concurrent pro-
cessing and naive buffer allocation can result in 31% packet reordering which, in
turn, results in 6% retransmission of packets in a single hop for IPv4 application.
The reordering and retransmission rates are measured as the potential number of
ACK replies received by the sender. We observe that the reordering and retransmis-
sion rates increase with the number of hops, resulting in up to 61% retransmission
in case of 10 hops.
We explore different transmit buffer allocation schemes namely, contiguous, strided,
local, and global allocation schemes. In the global buffer allocation, threads from
1.2 Our Contribution 5
different microengines compete for the same transmit buffer space. Hence, it in-
volves a mutual exclusion operation across all the microengines and threads. This
reduces the retransmission rate in a 10 hop network to 33%, but also drastically
reduces the throughput to 1 Gbps, which is unacceptably low. Hence we explore
a buffer allocation scheme where only threads from the same microengine compete
for a common buffer space. This scheme is called local buffer allocation. Here the
mutual exclusion operation is limited to only threads within a microengine. This
scheme results in a retransmission rate of 45% but with a transmit rate of 2.1 Gbps.
In the strided buffer allocation, the transmit buffer space is allocated completely
statically such that each microengine writes into the transmit buffer locations that
are apart by a constant stride. This eliminates the mutual exclusion and a through-
put of 2.96 Gbps is obtained. However the retransmission is high and is 56% on a
10-hop network. Since a packet traverses 16 hops, on an average [28] in the Internet,
the observed retransmission rates can significantly affect the network performance.
Our results also indicate that the parallel architecture of the network processor
can severely impact reordering and can cause up to 61% retransmission in a 10 hop
scenario. Since our performance study in the first part of the thesis reveals that de-
creasing the number of microengines from 8 to 4 and keeping the number of threads
to 16 does not degrade the performance, we study the impact of packet reordering
on a network processor with a fewer number of microengines. The retransmission
rate reduces from 61%, for a network processor with 8 microengines and 8 threads,
to 19% for a network processor with 2 microengines and 8 threads or 4 microengines
and 4 threads. This is achieved without sacrificing the throughput (2.96 Gbps). This
is because the throughput of the network processor saturates beyond a total of 16
threads. Further, to reduce retransmission rates we propose a scheme, Packet sort,
in which a few microengines/threads are dedicated to sort the packets in-order. We
compare the performance of Packet sort with the existing in-order schemes in the
IXP namely, Inter Thread Signaling (ITS) and Asynchronous Insert Synchronous
6 Introduction
Reset. The Packet sort achieves a throughput of 2.3 Gbps and performs better than
ITS and AISR which achieve a throughput of 2.1 and 1.1 Gbps respectively.
1.2.3 Performance under Bursty Traffic
The final part of this thesis investigates the performance of the network processor
in a bursty traffic scenario. Earlier works on performance evaluation of network
processors [29] [9] evaluate the performance under a Poisson packet arrival process in
a DoS attack scenario. However, earlier works [8] on traffic characterization indicate
that, on an average, only 10% of the traffic is due to DoS attacks. This work studies
the performance of the network processor in a bursty traffic. We model bursty traffic
using a Pareto distribution. Further, we explore various packet buffering schemes. In
particular we consider a parallel and pipelined packet flow architecture. Our results
indicate that the parallel scheme incurs considerable packet drop and results in a
lower throughput. In contrast, the pipelined scheme results in a higher throughput
and a lower packet drop.
1.3 Organization of the Thesis
The rest of the thesis is organized as follows. The following chapter presents the
IXP architecture in detail and also discusses the different network applications used
in our study. This chapter also provides an introduction to Petri nets. Chapter
3 presents the Petri net model of the network processor and model for a memory
access. This chapter also presents the validation results of the Petri net model and
the architecture exploration of the network processor. Chapter 4 studies the impact
on packet reordering in network processors. This chapter also presents different
ways to reduce packet reordering. In Chapter 5 we evaluate the network processor
in a bursty traffic scenario and also evaluate the performance of different packet
buffering schemes. Chapter 6 discusses the related work in this area. We present
1.3 Organization of the Thesis 7
our conclusions and directions of future work in chapter 7.
Chapter 2
Background
In this chapter we provide the necessary background on :
1. The network processor in general and the Intel IXP processor in particular
and,
2. The network applications that are used in this study, and
3. Petri nets.
This chapter is organized as follows. Section 2.1 provides an overview of different
network processors and describes the IXP2400 architecture and the packet flow in
the IXP processor. Section 2.2 presents the different applications that are running
in the IXP processor. Section 2.3 provides the necessary background on Petri nets.
2.1 Network Processors: An overview
Commercial network processors [13] [14] are store-forward architectures that buffer
incoming packets in a buffer memory (usually the DRAM), process the packets and
forward it to the corresponding output port. This section provides an architecture
overview of the IXP 2400 processor [14] and also provides a brief overview of Mo-
torola C-5 [26] and IBM Power-NP processors [13].
2.1 Network Processors: An overview 9
2.1.1 IXP Architecture
IXP processors [15] are multithreaded multiprocessor architectures which are typi-
cally employed in backbone routers.
Figure 2.1 Internal IXP 2400 Architecture.
Hash Unit M
E
Media Switch Fabric (MSF)
SRAM Channel
Intel XScaleCore
SRAMDRAM
NETWORK PORTS
Scratch padMemoryTBUF
DRAM ChannelRBUF
ME
ME
ME
ME
ME
ME
ME
The architecture of IXP 2400 processor depicted in Figure 2.1, consists of a
Xscale core, eight microengines, and application specific hardware units like hash
and crypto unit.
The Xscale is a 32 bit RISC processor, used to handle control and management
plane functions [38] like routing table update, to load the microengine instructions.
The XScale initializes and manages the chip and also handles exception. IXP 2400
contains eight microengines each running at 600 MHz. The microengines are spe-
cialized to perform network processing. Each microengine contains eight hardware
contexts, for a total of 64 threads and there are no context switch overhead. There
are 256 programmable General Purpose Registers in each microengine equally shared
between the eight threads. Further, there is a 4K instruction store associated with
each microengine.
The memory architecture of the IXP processor consists of SRAM, DRAM, scratch-
pad memory, and local memory. Typically packets are buffered in DRAM while the
10 Background
SRAM stores state information like the routing table, NAT table. IXP 2400 sup-
ports DRAM and SRAM sizes up to 512 MB and 8 MB respectively. The RAMs
are off chip and communicate with the processor using a high speed data path of
bandwidth 6.4 Gbps (for IXP 2400). The scratchpad is used for communication
between the different microengines like the mutex variables. The scratchpad is a
low latency memory and is on-chip. Additionally, each microengine consists of 640
words of local memory which is used for communication between hardware contexts.
The IXP 2400 and 2800 also use Next Neighbour registers which are used for com-
munication between adjacent microengines.
The off chip memory namely, SRAM and DRAM accesses, are accessed through
memory controllers which are resident on the IXP chip. There are independent
controllers for SRAM and DRAM memory. A thread requesting a memory access
enqueues its request in the corresponding memory controller. The controller sends
the request, the memory address, to the SRAM/DRAM and sends/receives data
through the external data bus which has a bandwidth of 6.4 Gbps. The memory
controllers form the interface between the microengines and the memory.
The IXP chip contains task specific functional units, hash unit (in IXP 2400 and
2850) and crypto unit (in IXP 2850), accessible by all the microengines. The hash
unit can be used in a hash based destination address lookup [29]. The IXP2850
contains two crypto units which implement 3DES, AES, SHA-1 algorithms in hard-
ware. When a thread in a ME requests for a hash/crypto computation , a context
switch occurs.
The IXP chip has a pair of FIFOs, Receive FIFO and Transmit FIFO, used to
send/receive packets to/from the network ports, each of size 8 KB. A data path
exists between the microengines and DRAMs to the FIFOs. The packet, which gets
buffered in the RFIFO, is moved to the DRAM through this data path. Similarly,
2.1 Network Processors: An overview 11
packets are moved out of the DRAM to the TFIFO through this data path.
Other processors in the IXP family include the IXP 1200, 2800, and 2850. These
processors have similar structure to the IXP 2400 but differ in the number of ME’s
and specialized hardware units. The IXP 1200 contains six ME’s with four threads
per ME. The IXP 2800 has sixteen MEs with each ME supporting eight threads.
The IXP 2850 is similar to IXP 2800 but additionally it has a crypto unit which
implements cryptographic algorithms in hardware.
The IXP processor uses a unique mechanism to access memory (SRAM and
DRAM). A detailed understanding of this interaction is needed as the memory
latency, will be discussed in the subsequent chapter, limits the network processor
throughput.
2.1.1.1 Microengine-Memory Unit Interaction
When a thread in a microengine requests a memory access or an access to hash or
crypto unit, places an appropriate request in the respective microengine command
queue (MEQ) (refer to Figure 2.2). A maximum of four outstanding requests can
be placed in a single microengine command queue. A common command bus arbiter
moves the request from the microengine command queue to the respective queues
for task unit or memory units.
Requests in the task units are processed in FIFO order. In case of memory
(DRAM), memory accesses to the same bank are processed in FIFO order. Each
queue allows a maximum of sixteen requests to be placed in its queue. If any of
the memory/task queues fills to a threshold level (a queue length of 10) then the
corresponding unit (memory/task unit) applies a back pressure mechanism on the
command bus arbiter [14]. This prevents further issue of requests from all micro-
engines containing a request of this type at the head of their command queue. This
consequently can fill the microengine command queue (MEQ). So in this scenario
12 Background
Figure 2.2 Microengine - Memory Unit Interface in IXP 2400.
.......................
CMD BUS ARBITER
SRAMDRAM
DRAM Q SRAM Q
MEQMEQ
ME0 ME7
a thread in a microengine attempting to place a request in the command queue
(MEQ) stalls since the queue is full. Our performance results indicate that these
stalls result in significant wastage of microengine clock cycles since other threads
waiting to execute in the microengine are prevented in doing so.
The following section describes the packet flow in the IXP processor. We assume
that packets have arrived in the input MAC.
2.1.1.2 Packet Flow in IXP Processors
Packets arrive from the external link to the input ports and get buffered at the input
port buffers (refer to Figure 2.3). Packets are then transferred to the RFIFO of
the NP through a high speed media interface. When a thread in a microengine is
available it takes control of the packet and transfers the packet from the RFIFO to
the DRAM. The packet/packet header is read from the DRAM by the corresponding
thread. The thread processes the packet/header, modifies it as necessary, and writes
back the new packet/header to the DRAM. Next the thread places the packet in
the TFIFO of the NP and writes the packet to the corresponding output port buffer
2.1 Network Processors: An overview 13
Figure 2.3 Packet flow in the IXP 2400.
RFIFO DRAM
MAC
DRAM ME TFIFO
OC XX LINE
through the media interface. Once the packet is transferred to the next hop the
thread is freed. It should be noted that during the packet flow from the RFIFO
to TFIFO, single thread is responsible for moving the packet. However, during the
transfer, e.g, from RFIFO to DRAM or DRAM to TFIFO, the thread relinquishes
the microengine (context switched).
2.1.2 Motorola C-5 Processor
The Motorola C-5 network processor [26] comprises sixteen channel processors (CPs)
with four threads on each CP. Consecutive CPs can be connected in a pipelined
fashion using special registers. CPs can be assigned individually to a port or in an
aggregate mode. In addition to CPs there is an executive processor (XP) that serves
as a centralized computing resource for the C-5 and manages the system interfaces.
Three on-chip buses connect all these computational resources to external memories.
Three specialized units are part of memory controllers: table lookup unit (TLU),
which accelerates six different types of table lookup algorithms with 11 dedicated
instructions, buffer management unit that accelerates creation and destruction of
variable width buffers for payload data stored in SDRAM, and queue management
unit that accelerates creation and destruction of queues for packet descriptor data
stored in SRAM.
14 Background
2.1.3 IBM PowerNP Network Processor
The IBM PowerNP [13] has the following main components: embedded processor
complex (EPC), data flow (DF), scheduler, MACs, and coprocessors. The EPC
processors work with coprocessors to execute application software and PowerNP-
related management software. The coprocessors provide hardware-assist functions
for performing common operations such as table searches and packet alterations.
The DF serves as the primary data path for receiving and transmitting network
traffic. It provides an interface to multiple large data memories for buffering data
traffic as it flows through the network processor. The traffic management scheduler
allows traffic flows to be scheduled individually per their assigned QoS class for
differentiated services.
2.2 Network Applications
Network applications can be broadly classified, depending on the type of process-
ing, into two types: Header Processing Applications (HPA) and Payload Processing
Applications (PPA) [9]. The processing in HPA is independent of the packet size
and type of the packet payload. These applications involve a header field inter-
rogation, and table lookup. Examples include IPv4 forwarding and NAT. PPA
represent applications that access the entire packet, and the amount of processing
is dependent on the size of the packet. These applications typically involve an fen-
cryption/decryption on the entire packet. Examples include IP Security protocols.
We have selected two each from HPA and PPA in our study. The HPA pro-
grams chosen are IPv4 forwarding [2] and NAT [34], and the PPA programs used
are IP Security protocols: Authentication Header(AH) [21] and Encapsulation Se-
curity Payload(ESP) [22].
This section describes the different network applications used in our study. We
2.2 Network Applications 15
observe that all applications running on routers have similar flows. The application
buffers the incoming packets into DRAM, read packet/packet header depending
on the application, processes the packet depending on the application, writes the
packet/packet header back to the DRAM, and transfer the modified packet to the
transmit buffer.
2.2.1 IP Forwarding
IP forwarding is a fundamental operation performed by the router. We focus on
forwarding for IP version 4 packets [2]. IPv4 uses the IP header of the packet to
determine the destination address. A lookup is performed based on the destination
address in the IP to determine the destination port number and the next hop ad-
dress. The routing table is stored in the SRAM. Accordingly the packet header is
modified.This work uses a hash based lookup [29]. The time-to-live field in the IP
header is decremented, the cyclic redundancy checksum (CRC) is recomputed. The
packet is then forwarded to the next hop.
2.2.2 Network Address Translation
Network Address Translation or NAT [34] is a method by which many network
addresses and their TCP/UDP ports are translated into a single network address
and its TCP/UDP ports. We focus on NAT for TCP protocols. When a host in
the LAN, which is assigned a local IP address, initiates a TCP session through the
router to an external network, the router changes the source IP address field in
the IP header to the globally visible router IP address. In addition, a unique port
number is allocated by the router to the session. The port numbers, assigned by the
router, increase in steps of one and wrap around after 65536.
A tuple consisting of protocol name (TCP or UDP), source IP address, and the
source port number distinguishes a connection. The translation table stores the tuple
and the corresponding private IP address.The translation table is maintained by the
16 Background
router. It is used to route packets from the external network to the corresponding
local node. The translation table is typically stored in the SRAM.
2.2.3 IP Security
IPSec protocols [36] are used to provide privacy and authentication services at the IP
layer. The two protocols supported by IPSec are: Authentication Header(AH) [21]
and Encapsulation Security Payload (ESP) [22]. IP Authentication Header (AH) is
used to provide connectionless integrity and data origin authentication for IP data-
grams while the IP Encapsulation Security Payload (ESP) encrypts the TCP/UDP
segment in addition to the AH features.
IPSec protocols use a network handshake mechanism between the source and
the destination using a security association (SA) . The security Association is a 3
tuple containing security protocol (AH or ESP), source IP address and a 32 bit
connection identifier referred to as SPI (Security Parameter Index). SPI contains a
shared key used for encryption, and the encryption algorithm. After the handshake,
the requisite computation is done depending on the protocol. A digital signature
is computed over the packet payload. The key shared with the destination is used
in the signature computation. The AH/ESP header is placed after the IP layer
protocol but before the higher level protocols. In case of ESP, in addition to all the
processing needed in AH , the higher layer protocol is encrypted and placed after
the ESP header.
We assume that the SPI data, in particular the shared key is stored in SRAM.
2.3 Petri Nets: An Introduction
Petri net is a mathematical modeling tool which is commonly used to model con-
currency and conflicts in systems. A Petri net is a particular kind of directed graph,
together with an initial state called the initial marking. The underlying graph of a
2.3 Petri Nets: An Introduction 17
Petri net is a directed, weighted, bipartite graph consisting of two kinds of nodes,
called places and transitions, where arc is from a place to a transition or from a
transition to a place. Places of Petri nets usually represent conditions such as avail-
ability of resources in the system while transitions model the activities in the system.
A token in a place is interpreted as holding the truth of the condition associated
with the place. It can also indicate the availability of a resource. A place can be
associated with zero or more tokens. The number of tokens in each place in a Petri
net model represents the marking of the Petri net model. The marking of a model
represents the state of the system (the underlying system being modeled). A system
normally starts with an initial marking which is a representative of the initial state
of the system. A system moves from one state to another as the transitions in the
model fire, resulting in new markings.
A transition usually represents the occurence of an event. A transition has
a certain number of input and output places representing the pre-conditions and
post-conditions of the event. An input place p to a transition t has an incoming
arc (p, t) into the transition and an output place q has an outgoing arc (t, q) from
the transition. A transition can have 0 or more input places and 0 or more output
places. A transition fires (or an event is said to take place) only when all the input
places associated with the particular transition have at-least one tokens each. 1 The
firing time of a transition represents the delay associated with the occurence of the
event. A transition can be classified either as an instantaneous or timed transition.
Instantaneous transitions, represented by a thin line, represent events which take
zero times. Timed transitions take finite amount of time and are represented by
thick lines. A Petri net with both timed and instantaneous transitions is referred
to as a Stochastic Petri nets (SPNs). In case of timed transitions, the firing time
takes values from a firing function. For example, in case of timed transitions in
a Generalized Stochastic Petri nets (GSPNs) the firing function can either be a
constant or can take exponentially distributed values.
1Once the transition fires, the tokens that enabled the transition are removed from the respective
18 Background
Figure 2.4 Petri Net Example
0.1
CPU
READY
TERMINATE
CONTINUE
0.9
EXECEND
DECIDE
Figure 2.4 shows a simple Petri net model for modeling a simplistic round-robin
CPU scheduler.
The place READY represents the Ready queue with a fixed number of ready
tasks in the initial state. The place CPU with a token represents the availability
of the processor. The transition EXEC models the execution of a ready process
for a given scheduling quanta amount of time. In this example we assume that this
transition takes a fixed amount of time q. Thus each firing of EXEC removes a token
from the place CPU and READY, takes q amount of time and places a token each
in the output places - DECIDE and CPU. The DECIDE place models a conflict.
The output arc from this place to the transition CONTINUE has a probability of
0.9 while that to the transition END has a probability of 0.1. Accordingly, one of
these transitions are enabled. These transitions are instantaneous and place a token
in either of the output places-CPU or TERMINATE.
Petri nets are capable of modeling concurrency and conflicts. A conflict occurs
when a place is an input place to multiple transitions. Such conflicts are resolved
by assigning probabilities to each of the output arcs of the place or equivalently the
input arcs of the respective conflicting transitions. The probability associated with
the arc, and hence the respective transition represents the probability with which
the transition will be chosen from among the conflicting transitions, provided the
transition is otherwise ready. Concurrency is naturally modeled in Petri nets as
input places and required number of tokens are deposited to all the output places of the transition.
2.3 Petri Nets: An Introduction 19
multiple transitions that are ready can fire simultaneously.
Colored Petri nets [20] have the same modeling description power of classical
Petri nets, but are more concise from a graphical viewpoint. This conciseness is
achieved by merging analogous places in a model into a single place, and associating
colors to tokens, places, and transitions to distinguish among various elements. A
transition can fire with respect to each of its colors. By firing a transition, tokens are
updated in the normal way except that a functional dependency is specified between
the color of the transition firing and the color of the involved tokens.
Petri net models are generally used to analyse system and establish properties
such as liveness, reachability, safety, boundedness [27]. They can also be used for
performance evaluation either using analysis or using simulation [11, 33]. In this
thesis we use the latter approach.
Chapter 3
Performance Modeling and
Evaluation
3.1 Introduction
The architecture model of earlier works, [7] [38] [9], do not study the impact of NP
specific architectural features like hash units, crypto units, CRC units, FIFOs, and
multiprocessors. These have complex interactions among the subsystems and it is
necessary to model them appropriately to get more accurate performance measures.
Further, the DRAM is involved in all the stages of the packet flow in the network
processor. Hence, it is important to model the packet flow end-to end (i.e, including
the transfer from the MAC to RFIFO to DRAM and from DRAM to TFIFO to
MAC). The large number of DRAM accesses and the high latency involved in a
memory access suggest that DRAM can be a potential bottleneck. However earlier
work on network processors [29] [31] [9] assume that packets are already buffered
(resident) in the memory.
In this chapter, we develop a Petri net model for both the network processor
and flow of packets in the network processor. Unlike some of the earlier works on
3.2 A Single Microengine Petri Net Model 21
Petri net modeling for multithreaded processors [11] [33] which focused on model-
ing the processor architecture and performance model of network processor [31] [9]
our model models the architecture, application, and their interaction in great de-
tail. Hence each application-architecture is modeled as a separate Petri net. In the
following subsection we describe the Petri net model for a single microengine run-
ning the IPv4 forwarding algorithm and later provide extension for multiprocessors.
The model for other applications are developed in a similar manner. Our model is
validated using the Intel proprietary simulator [16] for different parameters and for
different applications. We use this model in the subsequent performance evaluation
and architecture exploration.
This chapter is organized as follows. The Petri net model of the IXP proces-
sor for different applications is presented in section 3.2. Section 3.3 describes the
simulation methodology. Section 3.4 provides a detailed performance analysis and
evaluation of the model.
3.2 A Single Microengine Petri Net Model
Figure 3.1 shows a part of the Petri net model for a single microengine running
the IPv4 application. For clarity, only a part of the model which captures the
flow of packets from the external link to DRAM through the MAC is shown. The
firing time of a timed transition in our model takes either deterministic or expo-
nentially distributed values. In the following description, words in italics represent
places/transitions.
The place INPUT-LINE represents the external link. Packets arrive at IMAC,
the input MAC, from the external link at line speed 1. If an input port (IPORT)
1The line speed corresponds to 2.5 Gbps or higher
22 Performance Modeling and Evaluation
Figure 3.1 Petri Net Model for a Single Microengine in IXP 2400 Running IPv4Application
DRAM
UE
RFIFO
THREAD
RMACMEM
IMAC
LINE_RATE
IPORT
INPUT−LINE
MAC−FIFO
UE−RFIFO
UE−PROCESSING
SWAP−OUT
MEM−R1
UE−CMD−Q
DRAM−Q
WAITCMDBUS1
CMD−BUS
MV−DQ
MEM−R1
RFIFO−DRAM
DRAM−XFER
3.2 A Single Microengine Petri Net Model 23
in the MAC is free and if there is sufficient MAC memory, i.e., at least a token in
IMAC, the packet gets buffered in MAC. A token in RMACMEM indicates that
a packet has been buffered in the MAC. If a thread is free, denoted by a token in
place THREAD, it takes control of the packet and transfers the packet to the receive
buffer (RFIFO). The initial marking of place THREAD denotes the total number of
threads in a microengine. If the microengine is free, represented by a token in place
UE, the thread executes for UE-PROCESSING amount of clock cycles, and moves
the packet from RFIFO to DRAM. The thread swaps out, denoted by the arc from
SWAP-OUT to UE, after initiating a memory transaction by placing the request for
memory access in the microengine command queue (UE-CMD-Q). The availability
of a free entry in the command queue is denoted by a token in the place UE-CMD-
Q. The memory request is then moved from UE-CMD-Q to DRAM-Q through the
command bus arbiter (CMD-BUS). We defer the discussion on modeling memory
access to section 3.2.3. The memory request gets processed by DRAM and a token
is placed in DRAM-XFER indicating the completion of the memory operation.
The places UE, DRAM, CMD-BUS represent conflicts, i.e., two events competing
for a common resource. Conflicts are resolved by assigning probabilities to the con-
flicting events. Our Petri net model assigns equal probabilities for accessing shared
resources. We defer the discussion on modeling memory access to section 3.2.2.
The transitions, MAC-FIFO and RFIFO-DRAM represent packet flow from
MAC to RFIFO, and RFIFO to DRAM respectively. These transitions also represent
a part of the packet flow in the network processor. The places UE, THREAD, UE-
CMD-Q, DRAM-Q, CMD-BUS represent various resources in the architecture and
hence model the processor architecture. The timed transitions UE-PROCESSING
and RFIFO-DRAM represent the specific tasks and the time taken by these transi-
tions models the time taken by the corresponding tasks in the specific unit. Thus
the Petri net model is able to capture the processor architecture, applications, and
their interaction in detail.
24 Performance Modeling and Evaluation
3.2.1 Multiple Microengine Petri Net Model
In network applications, each microengine processes packets which are independent
of packets processed by other microengines [7]. The processing done by each mi-
croengine (described in the earlier subsection) is represented by a colour. We use
coloured Petri nets for modeling multiple microengines. The number of microengines
is represented by number of initial tokens, of different colours, in the place UE .
The NP being a store-forward architecture, the processor memory interaction
is very critical. The following subsection models the memory accesses in the IXP
processor.
3.2.2 Memory Modeling
Figure 3.2 and 3.3 show the detailed Petri net model of DRAM accesses in IXP
chip. DRAM memory in IXP has four banks. There exists a difference in memory
architecture between 2400 and 28XX processors.
IXP 2400 supports DDR-DRAMs while the 28XX series provides support only
for Rambus DRAM. The Rambus DRAMs differ from DDR-DRAMs in that they
support pipelined memory accesses [18]. Our Petri net model is able to model both
these type of DRAMs. Rest of this subsection describes the Petri net modeling of
memory accesses in DDR-DRAM and Rambus DRAM.
Figure 3.2 shows the Petri net model for memory accesses to a DDR-DRAM. A
token in DRAM indicates that the DRAM is free for memory access. We give an
initial marking of 4 for the place DRAM to represent four available DRAM banks.
We say that a bank conflict arises when two memory accesses are attempting
access to the same bank. Figure 3.2 models the bank conflict as follows. A token
is placed in MEMR2 for accessing the DRAM. The token is either returned back
3.2 A Single Microengine Petri Net Model 25
Figure 3.2 Petri Net model for Memory Access in DDR DRAM.
MEMR2
P11−P1
BANK_CONFLICT
DRAMUE−DRAM
WAITMEM2
WAITMEM1
to MEMR2 after BANK-CONFLICT clock cycles with a probability (1-p1) or the
token is placed in WAITMEM1 with a probability p1. Thus when a memory request
is not getting processed immediately, it is made to wait for BANK-CONFLICT clock
cycles before accessing the DRAM. The BANK-CONFLICT time is chosen as the
average DRAM memory access time for a packet. So in this model, memory accesses
go to different memory banks with probability (p1).
Figure 3.3 shows the Petri net model for memory accesses to a Rambus DRAM.
Memory accesses to a Rambus DRAM are pipelined. We assume four pipe stages,
represented by places PIPESTAGE1, PIPESTAGE2, PIPESTAGE3, PIPESTAGE4.
Bank conflicts in each pipe stage incur a PIPE-CONFLICT penalty. The pipe con-
flict time is roughly one fourth of the bank conflict time. Hence the bank conflicts
in Rambus-DRAMs incur lesser penalty.
26 Performance Modeling and Evaluation
Figure 3.3 Petri Net model for Memory Access in Rambus DRAM.
MEMR2
PIPE CONFILCT
WAITP1
PIPE CONFILCT
PIPE STAGE1
WAITP2
PIPE STAGE2
WAITPIPE1
PIPE CONFILCT
WAITP3
PIPE STAGE3
WAITPIPE2
PIPE CONFILCT
WAITP4
PIPE STAGE4
WAITPIPE3
MEMCOMP
PIPE1
PIPE2
PIPE3
PIPE4
3.3 Performance Evaluation of IXP
In this section we present the performance evaluation results of the NP running
different applications. We use IPv4, NAT, IPSec applications as benchmarks. The
applications are described in section 2.3.
3.3 Performance Evaluation of IXP 27
3.3.1 Simulation Methodology
We have developed Petri net models for IPv4 and NAT running on IXP 2400 and
IPSec protocols (AH and ESP) running on IXP 28502. The PN model for each ap-
plication is simulated using CNET [40]. CNET is an event-driven simulator which
simulates our timed petri net model. The simulator maintains a queue called the
event-queue. 3 This list is ordered with respect to the time in which the events
are scheduled to occur., with the event scheduled to occur in the nearest future at
the head of the list. The simulation time is advanced according to the events that
are to be executed in the event-queue. Each event triggers another set of events
to be enqueued in the event-queue. The simulation stops either when there are no
events in the event-queue or when the simulation time is exceeded. In our case, the
simulation is run for 10e8 microengine clock-cycles. The simulator outputs the fol-
lowing performance metrics- total number of tokens in a place, time average number
of tokens in a place, minimum and maximum number of tokens, at any given time
instant, in a place. We use these metrics in the following performance analysis.
The simulations were performed for different number of microengine/thread con-
figurations for a total of 16 configurations. We use the notation 2X8 to a config-
uration with 2 microengines and each microengine executing 8 threads. We model
packet arrivals as Poisson packet arrivals [6] with a mean λ, where λ is the mean
number of packet arrivals. In our study we assume a line rate of 6 Gbps and a fixed
packet size of 64 bytes4. Further, we assume that to access 8B of data, the DRAM
and SRAM take 50 nano-seconds and 8 nano-seconds respectively [18]. However to
access larger chunks of data (like 64 B) in DRAM which are in contiguous memory
locations, only an additional 5 nanosecond per 8B is required [12].
2IXP 2400 does not provide specific support for cryptographic applications and the crypto unitis only present in 2400 and 2850.
3The event in this description refers to a transition.4For these parameters λ is 0.24 micro-seconds.
28 Performance Modeling and Evaluation
Tasks IPv4 AH NAT ESPTotal thread proc 120 300 123 350Hash proc 85 - 85 -Crypto proc - 75 - 160Each FIFO MAC transfer 32 32 32 32
Table 3.1: Model parameters used in the Petri net model
Table 3.1 provides the model parameters used in our Petri net model. Note that
these parameters are on a per thread basis and are given in terms of the number of
processor clock cycles consumed.
We make the following assumptions in our simulation. The packet sizes are
assumed to be constant and of minimum size (64 B), this assumption is used for
worst case scenarios in case of DoS attacks. This scenario corresponds to the worst
performance of IXP processors for various applications. Other performance studies
[7] also evaluate network processors under similar conditions. In case of NAT, we
assume a constant session size of 10 kilobytes. We also assume that packets from
the external link and packets from local network arrive in the NP from mutually
exclusive ports.
In order to validate the Petri net results we have implemented all the applications
in MicroengineC [17], a high level programming language for Intel network proces-
sors and simulated on Intel SDK 3.51 [16]. This simulator is an instruction-level
simulator for the IXP chip developed by Intel corporation and is used to simulate
the IXP chip.
3.3.2 Validation Results
In the following subsection, first we provide a validation for the Petri net simulation
results. In the subsequent subsections, we use the Petri net approach for the per-
formance study of the network applications on the base IXP architectures as well as
3.3 Performance Evaluation of IXP 29
for architecture explorations.
The following performance parameters have been measured from the SDK simu-
lation and the PN simulation. We use these parameters to compare the results from
the PN model and the Intel SDK simulator.
• Throughput : The throughput of the NP, measured in Gigabits per second,
represents the aggregated (from all ports) number of packets transmitted.
• Microengine Utilization : This parameter gives the average utilization where
the average is measured as a time-average [3]. The utilization metric mea-
sured from the SDK simulation includes the time the microengine is executing,
aborted, and stalled. Execution is stalled when the microengine command
queue is full (4 entries) and the executing thread does not swap out.
• Microengine command queue length : This parameter gives the time averaged
command queue length of a single microengine. Note that the command queue
queues all requests for DRAM, SRAM and hash from a microengine queue as
described in section 2.2.1.
• DRAM Queue Length: This metric is the time averaged queue length of the
DRAM queue. The DRAM queue stores the requests from all MEs, waiting
to be serviced by the DRAM.
• Microengine Stall Percentage: This metric gives the percentage of time a
thread in a microengine is stalled.
In this subsection we initially analyze the results for header processing applica-
tions (IPv4 and NAT) and later for payload processing applications (AH,ESP). The
results presented have been arranged in the increasing order of the total number of
threads.
30 Performance Modeling and Evaluation
Figure 3.4 shows the transmit rates obtained from Petri net and SDK simula-
tions for all applications. In the Petri net simulation, we used different bank conflict
probabilities. For all applications we observe that the transmit rates obtained from
the Petri net simulations follow a similar trend to the SDK simulation. In particular,
the throughput rates from the SDK simulation closely follow that of the Petri net
simulation for bank conflict probabilities 0.5 and 0.7. Even though the variation for
other bank conflict probabilities are somewhat higher, the Petri net simulation is
able to predict the trend in general well.
Figure 3.4 Transmit Rates from PN and SDK Simulations.
0
1
2
3
4
5
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
Tx
RA
TE
(G
bps)
Micro engine- Number of threads
IPv4 SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(a) IP4
0
1
2
3
4
5
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
Tx
RA
TE
(G
bps)
Micro engine- Number of threads
NAT SDK RESULTCNET BANK PROB = 0.3CNET BANK PROB = 0.5CNET BANK PROB = 0.7CNET BANK PROB = 0.9
(b) NAT
0
1
2
3
4
5
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
Tx
RA
TE
(G
bps)
Micro engine- Number of threads
IPSEC AH SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(c) AH
0
1
2
3
4
5
6
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
Tx
Rat
e (G
bps)
Micro engine- Number of threads
IPSEC SDK RESULTCNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(d) ESP
In Figure 3.5, we compare the average utilization of microengine as observed
from the Petri net and SDK simulations. Once again these values closely match and
follow the same trend. These results essentially validate our Petri net model and
3.3 Performance Evaluation of IXP 31
Figure 3.5 Microengine Utilization from PN and SDK Simulations.
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
PE
R M
E U
TIL
IZA
TIO
N %
Micro engine- Number of threads
IPv4 SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(a) IP4
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
PE
R M
E U
TIL
IZA
TIO
N %
Micro engine- Number of threads
NAT SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(b) NAT
0
20
40
60
80
100
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
PE
R M
E U
TIL
IZA
TIO
N %
Micro engine- Number of threads
IPSEC AH SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(c) AH
0
20
40
60
80
100
1-1 1-2 2-1 1-4 2-2 4-1 1-8 2-4 4-2 8-1 2-8 4-4 8-2 4-8 8-4 8-8
PE
R M
E U
TIL
IZA
TIO
N %
Micro engine- Number of threads
IPSEC ESP SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(d) ESP
their performance results obtained from them.
3.3.3 Throughput
First we report the impact of multithreading and multiple micro-engines have on
the transmit rates achieved by various applications. The performance results are
obtained by simulating our detailed Petri net model.
Figure 3.4 shows the transmit rates for all applications. We observe that as
the total number of threads increases, the throughput increases and reaches 3 Gbps
for header processing applications (IPV4 and NAT) and nearly 4 Gbps for payload
processing applications (AH and ESP). These correspond to OC-48 and higher line
rates. The reason for higher throughput in PPA application as compared to the
32 Performance Modeling and Evaluation
Figure 3.6 DRAM utilization for Different Bank Probabilities.
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
DR
AM
UT
ILIZ
AT
ION
%
Micro engine- Number of threads
CNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(a) IP4
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
DR
AM
UT
ILIZ
AT
ION
%
Micro engine- Number of threads
CNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(b) NAT
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
DR
AM
UT
ILIZ
AT
ION
%
Micro engine- Number of threads
CNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(c) AH
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
DR
AM
UT
ILIZ
AT
ION
%
Micro engine- Number of threads
CNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(d) ESP
HPA application is that the PPA applications are computationally more intensive
and hence result in higher microengine utilization (refer to Figure 3.5).
Also we observe that the throughput saturates beyond a total of 16 threads which
occurs due to a high DRAM utilization.
DRAM Q length Stall %Config IP4 NAT AH ESP IP4 NAT AH ESP
4X8 10.9 9.7 4.7 5.08 29.5 11.5 13.8 22.48X8 10.8 8.3 9.2 9.63 76.8 65 31.96 61.5
Table 3.2: Time Average DRAM Queue Length and Stall Percentage.
We further observe that the throughput drops in SDK simulations in case of IPv4
3.3 Performance Evaluation of IXP 33
for 4X4 configurations and beyond. This is due to the negative feedback mechanism,
as discussed in section 2.2.1, which arises when the average DRAM queue length is
greater than 10 (10.81 for IPv4). We observe that the DRAM queue length for IPv4
is higher than 10. This may cause the throughput to saturate beyond 16 threads.
This results in a higher percentage of stalls (which is much higher for IPv4) resulting
in a reduced throughput.
Figure 3.6 plots the DRAM utilization for different applications and different
number of microengines and threads. The DRAM utilization in PPA applications
are lower (less than 60%) for 16 or more threads5. In PPA applications, certain
memory accesses such as packet header, only the first pipeline stage in the DRAM
is used. Hence the DRAM utilization is lower. Recall that these applications are
executed in the IXP 2850 which has Rambus memory. As discussed in section 3.2.2,
the Rambus DRAM is pipelined into 4 stages. Further we note that the through-
put rate for payload processing application is higher by 33% compared to that for
header processing applications. This is because of faster accesses in Rambus DRAM
which helps in supporting higher number of memory requests and hence the higher
throughput.
Figure 3.5 and Figure 3.7 show the microengine utilization and the average mi-
croengine command queue length respectively on a per microengine basis. (Recall
that the microengine-command-queue queues all requests for accesses to DRAM,
SRAM, Hash.) Both these parameters follow a triangular pattern for all the appli-
cations. This can be explained as follows. In a 1X8 configuration, all the 8 threads
execute on the same microengine whereas in a 8X1 the eight threads execute on dif-
ferent microengines. This leads to a higher microengine utilization for 1X8, nearly
60% for IPv4. In comparison in a 8X1 configuration the utilization is only 10% for
IPv4.
5DRAM utilization for Rambus DRAM is calculated as the average utilization for all the fourpipe stages.
34 Performance Modeling and Evaluation
Figure 3.7 Average Microengine Queue Length for Different Bank Probabilities.
0
1
2
3
4
5
6
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
AV
ER
AG
E M
E Q
UE
UE
LE
NG
TH
Micro engine- Number of threads
SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(a) IP4
0
1
2
3
4
5
6
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
AV
ER
AG
E M
E Q
UE
UE
LE
NG
TH
Micro engine- Number of threads
SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(b) NAT
0
1
2
3
4
5
6
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
AV
ER
AG
E M
E Q
UE
UE
LE
NG
TH
Micro engine- Number of threads
SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(c) AH
0
1
2
3
4
5
6
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
AV
ER
AG
E M
E Q
UE
UE
LE
NG
TH
Micro engine- Number of threads
IPSEC ESP SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9
(d) ESP
3.3 Performance Evaluation of IXP 35
3.3.4 Architecture Exploration
The main advantage of the PN model over the Intel SDK simulator is the relative
ease with which new architectural features can be evaluated. Further while SDK
simulation takes several hours to simulate a single configuration, the PN simulations
takes only 1 hour. Having validated the Petri net approach using the SDK simulator,
we can now use the former for evaluating the performance of a few enhancements
that we propose to improve the throughput of the network processor.
We explore the memory architecture only for header processing applications since
their performance is memory limited by the DRAM utilization.
3.3.4.1 Impact of DRAM Banks and Hash Units
The validation results in Section 3.4.3 indicate that DRAM limits the throughput
significantly in IPv4 and NAT. Hence a larger number of DRAM banks can be
beneficial. Since the number of banks is typically in powers of 2 we consider increas-
ing the DRAM banks to 8. Since DRAM is off chip, pin count is a constraint in
increasing the number of banks.
To keep the pin count same in the IXP processor we assume the width of DRAM
channel to be same as in the base IXP processor, and accordingly model the channel.
Note that DRAM banking still can be beneficial as the maximum number of parallel
accesses to the DRAM is increased to 8. In Figure 3.8 we plot the impact of
increasing the number of memory banks.
The performance results in Figure 3.8 indicate an improvement in throughput
by up to 20% (3.6 Gbps) with respect to the base case. In particular, the perfor-
mance improvement increases when the number of threads increase from 8 to 16 (for
configurations like 2X8 and 4X4). Further we observe that the DRAM utilization
decreases from 90% to 60%. As the DRAM utilization reduces by up to 40%, the
utilization of the hash unit increases to more than 90% and it becomes the bottle-
neck.
36 Performance Modeling and Evaluation
Figure 3.8 Impact of Number of DRAM Banks.
0
1
2
3
4
5
6
7
8
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
Tx
RA
TE
(G
bps)
Micro engine x Number of threads
4 BANK, PROB=0.5 4 BANK, PROB=0.9 8 BANK, PROB=0.5 8 BANK, PROB=0.9
(a) Transmit rate
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
DR
AM
UT
ILIZ
AT
ION
%
Micro engine x Number of threads
4 BANK, PROB=0.5 4 BANK, PROB=0.9 8 BANK, PROB=0.5 8 BANK, PROB=0.9
(b) DRAM Utilization
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
HA
SH
UT
ILIZ
AT
ION
%
Micro engine x Number of threads
4 BANK, PROB=0.5 4 BANK, PROB=0.9 8 BANK, PROB=0.5 8 BANK, PROB=0.9
(c) HASH Utilization
Next we evaluate the impact of increasing the number of hash units. We consider
a NP with 2 hash units. We obtain a throughput of 4.8 Gbps (shown in Figure 3.8(a))
an improvement of up to 60% in comparison with the base IXP architecture. Further,
we observe that the transmit rate does not increase beyond 4 microengines, especially
for configurations such as 8X2, 8X4. Also note that with 2 hash units the utilization
of hash units decreases to 60% and the DRAM utilization also remains around 60%.
So an IXP architecture with only 4 microengines and 2 hash units gives a significant
throughput improvement (66%) but consumes almost the same area as the base
IXP processor. This is based on the area estimates given in [9] where a hash unit
consumes almost the same area as four microengines. So we believe that future
network processor architecture will need to scale special processing units like hash
3.3 Performance Evaluation of IXP 37
Figure 3.9 Impact of Number of Hash Units.
0
1
2
3
4
5
6
7
8
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
Tx
RA
TE
(G
bps)
Micro engine x Number of threads
8 BANK, PROB=0.5 8 BANK, PROB=0.9
8 BANK + 2 HASH , PROB=0.5 8 BANK + 2 HASH, PROB=0.9
(a) Transmit rate
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
DR
AM
UT
ILIZ
AT
ION
%
Micro engine x Number of threads
8 BANK, PROB=0.5 8 BANK, PROB=0.9
8 BANK + 2 HASH , PROB=0.5 8 BANK + 2 HASH, PROB=0.9
(b) DRAM Utilization
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
HA
SH
UT
ILIZ
AT
ION
%
Micro engine x Number of threads
8 BANK, PROB=0.5 8 BANK, PROB=0.9
8 BANK + 2 HASH , PROB=0.5 8 BANK + 2 HASH, PROB=0.9
(c) HASH Utilization
units to support higher line rates.
3.3.4.2 Better Utilization of SRAM
The performance results for HPA indicate that DRAM is saturated beyond 16
threads whereas the SRAM utilization is only 27 % (refer to Figure 3.10(c), 3.11(c)).
Further the memory access time for DRAM is at least 5 times greater than a similar
SRAM access. In order to better utilize the SRAM and improve the packet through-
put we consider placing the packet header, fixed length of 20 Bytes, in SRAM and
the packet payload in DRAM.
The performance results for this scheme are shown in Figure 3.10 for IPv4 and
Figure 3.11 for NAT. This results in a performance improvement of up to 20% (Fig-
ure 3.10) in case of IPv4 and 6% in case of NAT (Figure 3.11). The performance
38 Performance Modeling and Evaluation
Figure 3.10 Performance Enhancements from Storing Packet Header in SRAM forIP4
0
1
2
3
4
5
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
Tx
RA
TE
(G
bps)
Micro engine x Number of threads
4 BANK, PROB=0.5 4 BANK, PROB=0.9
PKT HDR IN SRAM , BANK PROB=0.5PKT HDR IN SRAM, BANK PROB=0.9
(a) Transmit rate
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
DR
AM
UT
ILIZ
AT
ION
%
Micro engine x Number of threads
4 BANK, PROB=0.5 4 BANK, PROB=0.9
PKT HDR IN SRAM , BANK PROB=0.5PKT HDR IN SRAM, BANK PROB=0.9
(b) DRAM Utilization
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
SR
AM
UT
ILIZ
AT
ION
%
Micro engine x Number of threads
4 BANK, PROB=0.5 4 BANK, PROB=0.9
PKT HDR IN SRAM , BANK PROB=0.5PKT HDR IN SRAM, BANK PROB=0.9
(c) SRAM Utilization
improvement is due to the lesser memory access time for SRAM as compared to
DRAM and the reduction in contention for accessing the DRAM. However, the per-
formance saturates beyond 16 threads, as the SRAM utilization increases to greater
than 90%. It is interesting to note that while the IP4 forwarding application gives a
throughput improvement of 20% the NAT gives an improvement of around 6% (refer
Figure 3.11). This occurs due to the larger number of SRAM accesses involved in
NAT, since the translation table is stored in SRAM. A question that arises with this
approach is the buffering space in SRAM, as the SRAM size is typically limited to
8 MB or 16 MB and it stores state information like lookup table or NAT table.
However, even with an 8 MB SRAM and leaving 2 MB for lookup table and
other state information, we can still store as many as (6 MB/20 B = 300,000) packet
headers in SRAM. Hence the buffering space in SRAM is not really a concern. This
3.3 Performance Evaluation of IXP 39
Figure 3.11 Performance Enhancements from Storing Packet Header in SRAM forNAT
0
1
2
3
4
5
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
Tx
RA
TE
(G
bps)
Micro engine x Number of threads
4 BANK, PROB=0.5 4 BANK, PROB=0.9
PKT HDR IN SRAM , BANK PROB=0.5PKT HDR IN SRAM, BANK PROB=0.9
(a) Transmit rate
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
DR
AM
UT
ILIZ
AT
ION
%
Micro engine x Number of threads
4 BANK, PROB=0.5 4 BANK, PROB=0.9
PKT HDR IN SRAM , BANK PROB=0.5PKT HDR IN SRAM, BANK PROB=0.9
(b) DRAM Utilization
0
20
40
60
80
100
120
1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8
SR
AM
UT
ILIZ
AT
ION
%
Micro engine x Number of threads
4 BANK, PROB=0.5 4 BANK, PROB=0.9
PKT HDR IN SRAM , BANK PROB=0.5PKT HDR IN SRAM, BANK PROB=0.9
(c) SRAM Utilization
scheme is particularly attractive since a significant performance improvement can
be achieved without any additional hardware overhead. This also indicates alterna-
tive ways of buffering packet headers in existing on chip memory, like the scratch
pad and local memory, can give significant performance improvement without any
additional cost.
3.3.4.3 Limiting the Number of Pending DRAM Requests
In IPv4, we observe that the DRAM queue length to be greater than 10, and stalls
account for 75% of microengine utilization (refer to Table 3.2).
40 Performance Modeling and Evaluation
Whenever the DRAM queue length exceeds 10, the feedback mechanism (dis-
cussed in Section 2.2.1) prevents further issue of DRAM accesses from the micro-
engine command queue to the DRAM command queue. This also results in blocking
of other requests such as SRAM requests or hash requests. To alleviate this, we limit
the number of pending DRAM requests from each microengine. This allows the ex-
ecution of ready threads as well as prevents blocking of other accesses from each
microengine command queue.
Figure 3.12 Impact of Limiting Pending DRAM Accesses per Microengine
0
1
2
3
4
5
1x4 1x8 2x4 2x8 4x4 4x8 8x4 8x8
Tx
RA
TE
(G
bps)
Micro engine x Number of threads
IPv4 RESULT COUNT =1IPv4 RESULT COUNT =2IPv4 RESULT COUNT =3
IPv4 RESULT NO COUNTER
(a) Transmit rate
In Figure 3.12 we plot the transmit rate under various configuration when the
number of pending DRAM requests (COUNT) per microengine is limited to 1, 2 or
3. Limiting the pending DRAM requests to 2 or 3, increases the throughput by up
to 47% compared to the base case. Note that the throughput obtained in this case
is also the the maximum throughput obtained for IPv4 (refer to Figure 3.4).
3.3.5 Summary
This chapter develops a Petri Net model for a commercial network processor (Intel
IXP 2400 and 2850) for different applications. The PN model is developed for three
different applications viz IPv4, NAT and IPSec protocols and validated using the
3.3 Performance Evaluation of IXP 41
Intel proprietary SDK simulator. The model is validated for different processor pa-
rameters like processor utilization, queue length, and transmit rate. This validation
is done across different thread configurations. The salient feature of our model is
its ability to capture the architecture, applications and their interaction in great
detail. Our performance results show that while multithreading helps to improve
the throughput, increasing the total number of threads beyond a certain point (16
threads for HPA and 32 threads for PPA) results in performance saturation. Since
the transmit rate is limited by the packet buffer memory utilization, we investigate
different approaches to reduce the memory utilization. Our performance results
indicate that :
• In the IXP processor, although the DRAM is utilized 100% the SRAM is
utilized only upto 27%, hence we explore placing the packet header in SRAM
and packet payload in DRAM. This gives an improvement in transmit rate by
upto 20%. This scheme is particularly attractive since it does not involve any
additional hardware and there exists sufficient space in the SRAM to buffer
packet headers.
• Increasing the number of DRAM banks from 4 to 8 improves the throughput
by upto 20%. However when the number of banks is 8, the hash unit, a task
specific unit used for performing hardware lookup, becomes the bottleneck.
Increasing the number of hash units from 1 to 2 gives an improvement in the
throughput by upto 60% as compared to the base case. We further observe
that an identical improvement is obtained by using two hash units but lesser
number of microengines (4 MEs). So given a fixed die area, a NP architecture
with lesser number of processors but more task specific units, with respect to
the base IXP architecture, gives a better performance.
• When the number of outstanding memory requests in the IXP processor ex-
ceeds a threshold then all microengines with memory requests at the head of
their command FIFO are stalled. Instead if the number of pending memory
42 Performance Modeling and Evaluation
requests from each microengine is limited then an improvement in transmit
rate by up to 47% compared to the base case can be achieved.
Chapter 4
Packet Reordering in Network
Processors
4.1 Introduction
NPs employ multiple parallel processors (microengines) to exploit packet level par-
allelism inherent in network workloads in order to support OC 48 line rates (using
IXP 2400) as reported in the previous chapter. Each microengine processes packets
independent of the processing of other microengines. Since packets can get allocated
to threads in different microengines, packet order at the output of the NP cannot
be guaranteed. Earlier works [5] [24] study the impact of packet reordering on
the TCP throughput in routers. However they do not consider the impact of NP
architecture on reordering. This chapter studies the impact of network processor
architecture on packet reordering and packet retransmission. We extend the Petri
net model developed in the previous chapter to evaluate the impact of reordering in
network processors.
This chapter is organized as follows. In the following section we describe packet
reordering in the IXP architecture. Section 4.3 presents the performance results.
44 Packet Reordering in Network Processors
Section 4.4 describes different ways to reduce packet reordering. Section 4.5 sum-
marizes this chapter.
4.2 Packet Reordering
When packets belonging to a single flow, having the same source and destination
IP address and port number, arrive at the destination in an order different from the
sequence order, we say that the packets are reordered. Packet reordering is a well
known phenomenon in the Internet [4] [5].
Figure 4.1 Packet Reordering in Network Processors.
ME2
T4
P14
T3
P10
T1
P2
T2
P6
ME3
P15
T3 T4
P11
T2
P7
T1
P3
ME4
P16
T3
P12
T2
P8
T1
P4
T4
ME1
T1 T3 T4
P13P9
T2
P5P 1
P3 P5P2 P4
TFIFO
PACKET ARRIVAL
RFIFO
P1P2P3P4P5...P1...
Studies on backbone traffic measurement [8] suggest that TCP accounts for
80% of the Internet traffic. When packets get reordered, the TCP receiver begins to
generate duplicate ACKs. On receiving duplicate ACKs, the TCP sender concludes
that packet drops have occurred due to congestion. The Congestion Avoidance
algorithm [30] now kicks in and reduces the congestion window to roughly half
its current value. We explain the effect of reordering with the following scenario.
4.2 Packet Reordering 45
Assume that packets P0, P1, P2, P3, P4, P5 are packets of the same flow being sent
by A (sender) to B (receiver). The sender transmits packets P0, P1, P2, P3, P4, and
P5 strictly in that order. But, due to network delays/router processing, B receives
packets in the order, P0, P3, P4, P5, P2, P1. When B receives P3, instead of P1, it
sends an ACK for the last in-order packet received, i.e., in this case, the ACK for P0.
B continues to send ACKs for P0 when it receives P4, P5 and P2. When the sender
(A ) sees 3 duplicate ACKs for packet P0, it concludes that the network is congested,
and according to the Congestion Avoidance and Fast Retransmit algorithms [30], it
halves the transmit window size. As a result, the TCP sender transmits fewer packets
than what the network can actually accommodate. Thus the effect of reordering
is not only the retransmission of packets that are already transmitted but also an
unnecessary reduction of the sender’s congestion window leading to under-utilization
of the network resources. The following subsection explains the architecture impact
of network processor on packet reordering.
4.2.1 Reordering in Network Processors
A network processor, being a multithreaded multiprocessor, can process packets of
the same flow in different microengines and different threads. This may result in
packets getting forwarded in an order different from the transmitted order.
Consider the scenario shown in Figure 4.1. Assume packets P1, P2, P3, P4 of
the same flow arrive at the receive buffer (RFIFO) of the network processor in order.
Let the packets be allocated to threads in different microengines in the following way
: P1, P2, P3, P4 are allocated to ME1-T1 (Microengine1-Thread1), ME2-T1, ME3-
T1 and ME4-T1 respectively. Now packet P1, being processed by ME1-T1, can get
delayed with respect to P2, P3, P4. This can happen due to various reasons, e.g.,
processing of other threads in ME1, or pending memory requests in DRAM FIFO.
So the thread ME1-T1 completes the processing of P1 only after ME2-T1, ME3-T1,
and ME4-T1 have processed their respective packets. So packet P1 is delayed with
respect to P2, P3, P4 and is transmitted only after P2, P3, P4 have been forwarded.
46 Packet Reordering in Network Processors
This may result in a retransmission of P1 when multiple duplicate ACKS for P0
are received by the sender. This example explains how the concurrent processing
of packets can affect the ordering of packets. Note that multiple microengines is a
feature of network processors and a network processor such as the IXP 2400 [15]
has 64 threads and hence can process up to a total of 64 packets. This potentially
increases the chances of packet reordering.
4.2.2 Transmit Buffer Induced Reordering
In this subsection we explain the impact of the transmit buffer on packet reordering.
Transmit buffer is a shared resource in the IXP architecture. So all the threads
compete for a common transmit buffer space.
Figure 4.2 Transmit Buffer Reordering.
.
.
.
ME 1− T 1
ME2 − T2
ME8 − T8
ME1 − T 2
ME1− T4
ME1 − T5
ME1 − T6
ME1 − T8
ME2 − T1
TFIFO3
TFIFO4
TFIFO2
TFIFO5
TFIFO6
TFIFO7
TFIFO8
P1
TFIFO9
TFIFO10
TRANSMIT BUFFER
HEAD
TAIL
TFIFO1
ME1 − T3
ME8 − T1
P2
Hence, to ensure proper access of the transmit buffer, all threads should execute
a mutual exclusion operation. This, as reported in Section 3.6, results in a significant
drop in the throughput (61% drop in the transmit rate). So transmit buffer locations
are allocated a-priori to different threads. However, the transmit buffer dequeues
packets in a strict FIFO order. This aggravates packet reordering as illustrated in
the following example.
We consider a contiguous buffer allocation where different threads in different
microengines are allocated contiguous space in the transmit buffer. More specifically,
we will assume that ME1-T1 (Microengine1- Thread1) is allocated the first 64 bytes,
ME1-T2 is allocated the next 64 bytes and so on (refer to Figure 4.2). Assume
4.2 Packet Reordering 47
packets P1, P2, P3, P4 from flow F1 arrive strictly in that order in the receive
buffer. Further, assume that P1, P2, P3, P4 are allocated to ME1-T1, ME2-T1,
ME3-T1, and ME4-T1 respectively. After processing by different microengines, the
packets P1, P2, P3, and P4 are stored in TFIFO1, TFIFO9, TFIFO17 and TFIFO25
respectively. However, as mentioned earlier, packets are dequeued in a strict order
of the transmit buffer location. Thus, before P2 is dequeued from TFIFO9 location,
other packets from TFIFO2 to TFIFO8 will be dequeued. If packets from the same
flow as P2 are allocated to threads in microengine 1, they will get forwarded before
P2, causing the packet reordering problem.
Hence, the transmit buffer can independently induce reordering. Note that in
this example even if packets P1, P2, P3, and P4 complete in the same order, the
dequeuing of packets from the transmit buffer causes reordering. This indicates that
the transmit buffer can independently cause packet reordering. We explore different
transmit buffer schemes and study their effect on reordering.
4.2.3 Packet Ordering Mechanisms in IXP
The IXP processor supports the following two mechanisms to maintain packet order
in the network processor.
• Inter Thread Signaling (ITS): In this mechanism the start and finish tasks
of IPv4 forwarding are executed sequentially. However, the packet processing
functions are done parallely and independently across all microengines (refer
to Figure 4.3). The task of writing packets from DRAM to TFIFO also
takes place sequentially. Each thread wits for a signal from previous thread
before they can transfer packets to DRAM. Once the packets are transferred
to TFIFO the next thread is signaled. Thus the sequential processing at the
beginning and at the end of IPv4 ensures that packets are allocated in the
transmit buffer and transmitted out in-order.
In this scheme each thread is allocated a packet in sequential order. Assume
48 Packet Reordering in Network Processors
Figure 4.3 Inter Thread Signaling in the IXP.
PKT PROC
PKT PROC
PKT PROC
PKT PROC
PKT PROC
PKT PROC
ME8 − T8
ME1 − T4
ME1 − T2
ME1 − T3
ME1 − T2
ME1 − T1
...ME1 − T2
PKTALLOC
ME1 − T1
PKTALLOC
SERIAL EXECUTION
TIME (t)
PARALLEL EXECUTION
...ME1 − T2ME1 − T1
SERIAL EXECUTION
PKT−Tx PKT−Tx
that packets P1 and P2 arrive in the system in that order. An implicit logi-
cal ordering of the threads across all microengines specially the order ME-T1,
ME1-T2,...ME1-T8, ME2-T1, ..., ME8-T8 is assumed. Further, ME1-T1 is
assigned P1 and ME1-T2 is assigned P2. This assignment occurs in a sequen-
tial order, across all the threads in the processor. This ordering of threads
is done using Inter Thread Signaling (ITS). Each threads waits for a signal
to start the sequential task, performs the allocation of packet or transmission
of packet, and signals the neighboring thread. For example, ME1-T1 signals
ME1-T2 and ME1-T8 signals ME2-T1 as depicted in Figure 4.3.
• Asynchronous Insert Synchronous Remove: In this scheme, packet forwarding
is divided into four stages namely, packet buffering stage, packet processing
stage, reordering stage, and transmit stage. In the initial stage namely the
packet buffering stage, every packet is assigned a sequence number and buffered
in the memory (DRAM).
4.2 Packet Reordering 49
Figure 4.4 Asynchronous Insert Synchronous Reset (AISR) in the IXP.
TRANSMIT STAGE
PKT TX
ME6− T1
ME6− T2
ME6− T3
.
.
.
ME8− T8
TIME (t)
PACKET PROCESSING STAGE
PKT PROC
PKT PROC
PKT PROC
PKT PROC
PKT PROC
PKT PROC
ME4− T8
ME2 − T4
ME2 − T4
ME2− T3
ME2− T2
ME2− T1
...ME1 − T2
PKTALLOC
ME1 − T1
PKTALLOC
PACKET RX STAGE
PKTALLOC
ME1 − T8ME5 − T1
SYNC INSERT SYNC INSERT
ME5 − T8
REORDERING STAGE
The sequence number is maintained for all the packets arriving in the system.
The packet sequencing is done by a single microengine and eight threads in that
microengine (refer to Figure 4.4). The sequence number of a newly arriving
packet in the system is one greater than the previous packet. After the packets
are assigned sequence numbers, the packet processing stage processes packets
independently and passes the packet handler to the next stage, the reordering
stage. The packet processing stage is executed parallely by 4 microengines
(32 threads). The reordering stage performs a counting sort of the packet
handler is carried out by the reordering block to restore packet ordering. Here
the packets are also assigned different transmit buffer addresses. A single
microengine performs the counting sort. The transmit buffer address is passed
on to the last stage, the transmit stage. The transmit block, moves the packet
out of the DRAM to the network interfaces. 3 microengines (24 threads) are
involved in this final stage.
50 Packet Reordering in Network Processors
4.2.4 Performance Metric
The extent of packet reordering is measured using two performance metrics, namely
packet reordering and retransmission rates. Reordering is measured as the number
of duplicate ACKs that will be sent by the destination back to the source. Re-
transmission corresponds to the number of retransmission packets where 3 or more
duplicate ACKs cause a retransmission. Both Reordering and retransmission are
reported as a percentage of the total number of packets being transmitted (refer to
Equations 4.1 and 4.2)
ReorderingRate =Number of Duplicate ACKS
Total Number of Packets Sent(4.1)
RetransmissionRate =Number of Retransmitted Packets
Total Number of Packets Sent(4.2)
We use packet forwarding throughput (Gbps) as a measure of the network processor
performance. In the following section we study the extent of packet reordering
induced by the architectural features of IXP processor discussed in Section 4.2.1
and 4.2.2.
4.3 Packet Reordering in IXP
4.3.1 Petri Net Model
We extend the Petri net model introduced in Section 3.2 for IPv4 running on IXP
2400. In order to take into account the flow information, used in determining packet
sequence, each token is given two distinct attributes, a flow number and a sequence
number. The CNET Petri net simulator is modified to retain the flow information
associated with tokens. This is used in determining the packet order at the transmit
stage, and hence the reorder and retransmission rates.
4.3 Packet Reordering in IXP 51
4.3.1.1 Petri Net Model for Multiple Hops
In order to study the impact of retransmission on the TCP throughput, the entire
end to end packet flow across multiple routers needs to be modeled. Packet re-
ordering induced by each router can cumulatively add up, leading to a significant
degradation in the TCP throughput. Packets in the Internet traverse, on an aver-
age, 16 hops to reach the destination [28]. We have simulated a network topology
(depicted in Figure 4.5) with multiple routers. We assume that each router in the
Figure 4.5 Simulated Network Topology.
ROUTER 2SOURCE 1 ROUTER 1
IXP 2400 IXP 2400
SOURCE 2 / OTHER ROUTERS SOURCE 3 / OTHER ROUTERS
OTHER ROUTERS
DESTINATION DESTINATION
OTHER ROUTERS
DEST 1.............
SOURCE N / OTHER ROUTERS
above topology uses IXP2400 to forward packets. The multi hop environment is
incorporated by extending the IXP 2400 Petri net model, where the output of one
router (one Petri net model) is given as input to the next router.
We measure packet reordering for one flow, between SOURCE 1 and DEST 1.
Packets from other flows are used to simulate the network workload in the router.
To reduce the complexity of the simulator and the time taken to simulate, we use the
traffic going out of one router itself as the traffic from other sources/routers. This is
reasonable since, in the steady state, the amount and characteristics of traffic leaving
a router is similar to the traffic entering the next router. Hence, in our simulation,
52 Packet Reordering in Network Processors
we model only multiple flows from a single source to destination through multiple
routers; but we measure reorder/retransmit rates for 1 out of n flows ( we use n =
10), leaving the other (n-1) flows to model the network traffic entering /exiting the
router in the multiple hop. This models a real network scenario.
4.3.2 Validation
The Petri net model is simulated using the CNET Petri net simulator [40]. We
simulated up to 100,000 packets in each simulation. As before we validate the Petri
net results using the implementation in MicroengineC [17] which is executed on
SDK 3.51 [16]. The validation of the model is performed for different processor
parameters. In this section, the validation is restricted to reordering and retrans-
mission rates for a single hop. The validation is performed for different flow sizes,
640 B, 6.4 KB, 64 KB. We assume a contiguous buffer allocation in the validation.
Further, we assume a 64 B packet size and each flow to contain a fixed number of
packets. In our discussion we use 6.4 KB as the default flow size which is also the
the average flow size reported in the Internet [28]. We also assume a contiguous
buffer allocation.
We assume a network traffic of 3 Gbps, which is higher than the maximum line
rate currently supported (2.5 Gbps for OC-48) by IXP 2400 . We do not model
the network flow of ACK packets (from destination to source). Nor do we assume
rate reduction at source on retransmission. This is done in order to simulate worst
case scenarios as in DoS attacks. Table 4.1 shows the comparison of reordering
Flow Size=640B Flow Size=6.4KB Flow Size=64KB
Reorder Rate Retrans Rate Reorder Rate Retrans Rate Reorder Rate Retrans Rate
CNET 31.7% 5.8% 35.85% 8.35% 36.36% 9.05%
SDK 32.4% 4.7% 33.4% 7.1% 33% 8.2%
Table 4.1: Petri Net Model Validation
and retransmission rates obtained from the Petri net (CNET) and SDK simulations
for a single hop. We observe that the reordering and retransmission rates obtained
4.3 Packet Reordering in IXP 53
from the Petri net model closely match the SDK simulations for different flow sizes.
This essentially validates the Petri net model.
4.3.3 Performance Results
Next we study packet reordering under multiple hops using our Petri net model and
the CNET simulation. In this study we assume a packet size of 64 B and 512 B
and flow size of 6.4 KB. The 64 B packets represent a worst-case scenario and the
512 B packets correspond to the average packet size in Internet. Figure 4.6 shows
Figure 4.6 Packet Reordering in NP.
(a) Reordering (b) Retransmission
reordering and retransmission rates for various packet sizes and for different hops.
We observe that the reordering and retransmission rates increase with the number
of hops for all of the packet sizes. Further, for a 64 B packet size, the percentage of
retransmitted packets is as high as 61% for 10 hops.
However, for a packet size of 512 B, the average packet size in Internet [28],
the reordering and retransmission rates are much lower (46% and 14% respectively).
This occurs as only 8K/512=16 packets can be buffered at the receive buffer at any
given time in case of a 512 B packet size; whereas with a packet size of 64 B as
many as 128 packets can be buffered. So only 25% of the total number of threads,
54 Packet Reordering in Network Processors
i.e., 16 out of 64 threads are busy in the IXP processor. This reduces the extent of
concurrent processing and correspondingly the packet retransmission in the network
processor.
Although the retransmission rate is much lower for 512 B packets compared to
that of 64 B, a 14% retransmission is very high [24]. Earlier studies [24] indicate
a retransmission greater than 10% can result in significant reduction (up to 60%)
reduction in packet throughput. For a 16 hop network, average number of hops for
packets in Internet, the retransmission rate can further aggravate.
In the following section we explore different architectural ways to reduce packet
reordering.
4.4 Reducing Packet Reordering
In order to reduce packet reordering and its impact, we explore a few transmit buffer
allocation schemes, as well as architectural parameter tuning in this section.
4.4.1 Buffer Allocation Schemes
The transmit buffer allocation, as observed in Section 3.2, can independently induce
packet reordering. Hence, we explore the following buffer allocation schemes to
reduce reordering.
Figure 4.7 Different Transmit Buffer Allocation Schemes.
TBUF 5
TBUF 4
TBUF 3
TBUF 2
TBUF 6
TBUF 7
TBUF 8
TBUF 9
TBUF 10
.
.
.
TRANSMIT BUFFER
TBUF 1ME 1− T 1
ME1 − T 2
ME1 − T4
ME1 − T5
ME1 − T6
ME1 − T7
ME1 − T8
ME2 − T1
ME2 − T2
ME8 − T8
SYNCH
.
.
.
.
.
.
..
.
HEAD
TAIL
(a) Global
TBUF 5
TBUF 4
TBUF 3
TBUF 2
TBUF 6
TBUF 7
TBUF 8
TBUF 9
TBUF 10
.
.
.
TRANSMIT BUFFER
TBUF 1
ME1 − T 2
ME 1− T 1
SYNCH
SYNCH
ME2 − T1
ME2 − T2
ME2 − T3
ME1 − T3
ME1 − T4
ME2 − T4
TAIL
HEAD
.
.
.
.
.
.
.
(b) Local
TBUF 5
TBUF 4
TBUF 3
TBUF 2
TBUF 6
TBUF 7
TBUF 8
TBUF 9
TBUF 10
.
.
.
TRANSMIT BUFFER
TBUF 1ME 1− T 1
ME5 − T1
ME6 − T1
ME7 − T1
ME8 − T1
ME1 − T2
ME2 − T2
ME8 − T8
ME4 − T1
ME3 − T 1
ME2 − T 1
(c) Strided
4.4 Reducing Packet Reordering 55
• Global Buffer Allocation: In this scheme (depicted in Figure 4.7(a)) the com-
peting threads are allocated transmit buffer space as and when a thread is
ready to move the packet to the TFIFO. Since the transmit buffer is shared
across all the microengines the allocation has to be done using global syn-
chronization, a mutual exclusion operation. The mutual exclusion operation
is performed across all threads in all microengines. The mutex variable is
stored in the scratch pad as it is common to all the MEs. Since synchroniza-
tion is performed across all the microengines this can result in a drop in the
throughput.
• Local Buffer Allocation: In this scheme, shown in Figure 4.7(b), contiguous
sets of locations are allocated to different microengines. But threads within a
microengine compete for a common chunk allocated to that microengine and
access it through a mutual exclusion operation. The transmit buffer is allo-
cated by performing a mutual exclusion operation locally within a microengine.
There is one mutex variable for each microengine. Since only threads within
a microengine share a single mutex variable, the overheads are relatively low
compared to the global buffer allocation scheme.
• Strided Buffer Allocation: This scheme (refer to Figure 4.7(c)), allocates
buffers to microengines and threads a priori. However, unlike the contiguous
case, the buffer is allocated in a strided way. The stride is dependent on the
number of active microengines. So a NP running on 8 microengines will have
a stride of 8. The threads ME1-T1, ME2-T1, ..., ME1-T2 place packets in
TFIFO1, TFIFO2,..,TFIFO9 respectively.
A disadvantage of contiguous and strided allocation, as compared to local or
global buffer allocation, is that they assume a fixed buffer size. In our study
we assume a packet size of 64 B (as in DoS attack, worst case scenario) [29],
or 512 B (average packet size) [28]. In a general situation, as the buffer size
may vary from minimum to maximum packet size, a buffer size equal to the
56 Packet Reordering in Network Processors
maximum packet size (1.5 KB) needs to be allocated. This may result in
under-utilization of the transmit buffer when the packet sizes vary widely. On
the positive side, the contiguous and strided buffer allocation schemes enjoy
the benefit of not requiring any synchronization, which leads to better packet
throughput.
Figure 4.8 Impact of Various Buffer Allocation Schemes (64B Packet Size) - CNETresult.
(a) Reordering (b) Retransmission
Figure 4.9 Impact of Various Buffer Allocation Schemes (512B Packet Size) - CNETResult.
(a) Reordering (b) Retransmission
4.4.1.1 Performance Evaluation of Buffer Allocation Schemes
We have implemented different buffer allocation schemes on the SDK 3.5 simulator.
Table 4.2 reports the throughput achieved for different buffer allocation schemes
4.4 Reducing Packet Reordering 57
Schemes Throughput (Gbps)64B 512B
Contiguous 2.96 3.068Strided 2.96 3.068Local 2.1 2.3Global 1.1 1.4
Table 4.2: Impact of Buffer Allocation schemes on Throughput.
in a single hop network for 64 B and 512 B packet size. We observe that the local
and global allocation schemes suffer significant reduction in throughput. As ex-
plained in the previous section the performance degradation is due to the MUTEX
and the synchronization overhead. On the other hand, the strided buffer allocation
performs as well as contiguous allocation maintaining high packet throughput. The
throughput remains the same in each scheme as the packets go through multiple
hops. Hence we do not report throughput results for multiple hops. The impact of
various schemes on reordering and retransmission is shown in Figures 4.8 and 4.9
for 1, 5, and 10 hops for 64 B and 512 B packets. First let us look at the perfor-
mance results of 64 B packets. While strided and contiguous allocation result in
significant retransmission rates (greater than 55%) for 10 hops, the local and global
schemes reduce the retransmission rates to 45% and 33% respectively. However,
the throughput achieved by global and local schemes, 2.1 Gbps and 1.1 Gbps, are
unacceptably low. Although the retransmission rate is alarmingly high for a 10 hop
network for 64 B packets, it may not be a cause of major concern as a major part
of 64 B traffic usually corresponds to a DoS attack. So, the sender may not react to
duplicate ACKs.
On a more realistic situation, when the packet size is 512 B, the retransmission
rates are 15%, 12%, 3%, 2% for contiguous, strided, local and global buffer allo-
cation. While local and global allocation schemes achieve very low retransmission
rate, their throughput is also very low. From this discussion we observe that there
exists a trade-off between the throughput and the retransmission rate achieved by
58 Packet Reordering in Network Processors
each scheme. The retransmission rates of strided allocation scheme (12%) is still
considered to be high to cause significant degradation in TCP performance [24].
The global scheme completely eliminates packet reordering due to transmit buffer
allocation. The reordering/retransmission experienced in this scheme is entirely due
to the concurrency processing of packets by the MEs and threads.
We study the impact of the architecture parameters, such as number of mi-
croengines and number of threads on retransmission in the following subsection.
Further, since the strided buffer allocation while reducing the retransmission rate as
compared to the contiguous allocation does not impact the throughput, we use the
strided buffer allocation in the following sections.
4.4.2 Tuning Architecture Parameters
Performance studies in the earlier section indicate that the packet throughput sat-
urates beyond a total of 16 threads (refer to Section 3.3.2). The throughput results
for different number of threads for packet forwarding is reported in Table 4.3 for
easy reference.
Figure 4.10 Impact of Number of Microengines (64B Packet Size) - CNET Result.
(a) Reordering (b) Retransmission
The performance saturation occurs as the memory (DRAM) saturates beyond 16
threads. Then the additional threads (beyond 16) do not contribute to performance
improvement while at the same time they could adversely impact the packet reorder-
ing and retransmission. Therefore we study the effects of number of microengines
4.4 Reducing Packet Reordering 59
Figure 4.11 Impact of Number of Microengines (512B Packet Size) - CNET Result.
(a) Reordering (b) Retransmission
and number of threads using our Petri net model in the following subsection.
4.4.2.1 Impact of the Number of Microengines
A network processor with fewer microengines and/or fewer additional threads, while
giving the same throughput can reduce the reordering due to concurrent processing.
Figures 4.10 and 4.11 show the impact of number of microengines (each microengine
running 8 threads) on packet reordering/retransmission. We observe that the packet
retransmission drastically reduces from 56% (64 B) and 12% (512 B), for 8 ME x
8 threads, to 19% (64 B) and 5%(512 B), for 2 ME x 8 threads. This reduction in
retransmission rates is achieved without any penalty on packet throughput.
Number of Threads Transmit Rate (Gbps)64 (8x8) 2.96
32 (4x8,8x4) 2.9616 (2x8,4x4,8x2) 2.96
Table 4.3: Transmit Rates for Different Number of Threads.
Thus, a network processor using 2 or 3 microengines can reduce retransmission
by up to 27% for 64 B packet and 5% for 512 B packet, while providing a transmit
rate of a 8 microengine. Further, reducing the number of microengines reduces the
demand for VLSI area in the NP which could otherwise be used for accelerators or
60 Packet Reordering in Network Processors
functional units like the hash unit or crypto units.
4.4.2.2 Impact of the Number of Threads
Figures 4.12 and 4.13 compare the impact of number of active threads on reordering
and retransmission.
Figure 4.12 Impact of Number of Threads (64B Packet Size) - CNET Result.
(a) Reordering (b) Retransmission
In the above figure, a 4x8 configuration refers to 4 microengines with each mi-
croengine running 8 threads. It is interesting to note that configurations running
the same total number of threads give different retransmission rates. For example,
a 4x8 configuration reduces the retransmission for 1, 5, and 10 hops by up to 21%
as compared to an 8x4 configuration. Both configuration give a throughput of 2.96
Gbps (refer to Table 4.3). This indicates that the impact of multiple microengines
on packet ordering is more severe compared to multiple threads for the proposed
strided allocation. A similar trend, as explained earlier, is observed for 512 B packet
size although the reduction in retransmission/reorder rates are lower. This is due
to the limited buffering possible with the 512 B packet size.
4.4.3 Packet Sort: An Alternative Scheme
Our study on buffer allocation schemes indicate that while global and local buffer
allocation schemes can reduce the retransmission rates, they also incur significant
4.4 Reducing Packet Reordering 61
Figure 4.13 Impact of Number of Threads (512B Packet Size) - CNET Result.
(a) Reordering (b) Retransmission
performance penalty due to synchronization. Hence, in this subsection we explore an
algorithmic approach to eliminate reordering ensuring that the performance penalty
in throughput is minimized. We propose a packet forwarding scheme, Packet sort,
where the packet processing is pipelined (refer to Figure 4.14).
Figure 4.14 Packet Sort Implementation in the IXP.
PKT PROC
PKT PROC
PKT PROC
PKT PROC
PKT PROC
PKT PROC
ME4− T8
ME2 − T4
ME2 − T4
ME2− T3
ME2− T2
ME2− T1
TRANSMIT STAGE
PKT TX
ME6− T1
ME6− T2
ME6− T3
.
.
.
ME8− T8
TIME (t)
PACKET PROCESSING STAGE
ORDERING STAGE
ME5 − T1
INSERTION SORT
In this scheme the packet processing is partitioned into three stages. In the first
stage, the packet processing stage, 4 microengines concurrently move the packets
62 Packet Reordering in Network Processors
from RFIFO to DRAM and subsequently process them (based on the packet for-
warding application). Packets are placed in DRAM, by the threads from the first
stage, based on the flow information. In the second stage, the packet ordering stage,
the 1 microengine sort the packets based on the flow information and store it in the
scratch pad. The overhead involved in the sorting is minimal as the microengine
utilization is low in the second stage. The sorted packet addresses and the corre-
sponding transmit buffer addresses are stored in the scratch pad and communicated
to the remaining 3 microengines which execute the third stage of Packet sort, the
transmit stage. In this stage the packet is moved from the DRAM to the TFIFO
and further to the MAC by the 3 microengines. The scratch pad is used for the
communication between the pipe stages. We have implemented the Packet sort ap-
proach for packet forwarding in MicroengineC and measured its performance using
SDK. We have also developed the Petri net model of Packet sort and compared the
performance obtained from SDK and the Petri net simulation.
Packet sort completely eliminates packet reordering and gives a throughput of
2.5 Gbps. This approach is attractive as the network processor is able to support
current line rates (2.5 Gbps) for 10 or more hops.
Scheme Concurrent Flows Throughput (Gbps)SDK CNET
PacketSort
32 2.56 2.310 2.5 2.31 1.7 1.6
ITS NA 2.3 2.1AISR NA 1.1 0.960
Table 4.4: Comparison of Various Schemes to Overcome Reordering.
Table 4.4 reports the performance of packet sort for different number of con-
current flows. In this experiment, we assume a constant line rate (of 3 Gbps) and a
fixed packet size, of 64 B. The number of concurrent flows determines the flow size
as well as the number of packets per flow. the throughput of Packet sort decreases
from 2.5 Gbps to 1.7 Gbps as the number of flows decrease from 32 to 1 (refer to
4.4 Reducing Packet Reordering 63
Number of Concurrent Flows ME Util%1 76.2410 6032 40
Table 4.4). Note that even with 10 concurrent flows, which correspond to average
flow size in Internet, the throughput achieved by Packet sort is 2.5 Gbps. The Petri
net simulation results also exhibit a similar trend. The reason for the decrease in
the throughput with fewer concurrent flows is as follows. With a fixed line rate the
number of packets per flow decreases with the number of concurrent flows resulting
in a larger overhead in the sorting operation (refer to Table 4.4.3).
Next we compare the performance of Packet sort with those of in-built schemes
namely AISR and ITS. For this purpose we implement AISR with 1 microengine
performing the buffering operation, 4 microengines performing the packet processing
block, 1 microengine executing the reordering block, and 2 microengines running
the transmit block. The ITS runs totally parallel, with threads in all microengines
performing the complete IPv4 forwarding. Note that the ITS and AISR schemes are
not affected by the number of concurrent flows, as these schemes maintain a strict
packet order. It is interesting to note that AISR performs poorly as compared to the
other schemes. In particular the ITS is able to support a line rates of 2.3 Gbps which
is close to OC 48 line rates, but the AISR supports only a line rate of 1.1 Gbps.
This occurs as there are only 8 threads buffering the packets to the DRAM, the
first stage in AISR. This coupled with the saturation of the DRAM result in a lower
throughput. However an increase in the number of threads to 16 for the first stage
resulted in reduction in throughput to 0.9 Gbps since a global synchronization needs
to be done across all the 16 threads. While our implementation of AISR may not
be the most efficient. Hence to estimate an upper-bound for the AISR performance
64 Packet Reordering in Network Processors
taking into account all DRAM transactions, including RFIFO-DRAM and DRAM-
TFIFO.1. To obtain the upper bound we consider only the receive block of AISR
to be running in the SDK. We observe that the maximum possible throughput is
2.1 Gbps. Hence the AISR throughput is limited to a maximum of 2.1 Gbps. So
packet sort gives a throughput improvement of at least 16% with respect to the
upper bound of AISR.
4.5 Summary
This chapter studies the impact of parallel processing in network processor on packet
reordering and retransmission. We observe that in addition to the reordering due to
parallel processing the transmit buffer allocation adversely impacts reordering. We
summarize our contributions as follows:
• Our results reveal that the transmit buffer allocation significantly impacts
reordering and results in a packet retransmission rate of up to 61%. We explore
different transmit buffer allocation schemes namely, contiguous, strided, local,
and global for transmit buffer. The strided buffer allocation reduces the packet
retransmission by up to 24% while retaining the packet throughput of 3 Gbps.
Global and local buffer allocation schemes reduce retransmission rates further
but at the expense of performance.
• We study the impact of architecture parameters, namely, number of micro-
engines and number of threads on packet reordering. A network processor with
fewer microengines (2 or 3) or fewer threads (4 threads per microengine) can
significantly reduces the retransmission rate while achieving the same through-
put.
1Earlier studies assume that the packets are already available in DRAM and do not accountfor RFIFO to DRAM or DRAM-TFIFO transfer. Our performance evaluation studies , in Section3.3, indicates that DRAM saturates the performance.
4.5 Summary 65
• We propose an alternative scheme, Packet sort, which dedicates a certain num-
ber of threads to sort the packets to eliminate retransmission. This scheme
provides a line rate of 2.1 Gbps which is close to the current line rate. we
observe that Packet Sort outperforms, by up to 35%,the in-built schemes in
the IXP processor namely - Inter Thread Signaling (ITS) and Asynchronous
Insert and Synchronous Remove (AISR) by upto 35%.
Chapter 5
Performance Analysis of Network
Processor in Bursty Traffic
The previous two chapters evaluated the performance of the network processor with
a Poisson packet arrival, where the packet size is constant and equal to 64B. In
this traffic the minimum packet size of 64 B is assumed to simulate the worst case
scenario encountered by a router. Earlier works on network processor performance
evaluation [29] [9] also consider only a similar scenario. However, earlier works on
traffic characterization [8] indicate that, on an average, only 10% of the traffic is due
to DoS attacks. Hence, the performance of the network processor in a more realistic
traffic needs to be evaluated. Earlier work on Internet traffic characterization observe
that the traffic is self-similar and bursty in nature and develop a mathematical model
for the traffic.
A bursty traffic possesses the property of self-similarity [25]. Self-similarity is
the property in which a stochastic process (in this case the packet arrival) has the
same statistical properties at any time scale. So, a self-similar traffic will be bursty
at small time scales without any constant burst length. In effect, it is impossible
to predict (using mathematical models) whether a burst will occur and the burst
length. We develop a Petri net model of a realistic traffic based on the theoretical
model proposed for bursty traffic [23]. The performance of the network processor is
5.1 Motivation 67
evaluated using this model. Further, this chapter evaluates the necessity of a store-
forward architecture for a network processor and explores various packet buffering
architectures.
The rest of the chapter is organized as follows. The following section provides
the motivation for this study. Section 5.2 describes the traffic model and Section
5.3 develops a Petri net model of the traffic generator used in the study. In Sec-
tion 5.4 we present the different packet buffering schemes. Section 5.5 presents the
detailed performance analysis and evaluation of the network processor. We provide
concluding remarks in Section 5.6.
5.1 Motivation
In this section we explain the need for a store-forward architecture in a network
processor [7]. The store-forward architecture is used in a network processor due to
the following reasons :
• Limited Buffering Space in RFIFO. The Receive FIFO size is only 8 KB. So
if packets of size 1536 B are streaming into the network processor, then the
RFIFO will have space to buffer only 5 packets. Even with 512 B packets
the RFIFO can only buffer 15 packets. An application such as IP forwarding
requires on an average 2666 nanoseconds for the processing of a single packet
[29] of size 512 B. However, the inter-arrival time between the packets is 1638
nanoseconds. Thus packets arrive in the system at a higher rate as compared
to the processing rate. The RFIFO can at best buffer 15 packets (of 512 B).
Hence there is a need to buffer packets in larger memories such as DRAMs.
This makes the network processor a store-forward architecture.
• Bursty Traffic. Consider the traffic arrival in a router as shown in Figure 5.1.
Packets arrive at the receive FIFO of R1 which uses a network processor to
forward the packets. The traffic as seen in the RFIFO of R1 may contain
peaks and troughs as in a bursty arrival of packets. The network processor
68 Performance Analysis of Network Processor in Bursty Traffic
Figure 5.1 Packet Arrival in NP.
NO OF BYTES ROUTER WITH NETWORK PROCESSOR
INPUT TRAFFIC
TIME −> t1 t2
in R1 buffers the incoming packets in DRAM, processes these packets and
forwards the packets to the next hop. At time instant t1 the packet arrival
rate may be higher than the maximum supported line rate of the network
processor. However at time instant t2 the arrival rate drops. So if the network
processor buffers the packets at t1 temporarily then it can process the packets
with minimal packet drop. Otherwise a significant packet drop will occur at
time t1 . So a store-forward architecture is used to minimize packet drops that
occur due to sudden bursts in the network traffic.
In this chapter we study various packet buffering schemes for a store-forward ar-
chitecture. The following section describes the traffic generator model used in the
simulation.
5.2 Generation of Bursty Traffic
Earlier works on traffic measurement indicate that the Internet traffic is bursty and
self-similar in nature [8]. We use a traffic model similar to [25] to simulate a bursty
traffic. The rest of the section describes the model of the traffic generator.
Figure 5.2 shows the traffic generation model [23] used in this study. In this
5.2 Generation of Bursty Traffic 69
Figure 5.2 Bursty Traffic Generation.
ON/OFF STREAM 1
PKT SIZE = 96 B
ON/OFF STREAM 2
PKT SIZE = 80 B
ON/OFF STREAM 48
PKT SIZE = 1536 B
ON/OFF STREAM 1
PKT SIZE = 64 B
AGGREGRATOR
SYNTHETIC SELF
SIMILAR TRAFFIC
.
.
.
.
.
model the traffic is assumed to be an aggregate of different sub-streams. Each sub-
stream is assumed to be of constant packet size with finite ON/OFF periods. In
the ON time each sub-stream generates packets of constant size and restarts the
packet generation after the OFF period. The arrival rate in the ON period within
a sub-stream is equal to (packetsize)/(linerate). The ON and OFF times of each
sub-stream are Pareto distributed with a probability distribution function f(x) given
by
f(x) = αβα/xα+1 (5.1)
where α represents the shape parameter and β represents the scale parameter [32].
The shape parameter α can take values between 1 and 2 and the scale parameter
β is the ON/OFF time. The resultant traffic generated is used as the input to the
network processor.
It has been shown that the traffic generated using the above methodology is
bursty and similar to the traffic commonly encountered by routers [25]. The above
traffic generator is modeled using Petri nets. The following subsection describes the
Petri net model of the traffic generator.
70 Performance Analysis of Network Processor in Bursty Traffic
5.3 Petri net Model of the Traffic Generator
Figure 5.3 shows the Petri net model of the traffic generator. The place NET-
WORK1 represents the external link. Packets arrive in the MAC at a constant rate
equal to LINERATE for BURST-TIME amount of time. This models the ON
period. These packets are buffered in the MAC which is represented by WAITMAC
place. After BURST-TIME time, a token is placed in WAIT1 which subsequently
results in the firing of the transition BURST1. This removes the token from NET-
WORK1 which temporarily stops the traffic generation. The traffic generation is
resumed after IDLE TIME when a token is placed in NETWORK1 and NET1.
The IDLE-TIME corresponds to the OFF period.
Figure 5.3 Petri Net Model of Traffic Generator.
LINERATE
BURST1
NET1
BURST−TIME
WAIT1
IDLE1
IDLE−TIME
WAIT MAC
NETWORK1
In this model NETWORK1 and NET1 are initially assigned 1 token each to
start the packet generation. The multiple sub-streams (sources) are modeled using
colored Petri nets where each source is assigned a given color. The traffic generated
from each sub-stream is combined in WAITMAC . The packets generated by each
sub-stream can have different packet sizes. The attribute of the token is modified
to return the packet size. In our simulation experiments we varied the packet sizes
in steps of 16B starting from 64B. Further, a total of 48 different sub-streams are
used to generate the traffic.
5.4 Packet Buffering Schemes 71
5.4 Packet Buffering Schemes
We evaluate the performance of the following packet buffering schemes :
• Parallel Execution Model. In this scheme each thread in the network processor
buffers the packet in DRAM, processes the packet, and moves the packet to the
TFIFO. This packet flow in the network processor is same as that described
in Section 2.1.1.2.
• Packet Buffering with a Pipelined Packet Flow In this scheme the packet flow
Figure 5.4 Pipelined Buffering Scheme.
ME3
ME4
ME5
ME6
ME2
ME1
RxPACKET
STAGE STAGE
PACKETPROCESSING
PACKETTx
STAGE
ME8
ME7
TIME T−>
is divided into three stages (refer to Figure 5.4), namely, the packet receive
stage, packet processing stage, and packet transmit stage. Each of the three
stages are assigned to different microengines. 2 microengines receive and buffer
the packets, 4 microengines run the packet processing algorithm parallely, and
the remaining 2 microengines run the transmit stage of the pipeline. Hence,
in this scheme each microengine is responsible for a subtask in the execution
of the packet.
We extend the Petri net model developed in Section 3.2 to simulate these packet
buffering schemes running the IPv4 forwarding application. The traffic generator,
72 Performance Analysis of Network Processor in Bursty Traffic
described in previous section, is used as the input to the validated Petri net model
of IXP 2400 processor, developed in section 3.3 and simulated using CNET [40].
5.5 Results
In our study we assume a maximum line rate of 20 Gbps. Further, we assume
that to access the initial 8B of data, the DRAMs and SRAMs take 50 nanoseconds
and 8 nanoseconds respectively [18]. However to access larger chunks of data
(like 64B) in DRAM which are in contiguous memory locations, only an additional
5 nanosecond per 8B is required [12]. The traffic is generated using 48 sources
with each source generating a traffic of constant packet size. But the packet sizes
generated by different sources vary. In our simulation the packet size increases in
steps of 16B for each source; i.e, packets generated by source 1 is 64 B, and the
packets generated by source 2 is 80 B and so on (refer Figure 5.2).
Figure 5.5 shows the traffic generated, in terms of the number of bytes, using
the Petri net model described in Section 5.3. This traffic generated on a 2.5 Gbps
link (OC 48 link). We observe that this traffic is characterized by alternate peaks
and troughs. The number of bytes in a burst, i.e, in a time frame of 1 millisecond1,
varies from 5x106 to 5x107 bytes. This traffic, shown in Figure 5.5, is used as an
input to the network processor and used in the performance evaluation.
The following subsection evaluates the performance of different packet buffering
schemes in a bursty traffic scenario.
5.5.1 Impact of Packet Buffering
Tables 5.1, 5.2 and 5.3 compare the performance of different packet buffering
schemes explained in Section 5.1 for bursty traffic for different input line rates. We
observe that packets are dropped in case of the parallel buffering scheme even at low
11 millisecond = 6x105 IXP clock cycles
5.5 Results 73
Figure 5.5 Bursty Traffic Generated using 48 sources.
0
5e+06
1e+07
1.5e+07
2e+07
2.5e+07
3e+07
3.5e+07
4e+07
4.5e+07
5e+07
0 2000 4000 6000 8000 10000
Num
ber
of B
ytes
s in
a b
urst
(6x
10e
5 cl
ock
cylc
es)
Time Interval
48 source w idle time of 6x10 e 8
line rates of 1.7 Gbps. This occurs due to insufficient buffering space in the RFIFO
and/or the thread non-availability.
To justify this inference we measure the percentage utilization of RFIFO, per-
centage thread non-availability and percentage RFIFO non-availability.
• Thread Non-Availability : This parameter gives the percentage of time an in-
coming packet (to the RFIFO) is not processed due to the unavailability of
free threads. Note that in the pipelined buffering scheme the thread non-
availability is measured only for the receive stage of the pipeline (see Figure
5.4).
• RFIFO Non-Availability : This parameter gives the percentage of time the
RFIFO occupancy is full when a packet arrives at the RFIFO. A packet is
dropped either when the RFIFO is full or when there are no free threads
available.
• RFIFO Utilization: The time average number of free bytes available in the
RFIFO. When this metric is expressed as a percentage it is
RFIFO UTIL =Number of Free Bytes Available in RFIFO
8192 B∗ 100 (5.2)
74 Performance Analysis of Network Processor in Bursty Traffic
We observe that the RFIFO Non-Availability is 8% and 45% for the parallel
scheme for line rates of 1.7 Gbps and 6 Gbps respectively. In contrast the pipelined
buffering scheme has a non-availability of 0% and 15% for line rates of 1.7 Gbps
and 6 Gbps respectively. Packet drop happens in parallel even at lower line rate for
bursty traffic whereas in the pipelined buffering scheme no packets are dropped. This
trend of higher packet drop in parallel buffering scheme is even more pronounced at
higher line rates. This is because under the parallel buffering scheme, a single thread
is responsible for the entire processing of packets and a thread is occupied during
the entire period of packet processing. Thus once all the 64 threads in IXP 2400
are busy and if there is a burst of packets then due to lack of buffering the packets
can get dropped. Whereas in the pipelined scheme, a dedicated set of microengines
and threads, even though they are fewer in number, can move packets from the
RFIFO to DRAM. Whereas in case of the pipelined scheme a thread in the Rx stage
is only responsible for moving the packet from RFIFO to DRAM. As an evidence,
we present the utilization of the RFIFO (refer to Table 5.1, 5.2 and 5.3). The
pipelined buffering scheme buffers packets in the Rx stage (refer to Figure 5.4) and
process these packets when the line rate drops. However, in the parallel scheme
packets are either processed instantly or they are dropped.
Schemes Output Rates Drop RFIFO RFIFO ThreadGbps Rates Util. Non-Avail. Non-Avail.
Parallel Buffering Scheme 1.6 0.1 14% 8% 7%Pipelined Buffering Scheme 1.7 0 15% 0% 0%
Table 5.1: Output Line Rates supported with Input Rate of 1.7 Gbps
Table 5.4 compares the performance of the packet buffering schemes for an
exponential arrival with fixed packet size of 64B (DoS attack scenario). In contrast
to the bursty traffic scenario both these schemes support line rates of up to 2.96
Gbps without any packet drop. We observe that in both the schemes the line rate
is limited by DRAM bandwidth. Note that in this case, packet buffering does not
5.6 Summary 75
Schemes Output Rates Drop RFIFO RFIFO ThreadGbps Rates Util. Non-Avail. Non-Avail.
Parallel Buffering Scheme 2.29 0.85 47% 37% 11%Pipelined Buffering Scheme 2.9 0.24 23% 5% 7%
Table 5.2: Output Line Rates supported with Input Rate of 3.14 Gbps
Schemes Output Rates Drop RFIFO RFIFO ThreadGbps Rates Util. Non-Avail. Non-Avail.
Parallel Buffering Scheme 4.3 1.7 65% 45% 21%Pipelined Buffering Scheme 5.3 0.7 27% 15% 11%
Table 5.3: Output Line Rates supported with Input Rate of 6 Gbps
provide any advantage as the network traffic is streaming at a constant.
Schemes Transmit Rate (Gbps)Parallel Buffering Scheme 2.96
Pipelined Buffering Scheme 2.98
Table 5.4: Maximum Output Rates for Different Packet Buffering Schemes.
5.6 Summary
This chapter evaluates the performance of a network processor in bursty traffic.
Further, this chapter also examines the need for a store-forward architecture in a
network processor and also explores different packet buffering schemes. A bursty
traffic is simulated based on a mathematical model developed in earlier work. This
model is integrated into the Petri net model of the network processor and simulated
using CNET. Our results indicate that the parallel buffering scheme suffers signifi-
cant drop in packet throughput (up to 30%) due to packet drop under bursty traffic
even at line rates of 3.14 Gbps. However, the pipelined buffering scheme is able to
support a line rate of 3.14 Gbps. For an input line rate of 6 Gbps the pipelined
76 Performance Analysis of Network Processor in Bursty Traffic
buffering scheme achieves line rates upto 5.3 Gbps. Also this scheme has lesser
packet drops as compared to a parallel scheme at lower line rates. However, in case
of a constant packet size with exponential arrival the parallel and pipelined scheme
give similar transmit rates.
Chapter 6
Related Work
In this chapter we present a discussion on related work on network processors.
Section 6.1 discusses the related work in the performance evaluation of network
processors. Section 6.2 deals with the related work in the area of packet reordering
and bursty traffic evaluation.
6.1 Network Processor Performance Evaluation.
Crowley et al. [7] evaluate the performance of different processor architectures
for network applications using a trace-driven simulation approach. This work eval-
uates the performance of four architectures namely - superscalar, fine grain mul-
tithreaded (FGMT), simultaneous multithreaded (SMT), and chip multiprocessors
(CMP). This work evaluates the performance of these architectures on three dif-
ferent applications, viz., IPv4 forwarding, Message Digest 5, and Data encryption
applications. This work explores the impact of different processor parameters like
number of threads, processor clock rate for the different architectures and applica-
tions. This work concludes that the SMT architecture is best suited for network
processors. However, a major drawback of this work is that they do not model the
buffering of packets in the network processor. Our study shows that packet buffer-
ing is the bottleneck and the processor-memory interaction needs to be modeled in
78 Related Work
greater detail.
Spalink et al. [29] evaluate the performance of IPv4 forwarding running on a
IXP 1200 processor. This study evaluates the performance of the processor using
the Intel SDK 3.5 tool [16]. This work also uses a DoS traffic model for incoming
traffic with packets of 64 B size. This work also studies the impact of number of
threads on the throughput of the network processor. Their results indicate that
DRAM is the bottleneck resource and limits the throughput to 1.377 Gbps.
Wolf et al. [9] use an analytical approach to model a network processor. This
model derives mathematical equations for throughput, processor utilization, and
memory access based on existing models for multithreaded architectures [1] and
standard queuing models. Further, this study evaluates the impact of number of
threads, processor clock rate, data cache and instruction cache sizes. This work
also evaluates the impact on the chip area due to these parameters and uses a
performance-area metric to evaluate the performance of the network processor. Fi-
nally, this work gives insights, based on their results, into future directions in network
processor trends. A major drawback with this work is that it does not model the
processor-memory interaction in detail. Our work, on the other hand, uses a Petri
net model for performance evaluation. A salient feature of our approach is that
the PN model models the network processor, architecture, application and their
interaction.
Thiele et al. [31] develop a genetic algorithm framework for the design space
exploration of network processors. Given an application, flow characterization, and
available resources, their framework schedules the tasks of the applications and
binds them to the resources. More specifically, their framework consists of a task
and resource usage model. The task model models the different packet processing
functions such as header or payload processing functions. The resource model cap-
tures the utilization of various resources. Further, this work explores the different
resource-task mapping and finds an optimum mapping based on constraints like the
chip area.
6.2 Packet Reordering in Network Processors. 79
Weng and Wolf [37] construct an annotated directed acyclic graph of the ap-
plications using run time traces and use this to perform design space exploration.
They use a randomized algorithm to perform mapping of nodes to processors and
memories. The system throughput is determined by modeling the processing time
for each processing element in the system based on its workload, the memory con-
tention on each memory interface, and the communication overhead between the
pipeline stages. Memory contention is modeled using a queuing network approach.
At the end of the mapping process the best overall mapping is reported. While the
work reported in [31, 37] deal with design space exploration of application program,
and its corresponding mapping to a given network processor architecture, our work
deals with a few architecture enhancements for performance improvement.
Wolf et al. [38] develop a benchmark namely, Commbench, to evaluate the
performance of network processors. This benchmark suite classifies the network ap-
plication broadly into header processing and payload processing applications. The
classification is done based on the amount of processing required for applications.
Based on simulations, this work contrasts the performance of Commbench with the
standard SPEC benchmark. This work characterizes the performance of Comm-
bench applications with respect to instruction, data set locality, type of instructions
like the number of loads/stores. A drawback with this work is that it evaluates the
performance on a generic processor rather than a network processor.
6.2 Packet Reordering in Network Processors.
Benett et al. [5] are one of the earliest to report the problem of reordering and
its potential impact on the network throughput. They study the impact of packet
reordering in a backbone link. They observe that the parallelism existing in the
Internet due to multiple parallel paths leads to reordering of packets. They pro-
pose a modification in the TCP protocol to take into account the impact of packet
reordering.
Laor et al. [24] artificially introduce packet reordering and study the impact
80 Related Work
of reordering and retransmission on throughput. This work evaluates the effect
of packet reordering on the application throughput by simulating a backbone link.
This study evaluates the impact on application throughput by artificially inducing
reordering for multiplexed flows. Further, this work also evaluates the impact of
number of concurrent flows arriving in the router. This study indicates that a
retransmission of 10% of packets can reduce the network bandwidth significantly by
up to 60%.
Jaiswal et al. [19] classify the reordered/retransmitted packets as those arising
from routing loops, network duplication, due to the loss of the packet, and due to
the parallelism in packet transmission. They propose an evaluation methodology
to classify the packets and evaluate it on a backbone link. Their results indicate
are transmission of about 5% most of which is due to packet loss. The impact of
network anomalies like reordering is lesser as compared to the packet loss. However,
this work does not measure the impact of the network anomalies on the application
throughput. While these works evaluate packet reordering and its impact on packet
retransmission their studies are specifically not on network processors. Hence they
do not consider the effects of concurrency and FIFO ordering on packet reordering.
Our work focuses on the impact of the network processor architecture on the packet
reordering/retransmission. Modern network processors have in-built mechanisms
(ITS and AISR) to overcome packet reordering. We compare the performance of
these schemes with our proposed scheme in Section 4.4.3.
There have been several attempts to characterize the Internet traffic. Zhang et
al. [39] are one of the earliest to characterize the Ethernet traffic. They observe that
the Ethernet traffic is self-similar in nature. They conclude that the traffic exhibits
fractal like behavior. They further conduct a rigorous statistical analysis on Ethernet
traffic collected between 1989 to 1992. The mathematical model developed in this
paper is further extended to the Internet traffic. Taqqu et al. [35] provide a unique
way to generate self-similar traffic. This work uses a superposition of a number of
ON/OFF sources (referred to as packet trains) which generates a Constant Bit Rate
6.2 Packet Reordering in Network Processors. 81
traffic in the ON period. This method is used in several traffic generators like the
NS. Kramer et al. [23] extend this traffic generation to simulate Internet traffic.
They use a Poisson distribution for the ON time. This thesis uses a similar traffic
generation model.
Chapter 7
Conclusions
7.1 Summary
This thesis deals with the performance evaluation of network processors. We develop
a Petri Net model for a commercial network processor (Intel IXP 2400,2850) for dif-
ferent applications. The Petri net model is developed for three different applications
viz., IPv4 forwarding, Network Address Translation and IP security protocols. The
performance results are obtained by simulating Petri nets using CNET simulator.
Further this model is validated using the Intel proprietary SDK simulator. A salient
feature of our model is its ability to capture the architecture, applications and their
interaction in great detail. Initially, we study the performance of network processors
using Poisson arrival rate. The IXP 2400 achieves a throughput of 2.96 Gbps for
IPv4 and NAT and the IXP 2850 can achieve a throughput of 3.6 Gbps for IPSec
application. Our performance results indicate that the DRAM memory used for
packet buffering is the bottleneck. Our study shows that multithreading is effective
only up to a certain number of threads. Beyond this threshold packet buffer mem-
ory (DRAM) is fully utilized and increasing the number of threads is not beneficial.
Further, we observe that increasing the number of microengines beyond 4 provides
no additional gain in the throughput. In order to reduce the utilization of DRAM
7.1 Summary 83
we store packet headers in SRAM. We obtain up to 20% improvement in the trans-
mit rate at no additional hardware cost. Since DRAM is the bottleneck we explore
increasing the number of DRAM banks. Our result indicates that a network proces-
sor with 8 DRAM bank improves the throughput by upto 20%. However when the
number of DRAM banks is 8, the hash unit, a task specific unit used for performing
hardware lookup, becomes the bottleneck. Increasing the number of hash units from
1 to 2 gives an improvement in the throughput to 4.8 Gbps, an improvement of 60%
as compared to the base case. We further observe that an identical improvement is
obtained by using two hash units but fewer number of microengines (4 MEs). So
given a fixed die area, a NP architecture with lesser number of processors (support-
ing 16 or more threads) but more task specific units, with respect to the base IXP
architecture, gives a better performance.
The second part of the thesis studies the impact of packet level parallelism pro-
cessing in a network processor on packet reordering and retransmission under the
fast retransmission model. The Petri net model developed in Chapter 3 is extended
to take into account packet attributes like Source and Destination IP addresses and
sequence number. Further, the model is extended to study the impact of reordering
on a multi-hop network. We evaluate the impact on reordering and retransmission
for a multi-hop environment of 1, 5, and 10 hops and for different packet sizes, 64
B and 512 B. We observe that the reordering/retransmission increases non linearly
with the number of hops, reaching up to 60% for 10 hops. Our results indicate that
the parallel architecture of the network processor can severely impact reordering
and can cause up to 60% retransmission in a 10 hop scenario. Further, we observe
that in addition to reordering due to parallel processing, transmit buffer allocation
for each thread in a microengine severely impacts packet reordering. This is due
to the strict FIFO order dequeuing from the transmit buffer explained in detail in
Section 2. Hence we explore the following buffer allocation schemes - global, local,
contiguous, and strided buffer allocation.
84 Conclusions
In the global buffer allocation, transmit buffer space for threads from different
microengines are allocated in a critical section and the threads compete for a mutex
lock to enter the critical section. This reduces reordering to 14% but also drastically
reduces the throughput to 1 Gbps. Hence we explore a local buffer allocation scheme
where only threads from the same microengine compete for a common buffer space
and different microengines are allocated a fixed buffer space a-priori. This scheme
results in a reordering of 18% but gives a throughput of 2 Gbps.
The global and local allocation schemes use mutual exclusion for transmit buffer
allocation. These schemes significantly reduce the packet reordering, 14% in global
and 18% in local, but also result in a lower performance loss, 1 Gbps in global and
2 Gbps in local buffer allocation. On the other hand, a static buffer allocation,
contiguous and strided, without any mutual exclusion gives a transmit throughput
of 3 Gbps but with a packet reordering of up to 33.4%. In strided buffer allocation
threads within a microengine are allocated a fixed space, decided a-priori, with
a fixed stride between them. Threads from different microengines are allocated
successive locations in the transmit buffer.
Further, our results indicate that packet reordering reduces for a network proces-
sor with fewer number of microengines without significantly affecting the throughput
rates. The retransmission rate reduces from 61% (for 10 hops), for a network proces-
sor with 8 microengines and 8 threads, to 19% (for 10 hops) for a network processor
with 2 microengines and 8 threads or 4 microengines and 4 threads. This is achieved
without sacrificing the performance (2.96 Gbps). This is because the throughput of
the network processor saturates beyond a total of 16 threads as observed in Section
3.4.3. Based on this observation we propose a scheme, Packet sort, in which a few mi-
croengines/threads are dedicated to sort the packets in-order at the transmit buffer
side. Packet sort is able to support a line rate of up to 2.5 Gbps without any packet
7.2 Future Directions 85
reordering. Our results indicate that Packet sort achieves a significant through-
put improvement, of up to 35% over the in-built schemes in the IXP, namely, Inter
Thread Signaling (ITS) and Asynchronous Insert and Synchronous Remove (AISR).
The final part of this thesis investigates the performance of the network processor
in a bursty traffic scenario. We model bursty traffic using a Pareto distribution.
Further, we explore various packet buffering schemes. In particular we consider a
parallel and pipelined packet flow architectures. Our results indicate that the parallel
scheme supports line rates up to 4.3 Gbps and the pipelined scheme supports line
rates up to 5.3 Gbps.
7.2 Future Directions
Below we list a few possibilities for extending our work.
• The Petri net model developed in Chapter 3 executes a single application.
However, with increasing functionalities at the network layer there is a need
to support multiple concurrent applications in routers. For example, modern
edge routers will run a cryptographic application, a port-scan application for
attacks, a virus scan for detecting viruses, in addition to forwarding. Further,
the same router might also run the network address translation. Hence it will
be interesting to evaluate the performance of the network processor multiple
concurrent applications.
• Our study on packet reordering does not model the network flow of ACK
packets (from destination to source). Nor do we assume rate reduction at
source on retransmissions in a fast retransmit algorithm. Our study can be
extended to take these into account and study the performance. Further,
the retransmission of a packet due to reordering results in the packet being
resent. It is important to note that these retransmitted packets will not need
any processing in case they are still available in the DRAM. Heuristic can be
developed to forward these packets without any additional processing costs
86 Conclusions
being incurred by the network processor.
Bibliography
[1] A. Agarwal. Performance tradeoffs in multithreaded processors. IEEE Transac-
tions on Parallel and Distributed Systems, 3(5):525-539, Sept 1992.
[2] F. Baker. RFC 1812 - Requirements for IP Version 4 Routers, June 1995.
[3] J. Banks, J. Carson, and B. Nelson. Discrete-Event System Simulation. Prentice
Hall International, 1998.
[4] J. Bellardo, S. Savage. Measuring packet reordering. Proceedings of the ACM
SIGCOMM IMW, Marseille, France, November 2002.
[5] J. Bennett, C. Partridge, and N. Shectman. Packet Reordering is not Patholog-
ical Network Behavior. IEEE/ACM Transactions on Networking, 7(6):789798,
1999.
[6] J. Cao, William S. Cleveland, Dong Lin, Don X. Sun. On the Non stationarity
of Internet Traffic. Proceedings of ACM SIGMETRICS, pp 102–112, 2001.
[7] P. Crowley, M. Fiuczynski, J.L. Baer, B. Bershad. Characterizing processor
architectures for programmable network interfaces. In Proceedings of Interna-
tional Conference on Supercomputing, Feb 2000.
[8] C. Fraleigh, S. Moon, C. Diot, B. Lyles, and F. Tobagi. Packet-level traffic
measurements from a tier-1 IP backbone. Technical Report TR01-ATL110101,
Sprint ATL Technical Report, November 2001.
88 BIBLIOGRAPHY
[9] M. Franklin, T. Wolf. A Network Processor Performance and Design Model with
Benchmark Parameterization. First Workshop on Network Processors, Cam-
bridge, MA, February 2002.
[10] L. Garber. Denial of Service attacks rip the Internet. IEEE Computer , 33,4:12–
17, Apr. 2000. 2005.
[11] R. Govindarajan, F. Suciu, W. Zuberek. Timed Petri Net Models of Multi-
threaded Multiprocessor Architectures, Proc. of the 7th International Work-
shop on Petri Nets and Performance Models, pp.163-172, Saint Malo, France,
June 1997.
[12] J. Hasan, Satish Chandra, T. N. Vijaykumar. Efficient Use of Memory Band-
width to Improve Network Processor Throughput. In Proceedings of Interna-
tional Symposium on Computer Architecture, June 2003.
[13] IBM. The Network Processor: Enabling Technology for high performance Net-
working. IBM Microelectronics, 1999.
[14] Intel Corporation, Intel IXP 1200 Network Processor Hardware Reference Man-
ual. Revision 8, pp 27-29,102-104, August 2001.
[15] Intel Corporation, Intel IXP 2400 Network Processor Hardware Reference Man-
ual. Revision 7, November 2003.
[16] Intel IXP2400/IXP2800 Development Tools Users Guide. Revision 11, March
2004.
[17] Intel IXP2400/IXP2800 Network Processors Microengine C Language Support
Reference Manual. Revision 9, November 2003.
[18] B. Jacob, D. Wang. DRAM: Architectures, Interfaces, and Systems: A Tutorial
. (http://www.ee.umd.edu/ blj/talks/DRAM-Tutorial-isca2002-2.pdf).
BIBLIOGRAPHY 89
[19] S. Jaiswal, G. Iannaccone, C. Diot, J. Kuorose and D. Towsley. Measuring and
classification of out-of-sequence packets in a Tier-1 IP Backbone, International
Measurement Workshop(IMW), 2003.
[20] K. Jensen. A Brief Introduction to Coloured Petri Nets. In Proceeding of
the TACAS-1997 Workshop, Lecture Notes in Computer Science Vol. 1217,
Springer-Verlag 1997, 203-208.
[21] S. Kent, R. Atkinson. RFC 2402 - IP Authentication Header (AH), November
1998.
[22] S. Kent, R. Atkinson. RFC 2406 - IP Encapsulating Security Payload (ESP),
November 1998.
[23] G. Kramer. Generation of self-similar traffic using Traf Gen 3.
(http://wwwcsif.cs.ucdavis.edu/ kramer/code/trf gen3.html)
[24] M. Laor, L. Gendel. The Effect of Packet Reordering in a Backbone Link on
Application Throughput. IEEE Network Sept/Oct 2002.
[25] W. Leland, M. S. Taqqu, W. Willinger and D. Wilson. On the Self Similar
Nature of Ethernet Traffic. Proc of SIGCOMM, Sept 1993.
[26] Motorola C-5 Network Processor Hardware Reference Manual, Revision 1.7,
October 2001.
[27] Y. Narhari and N. Vishwanadham. Performance Modeling of Automated Man-
ufacturing Systems, Prentice Hall, 1992.
[28] National Laboratory for Applied Network Research (NLANR). Insights into
Current Internet Traffic Workloads.(http://www.nlanr.net/NA/tutorial.html)
[29] T. Spalink, Scott Karlin, Larry Peterson. Evaluating Network Processors for IP
Forwarding. Technical Report TR-626-00, Department of Computer Science,
Princeton University, Nov. 2000.
90 BIBLIOGRAPHY
[30] R. Stevens. TCP/IP Illustrated, Volume 1: The Protocols, Addison-Wesley,
1994.
[31] L. Thiele, Samarjit Chakraborty, Matthias Gries, Simon Kunzli. Design space
exploration of network processor architectures. First Workshop Workshop on
Network Processors, Cambridge, MA, February 2002.
[32] K. Trivedi. Probability and Statistics, with Reliability, Queuing and Computer
Science Applications, Wiley Interscience, 2002.
[33] R. Saavedra-Barrera, D. Culler, and T. Von Eicken. Analysis of multithreaded
architectures for parallel computing. In Second Annual ACM Symposium on
Parallel Algorithms and Architectures, pages 169–178, July 1990.
[34] P. Saisuresh, K. Egevang. RFC 3022 - Traditional IP Network Address Trans-
lator (Traditional NAT), January 2001.
[35] M. S. Taqqu, Walter Willinger, and R. Sherman. Proof of a Fundamental Result
in Self-Similar Traffic Modeling. Proceedings of ACM SIGCOMM Computer
Communication Review, vol. 27, pp. 5-23, 1997.
[36] R. Thayer, N. Doraswamy, R. Glenn. RFC 2411 - IP Security Document
Roadmap, November 1998.
[37] N. Weng and T. Wolf. Pipelining vs multiprocessing: Choosing the right net-
work proecessor topology. Proceedings of Advanced Networking and Communi-
cation Hardware Workshop (ANCHOR 2004) in conjunction with the 31st An-
nual International Symposium on Computer Architecture (ISCA 2004), June
2004.
[38] T. Wolf. and M. Franklin. Commbench : A Telecommunication benchmark
for Network Processors. In Proceedings of the International Symposium on
Performance Analysis of Systems and Software, April 2000, pp. 154-162.
BIBLIOGRAPHY 91
[39] Y. Zhang, L. Breslau, V. Paxson and S. Shenker. On the Characteristics and
Origins of Internet Flow Rates. In Proceedings of SIGCOMM, August 2002.
[40] W. M. Zuberek. Modeling using Timed Petri Nets - event-driven simulation,
Technical Report No. 9602, Dept. of Computer Science, Memorial Univ. of New-
foundland, St. John’s, Canada, 1996 (ftp://ftp.ca.mun.ca/pub/techreports/tr-
9602.ps.Z).
top related