improving the performance of passive network …mikepo/papers/pcaplb.comcom12.pdfpassive network...

Computer Communications 35 (2012) 129–140

Contents lists available at SciVerse ScienceDirect

Computer Communications

journal homepage: www.elsevier .com/locate /comcom

Improving the performance of passive network monitoring applicationswith memory locality enhancements

Antonis Papadogiannakis a,⇑, Giorgos Vasiliadis a, Demetres Antoniades a, Michalis Polychronakis b,Evangelos P. Markatos a

a Institute of Computer Science, Foundation for Research and Technology – Hellas, P.O. Box 1385, Heraklion, GR-711-10, Greeceb Computer Science Department, Columbia University, New York, USA

a r t i c l e i n f o

Article history:Received 5 December 2010Received in revised form 4 August 2011Accepted 6 August 2011Available online 17 August 2011

Keywords:Passive network monitoringIntrusion detection systemsLocality bufferingPacket capturing

0140-3664/$ - see front matter � 2011 Elsevier B.V. Adoi:10.1016/j.comcom.2011.08.003

⇑ Corresponding author. Tel.: +30 2810 391663; faxE-mail addresses: [email protected] (A. Papado

(G. Vasiliadis), [email protected] (D. Antoniade(M. Polychronakis), [email protected] (E.P. Marka

a b s t r a c t

Passive network monitoring is the basis for a multitude of systems that support the robust, efficient, andsecure operation of modern computer networks. Emerging network monitoring applications are moredemanding in terms of memory and CPU resources due to the increasingly complex analysis operationsthat are performed on the inspected traffic. At the same time, as the traffic throughput in modern net-work links increases, the CPU time that can be devoted for processing each network packet decreases.This leads to a growing demand for more efficient passive network monitoring systems in which runtimeperformance becomes a critical issue.

In this paper we present locality buffering, a novel approach for improving the runtime performance of alarge class of CPU and memory intensive passive monitoring applications, such as intrusion detection sys-tems, traffic characterization applications, and NetFlow export probes. Using locality buffering, capturedpackets are being reordered by clustering packets with the same port number before they are delivered tothe monitoring application. This results in improved code and data locality, and consequently, in an over-all increase in the packet processing throughput and decrease in the packet loss rate. We have imple-mented locality buffering within the widely used libpcap packet capturing library, which allowsexisting monitoring applications to transparently benefit from the reordered packet stream without mod-ifications. Our experimental evaluation shows that locality buffering improves significantly the perfor-mance of popular applications, such as the Snort IDS, which exhibits a 21% increase in the packetprocessing throughput and is able to handle 67% higher traffic rates without dropping any packets.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

Along with the phenomenal growth of the Internet, the volumeand complexity of Internet traffic is constantly increasing, and fasternetworks are constantly being deployed. Emerging highly distrib-uted applications, such as media streaming, cloud computing, andpopular peer-to-peer file sharing systems, demand for increasedbandwidth. Moreover, the number of attacks against Internet-connected systems continues to grow at alarming rates.

As networks grow larger and more complicated, effective pas-sive network monitoring is becoming an essential operation forunderstanding, managing, and improving the performance andsecurity of computer networks. Passive network monitoring isgetting increasingly important for a large set of Internet usersand service providers, such as ISPs, NRNs, computer and telecom-

ll rights reserved.

: +30 2810 391601.giannakis), [email protected]), [email protected]).

munication scientists, security administrators, and managers ofhigh-performance computing infrastructures.

While passive monitoring has been traditionally used forrelatively simple network traffic measurement and analysis appli-cations, or just for gathering packet traces that are analyzedoff-line, in recent years it has become vital for a wide class of moreCPU and memory intensive applications, such as network intrusiondetection systems (NIDS) [1], accurate traffic categorization [2],and NetFlow export probes [3]. Many of these applications needto inspect both the headers and the whole payloads of the capturedpackets, a process widely known as deep packet inspection [4]. Forinstance, measuring the distribution of traffic among differentapplications has become a difficult task. Several recent applicationsuse dynamically allocated ports, and therefore, cannot be identifiedbased on a well known port number. Instead, protocol parsing andseveral other heuristics, such as searching for an application-specific string in the packets payload [2], are commonly used. Also,intrusion detection systems such as Snort [1] and Bro [5] need tobe able to inspect the actual payload of network packets in order todetect malware and intrusion attempts. Threats are identified

http://dx.doi.org/10.1016/j.comcom.2011.08.003

mailto:[email protected]





http://dx.doi.org/10.1016/j.comcom.2011.08.003

http://www.sciencedirect.com/science/journal/01403664

http://www.elsevier.com/locate/comcom

130 A. Papadogiannakis et al. / Computer Communications 35 (2012) 129–140

using attack ‘‘signatures’’ that are evaluated by advanced patternmatching algorithms.

The complex analysis operations of such demanding applica-tions incur an increased number of CPU cycles spent on every cap-tured packet. Consequently, this reduces the overall processingthroughput that the application can sustain without droppingincoming packets. At the same time, as the speed of modern net-work links increases, there is a growing demand for more efficientpacket processing using commodity hardware that is able to keepup with higher traffic loads.

A common characteristic that is often found in passive monitor-ing applications is that they usually perform different operationson different types of packets. For example, a NIDS applies a certainsubset of attack signatures on packets with destination port 80, i.e.,it applies the web-attack signatures on packets destined to webservers, while a different set of signatures is applied on packet des-tined to database servers, and so on. Furthermore, NetFlow probes,traffic categorization, as well as TCP stream reassembly, which hasbecome a mandatory function of modern NIDS [6], all need tomaintain a large data structure that holds the active network flowsfound in the monitored traffic at any given time. Thus, for packetsbelonging to the same network flow, the process accesses the samepart of the data structure that corresponds to the particular flow.

In all above cases, we can identify a locality of executed instruc-tions and data references for packets of the same type. In this paper,we present a novel technique for improving packet processing per-formance by taking advantage of this locality property which iscommonly exhibited by many different passive monitoring applica-tions. In practice, the captured packet stream is a mix of interleavedpackets that correspond to hundreds or thousands of differentpacket types, depending on the monitored link. Our approach,called locality buffering, is based on reordering the packet streamthat is delivered to the monitoring application in a way that en-hances the locality of the application’s code execution and data ac-cess, improving the overall packet processing performance.

We have implemented locality buffering in libpcap [7], themost widely used packet capturing library, which allows forimproving the performance of a wide range of passive monitoringapplications written on top of libpcap in a transparent way,without the need to modify them. Our implementation combineslocality buffering with memory mapping, which optimizes theperformance of packet capturing by mapping the buffer in whichpackets are stored by the kernel into user level memory.

Our experimental evaluation using real-world applications andnetwork traffic shows that locality buffering can significantly im-prove packet processing throughput and reduce the packet lossrate. For instance, the popular Snort IDS exhibits a 21% increasein the packet processing throughput and is able to process 67%higher traffic rates with no packet loss.

The rest of this paper is organized as follows: Section 2 outlinesseveral methods that can be used for improving packet capturingperformance. In Section 3 we describe the overall approach oflocality buffering, while in Section 4 we present in detail ourimplementation of locality buffering within the libpcap packetcapturing library combined with memory mapping. Section 5 pre-sents the experimental evaluation of our prototype implementa-tion using three popular passive monitoring tools. Finally, Section6 summarizes related work, Section 7 discusses limitations of ourapproach and future work and Section 8 concludes the paper.

2. Background

Passive monitoring applications analyze network traffic by cap-turing and examining individual packets passing through the mon-itored link, which are then analyzed using various techniques, from

simple flow-level accounting, to fine-grained operations like deeppacket inspection. Popular passive network monitoring applica-tions, such as intrusion detection systems, per-application trafficcategorization, and NetFlow export probes, are built on top of li-braries for generic packet capturing. The most widely used libraryfor packet capturing is libpcap [7].

2.1. Packet capturing in Linux

We briefly describe the path that packets take from the wire un-til they are delivered to the user application for processing. Allpackets captured by the network card are stored in memory bythe kernel. In Linux [8], this is achieved by issuing an interruptfor each packet or an interrupt for a batch of packets [9,10]. Then,the kernel hands the packets over to every socket that matches thespecified BPF filter [11]. In case that a socket buffer becomes full,the next incoming packets will be dropped from this socket. Thus,the size of the socket buffer affects the tolerance of a passive mon-itoring application in short-term traffic or processing bursts. Final-ly, each packet is copied to memory accessible by the user-levelapplication.

The main performance issues in the packet reception process inLinux that affect the packet capturing and processing throughputare the following:

� High interrupt service overhead (per packet cost): Even a fast pro-cessor is overwhelmed by constantly servicing interrupts athigh packet arrival rates, having no time to process the packets(receive livelock) [9]. NAPI [10] combines polling with interruptsto solve this problem.� Kernel-to-user-space context switching (per packet cost): It takes

place when a packet crosses the kernel-to-user-space border,i.e., calling a system call for each packet reception. Thus, theuser level packet processing will start several milliseconds later.Using a memory mapped buffer between kernel and user spacefor storing packets solves this problem efficiently, since packetsare accessible directly from user space without calling any sys-tem calls.� Data copy and memory allocation costs (per byte cost): Copying

the packet data from NIC to kernel memory and from kernel-level to user-level memory consumes a significant amount ofCPU time. Zero-copy approaches [12–14] have been proposedto reduce such costs.

Braun et al. [15] and Schneider et al. [16] compare the perfor-mance of several packet capturing solutions on different operatingsystems using the same hardware platforms and provide guide-lines for system configuration to achieve optimal performance.

2.2. Using memory mapping between kernel and user-levelapplications

In order to avoid kernel-to-user-space context switches andpacket copies from kernel to user space for each captured packet,a memory-mapped ring buffer shared between kernel and userspace is used to store the captured packets. The general principleof memory-mapping is to allow access from both kernel and userspace to the same memory segment. The user level applicationsare then able to read the packets directly from the ring buffer,avoiding context switching to the kernel.

The ring buffer plays the same role as the socket buffer that wedescribed earlier. The kernel is capable of inserting packets cap-tured by the network interface into the ring buffer, while the useris able to read them directly from there. In order to prevent raceconditions between the two different processes, an extra headeris placed in front of each packet to ensure atomicity while reading

A. Papadogiannakis et al. / Computer Communications 35 (2012) 129–140 131

and writing packets into the buffer. Whenever the processing of apacket is over, it is marked as read using this header, and the posi-tion in which the packet is stored is considered by the kernel asempty. The kernel uses an end pointer that points to the first avail-able position to store the next arrived packet, while the user-levelapplication uses a start pointer that points to the first non-readpacket. These two pointers guarantee the proper operation of thecircular buffer: The kernel simply iterates through the circular buf-fer, storing newly arrived packets on empty positions and blockswhenever the end pointer reaches the last empty position, whilethe user application processes every packet in sequence as longas there are available packets in the buffer. The main advantageof the memory mapped circular buffer is that it avoids the contextswitches from kernel to user level for copying each packet. The lat-est versions of Linux kernel and libpcap support this memorymapping functionality, through the PACKET_MMAP option that isavailable in the PACKET socket interface.

Packets are still being copied from the DMA memory allocatedby the device driver to the memory mapped ring buffer througha software interrupt handler. This copy leads to a performance deg-radation, so zero copy techniques [13,14,17–19] have been pro-posed to avoid it. These approaches can avoid this data copy bysharing a memory buffer between all the different network stacklayers within kernel and between kernel and user space.

2.3. Using specialized hardware

Another possible solution to accelerate packet capturing is touse specialized hardware optimized for high-speed packet capture.For instance, DAG monitoring cards [20] are capable of full packetcapture at high speeds. Contrary to commodity network adapters, aDAG card is capable of retrieving and mapping network packets touser space through a zero-copy interface, which avoids costly inter-rupt processing. It can also stamp each packet with a high precisiontimestamp. A large static circular buffer, which is memory-mappedto user-space, is used to hold arriving packets and avoid costlypacket copies. User applications can directly access this bufferwithout the invocation of the operating system kernel.

When using DAG cards, the performance problems occurred inLinux packet capturing can be eliminated, but at a price that is pro-hibitively high for many organizations. On the other hand, com-modity hardware is always preferable and much easier to findand deploy for network monitoring. In addition, specialized hard-ware alone may not be enough for advanced monitoring tasks athigh network speeds, e.g., intrusion detection.

3. Locality buffering

The starting point of our work is the observation that severalwidely used passive network monitoring applications, such asintrusion detection systems, perform almost identical operationsfor a certain class of packets. At the same time, different packetclasses result to the execution of different code paths, and to dataaccesses to different memory locations. Such packet classes includethe packets of a particular network flow, i.e., packets with the sameprotocol, source and destination IP addresses, and source and des-tination port numbers, or even wider classes such as all packets ofthe same application-level protocol, e.g., all HTTP, FTP, or BitTor-rent packets.

Consider for example a NIDS like Snort [1]. Each arriving packetis first decoded according to its Layer 2–4 protocols, then it passesthrough several preprocessors, which perform various types of pro-cessing according to the packet type, and finally it is delivered tothe main inspection engine, which checks the packet protocolheaders and payload against a set of attack signatures. According

to the packet type, different preprocessors may be triggered. Forinstance, IP packets go through the IP defragmentation preproces-sor, which merges fragmented IP packets, TCP packets go throughthe TCP stream reassembly preprocessor, which reconstructs thebi-directional application level network stream, while HTTP pack-ets go through the HTTP preprocessor, which decodes and normal-izes HTTP protocol fields. Similarly, the inspection engine willcheck each packet only against a subset of the available attack sig-natures, according to its type. Thus, packets destined to a Web ser-ver will be checked against the subset of signatures tailored to Webattacks, FTP packets will be checked against FTP attack signatures,and so on.

When processing a newly arrived packet, the code of the corre-sponding preprocessors, the subset of applied signatures, and allother accessed data structures will be fetched into the CPU cache.Since packets of many different types will likely be highly inter-leaved in the monitored traffic mix, different data structures andcode will be constantly alternating in the cache, resulting to cachemisses and reduced performance. The same effect occurs in othermonitoring applications, such as NetFlow collectors or traffic clas-sification applications, in which arriving packets are classifiedaccording to the network flow in which they belong to, which re-sults to updates in a corresponding entry of a hash table. If manyconcurrent flows are active in the monitored link, their packets willarrive interleaved, and thus different portions of the hash table willbe constantly being transferred in and out of the cache, resulting topoor performance.

The above observations motivated us to explore whetherchanging the order in which packets are delivered from the OS tothe monitoring application improves packet processing perfor-mance. Specifically, we speculated that rearranging the capturedtraffic stream so that packets of the same class are delivered tothe application in ‘‘batches’’ would improve the locality of codeand data accesses, and thus reduce the overall cache miss ratio.This rearrangement can be conceptually achieved by bufferingarriving packets into separate ‘‘buckets’’, one for each packet class,and dispatching each bucket at once, either whenever it gets full, orafter some predefined timeout since the arrival of the first packetin the bucket. For instance, if we assume that packets with thesame destination port number correspond to the same class, theninterleaved packets destined to different network services will berearranged so that packets destined to the same network serviceare delivered back-to-back to the monitoring application, as de-picted in Fig. 1.

Choosing the destination port number as a class identifierstrikes a good balance between the number of required bucketsand the achieved locality for commonly used network monitoringapplications. Indeed, choosing a more fine-grained classificationscheme, such as a combination of the destination IP address andport number, would require a tremendous amount of buckets,and would probably just add overhead, since most of the applica-tions of interest to this work perform (5-tuple) flow-based classifi-cation. At the same time, packets destined to the same port usuallycorrespond to the same application-level protocol, so they willtrigger the same Snort signatures and preprocessors, or will belongto the same or ‘‘neighboring’’ entries in a network flow hash table.

However, sorting the packets by destination port only wouldcompletely separate the two directions of each bi-directional flow,i.e., client requests from server responses. This would increase sig-nificantly the distance between request and response packets, andin case of TCP flows, the distance between SYN and a SYN/ACKpackets. For traffic processing operations that require to inspectboth directions of a connection, this would add a significant delay,and eventually decrease memory locality, due to the separation ofeach bi-directional flow in two parts. Moreover, TCP reassemblywould suffer from extreme buffering until the reception of pending

FTPWeb p2p Web p2p Packet Capturing Library using LB ApplicationFTP Web Web p2p p2p

Fig. 1. The effect of locality buffering on the incoming packet stream.

Table 1Snort’s performance using a sorted trace.

Performance metric Original trace Sorted trace Improvement (%)

Throughput (Mbit/s) 473.97 596.15 25.78Cache misses (per packet) 11.06 1.33 87.98CPU cycles (per packet) 31,418.91 24,657.98 21.52


ACK packets, or even discard the entire flow. For example, thiscould happen in case that ACKs are not received within a timeoutperiod, or a packet is received before the SYN packet, i.e., before theTCP connection establishment. Furthermore, splitting the twodirections of a flow would alter the order in which the packetsare delivered to the application. This could cause problems toapplications that expect the captured packets to be delivered withmonotonically increasing timestamps.

Based on the above, we need a sorting scheme that will be ableto keep the packets of both directions of a flow together, in thesame order, and at the same time maintain the benefits of packetsorting based on destination port: good locality and lightweightimplementation. Our choice is based on the observation that theserver port number, which commonly characterizes the class ofthe flow, is usually lower than the client port number, which isusually a high port number randomly chosen by the OS. Also, bothdirections of a flow have the same pair of port numbers, in just re-verse order. Packets in server-to-client direction have the server’sport as source port number. Hence, in most cases, choosing thesmaller port between the source and destination port numbers ofeach packet will give us the server’s port in both directions. In caseof known services, low ports are almost always used. In case ofpeer-to-peer traffic or other applications that may use high ser-ver-side port numbers, connections between peers are establishedusing high ports only. However, sorting based on any of these twoports has the same effect to the locality of the application’s mem-ory accesses. Sorting always based on the smaller among the twoport numbers ensures that packets from both directions will beclustered together, and their relative order will always be main-tained. Thus, our choice is to sort the packets according to thesmaller between the source and destination ports.

3.1. Feasibility estimation

To get an estimation of the feasibility and the magnitude ofimprovement that locality buffering can offer, we performed a pre-liminary experiment whereby we sorted off-line the packets of anetwork trace based on the lowest between the source and desti-nation port numbers, and fed it to a passive monitoring applica-tion. This corresponds to applying locality buffering usingbuckets of infinite size. Details about the trace and the experimen-tal environment are discussed in Section 5. We ran Snort v2.9 [1]using both the sorted and the original trace, and measured the pro-cessing throughput (trace size divided by user time), L2 cachemisses, and CPU cycles of the application. Snort was configuredwith all the default preprocessors enabled as specified in its defaultconfiguration file and used the latest official rule set [21] contain-ing 19,009 rules. The Aho–Corasick algorithm was used for patternmatching [22]. The L2 cache misses and CPU clock cycles weremeasured using the PAPI library [23], which utilizes the hardwareperformance counters.

Table 1 summarizes the results of this experiment (each mea-surement was repeated 100 times, and we report the average val-ues). We observe that sorting results to a significant improvementof more than 25% in Snort’s packet processing throughput, L2 cachemisses are reduced by more than 8 times, and 21% less CPU cyclesare consumed.

From the above experiment, we see that there is a significantpotential of improvement in packet processing throughput using

locality buffering. However, in practice, rearranging the packetsof a continuous packet stream can only be done in short intervals,since we cannot indefinitely wait to gather an arbitrarily largenumber of packets of the same class before delivering them tothe monitoring application—the captured packets have to be even-tually delivered to the application within a short time interval (inour implementation, in the orders of milliseconds). Note thatslightly relaxing the in-order delivery of the captured packets re-sults to a delay between capturing the packet, and actually deliver-ing it to the monitoring application. However, such a sub-seconddelay does not actually affect the correct operation of the monitor-ing applications that we consider in this work (delivering an alertor reporting a flow record a few milliseconds later is totally accept-able). Furthermore, packet timestamps are computed before local-ity buffering, and are not altered in any way, so any inter-packettime dependencies remain intact.

4. Implementation within libpcap

We have chosen to implement locality buffering within libp-

cap, the most widely used packet capturing library, which is thebasis for a multitude of passive monitoring applications. Typically,applications read the captured packets through a call such aspcap_next or pcap_loop, one at a time, in the same order as theyarrive to the network interface. By incorporating locality bufferingwithin libpcap, monitoring applications continue to operate asbefore, taking advantage of locality buffering in a transparentway, without the need to alter their code or link them with extralibraries. Indeed, the only difference is that consecutive calls topcap_next or similar functions will most of the time return pack-ets with the same destination or source port number, depending onthe availability and the time constraints, instead of highly inter-leaved packets with different port numbers.

4.1. Periodic packet stream sorting

In libpcap, whenever the application attempts to read a newpacket, e.g., through a call to pcap_next, the library reads a packetfrom kernel and delivers it to the application. Using pcap_loop,the application registers a callback function for packet processingthat is called once per each captured packet read from kernel bylibpcap. In case that memory mapping is not supported, thepacket is copied through a recv call from kernel space to userspace in a small buffer equal to the maximum packet size, and thenpcap_next returns a pointer to the beginning of the new packet orthe callback function registered by pcap_loop is called. Withmemory mapping, the next packet stored by kernel in the sharedring buffer is returned to application or processed by the callbackfunction. If no packets are stored, poll is called to wait for the nextpacket reception.


So far, we have conceptually described locality buffering as a setof buckets, with packets having the same source or destinationport ending up into the same bucket. One straightforward imple-mentation of this approach would be to actually maintain a sepa-rate buffer for each bucket, and copy each arriving packet to itscorresponding buffer. However, this has the drawback that an ex-tra copy is required for storing each packet to the correspondingbucket, right after it has been fetched from the kernel.

In order to avoid additional packet copy operations, which incursignificant overhead, we have chosen an alternative approach. Wedistinguish between two different phases: the packet gatheringphase, and the packet delivery phase. In the case without memorymapping, we have modified the single-packet-sized buffer oflibpcap to hold a large number of packets instead of just one.During the packet gathering phase, newly arrived packets are writ-ten sequentially into the buffer by increasing the buffer offset inthe recv call until the buffer is full or a certain timeout has ex-pired. For libpcap implementation with memory mapping sup-port, the shared buffer is split into two parts. The first part of thebuffer is used for gathering packets in the gathering phase, andthe second part for delivering packets based on the imposed sort-ing. The gathering phase lasts either till the buffer used for packetgathering gets full or till a timeout period expires.

Instead of arranging the packets into different buckets, whichrequires an extra copy operation for each packet, we maintain anindex structure that specifies the order in which the packets inthe buffer will be delivered to the application during the deliveringphase, as illustrated in Fig. 2. The index consists of a table with64 K entries, one for each port number. Each entry in the tablepoints to the beginning of a linked list that holds references toall packets within the buffer with the particular port number. Inthe packet delivery phase, the packets are delivered to the applica-tion ordered according to their smaller port by traversing each listsequentially, starting from the first non-empty port number entry.In this way we achieve the desired packet sorting, while, at thesame time, all packets remain in place, at the initial memory loca-tion in which they were written, avoiding extra costly copy opera-tions. In the following, we discuss the two phases in more detail.

In the beginning of each packet gathering phase the indexing ta-ble is zeroed using memset(). For each arriving packet, we performa simple protocol decoding for determining whether it is a TCP orUDP packet, and consequently extract its source and destination

dst port01

…

21

…

80

…

65536

dst port:21

dst port: 80

dst port: 2217

Pack

Indexing Structure

Fig. 2. Using an indexing table with a linked list for each port, the packets

port numbers. Then, a new reference for the packet is added tothe corresponding linked list. For non-TCP or non-UDP packets, areference is added into a separate list. The information that wekeep for every packet in each node of the linked lists includesthe packet’s length, the precise timestamp of the time when thepacket was captured, and a pointer to the actual packet data inthe buffer.

Instead of dynamically allocating memory for new nodes in thelinked lists, which would be an overkill, we pre-allocate a large en-ough number of spare nodes, equal to the maximum number ofpackets that can be stored in the buffer. Whenever a new referencehas to be added in a linked list, a spare node is picked. Also, for fastinsertion of new nodes at the end of the list, we keep a table with64 K pointers to the tail of each list.

The overhead of this indexing process is negligible. We mea-sured it using a simple analysis application, which just receivespackets in user space and then discards them, resulting to less than6% overhead for any traffic rate. This is because most of the CPUtime in this application is spent for capturing packets and deliveringthem to user space. The overhead of finding the port numbers andadding a node to our data structure for each packet is negligiblecompared to packet capturing and other per-packet overheads.The overhead of making zero (through a memset() call) the index-ing table is also negligible, since we make it once for a large group ofpackets. In this measurement we used a simple analysis applicationwhich does not benefit from improved cache memory locality. Forreal world applications this overhead is even smaller, and as we ob-serve in our experimental evaluation (Section 5) the benefits frommemory locality enhancements outreach by far this overhead.

The system continues to gather packets until the buffer be-comes full or a certain timeout has elapsed. The timeout ensuresthat if packets arrive with a low rate, the application will not waittoo long for receiving the next batch of packets. The buffer size andthe timeout are two significant parameters of our approach, sincethey influence the number of sorted packets that can be deliveredto the application in each batch. Both timeout and buffer size canbe defined by the application. Depending on the per-packet pro-cessing complexity of each application, the buffer size determinesthe benefit in its performance. In Section 5 we examine the effectthat the number of packets in each batch has on the overall perfor-mance using three different passive monitoring applications. Thetimeout parameter is mostly related to the network’s workload.

dst port: 80

dst port: 2217

dst port: 21

dst port: 80

et Buffer

are delivered to the application sorted by their smaller port number.


Upon the end of the packet gathering phase, packets can bedelivered to the application following the order imposed by theindexing structure. For that purpose, we keep a pointer to the listnode of the most recently delivered packet. Starting from thebeginning of the index table, whenever the application requests anew packet, e.g., through pcap_next, we return the packet pointedeither by the next node in the list, or, if we have reached the end ofthe list, by the first node of the next non-empty list. The latter hap-pens when all the packets of the same port have been delivered(i.e., the bucket has been emptied), so conceptually the system con-tinues with the next non-empty group.

4.2. Using a separate thread for packet gathering

In case that memory mapping is not supported in the system, asingle buffer will be used for both packet gathering and delivery. Adrawback of the above implementation is that during the packetgathering phase, the CPU remains idle most of the time, since nopackets are delivered to the application for processing in the mean-while. Reversely, during the processing of the packets that werecaptured in the previous packet gathering period, no packets arestored in the buffer. In case that the kernel’s socket buffer is smalland the processing time for the current batch of packets is in-creased, it is possible that a significant number of packets mayget lost by the application in case of high traffic load.

Although in practice this effect does not degrade performancewhen short timeouts are used, we can improve further the perfor-mance of locality buffering in this case by employing a separatethread for the packet gathering phase, combined with the usageof two buffers instead of a single one. The separate packet gather-ing thread receives the packets from the kernel and stores them tothe write buffer, and also updates its index. In parallel, the applica-tion receives packets for processing from the main thread of libp-cap, which returns the already sorted packets of the second readbuffer. Each buffer has its own indexing table.

Upon the completion of both the packet gathering phase, i.e.,after the timeout expires or when the write buffer becomes full,and the parallel packet delivery phase, the two buffers are swapped.The write buffer, which now is full of packets, turns to a read buffer,while the now empty read buffer becomes a write buffer. The wholeswapping process is as simple as swapping two pointers, whilesemaphore operations ensure the thread-safe exchange of the twobuffers.

4.3. Combine locality buffering and memory mapping

A step beyond is to combine locality buffering with memorymapping to further increase the performance of each individualtechnique. While memory mapping improves the performance ofpacket capturing, locality buffering aims to improve the perfor-mance of the user application that processes the captured packets.

As we described in Section 2.2, the buffer where the networkpackets are stored in libpcap with memory mapping support isaccessible from both the kernel and libpcap library. The packetsare stored sequentially into this buffer by the kernel as they arrive,while the libpcap library allows a monitoring application to pro-cess them by returning a pointer to the next packet throughpcap_next or calling the callback function registered throughpcap_loop for each packet that arrives.

In case the buffer is empty, libpcap blocks, calling poll, wait-ing for new packets to arrive.

After finishing with the processing of each packet, through thecallback function or when the next pcap_next is called, libpcapmarks the packet as read so that the kernel can later overwrite thepacket with a new one. Otherwise, if a packet is marked as unread,the kernel is not allowed to copy a new packet into this position of

the buffer. In this way, any possible data corruption that could hap-pen by the parallel execution of the two processes (kernel andmonitoring application) is avoided.

The implementation of locality buffering in the memorymapped version of libpcap does not require to maintain a sepa-rate buffer for sorting the arriving packets, since we have direct ac-cess to the shared memory mapped buffer in which they arestored. To deliver the packets sorted based on the source or desti-nation port number to the application, we process a small portionof the shared buffer each time as a batch: instead of executing thehandler function every time a new packet is pushed into the buffer,we wait until a certain amount of packets has been gathered or acertain amount of time has been elapsed. The batch of packets isthen ordered based on the smaller of source and destination portnumbers.

The sorting of the packets is performed as described in Section4.1. The same indexing structure, as depicted in Fig. 2, was built tosupport the sorting. The structure contains pointers directly to thepackets on the shared buffer. Then, the handler function is appliediteratively on each indexed packet based on the order imposed bythe indexing structure. After the completion of the handler func-tion, the packet is marked for deletion as before in order to avoidany race conditions between the kernel process and the user-levellibrary.

A possible weakness of not using an extra buffer, as described inSection 4.2, is that if the batch of the packets is large in comparisonto the shared buffer, a significant number of packets may get lostduring the sorting phase in case of high traffic load. However, asdiscussed in Section 5, the fraction of the packets that we needto sort is very small compared to the size of the shared buffer.Therefore, it does not affect the insertion of new packets in themeanwhile.

In case of memory mapping, a separate thread for the packetgathering phase is not required. New incoming packets are cap-tured and stored into the shared buffer by the kernel in parallelwith the packet delivery and processing phase, since kernel anduser level application (including the libpcap library) are two dif-ferent processes. Packets that have been previously stored in bufferby kernel are sorted in batches during the gathering phase andthen each sorted batch of packets are delivered one-by-one tothe application for further processing.

5. Experimental evaluation

5.1. Experimental environment

Our experimental environment consists of two PCs intercon-nected through a 10 Gbit switch. The first PC is used for traffic gen-eration, which is achieved by replaying real network traffic tracesat different rates using tcpreplay [24]. The traffic generation PCis equipped with two dual-core Intel Xeon 2.66 GHz CPU with4 MB L2 cache, 6 GB RAM, and a 10 Gbit network interface (SMC10G adapter with XFP). This setup allowed us to replay traffictraces with speeds up to 2 Gbit/s. Achieving larger speeds wasnot possible using large network traces because usually the tracecould not be effectively cached in main memory.

By rewriting the source and destination MAC addresses in allpackets, the generated traffic is sent to the second PC, the passivemonitoring sensor, which captures the traffic and processes itusing different monitoring applications. The passive monitoringsensor is equipped with two quad-core Intel Xeon 2.00 GHz CPUswith 6 MB L2 cache, 6 GB RAM, and a 10 Gbit network interface(SMC 10G adapter with XFP). The size of memory mapped bufferwas set to 60,000 frames in all cases, in order to minimize packetdrops due to short packet bursts. Indeed, we observe that when

Locality Buffer size (# packets)

0 5000 10000 15000 20000 25000 30000

CPU

clo

ck c

ycle

s (p

er p

acke

t)

0

5000

10000

15000

20000

pcappcap+LB

Fig. 3. Snort’s CPU cycles as a function of the buffer size for 250 Mbit/s traffic.

Locality Buffer size (# packets)0 5000 10000 15000 20000 25000 30000

L2 c

ache

mis

ses

(per

pac

ket)

0

5

10

15

20

pcappcap+LB

Fig. 4. Snort’s L2 cache misses as a function of the buffer size for 250 Mbit/s traffic.


Dro

pped

Pac

kets

(%)

0

20

40

60

80

100

pcappcap+LB

Fig. 5. Snort’s packet loss ratio as a function of the buffer size for 2 Gbit/s traffic.


packets are dropped by kernel, in higher traffic rates, the CPU uti-lization in the passive monitoring sensor is always 100%. Thus, inour experiments, packets are lost due to the high CPU load. BothPCs run 64bit Ubuntu Linux (kernel version 2.6.32).

For the evaluation we use a one-hour full payload trace cap-tured at the access link that connects an educational network withthousands of hosts to the Internet. The trace contains 58,714,906packets, corresponding to 1,493,032 different flows, totalling morethan 40 GB in size. To achieve high speeds, up tp 2 Gbit/s, we splitthe trace into a few smaller parts, which can be effectively cachedin the 6 GB main memory, and we replay each part of the trace for10 times in each experiment.

We measure the performance of the monitoring applications ontop of the original version of libpcap-1.1.1 and our mofidiedversion with locality buffering. The latter combines locality buffer-ing with the memory mapping. For each setting, we measure the L2cache misses and the CPU clock cycles by reading the CPU perfor-mance counters through the PAPI library [23]. Another importantmetric we measure is the percentage of packets being droppedby libpcap, which is occurred when replaying traffic in high ratesdue to high CPU utilization.

Traffic generation begins after the application has been initi-ated. The application is terminated immediately after capturingthe last packet of the replayed trace. All measurements were re-peated 10 times and we report the average values. We focus mostlyon the discussion of our experiments using Snort IDS, which is themost resource-intensive among the tested applications. However,we also briefly report on our experiences with Appmon and Fprobemonitoring applications.

5.2. Snort

As in the experiments of Section 3.1, we ran Snort v2.9 using itsdefault configuration, in which all the default preprocessors wereenabled, and we used the latest official rule set [21] containing19,009 rules. Initially, we examine the effect that the size of thebuffer in which the packets are sorted has on the overall applica-tion performance. We vary the size of the buffer from 100 to32,000 packets while replaying the network trace at a constant rateof 250 Mbit/s. We send traffic at several rates, but we first presentresults from constant 250 Mbit/s since no packets were dropped atthis rate, to examine the effect of buffer size on CPU cycles spentand L2 cache misses when no packets are lost. We do not useany timeout in these experiments for packet gathering. As longas we send traffic at constant rate, the buffer size determineshow long the packet gathering phase will last. Respectively, a time-out value corresponds to a specific buffer size.

Figs. 3 and 4 show the per-packet CPU cycles and L2 cachemisses respectively when Snort processes the replayed traffic usingthe original and modified versions of libpcap. Both libpcap ver-sions use the memory mapping support, with the same size for theshared packet buffer (60,000 frames) for fairness. Fig. 5 presentsthe percentage of the packets that are being dropped by Snortwhen replaying the traffic at 2 Gbit/s, for each different versionof libpcap.

We observe that increasing the size of the buffer results to few-er cache misses, fewer clock cycles, less dropped packets, and gen-erally to an overall performance improvement for the localitybuffering implementations. This is because using a larger packetbuffer offers better possibilities for effective packet sorting, andthus to better memory locality. However, increasing the size from8,000 to 32,000 packets gives only a slight improvement. Based onthis result, we consider 8,000 packets as an optimum buffer size inour experiments. When sending traffic in a constant rate of250 Mbit/s, with no timeout specified, 8,000 packets as buffer sizeroughly correspond to an 128 millisecond period at average.

Traffic Rate (Mbit/sec)0 400 800 1200 1600 2000

CPU

Util

izat

ion

(%)

0

20

40

60

80

100

pcappcap+LB

Fig. 7. CPU utilization in the passive monitoring sensor when running Snort, as afunction of the traffic speed for an 8,000-packet locality buffer.


We can also notice that using locality buffering we achieve asignificant reduction in L2 cache misses from 13.4 per packet to4.1, when using a 8,000-packet buffer, which is an improvementof 3.3 times against Snort with the original libpcap library. There-fore, Snort’s user time and clock cycles are significantly reducedusing locality buffering, making it faster by more than 20%. More-over, when replaying the traffic at 2 Gbit/s, the packet loss ratio isreduced by 33%. Thus, Snort with locality buffering and memorymapped libpcap performs significantly better than using the ori-ginal libpcap with memory mapping support. When replayingthe trace at low traffic rates, with no packet loss, Snort outputsthe same set of alerts with and without locality buffering, so thepacket reordering does not affect the correct operation of Snort’sdetection process.

We repeated the experiment by replaying the trace in differentrates ranging from 100 to 2,000 Mbit/s and in every case we ob-served a similar behavior. In all rates, 8,000 packets was found tobe the optimum buffer size. Using this buffer size, locality bufferingresults in all rates to a significant reduction in Snort’s cache missesand CPU cycles, similar to the improvement observed for 250 Mbit/s traffic against the original libpcap. The optimum buffer size de-pends mainly on the nature of traffic in the monitored network andon the network monitoring application’s processing.

An important metric for evaluating the performance of ourimplementations is the percentage of the packets that are beingdropped in high traffic rates by the kernel due to high CPU load,and the maximum processing throughput that Snort can sustainwithout dropping packets. In Fig. 6 we plot the average percentageof packets that are being lost while replaying the trace with speedsranging from 100 to 2,000 Mbit/s, with a step of 100 Mbit/s. The2,000 Mbit/s limitation is due to caching the trace file parts fromdisk to main memory in the traffic generator machine, in orderto generate real network traffic. We used a 8,000-packet localitybuffer, which was found to be the optimal size for Snort whenreplaying our trace file at any rate.

Using the unmodified libpcap with memory mapping, Snortcannot process all packets in rates higher than 600 Mbit/s, so a sig-nificant percentage of packets is being lost. On the other hand,using locality buffering the packet processing time is acceleratedand the system is able to process more packets in the same timeinterval. As shown in Fig. 6, using locality buffering Snort becomesmuch more resistant in packet loss, and starts to lose packets at1 Gbit/s instead of 600 Mbit/s. Moreover, at 2 Gbit/s, our imple-mentation drops 33% less packets than the original libpcap.

Fig. 7 shows Snort’s CPU utilization as a function of the trafficrate, for rates varying from 100 Mbit/s to 2 Gbit/s, with a

Traffic Rate (Mbit/sec)0 400 800 1200 1600 2000

Dro

pped

Pac

kets

(%)

0

10

20

30

40

50

60

70

80

90

pcappcap+LB

Fig. 6. Packet loss ratio of the passive monitoring sensor when running Snort, as afunction of the traffic speed for an 8,000-packet locality buffer.

100 Mbit/s step, and a locality buffer size of 8,000 packets. We ob-serve that for low speeds, locality buffering reduces the number ofCPU cycles spent for packet processing due to the improved mem-ory locality. For instance, for 500 Mbit/s, Snort’s CPU utilizationwith locality buffering is reduced by 20%. The CPU utilization whenSnort does not use locality buffering exceeds 90% for rates higherthan 600 Mbit/s, and reaches up to 98.3% for 1 Gbit/s. On the otherhand, using locality buffering, Snort’s CPU utilization exceeds 90%for rates higher than 1 Gbit/s, and reaches about 97% for 1.5 Gbit/s rate. We also observe that packet loss events occur due to highCPU utilization, when it approaches 100%. Without locality buffer-ing, 92.7% CPU utilization for 700 Mbit/s results to 1.4% packet lossrate, while 98% utilization for 1.1 Gbit/s results to 47.2% packetloss.

In Fig. 8 we plot the average percentage of dropped packetswhile replaying traffic with normal timing behavior, instead ofsending traffic with a constant rate. We replay the traffic of ourtrace based on it’s normal traffic patterns when the trace was cap-tured, using the multiplier option of tcpreplay [24] tool. Thus, weare able to replay the trace at the speed that it was recorded, whichis 88 Mbit/s on average, or at a multiple of this speed. In this exper-iment we examine the performance of Snort using the original andour modified libpcap in case of normal traffic patterns with pack-et bursts, instead of constant traffic rates. We send the trace usingmultiples from 1 up to 16, and we plot the percentage of dropped

Multiple of real trace’s speed0 2 4 6 8 10 12 14 16

Dro

pped

Pac

kets

(%)

0

10

20

30

40

50

60

70

80

pcappcap+LB

Fig. 8. Packet loss ratio of the passive monitoring sensor when running Snort, as afunction of the actual trace’s speed multiple, with an 8,000-packet locality bufferand 100 ms timeout.

L2 c

ache

mis

ses

(per

pac

ket)

2

4

6

8

10

12

14

pcappcap+LB


packets as a function of this multiplication factor. We use 8,000packets as buffer size and 100 ms timeout.

We observe that locality buffering reduces the percentage ofdropped packets in higher traffic rates, when using larger multipli-cation factors. Snort with locality buffering starts dropping packetswhen sending traffic eight times faster than the actual speed of thetrace, while Snort with the original libpcap drops packets fromfour times faster speed. When replaying traffic 16 times faster thanthe recorded speed, which results to 1,408 Mbit/s on average, Snortwith locality buffering drops 34% less packets.

Since both versions of libpcap use a ring buffer for storingpackets with the same size, they are resistant to packet bursts ata similar factor. However, libpcap with locality buffering is faster,due the improved memory locality, and so it is more resistant topacket drops in cases of overloads and traffic bursts.


0

Fig. 10. Appmon’s L2 cache misses as a function of the buffer size for 500 Mbit/straffic.

5.3. Appmon

Appmon [2] is a passive network monitoring application foraccurate per-application traffic identification and categorization.It uses deep-packet inspection and packet filtering for attributingflows to the applications that generate them. We ran Appmon ontop of our modified version of libpcap to examine its perfor-mance using different buffer sizes that vary from 100 to 32,000packets, and compare with the original libpcap. Fig. 9 presentsthe Appmon’s CPU cycles and Fig. 10 the L2 cache misses measuredwhile replaying the trace at a constant rate of 500 Mbit/s. At thisrate no packet loss was occurred.

The results show that Appmon’s performance can be improvedusing the locality buffering implementation, reducing the CPU cy-cles by about 16% compared to Appmon using the original libp-cap. Cache misses are reduced by up to 31% for 32,000 packetsbuffer size and 28% for 16,000 packets. We notice that in case ofAppmon the optimum buffer size is around 16,000 packets, whilein Snort 8,000 packets size is enough to optimize the performance.This happens because Appmon is not so CPU-intensive as Snort, soit requires a larger amount of packets to be sorted in order toachieve a significant performance improvement.

We ran Appmon with traffic rates varying from 250 to2,000 Mbit/s, observing similar results. Since Appmon does lessprocessing than snort, less packets are dropped in high rates. Theoutput of Appmon remains identical in all cases, which means thatthe periodic packet stream sorting does not affect the correct oper-ation of Appmon’s classification process.


CPU

clo

ck c

ycle

s (p

er p

acke

t)

0

2000

4000

6000

8000

10000

pcappcap+LB

Fig. 9. Appmon’s CPU cycles as a function of the buffer size for 500 Mbit/s traffic.

5.4. Fprobe

Fprobe [3] is a passive monitoring application that collectstraffic statistics for each active flow and exports the correspondingNetFlow records. We ran Fprobe with the original and our modifiedversion of libpcap and performed the same measurements aswith Appmon.

Fig. 11 plots the CPU cycles and Fig. 12 the L2 cache misses ofthe Fprobe variants for buffer sizes from 100 up to 32,000 packets,while replaying the trace at a rate of 500 Mbit/s.

We notice a speedup of about 11% in Fprobe when locality buf-fering is enabled, for 4,000 packets buffer size, while cache missesare reduced by 19%. The buffer size that optimizes the overall per-formance of Fprobe in this setup is around 4,000 packets. No pack-et loss occurred for Fprobe at all traffic rates. Fprobe is even lessCPU-intensive than Appmon and Snort since it performs only afew operations per packet. The time spent in kernel for packet cap-turing is significantly larger than the time spent in user space.Thus, Fprobe benefits less from the locality buffering enhance-ments. For passive monitoring applications in general, the perfor-mance improvement due to locality buffering increases as the


CPU

clo

ck c

ycle

s (p

er p

acke

t)

0

500

1000

1500

2000

2500

3000

pcappcap+LB

Fig. 11. Fprobe’s CPU cycles as a function of the buffer size for 500 Mbit/s traffic.

Locality Buffer size (# packets)

0 5000 10000 15000 20000 25000 30000

L2 c

ache

mis

ses

(per

pac

ket)

0

1

2

3

4 pcappcap+LB

Fig. 12. Fprobe’s L2 cache misses as a function of the buffer size for 500 Mbit/straffic.


time spent in user space increases, and also depends on memoryaccess patterns. Similar results were observed for all rates of thereplayed traffic.

6. Related work

The concept of locality buffering for improving passive networkmonitoring applications, and, in particular, intrusion detection andprevention systems, was first introduced by Xinidis et al. [25], aspart of a load balancing traffic splitter for multiple network intru-sion detection sensors that operate in parallel. In this work, theload balancer splits the traffic to multiple intrusion detection sen-sors, so that similar packets (e.g., packets destined to the sameport) are processed by the same sensor. However, in this approachthe splitter uses a limited number of locality buffers and copieseach packet to the appropriate buffer based on hashing on its des-tination port number. Our approach differs in two major aspects.First, we have implemented locality buffering within a packet cap-turing library, instead of a separate network element. To the best ofour knowledge, our prototype implementation within the libpcaplibrary is the first attempt for providing memory locality enhance-ments for accelerating packet processing in a generic and transpar-ent way for existing passive monitoring applications. Second, themajor improvement of our approach is that packets are not actu-ally copied into separate locality buffers. Instead, we maintain aseparate index which allows for scaling the number of locality buf-fers up to 64 K.

Locality enhancing techniques for improving server perfor-mance have been widely studied. For instance, Markatos et al.[26] present techniques for improving request locality on a Webcache, which results to significant improvements in the file systemperformance.

Braun et al. [15] and Schneider et al. [27] compare the perfor-mance of packet capturing libraries on different operating systemsusing the same hardware platforms and provide guidelines for sys-tem configuration to achieve optimal performance. Several re-search efforts [12–14,17] have focused on improving theperformance of packet capturing through kernel and library mod-ifications. These approaches reduce the time spent in kernel andthe number of memory copies required for delivering each packetto the application. In contrast, our locality buffering approach aimsto improve the packet processing performance of the monitoringapplication itself, by exploiting the inherent locality of the

in-memory workload of the application. Moreover, such ap-proaches can be integrated with our locality buffering techniqueto achieve the maximum possible improvement in CPU-intensivepassive monitoring applications running in commodity hardware.

Schneider et al. [16] show that commodity hardware and soft-ware may be able to capture traffic for rates up to 1 Gbit/s, mainlydue to limitations with buses bandwidth and CPU load. To copewith this limitation, the authors propose a monitoring architecturefor higher speed interfaces by splitting the traffic across a set ofnodes with lower speed interfaces, using a feature of current Ether-net switches. The detailed configuration for load balancing is leftfor the application. Vallentin et al. [28] present a NIDS clusterbased on commodity PCs. Some front-end nodes are responsibleto distribute the traffic across the cluster’s back-end nodes. Severaltraffic distribution schemes are discussed, focused on minimizingthe communication between the sensors and keeping them simpleenough to be implemented effectively in the front-end nodes.Hashing a flow identifier is proposed as the right choice. A local-ity-aware approach in such architectures would cluster similarflows to the same node, by sorting and splitting packets based onsource and destination port numbers, to improve memory localityand cache usage in each individual node.

Fusco and Deri [29] utilize the RSS feature of modern NICs [30],which split the network traffic in multiple RX queues, usuallyequal to the number of CPU cores, to parallelize packet capturingusing all CPU cores. Moreover, packets are copied directly fromeach hardware queue to a corresponding ring buffer, which is ex-posed in user-level as a virtual network interface. Thus, applica-tions can easily and efficiently split the load to multiple threadsor processes without contention. The load distribution in the sev-eral queues is based on a hash function, so it’s not optimized interms of memory locality and L2 cache usage. A locality-aware loaddistribution algorithm, like our approach, would split packets inqueues based on their source and destination port numbers. Thus,both directions of each flow as well as similar flows would fall tothe same queue and would be processed by the same thread andsame CPU core. Locality-aware and flow-based load balancing ap-proaches could be combined to handle cases where most of theflows have a common port number, so that these flows can besplitted to more than one CPU cores.

Sommer et al. [31] utilize multicore processors to pararrelizeevent-based network prevention systems, using multiple eventqueues that collect together semantically related events for in-order execution. Since the events are related, keeping them withina single queue localizes memory access to shared state by the samethread. On the other hand, our approach clusters together similarpackets, i.e., packets belonging to the same or similar flow, withthe same or close port numbers.

7. Discussion

We consider two main limitations of our locality buffering ap-proach. The first limitation is that the packet reordering we imposeresults in non-monotonic increasing timestamps among differentflows (it guarantees monotonic increasing timestamps only pereach bi-directional flow). Therefore, applications that require thisproperty, e.g., for connection timeout issues, may have problemswhen dealing with non-monotonic increasing timestamps. Suchapplications should either be modified to handle this issue, other-wise they cannot use our approach.

The second limitation of our approach is that our generic imple-mentation within libpcap, which sorts packets based on source ordestination port numbers, may not be suitable for applications thatrequire a custom packet sorting or scheduling approach, e.g., basedon application’s semantics. Monitoring applications may perform


similar processing for packets with specific port numbers, whichcannot be known to the packet capturing library. Such applicationsshould not use our modified libpcap version, but instead imple-ment a custom scheduling scheme for packet processing order.Our implementation’s goal, is to improve transparently the perfor-mance of a large class of existing applications, where the packetprocessing tasks depend mainly on the packets port numbers.

In particular, locality buffering technique is intended for appli-cations which perform similar processing and similar memoryaccesses for the same class of packets, e.g., for packets of the sameflow or packets belonging to the same higher level protocol orapplication. For instance, signature-based intrusion detection sys-tems can benefit from locality buffering, due to different set of sig-natures matched against different classes of packets. Other types ofmonitoring applications may not gain the same performanceimprovement from locality buffering.

Finally, recent trends impose the use of multiple CPU cores perprocessor, instead of building and using faster processors. Thus,applications should utilize all the available CPU cores to take fulladvantage of modern hardware and improve their performance.Although L2 cache memory becomes larger in newest processors,more processor cores tend to access the shared L2 cache, so localityenhancements can still benefit the overall performance. Our ap-proach can be extended to exploit memory locality enhancementsfor improving the performance of multithreaded applications run-ning in multicore processors. Improving memory locality for eachthread, which usually runs on a single core, is an important factorthat can significantly improve packet processing performance.Each thread should process similar packets to improve its memoryaccess locality, similarly to our approach, which is intended for sin-gle-threaded applications.

Applications usually choose to split packets to multiple threads(one thread per core) based on flow identifiers. A generic locality-aware approach for efficient packet splitting to multiple threads, inorder to optimize cache usage in each CPU core, should sort pack-ets based on their port numbers and then divide them to the multi-ple threads. This will lead to improved code and data locality ineach CPU core, similarly to our locality buffering approach. How-ever, in some applications the best splitting of packets to multiplethreads can be done only by the application itself, based on customapplication’s semantics, e.g., custom sets of ports with similar pro-cessing. In these cases, a generic library for improved memorylocality cannot be used.

8. Conclusion

In this paper, we present a technique for improving the packetprocessing performance in a wide range of passive network moni-toring applications by enhancing the locality of code and dataaccesses. Our approach is based on reordering the captured packetsbefore delivering them to the monitoring application by groupingtogether packets with the same source or destination port number.This results to improved locality in application code and dataaccesses, and consequently to an overall increase in the packet pro-cessing throughput and to a significant decrease in the packet lossrate.

To maximize improvements in processing throughput, we com-bine locality buffering with memory mapping, an existing tech-nique in libpcap that optimizes the performance of packetcapturing. By mapping a buffer into shared memory, this techniquereduces the time spent in context switching for delivering packetsfrom kernel to user space.

We describe in detail the design and implementation of localitybuffering within libpcap. Our experimental evaluation usingthree representative passive monitoring applications shows that

all applications gain a significant performance improvement whenusing the locality buffering implementations, while the system cankeep up with higher traffic speeds without dropping packets. Spe-cifically, locality buffering results to a 25% increase in the process-ing throughput of the Snort IDS and allows it to process two timeshigher traffic rates without packet drops.

Using the original libpcap implementation, the Snort sensorstarts to drop packets when the monitored traffic speed reaches600 Mbit/s, while using locality buffering, packet loss is exhibitedbeyond 1 Gbit/s. Fprobe, a NetFlow export probe, and Appmon,an accurate traffic classification application, also exhibit significantthroughput improvements, up to 12% and 18% respectively,although they do not perform as CPU-intensive processing as Snort.

Overall, we believe that implementing locality buffering withinlibpcap is an attractive performance optimization, since it offerssignificant performance improvements to a wide range of passivemonitoring applications, while at the same time its operation iscompletely transparent, without the need to modify existing appli-cations. Our implementation of locality buffering in the memorymapped version of libpcap offers even better performance, sinceit combines optimizations in both the packet capturing and packetprocessing phases.

Acknowledgments

This work was supported in part by the FP7-PEOPLE-2009-IOFproject MALCODE and the FP7 project SysSec, funded by theEuropean Commission under Grant Agreements Nos. 254116 and257007. We thank the anonymous reviewers for their valuablefeedback on earlier versions of this paper. Antonis Papadogiannakis,Giorgos Vasiliadis, Demetres Antoniades, and Evangelos Markatosare also with the University of Crete.

References

[1] M. Roesch, Snort: Lightweight intrusion detection for networks, in:Proceedings of the USENIX Systems Administration Conference (LISA), 1999.

[2] D. Antoniades, M. Polychronakis, S. Antonatos, E.P. Markatos, S. Ubik, A. Oslebo,Appmon: an application for accurate per application traffic characterization,in: Proceedings of the IST Broadband Europe Conference, 2006.

[3] Fprobe: Netflow probes, <http://fprobe.sourceforge.net/>.[4] M. Grossglauser, J. Rexford, Passive traffic measurement for IP operations, in:

The Internet as a Large-Scale Complex System, 2005, pp. 91–120.[5] V. Paxson, Bro: A system for detecting network intruders in real-time, in:

Proceedings of the 7th USENIX Security Symposium, 1998.[6] M. Handley, V. Paxson, C. Kreibich, Network intrusion detection: evasion,

traffic normalization, and end-to-end protocol semantics, in: Proceedings ofthe 10th USENIX Security Symposium, 2001.

[7] S. McCanne, C. Leres, V. Jacobson, libpcap, lawrence Berkeley Laboratory,Berkeley, CA. (software available from <http://www.tcpdump.org/>).

[8] G. Insolvibile, Inside the linux packet filter, Linux Journal 94 (2002).[9] J.C. Mogul, K.K. Ramakrishnan, Eliminating receive livelock in an interrupt-

driven kernel, ACM Transactions on Computer Systems 15 (3) (1997) 217–252.[10] J.H. Salim, R. Olsson, A. Kuznetsov, Beyond softnet, in: ALS ’01: Proceedings of

the 5th Annual Conference on Linux Showcase & Conference, 2001.[11] S. McCanne, V. Jacobson, The BSD Packet Filter: A new architecture for user-

level packet capture, in: Proceedings of the Winter 1993 USENIX Conference,1993, pp. 259–270.

[12] L. Deri, Improving passive packet capture:beyond device polling, in:Proceedings of the 4th International System Administration and NetworkEngineering Conference (SANE), 2004.

[13] L. Deri, ncap: Wire-speed packet capture and transmission, in: Proceedings ofthe IEEE/IFIP Workshop on End-to-End Monitoring Techniques and Services(E2EMON), 2005.

[14] A. Biswas, P. Sinha, On improving performance of network intrusion detectionsystems by efficient packet capturing, in: Proceedings of the 10th IEEE/IFIPNetwork Operations and Management Symposium (NOMS), Vancouver,Canada, 2006.

[15] L. Braun, A. Didebulidze, N. Kammenhuber, G. Carle, Comparing and improvingcurrent packet capturing solutions based on commodity hardware, in:Proceedings of the 10th Annual Conference on Internet Measurement (IMC),2010, pp. 206–217.

[16] F. Schneider, J. Wallerich, A. Feldmann, Packet capture in 10-gigabit ethernetenvironments using contemporary commodity hardware, in: Proceedings of

http://fprobe.sourceforge.net/

http://www.tcpdump.org/


the 8th International Conference on Passive and Active network Measurement(PAM), 2007, pp. 207–217.

[17] H. Bos, W. de Bruijn, M. Cristea, T. Nguyen, G. Portokalidis, FFPF: fairly fastpacket filters, in: Proceedings of the 6th Conference on Symposium onOpearting Systems Design & Implementation (OSDI), 2004.

[18] R. Watson, C. Peron, Zero-Copy BPF Buffers, FreeBSD Developer Summit.[19] A. Fiveg, Ringmap capturing stack for high performance packet capturing in

FreeBSD, FreeBSD Developer Summit.[20] Dag ethernet network monitoring cards, Endace measurement systems,

<http://www.endace.com/>.[21] Sourcefire vulnerability research team (vrt), <http://www.snort.org/vrt/>.[22] A.V. Aho, M.J. Corasick, Efficient string matching: an aid to bibliographic

search, Communications of the ACM 18 (1975) 333–340.[23] Performance application programming interface, <http://icl.cs.utk.edu/papi/>.[24] Tcpreplay, <http://tcpreplay.synfin.net/trac/>.[25] K. Xinidis, I. Charitakis, S. Antonatos, K.G. Anagnostakis, E.P. Markatos, An

active splitter architecture for intrusion detection and prevention, IEEETransactions on Dependable and Secure Computing 03 (1) (2006) 31–44.

[26] E.P. Markatos, D.N. Pnevmatikatos, M.D. Flouris, M.G.H. Katevenis, Web-conscious storage management for web proxies, IEEE/ACM Transactions onNetworking 10 (6) (2002) 735–748.

[27] F. Schneider, J. Wallerich, Performance evaluation of packet capturing systemsfor high-speed networks, in: Proceedings of the 2005 ACM Conference onEmerging Network Experiment and Technology (CoNEXT), 2005, pp. 284–285.

[28] M. Vallentin, R. Sommer, J. Lee, C. Leres, V. Paxson, B. Tierney, The NIDS cluster:Scalable, stateful network intrusion detection on commodity hardware, in:Proceedings of the 10th International Symposium on Recent Advances inIntrusion Detection (RAID), 2007, pp. 107–126.

[29] F. Fusco, L. Deri, High speed network traffic analysis with commodity multi-core systems, in: Proceedings of the 10th annual conference on Internetmeasurement (IMC), 2010, pp. 218–224.

[30] Intel Server Adapters, Receive side scaling on Intel Network Adapters, <http://www.intel.com/support/network/adapter/pro100/sb/cs-027574.htm>.

[31] R. Sommer, V. Paxson, N. Weaver, An architecture for exploiting multi-coreprocessors to parallelize network intrusion prevention, Concurrency andComputation: Practice and Experience 21 (10) (2009) 1255–1279.

http://www.endace.com/

http://www.snort.org/vrt/

http://icl.cs.utk.edu/papi/

http://tcpreplay.synfin.net/trac/

http://www.intel.com/support/network/adapter/pro100/sb/cs-027574.htm

http://www.intel.com/support/network/adapter/pro100/sb/cs-027574.htm

improving the performance of passive network …mikepo/papers/pcaplb.comcom12.pdfpassive network...

Documents