ingrid 2007 instrumenting the grid · university of genoa department of communication, computer and...

University of GenoaDepartment of Communication, Computer and System Sciences

“Analyzing and Optimizing the Linux Networking Stack”

Raffaele Bolla, Roberto Bruschi, Andrea Ranieri, Gioele Traverso

Santa Margherita, Genoa, Italy

Speaker: Roberto Bruschi

INGRID 2007Instrumenting the Grid

Telecommunication Networks and Telematics Research Group InfoCom Genova S.R.L.

2

Outline• Scenario and objectives;

• System architecture;— Linux SW Architecture;— System refinement and parameter tuning;— HW Architecture;

• Performance evaluation tools and Testbed;

• Numerical results;

• Conclusion and future work.

These activities have been carried out during the Bora-Bora project.

3

Scenario

• Today, many Open Source Operating Systems (OSs) include networking capabilities so sophisticated and complete, that theyare often employed in all such cases where an intensive and critical network usage is expected.

• In fact, owing to the flexibility of the Open Source approach, these OSs are becoming more and more widespread in many application fields,

— for example, in network firewalls, web servers, software routers and Grid architectures.

• Notwithstanding their arising popularity and their continuous development, there is no clear indication about

— the maximum performance level that these architectures can achieve, — the computational weight that they introduce.

• In fact, the knowledge of these performance indexes can help to:— better dimension all such systems with intensive network usage; — to design new task allocators for Grid clusters that take into account the

performance limits and the resources needed by networking operations.

4

Scenario & Objectives

• With this aim in mind, we have decided to analyze and to optimize the Linux networking stack performance in this work, trying to evaluate the impact of networking functionalities on the system.

• We have not taken transport layer (TCP or UDP) into consideration, to better highlight the effects and the features of low level I/O network operations (packet reception and transmission)

• Linux can boast one of the most complete networking stacks (i.e., networking source code is about 40% of the whole kernel), and it offers support to a wide range of hardware components.

• Besides, we have to underline that Linux is still deeply inclined to the networking usage, but it is a general purpose architecture, and so it presents a large number of aspects that can be tuned or optimized to enhance its networking performance.

5

The Linux Networking Overview

• While all the packet processing functions are realized inside the Linux kernel, the large part of daemons/applications requiring network functionalities run in user mode.

• The critical element for the network functionalities is the kernel, where all the link, network and transport layer operations are realized.

• During the last years, the networking support integrated in the Linux kernel has experienced many structural and refining developments.

• For these reasons, we have chosen to use a last generation Linux kernel, more specifically a 2.6 version.

6


• The Linux networking architecture is fundamentally based on an interrupt mechanism:

— network boards signal the kernel upon packet reception or transmission, through HW interrupts;

— each HW interrupt is served as soon as possible by a handling routine, which suspends the operations currently processed by the CPU;

— until completed, the runtime cannot be interrupted by anything, even by other interrupt handlers;

— with the clear purpose to make the system reactive, the interrupt handlers are designed to be very short, while all the time consuming tasks are performed by the so-called “Software Interrupts” (SoftIRQ) in a second time.

• This is the well known “top half – bottom half” IRQ routine division implemented in the Linux kernel.

• SoftIRQs are actually a form of kernel activity that can be scheduled for later execution rather than real interrupts:

— They differ from HW IRQs mainly in that a SoftIRQ is scheduled for execution by an activity of the kernel, like for example a HW IRQ routine, and has to wait until it is called by the scheduler.

— SoftIRQs can be interrupted only by HW IRQ routines.

7


• The packet process is fundamentally composed by four different modules: — a “reception API”, which handles the packet reception (NAPI for the

2.6 kernels); — a “transmission API” that manages the forwarding operations to

egress network interfaces;— a module that carries out the IP layer processing (both in reception

and in transmission);— A transport layer module (TCP and UDP) (not discussed here)

• In particular, the reception and the transmission APIs are the lowest level modules, and are composed by both HW IRQ routines and SoftIRQs:— They work by managing the network interfaces and performing

some layer 2 functionalities.

8


• the NAPI has been explicitly created to increase the packet reception process scalability.

• It handles network interface requests with an interrupt moderation mechanism, which allows to adaptively switch from a classical interrupt management of the network interfaces to a polling one.

• This is done by:— inserting, during the HW IRQ routine, the identifier of the board

generating the IRQ into a special list, called “poll list” , by scheduling a reception SoftIRQ, and by disabling the HW IRQs for that device;

— when the SoftIRQ is activated, the kernel polls all the devices, whose identifier is included in the poll list, and a maximum of quota packets are served per device;

— if the buffer (RxRing) of a device is emptied, then its identifier is removed from the poll list and its HW IRQs re-enabled; otherwise, its HW IRQs are left disabled, the identifier kept in the poll list and a further SoftIRQ scheduled.

9


Poll Listeth0

Rx Ring Buffer

DMA Buffer

Packet Processing

Tx Ring Buffer

10

The Networking 2.6 Kernel

NAPI implements an adaptive mechanism that, by using the interrupt mitigation, behaves as the classical SofNet API with low input rates, while, for higher rates, it works like a polling mechanism.

Kernel space

User space

Transport Layer (TCP/UDP processing)

To Applications From Applications

11

System refinement

• The whole networking kernel architecture is quite complex and presents some aspects and many parameters that can be tuned for system optimization:— Tuning of the driver parameters (interrupt rates, ring

buffers,..);— Tuning of the Tx and Rx parameters (quota, qdisc

dimensioning);— Optimization of some specific 2.6 kernel parameters

(scheduler frequency);— Rationalization of memory management (skb descriptor

recycling patch).— Monitoring of kernel dynamics (counter patch).

12

HW Architectures

• To benchmark the performance of the Linux networking stack, we have used two hardware architectures:— Gelso: based on SuperMicro X5DL8-GG mainboard, equipped with

a PCI-X bus and with a 32 bit 2.4 GHz Intel Xeon;— Magnolia: based on a SuperMicro X7DBE mainboard, equipped

with both the PCI Express and PCI-X busses, and with a 5050 Intel Xeon, which is a dual core 64 bit processor.

• The presence of more hardware architectures gives us the possibility to better understand which performance bottlenecks can depend on the selected hardware and how, and to evaluate how networking performance scales according to hardware capabilities.

• About the network interface cards (NIC), we have used the Intel PRO 1000 XT server adapters for the PCI-X bus and the Intel PRO 1000 PT dual port adapters for the PCI Express.

13

HW Architecture

• The PC architecture is a general-purpose one and it is not specifically optimized for network operations.

• The PC internal data path has to use a centralized memory shared I/O mechanism.

• The bandwidth of internal busses and the PC computational capacity are the two most critical hardware elements involved in determination of the maximum performance.

1

23

41

23

4

14

Performance evaluation tools

• The OR performance can be analyzed by using both internal and external measurement methods:— External measures: performed by using a professional

equipment, namely Agilent N2X Router Tester. – It allows to obtain throughput and latency measurements

with very high availability and accuracy levels (i.e., the minimum guaranteed timestamp resolution is 10 ns).

— Internal measures: performed by using Oprofile, an open source tool that realizes a continuous monitoring of system dynamics with a frequent and quite regular sampling of CPU hardware registers. – Oprofile allows evaluating, in a very effective and deep

way, the CPU utilization of both each software application and each single kernel function running in the system with a very low computational overhead.

15

Testbed Description

NAPI TxAPI

IP Processing

TCP/UDP Processing

Applications

Gigabit Ethernet

16

Numerical Results:Throughput and Latencies

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

64 128 256 512 1024 1500

Packet Size (B)

Max

imum

Thr

ough

put (

pkt/s

)

Gelso - NoRecy- PCI-XGelso - Recy- PCI-XMagnolia - NoRecy - PCI-ExMagnolia - NoRecy - PCI-XMagnolia - Recy - PCI-ExMagnolia -Recy - PCI-XMaximum Theretical Rate

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

5% 10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

Offered Load (%)

Thro

ughp

ut (p

kt/s

)

Gelso - NoRecy - PCI-XGelso - Recy - PCI-XMagnolia - NoRecy - PCI-ExMagnolia - NoRecy - PCI-XMagnolia - Recy6.3.9 - PCI-ExMagnolia - Recy6.3.9 - PCI-X

0

1000

2000

3000

4000

5000

60005% 10

%15

%20

%25

%30

%35

%40

%45

%50

%55

%60

%65

%70

%75

%80

%85

%90

%95

%10

0%

Offered Load (%)

Late

ncy

(us)

Gelso - NoRecy - PCI-XGelso - Recy - PCI-XMagnolia - NoRecy - PCI-ExMagnolia - NoRecy - PCI-XMagnolia - Recy6.3.9 - PCI-ExMagnolia - Recy6.3.9 - PCI-X

Max Throughput vs packet size

Throughput and latency delays vs Offered Load with 64 Bytes sized packets

17

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100

CPU Utilization [%]

Offered Load [%]

idlescheduler

memoryIP processing

NAPITx API

IRQEth processing

oprofile

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80 90 100

CPU Utilization [%]

Offered Load [%]

idlescheduler

memoryIP processing

NAPITx API

IRQEth processing

oprofile

CPU utilization of the kernel networking functionalities according to different traffic offered load (64 Byte sized packets) for Gelso with the standard kernel.

CPU utilization of the kernel networking functionalities according to different traffic offered load (64 Bytes sized packets) for Gelso with the skbuff recycling patch.

Numerical Results: Oprofiles Reports

18

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

CPU Utilization [%]

Offered Load [%]

idlescheduler

memoryIP processing

NAPITx API

IRQEth processing

oprofile

CPU utilization of the kernel networking functionalities according to different traffic offered load (1500 Byte sized packets) for Gelso with the skbuff recycling patch.

Numerical Results: Oprofiles Reports

19

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1

1.2

1.4

Occurrences [# / Pkts]


Offered Load [%]

IRQPoll Run

rxSoftIrq

0

0.001

0.002

0.003

0.004

0.005

0.006

0 10 20 30 40 50 60 70 800.02

0.04

0.06

0.08

0.1

0.12

0.14



Offered Load [%]

IRQWakeFunc

Number of IRQ routines, of polls and of Rx SoftIRQ (second y-axis) for the RX board for the skbuff recycling patched kernel, in the presence of an incoming traffic flow with 64 Bite sized packets.

Number of IRQ routines for the TX board, of Tx Ring clearing by TxSoftIRQ (“func”) and by RxSoftIRQ (“wake”) for the skbuffrecycling patched kernel, in the presence of an incoming traffic flow with 64 Bite sized packets. The second y axis refers to “wake”.

Numerical Results: Counter Patch Reports

20

Numerical Results:User Plane Impact

CPU utilization of the kernel networking functionalities according to different traffic offered load (64 Byte sized packets) for Gelso with the standard kernel. Dummy processes included.

21

Conclusions and Future Works

• The main contribution of this work has been reporting the results of a deep activity of optimization and testing realized on the Linux networking stack, and, more in particular, on layers 2 and 3.

• The main objective has been the performance evaluation (with respect to IPv4 packet forwarding) of both a standard kernel and its optimized version on two different hardware architectures.

• The benchmarking, carried out with both external (i.e., throughput and latency) and internal (i.e., kernel profiling and internal counters) measurements, has shown that:— Packet processing operations have CPU intensite nature;— They generally cause a high HW IRQ rate, that may cause low

system reactiveness— Kernel ad hoc optimiztions sensibly improve the overall

performance— Hardware improvements (CPU frequency and I/O bus speed)

enhance the packet processing performance

22

Thanks for your attention

ingrid 2007 instrumenting the grid · university of genoa department of communication, computer and...

Documents