ingrid 2007 instrumenting the grid · university of genoa department of communication, computer and...
TRANSCRIPT
University of GenoaDepartment of Communication, Computer and System Sciences
“Analyzing and Optimizing the Linux Networking Stack”
Raffaele Bolla, Roberto Bruschi, Andrea Ranieri, Gioele Traverso
Santa Margherita, Genoa, Italy
Speaker: Roberto Bruschi
INGRID 2007Instrumenting the Grid
Telecommunication Networks and Telematics Research Group InfoCom Genova S.R.L.
2
Outline• Scenario and objectives;
• System architecture;— Linux SW Architecture;— System refinement and parameter tuning;— HW Architecture;
• Performance evaluation tools and Testbed;
• Numerical results;
• Conclusion and future work.
These activities have been carried out during the Bora-Bora project.
3
Scenario
• Today, many Open Source Operating Systems (OSs) include networking capabilities so sophisticated and complete, that theyare often employed in all such cases where an intensive and critical network usage is expected.
• In fact, owing to the flexibility of the Open Source approach, these OSs are becoming more and more widespread in many application fields,
— for example, in network firewalls, web servers, software routers and Grid architectures.
• Notwithstanding their arising popularity and their continuous development, there is no clear indication about
— the maximum performance level that these architectures can achieve, — the computational weight that they introduce.
• In fact, the knowledge of these performance indexes can help to:— better dimension all such systems with intensive network usage; — to design new task allocators for Grid clusters that take into account the
performance limits and the resources needed by networking operations.
4
Scenario & Objectives
• With this aim in mind, we have decided to analyze and to optimize the Linux networking stack performance in this work, trying to evaluate the impact of networking functionalities on the system.
• We have not taken transport layer (TCP or UDP) into consideration, to better highlight the effects and the features of low level I/O network operations (packet reception and transmission)
• Linux can boast one of the most complete networking stacks (i.e., networking source code is about 40% of the whole kernel), and it offers support to a wide range of hardware components.
• Besides, we have to underline that Linux is still deeply inclined to the networking usage, but it is a general purpose architecture, and so it presents a large number of aspects that can be tuned or optimized to enhance its networking performance.
5
The Linux Networking Overview
• While all the packet processing functions are realized inside the Linux kernel, the large part of daemons/applications requiring network functionalities run in user mode.
• The critical element for the network functionalities is the kernel, where all the link, network and transport layer operations are realized.
• During the last years, the networking support integrated in the Linux kernel has experienced many structural and refining developments.
• For these reasons, we have chosen to use a last generation Linux kernel, more specifically a 2.6 version.
6
The Linux Networking Overview
• The Linux networking architecture is fundamentally based on an interrupt mechanism:
— network boards signal the kernel upon packet reception or transmission, through HW interrupts;
— each HW interrupt is served as soon as possible by a handling routine, which suspends the operations currently processed by the CPU;
— until completed, the runtime cannot be interrupted by anything, even by other interrupt handlers;
— with the clear purpose to make the system reactive, the interrupt handlers are designed to be very short, while all the time consuming tasks are performed by the so-called “Software Interrupts” (SoftIRQ) in a second time.
• This is the well known “top half – bottom half” IRQ routine division implemented in the Linux kernel.
• SoftIRQs are actually a form of kernel activity that can be scheduled for later execution rather than real interrupts:
— They differ from HW IRQs mainly in that a SoftIRQ is scheduled for execution by an activity of the kernel, like for example a HW IRQ routine, and has to wait until it is called by the scheduler.
— SoftIRQs can be interrupted only by HW IRQ routines.
7
The Linux Networking Overview
• The packet process is fundamentally composed by four different modules: — a “reception API”, which handles the packet reception (NAPI for the
2.6 kernels); — a “transmission API” that manages the forwarding operations to
egress network interfaces;— a module that carries out the IP layer processing (both in reception
and in transmission);— A transport layer module (TCP and UDP) (not discussed here)
• In particular, the reception and the transmission APIs are the lowest level modules, and are composed by both HW IRQ routines and SoftIRQs:— They work by managing the network interfaces and performing
some layer 2 functionalities.
8
The Linux Networking Overview
• the NAPI has been explicitly created to increase the packet reception process scalability.
• It handles network interface requests with an interrupt moderation mechanism, which allows to adaptively switch from a classical interrupt management of the network interfaces to a polling one.
• This is done by:— inserting, during the HW IRQ routine, the identifier of the board
generating the IRQ into a special list, called “poll list” , by scheduling a reception SoftIRQ, and by disabling the HW IRQs for that device;
— when the SoftIRQ is activated, the kernel polls all the devices, whose identifier is included in the poll list, and a maximum of quota packets are served per device;
— if the buffer (RxRing) of a device is emptied, then its identifier is removed from the poll list and its HW IRQs re-enabled; otherwise, its HW IRQs are left disabled, the identifier kept in the poll list and a further SoftIRQ scheduled.
9
The Linux Networking Overview
Poll Listeth0
Rx Ring Buffer
DMA Buffer
Packet Processing
Tx Ring Buffer
10
The Networking 2.6 Kernel
NAPI implements an adaptive mechanism that, by using the interrupt mitigation, behaves as the classical SofNet API with low input rates, while, for higher rates, it works like a polling mechanism.
Kernel space
User space
Transport Layer (TCP/UDP processing)
To Applications From Applications
11
System refinement
• The whole networking kernel architecture is quite complex and presents some aspects and many parameters that can be tuned for system optimization:— Tuning of the driver parameters (interrupt rates, ring
buffers,..);— Tuning of the Tx and Rx parameters (quota, qdisc
dimensioning);— Optimization of some specific 2.6 kernel parameters
(scheduler frequency);— Rationalization of memory management (skb descriptor
recycling patch).— Monitoring of kernel dynamics (counter patch).
12
HW Architectures
• To benchmark the performance of the Linux networking stack, we have used two hardware architectures:— Gelso: based on SuperMicro X5DL8-GG mainboard, equipped with
a PCI-X bus and with a 32 bit 2.4 GHz Intel Xeon;— Magnolia: based on a SuperMicro X7DBE mainboard, equipped
with both the PCI Express and PCI-X busses, and with a 5050 Intel Xeon, which is a dual core 64 bit processor.
• The presence of more hardware architectures gives us the possibility to better understand which performance bottlenecks can depend on the selected hardware and how, and to evaluate how networking performance scales according to hardware capabilities.
• About the network interface cards (NIC), we have used the Intel PRO 1000 XT server adapters for the PCI-X bus and the Intel PRO 1000 PT dual port adapters for the PCI Express.
13
HW Architecture
• The PC architecture is a general-purpose one and it is not specifically optimized for network operations.
• The PC internal data path has to use a centralized memory shared I/O mechanism.
• The bandwidth of internal busses and the PC computational capacity are the two most critical hardware elements involved in determination of the maximum performance.
1
23
41
23
4
14
Performance evaluation tools
• The OR performance can be analyzed by using both internal and external measurement methods:— External measures: performed by using a professional
equipment, namely Agilent N2X Router Tester. – It allows to obtain throughput and latency measurements
with very high availability and accuracy levels (i.e., the minimum guaranteed timestamp resolution is 10 ns).
— Internal measures: performed by using Oprofile, an open source tool that realizes a continuous monitoring of system dynamics with a frequent and quite regular sampling of CPU hardware registers. – Oprofile allows evaluating, in a very effective and deep
way, the CPU utilization of both each software application and each single kernel function running in the system with a very low computational overhead.
16
Numerical Results:Throughput and Latencies
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
64 128 256 512 1024 1500
Packet Size (B)
Max
imum
Thr
ough
put (
pkt/s
)
Gelso - NoRecy- PCI-XGelso - Recy- PCI-XMagnolia - NoRecy - PCI-ExMagnolia - NoRecy - PCI-XMagnolia - Recy - PCI-ExMagnolia -Recy - PCI-XMaximum Theretical Rate
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
5% 10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
Offered Load (%)
Thro
ughp
ut (p
kt/s
)
Gelso - NoRecy - PCI-XGelso - Recy - PCI-XMagnolia - NoRecy - PCI-ExMagnolia - NoRecy - PCI-XMagnolia - Recy6.3.9 - PCI-ExMagnolia - Recy6.3.9 - PCI-X
0
1000
2000
3000
4000
5000
60005% 10
%15
%20
%25
%30
%35
%40
%45
%50
%55
%60
%65
%70
%75
%80
%85
%90
%95
%10
0%
Offered Load (%)
Late
ncy
(us)
Gelso - NoRecy - PCI-XGelso - Recy - PCI-XMagnolia - NoRecy - PCI-ExMagnolia - NoRecy - PCI-XMagnolia - Recy6.3.9 - PCI-ExMagnolia - Recy6.3.9 - PCI-X
Max Throughput vs packet size
Throughput and latency delays vs Offered Load with 64 Bytes sized packets
17
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90 100
CPU Utilization [%]
Offered Load [%]
idlescheduler
memoryIP processing
NAPITx API
IRQEth processing
oprofile
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80 90 100
CPU Utilization [%]
Offered Load [%]
idlescheduler
memoryIP processing
NAPITx API
IRQEth processing
oprofile
CPU utilization of the kernel networking functionalities according to different traffic offered load (64 Byte sized packets) for Gelso with the standard kernel.
CPU utilization of the kernel networking functionalities according to different traffic offered load (64 Bytes sized packets) for Gelso with the skbuff recycling patch.
Numerical Results: Oprofiles Reports
18
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
CPU Utilization [%]
Offered Load [%]
idlescheduler
memoryIP processing
NAPITx API
IRQEth processing
oprofile
CPU utilization of the kernel networking functionalities according to different traffic offered load (1500 Byte sized packets) for Gelso with the skbuff recycling patch.
Numerical Results: Oprofiles Reports
19
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 800
0.2
0.4
0.6
0.8
1
1.2
1.4
Occurrences [# / Pkts]
Occurrences [# / Pkts]
Offered Load [%]
IRQPoll Run
rxSoftIrq
0
0.001
0.002
0.003
0.004
0.005
0.006
0 10 20 30 40 50 60 70 800.02
0.04
0.06
0.08
0.1
0.12
0.14
Occurrences [# / Pkts]
Occurrences [# / Pkts]
Offered Load [%]
IRQWakeFunc
Number of IRQ routines, of polls and of Rx SoftIRQ (second y-axis) for the RX board for the skbuff recycling patched kernel, in the presence of an incoming traffic flow with 64 Bite sized packets.
Number of IRQ routines for the TX board, of Tx Ring clearing by TxSoftIRQ (“func”) and by RxSoftIRQ (“wake”) for the skbuffrecycling patched kernel, in the presence of an incoming traffic flow with 64 Bite sized packets. The second y axis refers to “wake”.
Numerical Results: Counter Patch Reports
20
Numerical Results:User Plane Impact
CPU utilization of the kernel networking functionalities according to different traffic offered load (64 Byte sized packets) for Gelso with the standard kernel. Dummy processes included.
21
Conclusions and Future Works
• The main contribution of this work has been reporting the results of a deep activity of optimization and testing realized on the Linux networking stack, and, more in particular, on layers 2 and 3.
• The main objective has been the performance evaluation (with respect to IPv4 packet forwarding) of both a standard kernel and its optimized version on two different hardware architectures.
• The benchmarking, carried out with both external (i.e., throughput and latency) and internal (i.e., kernel profiling and internal counters) measurements, has shown that:— Packet processing operations have CPU intensite nature;— They generally cause a high HW IRQ rate, that may cause low
system reactiveness— Kernel ad hoc optimiztions sensibly improve the overall
performance— Hardware improvements (CPU frequency and I/O bus speed)
enhance the packet processing performance