journey to the center of the linux kernel

Upload: agusalsa

Post on 22-Feb-2018

257 views

Category:

Documents


2 download

TRANSCRIPT

  • 7/24/2019 Journey to the Center of the Linux Kernel

    1/25

    Journey to the Center of the Linux Kernel:

    Traffic Control, Shaping and QoS

    Julien Vehent-see revisions

    1 Introduction

    This document describes the Traffic Control subsystem of the Linux Kernel in depth, algorithm

    by algorithm, and shows how it can be used to manage the outgoing traffic of a Linux system.Throughout the chapters, we will discuss both the theory behind and the usage of Traffic Control,

    and demonstrate how one can gain a complete control over the packets passing through his

    system.

    a QoS graph

    The initial target of this paper was to gain a better control over a small !L uplink. "nd it grew

    over time to cover a lot more than that. ##$ of the information provided here can be applied to

    any type of server, as well as routers, firewalls, etc%

    http://jve.linuxwall.info/http://jve.linuxwall.info/http://wiki.linuxwall.info/doku.php/en:ressources:dossiers:networking:traffic_control?do=revisionshttp://wiki.linuxwall.info/doku.php/en:ressources:dossiers:networking:traffic_control?do=revisionshttp://jve.linuxwall.info/
  • 7/24/2019 Journey to the Center of the Linux Kernel

    2/25

    The Traffic Control topic is large and in constant evolution, as is the Linux Kernel. The real

    credit goes to the developers behind the /netdirectory of the kernel, and all of the researchers

    who created and improved all of these algorithms. This is merely an attempt to document someof this work for the masses. "ny participation and comments are welcome, in particular if you

    spotted an inconsistency somewhere. &lease email 'ulien(at)linuxwall.info, your messages are

    always most appreciated.

    *or the technical discussion, since the L"+TC mailing list doesnt exists anymore, try those two-

    /etfilter users mailing listfor general discussions /etev mailing list) is where magic

    happens 0developers 1L2

    2 oti!ation

    This article was initially published in the french issue of 3nu4Linux 1aga5ine *rance 6789, in1ay 8:7:. 3L1* is kind enough to provide a contract that release the content of the article

    under Creative Common after some time. ; extended the initial article

  • 7/24/2019 Journey to the Center of the Linux Kernel

    3/25

    ;n the ;nternet world, everything is packets. 1anaging an network means managing packets- how

    they are generated, router, transmitted, reorder, fragmented, etc% Traffic Control works on

    packets leavingthe system. ;t doesnt, initially, have as an ob'ective to manipulate packetsentering the system 0although you could do that, if you really want to slow down the rate at

    which you receive packets2. The Traffic Control code operates between the ;& layer and the

    hardware driver that transmits data on the network. e are discussing a portion of code thatworks on the lower layers of the network stack of the kernel. ;n fact, the Traffic Control code is

    the very one in charge of constantly furnishing packets to send to the device driver.

    ;t means that the TC module, the packet scheduler, is permanently activate in the kernel. Dven

    when you do not explicitly want to use it, its there scheduling packets for transmission. @ydefault, this scheduler maintains a basic %ueue0similar to a *;*E type

  • 7/24/2019 Journey to the Center of the Linux Kernel

    4/25

    /etfilter can be used to interact directly with the structure representing a packet in the kernel.

    This structure, the skJbuff, contains a field called JJuI8 nfmark that we are going to modify.

    TC will then read that value to select the destination class of a packet.

    The following iptables rule will apply the mark ?: to outgoing packets 0EFT&FT chain2 sent by

    the web server 0TC& source port is ?:2.

    # iptables -t mangle -A OUTPUT -o eth0 -p tcp --sport 80 -j MARK --set-mark 80

    e can control the application of this rule via the netfilter statistics-

    # iptables -L OUTPUT -t mangle -!hain OUTPUT "polic A!!$PT %&'0% packets( '0)M btes*pkts btes target prot opt in o+t so+rce ,estination%8). '0)M MARK tcp -- an eth0 an/here an/here tcp spt/// MARK1set 012030144444444

    Mou probably noticed that the rule is located in the mangle table. e will go back to that a little

    bit later.

    "'" To cla$$e$ in a tree

    To manipulate TC policies, we need the /$#in/tcbinary from theiproute package0aptitude

    install iproute2.

    The iproute package must match your kernel version. Mour distributions package manager will

    normally take care of that.

    e are going to create a tree that represents our scheduling policy, and that uses the HT@

    scheduler. This tree will contain two classes- one for the marked traffic 0TC& sport ?:2, and onefor everything else.

    # tc 5,isc a,, ,e eth0 root han,le ' htb ,e4a+lt 60# tc class a,, ,e eth0 parent '0 classi, ''0 htb rate 600kbit ceil 600kbitprio ' mt+ '200# tc class a,, ,e eth0 parent '0 classi, '60 htb rate 86&kbit ceil '06&kbitprio 6 mt+ '200

    The two classes are attached to the root. Dach class has a guaranteed bandwidth 0rate value2 and

    an opportunistic bandwidth 0ceil value2. ;f the totality of the bandwidth is not used, a class willbe allowed to increased its flow rate up to the ceil value. Etherwise, the rate value is applied. ;t

    means that the sum of the rate values must correspond to the total bandwidth available.

    ;n the previous example, we consider the total upload bandwidth to be 7:8Nkbits4s, so class 7:

    0web server2 gets 8::kbits4s and class 8: 0everything else2 gets ?8N kbits4s.

    TC can use both itand p$notations, but they dont have the same meaning. itis the rate

    in kiloAbits per seconds, and p$is in kiloAbytes per seconds. ;n this article, ; will the kbit

    notation only.

    http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=blob;f=include/linux/skbuff.hhttp://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=blob;f=include/linux/skbuff.hhttp://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2
  • 7/24/2019 Journey to the Center of the Linux Kernel

    5/25

    "'- Connecting the .ar&$ to the tree

    e now have on one side a traffic shaping policy, and on the other side packets marking. Toconnect the two, we need a filter.

    " filter is a rule that identify packets 0handle parameter2 and direct them to a class 0fw flowidparameter2. !ince several filters can work in parallel, they can also have a priority. " filter must

    be attached to the root of the Bo! policy, otherwise, it wont be applied.

    # tc 4ilter a,, ,e eth0 parent '0 protocol ip prio ' han,le 80 4/ 4lo/i,''0

    e can test the policy using a simply client4server setup. /etcat is very useful for such testing.

    !tart a listening process on the server that applies the policy using-

    # nc -l -p 80 7 3,e3ero

    "nd connect to it from another machine using -

    # nc ')69'.89'9' 80 : 3,e3n+ll

    The server process will send 5eros 0taken from 4dev45ero2 as fast as it can, and the client will

    receive them and throw them away, as fast as it can.

    Fsing iptrafto monitor the connection, we can supervise the bandwidth usage 0bottom right

    corner2.

    The value is 7##.8:kbits4s, which is close enough to the 8::kbits4s target. The precision of the

    scheduler depends on a few parameters that we will discuss later on.

    "ny other connection from the server that uses a source port different from TC&4?: will have aflow rate between ?8Nkbits4s and 7:8Nkbits4s 0depending on the presence of other connections in

    parallel2.

    - Tenty Thou$and League$ nder the Code

    /ow that we en'oyed this first contact, it is time to go back to the fundamentals of the Buality of

    !ervice of Linux. The goal of this chapter is to dive into the algorithms that compose the traffic

    control subsystem. Later on, we will use that knowledge to build our own policy.

    The code of TC is located in the net/$cheddirectory of the sources of the kernel. The kernelseparates the flows entering the system 0ingress2 from the flows leaving it 0egress2. "nd, as we

    said earlier, it is the responsibility of the TC module to manage the egress path.

  • 7/24/2019 Journey to the Center of the Linux Kernel

    6/25

    The illustration below show the path of a packet inside the kernel, where it enters 0ingress2 and

    where it leaves 0egress2. ;f we focus on the egress path, a packet arrives from the layer N 0TC&,

    F&, %2 and then enter the ;& layer 0not represented here2. The /etfilter chains EFT&FT and&E!T+EFT;/3 are integrated in the ;& layer and are located between the ;& manipulation

    functions 0header creation, fragmentation, %2. "t the exit of the /"T table of the

    &E!T+EFT;/3 chain, the packet is transmitted to the egress

  • 7/24/2019 Journey to the Center of the Linux Kernel

    7/25

    This command means attach a root

  • 7/24/2019 Journey to the Center of the Linux Kernel

    8/25

    $1ample >nternet Batagram ?ea,er 4rom R! %)'

    This algorithm is defined in net4sched4schJgeneric.cand represented in the diagram below.

    dia source

    The length of a band, representing the number of packet it can contain, is set to 7:: by default

    and defined outside of TC. ;ts a parameter that can be set using ifconfig, and visuali5ed in 4sys-

    # cat 3ss3class3net3eth03t1C5+e+eClen'000

    Ence the default value of 7::: is passed, TC will start dropping packets. This should very rarelyhappen because TC& makes sure to adapt its sending speed to the capacity of both systems

    participating in the communication 0thats the role of the TC& slow start2. @ut experimentsshowed that increased that limit to 7:,:::, or even 7::,:::, in some very specific cases of

    gigabits networks can improve the performances. ; wouldnt recommend touching this value

    unless you really now what you are doing. ;ncreasing a buffer si5e to a too large value can havevery negative side effect on the

  • 7/24/2019 Journey to the Center of the Linux Kernel

    9/25

    S(Q305(*LT36*S630I7IS+gives the number of buckets and default to 7:8N

    S(Q305T6defines the depth of each bucket, and defaults to 78? packets

    #,e4ine @DCB$PT? '68 3E ma1 n+mber o4 packets per 4lo/ E3#,e4ine @DCB$AULTC?A@?CB>=>@OR '06&

    These two value determine the maximum number of packets that can be

  • 7/24/2019 Journey to the Center of the Linux Kernel

    10/25

    destination ;&- 79=.778.78#.87=

    destination port- 87NS

    3E>P so+rce a,,ress in he1a,ecimalE3

    h' %e444e40

    3E>P Bestination a,,ress in he1a,ecimalE3h6 a4%08',%

    3E 0. is the protocol n+mber 4or T!P "bits %6 to 80 o4 the >P hea,er*e per4orm a OR bet/een the ariable h6 obtaine, in the preio+s stepan, the T!P protocol n+mberE3h6 h6 OR 0.

    3E i4 the >P packet is not 4ragmente,( /e incl+,e the T!P ports in the hashE33E '4)008.6 is the he1a,ecimal representation o4 the so+rce ,estination portse per4orm another OR /ith this al+e an, the h6 ariable

    E3h6 h6 OR '4)008.6

    3E An, 4inall( /e +se the Nenkins algorithm /ith some a,,itional gol,enn+mbersThis jhash 4+nction is ,e4ine, some/here else in the kernel so+rce co,eE3h jhash"h'( h6( pert+rbation*

    The result obtained is a hash value of I8 bits that will be used by !*B to select the destination

    bucket of the packet. @ecause theperturbvalue is regenerated every 7: seconds, the packets

    from a reasonably long connexion will be directed to different buckets over time.

    @ut this also means that !*B might break the se

  • 7/24/2019 Journey to the Center of the Linux Kernel

    11/25

    !*B scheduler that works with 7: buckets only and considers the ;& addresses of the packets in

    the hash.

    This discipline is classless as well, which means we cannot direct packet to another schedulerwhen they leave !*B. &ackets are transmitted to the network interface only.

    -'2 Cla$$ful 0i$cipline$

    -'2'1 T8( 4 To&en 8uc&et (ilter

    Fntil now, we looked at algorithm that do not allow to control the amount of bandwidth. !*Band &*;*EJ*"!T give the ability to smoothen the traffic, and even to prioriti5e it a bit, but not to

    control its throughput.

    ;n fact, the main problem when controlling the bandwidth is to find an efficient accounting

    method. @ecause counting in memory is extremely difficulty and costly to do in realAtime,

    computer scientists took a different approach here.

    ;nstead of counting the packets 0or the bits transmitted by the packets, its the same thing2, the

    Token @ucket *ilter algorithm sends, at a regular interval, a tokeninto a bucket. /ow this isdisconnected from the actual packet transmission, but when a packet enters the scheduler, it will

    consume a certain number of tokens. ;f there is not enough tokens for it to be transmitted, the

    packet waits.

    Fntil now, with !*B and &*;*EJ*"!T, we were talking about packets, but with T@* we nowhave to look into the bits contained in the packets. Lets take an example- a packet carrying ?:::

    bits 07K@2 wishes to be transmitted. ;t enters the T@* scheduler and T@* control the content of

    its bucket- if there are ?::: tokens in the bucket, T@* destroys them and the packet can pass.Etherwise, the packet waits until the bucket has enough tokens.

    The fre

  • 7/24/2019 Journey to the Center of the Linux Kernel

    12/25

    !o with a very large burst value, say 7,:::,::: tokens, we would let a maximum of ?I fully

    loaded packets 0roughly 78NK@ytes if they all carry their maximum 1TF2 traverse the scheduler

    without applying any sort of limit to them.

    To overcome this problem, and provides better control over the bursts, T@* implements a second

    bucket, smaller and generally the same si5e as the 1TF. This second bucket cannot store largeamount of tokens, but its replenishing rate will be a lot faster that the one of the big bucket. This

    second rate is calledpeakrateand it will determine the maximum speed of a burst.

    Lets take a step back and look at those parameters again. e have-

    peakrate rate - the second bucket fills up faster than the main one, to allow and control

    bursts. ;f the peakrate value is infinite, then T@* behaves as if the second bucket didnt

    exist. &ackets would be de

  • 7/24/2019 Journey to the Center of the Linux Kernel

    13/25

    'ust =I bytes each. "nd from those =I bytes, only N? are from the original packet, the rest is

    occupied by the "T1 headers.

    !o where is the problem V Considering the following network topology.

    The Bo! box is in charge of performing the packet scheduling before transmitting it to themodem. The packets are then split by the modem into "T1 cells. !o our initial 7.=K@ ethernet

    packets is split into I8 "T1 cells, for a total si5e of I8 = bytes of headers per cell W 7=:: bytes

    of data G 0I8=2W7=:: G 7SS: bytes. 7SS: bytes is 7:.S$ bigger than 7=::. hen "T1 is used,we lose 7:$ of bandwidth compared to an ethernet network 0this is an estimate that depend on

    the average packet si5e, etc%2.

    ;f T@* doesnt know about that, and calculates its rate based on the sole knowledge of the

    ethernet 1TF, then it will transmit 7:$ more packets than the modem can transmit. The modemwill start

  • 7/24/2019 Journey to the Center of the Linux Kernel

    14/25

    T@* gives a pretty accurate control over the bandwidth assigned to a

  • 7/24/2019 Journey to the Center of the Linux Kernel

    15/25

    %uantu.is similar to the

  • 7/24/2019 Journey to the Center of the Linux Kernel

    16/25

    *or very small or very large bandwidth, it is important to tune r2%properly. ;f r8< is too large,

    too many packets will leave a

  • 7/24/2019 Journey to the Center of the Linux Kernel

    17/25

    @ut in most cases, this optimi5ation is simply deactivated, as shown below-

    # cat 3ss3mo,+le3schChtb3parameters3htbChsteresis0

    -'2'" Co0el

    http-44

  • 7/24/2019 Journey to the Center of the Linux Kernel

    18/25

    Home networks are tricky to shape, because everybody wants the priority and its difficult to

    predetermine a usage pattern. ;n this chapter, we will build a TC policy that answer general

    needs. Those are-

    Low latency. The uplink is only 7.=1bps and the latency shouldnt be more than I:ms

    under high load. e can tune the buffers in the

  • 7/24/2019 Journey to the Center of the Linux Kernel

    19/25

    echo #---ssh - i, 00 - rate '.0 kbit ceil ''60 kbit3sbin3tc class a,, ,e eth0 parent '' classi, '00 htb rate '.0kbit ceil ''60kbit b+rst '2k prio

    # @D /ill mi1 the packets i4 there are seeral# @@? connections in parallel# an, ens+re that none has the priorit

    echo #--- s+b ssh s453sbin3tc 5,isc a,, ,e eth0 parent '00 han,le '00 s45 pert+rb '0 limit 6

    echo #--- ssh 4ilter3sbin3tc 4ilter a,, ,e eth0 parent '0 protocol ip prio han,le 00 4/ 4lo/i, '00

    echo #--- net4ilter r+le - @@? at 003sbin3iptables -t mangle -A PO@TROUT>GV -o eth0 -p tcp

    --tcp-4lags @JG @JG -,port 66 -j !OGGMARK

    --set-mark 00

    The first rule is the definition of the HT@ class, the leaf. ; connects back to its parent 7-7, defines

    a rate of 7S:kbit4s and can use up to 778:kbit4s by borrowing the difference from other leaves.

    The burst value is set to 7=k, with is 7: full packets with a 1TF of 7=:: bytes.

    The second rule defines a !*B

  • 7/24/2019 Journey to the Center of the Linux Kernel

    20/25

    Let us now load the script on our gateway, and visualise the GV eth0 TRA>! !OGTROL RUL$@ OR ramiel

    #-clean+p

    RTG$TL>GK ans/ers Go s+ch 4ile or ,irector#-,e4ine a ?TH root 5,isc#--+plink - rate '.00 kbit ceil '.00 kbit#---interactie - i, '00 - rate '.0 kbit ceil '.00 kbit#--- s+b interactie p4i4o#--- interactie 4ilter#--- net4ilter r+le - all UBP tra44ic at '00#---tcp acks - i, 600 - rate 60 kbit ceil '.00 kbit#--- s+b tcp acks p4i4o#--- 4iltre tcp acks#--- net4ilter r+le 4or T!P A!Ks /ill be loa,e, at the en,#---ssh - i, 00 - rate '.0 kbit ceil ''60 kbit#--- s+b ssh s45#--- ssh 4ilter#--- net4ilter r+le - @@? at 00#---http branch - i, &00 - rate 800 kbit ceil '.00 kbit#--- s+b http branch s45#--- http branch 4ilter#--- net4ilter r+le - http3s#---,e4a+lt - i, ))) - rate '.0kbit ceil '.00kbit#--- s+b ,e4a+lt s45#--- 4iltre ,e4a+lt#--- propagating marks on connections#--- Mark T!P A!Ks 4lags at 600

    Tra44ic !ontrol is +p an, r+nning# 3etc3net/ork3i4-+p9,3ln/Cgate/aCtc9sh sho/

    ---- 5,iscs ,etails -----5,isc htb ' root re4cnt 6 r65 &0 ,e4a+lt ))) ,irectCpacketsCstat 0 er 9'%5,isc p4i4o ''00 parent ''00 limit '0p5,isc p4i4o '600 parent '600 limit '0p5,isc s45 '00 parent '00 limit 6p 5+ant+m '2'&b 4lo/s 63'06& pert+rb'0sec5,isc s45 '&00 parent '&00 limit 6p 5+ant+m '2'&b 4lo/s 63'06& pert+rb'0sec5,isc s45 '))) parent '))) limit 6p 5+ant+m '2'&b 4lo/s 63'06& pert+rb'0sec

    ---- 5,iscs statistics --5,isc htb ' root re4cnt 6 r65 &0 ,e4a+lt ))) ,irectCpacketsCstat 0@ent '.%%.)20 btes '626' pkt ",roppe, &8'( oerlimits 68')0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 05,isc p4i4o ''00 parent ''00 limit '0p@ent '80..& btes ')82 pkt ",roppe, 0( oerlimits 0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 05,isc p4i4o '600 parent '600 limit '0p@ent 2.0%&06 btes '008)) pkt ",roppe, &8'( oerlimits 0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 05,isc s45 '00 parent '00 limit 6p 5+ant+m '2'&b pert+rb '0sec

  • 7/24/2019 Journey to the Center of the Linux Kernel

    21/25

    @ent 0 btes 0 pkt ",roppe, 0( oerlimits 0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 05,isc s45 '&00 parent '&00 limit 6p 5+ant+m '2'&b pert+rb '0sec@ent )%)0&)% btes '2.86 pkt ",roppe, 0( oerlimits 0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 05,isc s45 '))) parent '))) limit 6p 5+ant+m '2'&b pert+rb '0sec@ent '')88% btes .%22 pkt ",roppe, 0( oerlimits 0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 0

    The output below is 'ust two types of output tc can generate. Mou might find the class statistics to

    be helpful to diagnose leaves consumption-

    # tc -s class sho/ ,e eth0

    F999tr+ncate,999I

    class htb '&00 parent '' lea4 '&00 prio & rate 800000bit ceil '.00Kbitb+rst 0Kb cb+rst '.00b@ent '06)002 btes '.&6. pkt ",roppe, 0( oerlimits 0 re5+e+es 0*

    rate 6.6&bit 2pps backlog 0b 0p re5+e+es 0len,e, '.&6& borro/e, 6 giants 0tokens &%)'620 ctokens '60.62

    "bove is shown the detailled statistics for the HTT& leaf, and you can see the accumulated rate,statistics of packets per seconds, but also the tokens accumulated, lended, borrowed, etc% this is

    the most helpful output to diagnose your policy in depth.

    = * ord a#out B8uffer8loatB

    e mentionned that too large buffers can have a negative impact on the performances of a

    connection. @ut how bad is it exactly V

    The answer to that

  • 7/24/2019 Journey to the Center of the Linux Kernel

    22/25

    latenc "spee,* meets o+r nee,s9 More o4 /hat o+ ,onWtnee, is +seless9H+44erbloat ,estros the spee, /e reall nee,9

    1ore information on 3ettyss page, and in this paper from 7##S- ;ts the Latency, !tupid.

    Long story short- if you have bad latency, but large bandwidth, you will be able to transfer verylarge files efficiently, but a simple /!

  • 7/24/2019 Journey to the Center of the Linux Kernel

    23/25

    An is the number of buffers of N:#S bytes given to the socket.

    # nttcp -t -B -n60&8000 ')69'.89'9660

    "nd at the same time, on the laptop, launch a ping of the desktop.

    .& btes 4rom ')69'.89'9660 icmpCre5' ttl.& time0900 ms

    .& btes 4rom ')69'.89'9660 icmpCre56 ttl.& time098. ms

    .& btes 4rom ')69'.89'9660 icmpCre5 ttl.& time')96 ms

    .& btes 4rom ')69'.89'9660 icmpCre5& ttl.& time')96 ms

    .& btes 4rom ')69'.89'9660 icmpCre52 ttl.& time')96 ms

    .& btes 4rom ')69'.89'9660 icmpCre5. ttl.& time')96 ms

    .& btes 4rom ')69'.89'9660 icmpCre5% ttl.& time')9 ms

    .& btes 4rom ')69'.89'9660 icmpCre58 ttl.& time')90 ms

    .& btes 4rom ')69'.89'9660 icmpCre5) ttl.& time0968' ms

    .& btes 4rom ')69'.89'9660 icmpCre5'0 ttl.& time09.6 ms

    The first two pings are launch before nttcp is launched. hen nttcp starts, the latency augments

    but this is still acceptable.

    /ow, reduce the speed of each network card on the desktop and the laptop to 7::1bips. Thecommand is-

    #ethtool -s eth0 spee, '00 ,+ple1 4+ll

    # ethtool eth0

  • 7/24/2019 Journey to the Center of the Linux Kernel

    24/25

    collisions0 t15+e+elen'000

    # ethtool -g eth0Ring parameters 4or eth0F999I!+rrent har,/are settingsF999IT 2''

    e start by changing the tx

  • 7/24/2019 Journey to the Center of the Linux Kernel

    25/25

    @ut while the TC& stack was filling up the T> buffers, all the other packets that our system

    wanted to send got either stuck somewhere in the