freebsd kernel - university of cambridge · 9/28/17 6 fine-grained scheduling 11 true kernel...

14
9/28/17 1 Concurrent systems Lecture 8: Case study - FreeBSD kernel concurrency Dr Robert N. M. Watson 1 FreeBSD kernel Open-source OS kernel Large: millions of LoC Complex: thousands of subsystems, drivers, ... Very concurrent: dozens or hundreds of CPU cores / hyperthreads Widely used: NetApp, EMC, Dell, Apple, Juniper, Netflix, Sony, Panasonic, Cisco, Yahoo!, … Why a case study? Extensively employs C&DS principles Concurrency performance and composability at scale Consider design and evolution 2 In the library: Marshall Kirk McKusick, George V. Neville-Neil, and Robert N. M. Watson. The Design and Implementation of the FreeBSD Operating System (2nd Edition), Pearson Education, 2014.

Upload: letuyen

Post on 08-Apr-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

9/28/17

1

ConcurrentsystemsLecture8:Casestudy- FreeBSDkernelconcurrency

Dr RobertN.M.Watson

1

FreeBSDkernel• Open-sourceOSkernel

– Large:millionsofLoC– Complex:thousandsofsubsystems,

drivers,...– Veryconcurrent:dozensorhundreds

ofCPUcores/hyperthreads– Widelyused:NetApp,EMC,Dell,

Apple,Juniper,Netflix,Sony,Panasonic,Cisco,Yahoo!,…

• Whyacasestudy?– ExtensivelyemploysC&DSprinciples– Concurrencyperformanceand

composabilityatscale• Considerdesignandevolution

2Inthelibrary:MarshallKirkMcKusick,GeorgeV.Neville-Neil,andRobertN.M.Watson.TheDesignandImplementationoftheFreeBSDOperatingSystem (2ndEdition),PearsonEducation,2014.

9/28/17

2

BSD+FreeBSD:abriefhistory• 1980sBerkeleyStandardDistribution(BSD)– ‘BSD’-styleopen-sourcelicense(MIT,ISC,CMU,…)– UNIXFastFileSystem(UFS/FFS),socketsAPI,DNS,usedTCP/IPstack,FTP,sendmail,BIND,cron,vi,…

• Open-sourceFreeBSDoperatingsystem1993:FreeBSD1.0withoutsupportformultiprocessing1998:FreeBSD3.0with“giant-lock”multiprocessing2003:FreeBSD5.0withfine-grainedlocking2005:FreeBSD6.0withmaturefine-grainedlocking2012:FreeBSD9.0withTCPscalabilitybeyond32cores

3

FreeBSD:beforemultiprocessing(1)

• ConcurrencymodelinheritedfromUNIX• Userspace– Preemptivemultitaskingbetween processes– Later,preemptivemultithreadingwithin processes

• Kernel– ‘Just’aCprogramrunning‘baremetal’– Internallymultithreaded

• Userthreadsoperating‘inkernel’(e.g.,insystemcalls)• Kernelservices(e.g.,asynchronousworkforVM,etc.)

4

9/28/17

3

FreeBSD:beforemultiprocessing(2)• Cooperativemultitaskingwithinkernel– Mutualexclusionaslongasyoudon’tsleep()– Impliedgloballockmeanslocallocksrarelyrequired– Exceptforinterrupthandlers,non-preemptivekernel– Criticalsectionscontrolinterrupt-handlerexecution

• Waitchannels:impliedconditionvariableperaddresssleep(&x, …); // Wait for event on &xwakeup(&x); // Signal an event on &x

– Mustleaveglobalstateconsistentwhencallingsleep()– Mustreloadanycachedlocalstateaftersleep()returns

• Usetobuildhigher-levelsynchronizationprimitives– E.g.,lockmgr()reader-writerlockcanbeheldoverI/O(sleep),usedinfilesystems

5

Pre-multiprocessorscheduling

6

Lotsofunexploitedparallelism!

9/28/17

4

Hardwareparallelism,synchronization

• Late1990s:multi-CPUbeginstomovedownmarket– In2000s:2-processorabigdeal– In2010s:64-coreisincreasinglycommon

• Coherent,symmetric,sharedmemorysystems– Instructionsforatomicmemoryaccess

• Compare-and-swap,test-and-set,loadlinked/storeconditional• SignalingviaInter-ProcessorInterrupts(IPIs)– CPUscantriggeraninterrupthandleroneachanother

• Vendorextensionsforperformance,programmability– MIPSinter-threadmessagepassing– IntelTMsupport:TSX (Whoops:HSW136!)

7

Giantlockingthekernel• FreeBSDfollowsfootstepsofCray,Sun,…• First,allowuserprogramstoruninparallel– Oneinstanceofkernelcode/datasharedbyallCPUs– Differentuserprocesses/threadsondifferentCPUs

• Giantspinlockaroundkernel– Acquireonsyscall/traptokernel;droponreturn– Ineffect:kernelrunsonatmostonceCPUatatime;‘migrates’betweenCPUsondemand

• Interrupts– IfinterruptdeliveredonCPUXwhilekernelisonCPUY,forwardinterrupttoYusinganIPI

8

9/28/17

5

Giant-lockedscheduling

9

Serialkernelexecution;parallelismopportunitymissed

Kernelgiant-lockcontention

Kernel-userparallelism

User-userparallelism

Fine-grainedlocking• Giant lockingisOKforuser-programparallelism• Kernel-centeredworkloadstriggerGiantcontention– Scheduler,IPC-intensiveworkloads– TCP/buffercacheonhigh-loadwebservers– Process-modelcontentionwithmultithreading(VM,…)

• Motivatesmigrationtofine-grainedlocking– Greatergranularity(may)affordgreaterparallelism

• Mutexes +conditionvariablesratherthansemaphores– Increasingconsensusonpthreads-likesynchronization– Explicitlocksareeasiertodebugthansemaphores– Supportforpriorityinheritance +prioritypropagation– E.g.,Linuxhasalsonowmigratedawayfromsemaphores

10

9/28/17

6

Fine-grainedscheduling

11

Truekernelparallelism

Howdoesthisworkinpractice?• Kernelisheavilymulti-threaded• Eachuserthreadhasacorrespondingkernelthread– Representsuserthreadwheninsyscall,pagefault,etc.

• Kernelsservicesoftenexecuteinasynchronousthreads– Interrupts,timers,I/O,networking,etc.

➡ Thereforeextensivesynchronization– Lockingmodelisalmostalwaysdata-oriented– Think‘monitors’ratherthan‘criticalsections’– Referencecountingorreader-writerlocksusedforstability– Higher-levelpatterns(producer-consumer,activeobjects,etc.)usedfrequently

• Avoidingdeadlockisanessentialaspectofthedesign12

9/28/17

7

Kernelthreadsinaction

13

robert@lemongrass-freebsd64:~> procstat –atPID TID COMM TDNAME CPU PRI STATE WCHAN 0 100000 kernel swapper 1 84 sleep sched0 100009 kernel firmware taskq 0 108 sleep -0 100014 kernel kqueue taskq 0 108 sleep -0 100016 kernel thread taskq 0 108 sleep -0 100020 kernel acpi_task_0 1 108 sleep -0 100021 kernel acpi_task_1 1 108 sleep -0 100022 kernel acpi_task_2 1 108 sleep -0 100023 kernel ffs_trim taskq 1 108 sleep -0 100033 kernel em0 taskq 1 8 sleep -1 100002 init - 0 152 sleep wait 2 100027 mpt_recovery0 - 0 84 sleep idle 3 100039 fdc0 - 1 84 sleep -4 100040 ctl_thrd - 0 84 sleep ctl_work5 100041 sctp_iterator - 0 84 sleep waiting_ 6 100042 xpt_thrd - 0 84 sleep ccb_scan7 100043 pagedaemon - 1 84 sleep psleep8 100044 vmdaemon - 1 84 sleep psleep9 100045 pagezero - 1 255 sleep pgzero10 100001 audit - 0 84 sleep audit_wo11 100003 idle idle: cpu0 0 255 run -11 100004 idle idle: cpu1 1 255 run -12 100005 intr swi4: clock 1 40 wait -12 100006 intr swi4: clock 0 40 wait -12 100007 intr swi3: vm 0 36 wait -12 100008 intr swi1: netisr 0 1 28 wait -12 100015 intr swi5: + 0 44 wait -12 100017 intr swi6: Giant task 0 48 wait -12 100018 intr swi6: task queue 0 48 wait -12 100019 intr swi2: cambio 1 32 wait -12 100024 intr irq14: ata0 0 12 wait -12 100025 intr irq15: ata1 1 12 wait -12 100026 intr irq17: mpt0 1 12 wait -12 100028 intr irq18: uhci0 0 12 wait -12 100034 intr irq16: pcm0 0 4 wait -12 100035 intr irq1: atkbd0 1 16 wait -12 100036 intr irq12: psm0 0 16 wait -

12 100037 intr irq7: ppc0 0 16 wait -12 100038 intr swi0: uart uart 0 24 wait -13 100010 geom g_event 0 92 sleep -13 100011 geom g_up 1 92 sleep -13 100012 geom g_down 1 92 sleep -14 100013 yarrow - 1 84 sleep -15 100029 usb usbus0 0 32 sleep -15 100030 usb usbus0 0 28 sleep -15 100031 usb usbus0 0 32 sleep USBWAIT 15 100032 usb usbus0 0 32 sleep -16 100046 bufdaemon - 0 84 sleep psleep17 100047 syncer - 1 116 sleep syncer18 100048 vnlru - 1 84 sleep vlruwt19 100049 softdepflush - 1 84 sleep sdflush104 100055 adjkerntz - 1 152 sleep pause615 100056 dhclient - 0 139 sleep select 667 100075 dhclient - 1 120 sleep select 685 100068 devd - 1 120 sleep wait798 100065 syslogd - 0 120 sleep select 895 100076 sshd - 0 120 sleep select 934 100052 login - 1 120 sleep wait935 100070 getty - 0 152 sleep ttyin936 100060 getty - 0 152 sleep ttyin937 100064 getty - 0 152 sleep ttyin938 100077 getty - 1 152 sleep ttyin939 100067 getty - 1 152 sleep ttyin940 100072 getty - 1 152 sleep ttyin941 100073 getty - 0 152 sleep ttyin9074 100138 csh - 0 120 sleep ttyin3023 100207 ssh-agent - 1 120 sleep select 3556 100231 sh - 0 123 sleep piperd3558 100216 sh - 1 124 sleep wait3559 100145 sh - 0 122 sleep vmo_de3560 100058 sh - 0 123 sleep piperd3588 100176 sshd - 0 122 sleep select 3590 101853 sshd - 1 122 run -3591 100069 tcsh - 0 152 sleep pause3596 100172 procstat - 0 172 run -

Kernel-internalconcurrencyisrepresentedusingafamiliarsharedmemorythreadingmodel

PID TID COMM TDNAME CPU PRI STATE WCHAN 11 100003 idle idle: cpu0 0 255 run -12 100024 intr irq14: ata0 0 12 wait -12 100025 intr irq15: ata1 1 12 wait -12 100008 intr swi1: netisr 0 1 28 wait -

3588 100176 sshd - 0 122 sleep select

Vasthoardsofthreadsrepresentconcurrentactivities

Device-driverinterruptsexecuteinkernelithreads

IdleCPUsareoccupiedbyanidlethread…why?

Asynchronouspacketprocessingoccursinanetisr ‘soft’ithread

Familiaruserspacethread:sshd,blockedinnetworkI/O(‘inkernel’)

WITNESSlock-orderchecker• Kernelreliesonpartiallockordertopreventdeadlock(Recalldiningphilosophers)– In-fieldlock-relateddeadlocksare(very)rare

• WITNESSisalock-orderdebuggingtool– Warnswhenlockcycles(could)arisebytrackingedges– Onlyindebuggingkernelsduetooverhead(15%+)

• Tracksbothstaticallydeclared,dynamiclockorders– Staticordersmostcommonlyintra-module– Dynamicordersmostcommonlyinter-module

• Deadlocksforconditionvariablesremainhardtodebug– WhatthreadshouldhavewokenupaCVbeingwaitedon?– Similartosemaphoreproblem

14

9/28/17

8

WITNESS:globallock-ordergraph*

15*Turnsoutthatthegloballock-ordergraphisprettycomplicated.

16*CommentaryonWITNESSfull-systemlock-ordergraphcomplexity;courtesyScottLong,Netflix

*

9/28/17

9

Excerptfromgloballock-ordergraph*

17*Thelocallock-ordergraphisalso complicated.

Thisbitmostlyhastodowithnetworking

Localclusters:e.g.,relatedlocksfromthefirewall:twoleafnodes;oneisheldovercallstoothersubsystems

Networkinterfacelocks:“transmit”occursatthe

bottomofcallstacksviamanylayersholdinglocks

Memoryallocatorlocksfollowmostotherlocks,sincemostkernelcomponentsrequirememoryallocation

WITNESSdebugoutput

18

1st 0xffffff80025207f0 run0_node_lock (run0_node_lock) @ /usr/src/sys/net80211/ieee80211_ioctl.c:13412nd 0xffffff80025142a8 run0 (network driver) @ /usr/src/sys/modules/usb/run/../../../dev/usb/wlan/if_run.c:3368

KDB: stack backtrace:db_trace_self_wrapper() at db_trace_self_wrapper+0x2akdb_backtrace() at kdb_backtrace+0x37_witness_debugger() at _witness_debugger+0x2cwitness_checkorder() at witness_checkorder+0x853_mtx_lock_flags() at _mtx_lock_flags+0x85run_raw_xmit() at run_raw_xmit+0x58ieee80211_send_mgmt() at ieee80211_send_mgmt+0x4d5domlme() at domlme+0x95setmlme_common() at setmlme_common+0x2f0ieee80211_ioctl_setmlme() at ieee80211_ioctl_setmlme+0x7eieee80211_ioctl_set80211() at ieee80211_ioctl_set80211+0x46fin_control() at in_control+0xadifioctl() at ifioctl+0xecekern_ioctl() at kern_ioctl+0xcdsys_ioctl() at sys_ioctl+0xf0amd64_syscall() at amd64_syscall+0x380Xfast_syscall() at Xfast_syscall+0xf7--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800de7aec, rsp = 0x7fffffffd848, rbp = 0x2a ---

Locknamesandsourcecodelocationsof

acquisitionsaddingtheoffendinggraphedge

Stacktracetoacquisitionthattriggeredcycle:802.11calledUSB;

previously,perhapsUSBcalled802.11?

9/28/17

10

Casestudy:thenetworkstack(1)• Whatisanetworkstack?– Kernel-residentlibraryofnetworkingroutines– Sockets,TCP/IP,UDP/IP,Ethernet,…

• Implementsuserabstractions,network-interfaceabstraction,protocolstatemachines,sockets,etc.– Systemcalls:socket(),connect(),send(),recv(),listen(),…

• Highlycomplexandconcurrentsubsystem– Composedfrommany(pluggable)elements– Socketlayer,networkdevicedrivers,protocols,…

• Typicalpaths‘up’and‘down’:packetscomein,goout

19

Network-stackworkflows

20

Applicationssend,receive,awaitdata

onsockets

Data/packetsprocessed;

dispatchedviaproducer-consumer

relationships

Packetsgoinandoutofnetwork

interfaces

Thework:adding/removingheaders,calculatingchecksums,fragmentation/defragmentation,segmentreassembly,reordering,flowcontrol,etc.

9/28/17

11

Casestudy:thenetworkstack(2)

• First,makeitsafe withouttheGiantlock– Lotsofdatastructuresrequirelocks– Conditionsignalingalreadyexistsbutwillbeaddedto– Establishkeyworkflows,lockorders

• Then,makeitfast– Especiallylockingprimitivesthemselves– Increaselockinggranularitywherethereiscontention

• Ashardwarebecomesmoreparallel,identifyandexploitfurtherconcurrencyopportunities– Addmorethreads,distributemorework

21

Whattolockandhow?• Fine-grainedlockingoverhead vs.contention– Somecontentionisinherent:necessarycommunication– Somecontentionisfalsesharing:sideeffectofstructures

• Principle:lockdata,notcode(i.e.,notcriticalsections)– Keystructures:NICs,sockets,workqueues,…– Independentstructureinstancesoftenhaveownlocks

• Horizontalvs.verticalparallelism– H:Differentlocksacrossconnections(e.g.,TCP1vs.TCP2)– H:Differentlockswithinalayer(e.g.,recv.vs.sendbuffers)– V:Differentlocksatdifferentlayers(e.g.,socketvs.TCP)

• Thingsnottolock:packetsinflight- mbufs (‘work’)

22

9/28/17

12

Example:UniversalMemoryAllocator(UMA)

• Keykernelservice• Slaballocator

– (Bonwick 1994)• Per-CPUcaches

– Individuallylocked– Amortise (oravoid)global

lockcontention• Someallocationpatterns

useonlyper-CPUcaches• Othersrequiredipping

intotheglobalpool

23

🔒🔒

🔒

Workdistribution• Packets(mbufs)areunitsofwork• Parallelworkrequiresdistributiontothreads• Mustkeeppacketsordered– orTCPgetscranky!• Implication:strongper-flowserialization– I.e.,nogeneralizedproducer-consumer/roundrobin– Variousstrategiestokeepworkordered;e.g.:

• Processinasinglethread• Multiplethreadsina‘pipeline’linkedbyaqueue

– Misordering OKbetweenflows,justnotwithinthem• Establishflow-CPUaffinity canbothorderprocessingandutilizecacheswell

24

9/28/17

13

Scalability

25

?

Performanceincreasemayreduceduetocontention,whichwastesresources

Keyidea:speedup

Asweaddmoreparallelism,wewouldlikethesystemtogetfaster.

Keyidea:performancecollapse

Sometimesparallelismhurtsperformancemorethanithelpsduetowork-distributionoverheads,

contention.

Longer-termstrategies

• Hardwarechangemotivatescontinuingwork– Optimizeinevitablecontention– Locklessprimitives– Read-mostlylocks,read-copy-update(RCU)– Per-CPUdatastructures– Betterdistributeworktomorethreadstoutilisegrowingcore/hyperthread count

• Optimise forlocality,notjustcontention:cache,NUMA,andI/Oaffinity– Ifcommunicationisessential,contentionisinevitable

26

9/28/17

14

Conclusions• FreeBSDemploysmanyofC&DStechniques– Multithreadingwithin(andover)thekernel– Mutualexclusion,conditionsynchronization– Partiallockorderwithdynamicchecking– Producer-consumer,locklessprimitives– AlsoWrite-AheadLogging(WAL)infilesystems,…

• Real-worldsystemsarereallycomplicated– Compositionisnotstraightforward– Parallelismperformancewinsarealotofwork– Hardwarecontinuestoevolve,placingpressureonsoftwaresystemstoutilise newparallelism

• Next:DistributedSystems!27