concurrent systems - university of cambridge · 10/23/16 1 concurrent systems case study: freebsd...

14
10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency Dr Robert N. M. Watson 1 FreeBSD kernel Open-source OS kernel Large: millions of LoC Complex: thousands of subsystems, drivers, ... Very concurrent: dozens or hundreds of CPU cores/threads Widely used: NetApp, EMC, Dell, Apple, Juniper, Netflix, Sony, Cisco, Yahoo!, … Why a case study? Employs C&DS principles Concurrency performance and composability at scale 2 In the library: Marshall Kirk McKusick, George V. Neville-Neil, and Robert N. M. Watson. The Design and Implementation of the FreeBSD Operating System (2nd Edition), Pearson Education, 2014.

Upload: lybao

Post on 04-Jan-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

1

ConcurrentsystemsCasestudy:FreeBSDkernelconcurrency

Dr RobertN.M.Watson

1

FreeBSDkernel• Open-sourceOSkernel

– Large:millionsofLoC– Complex:thousandsof

subsystems,drivers,...– Veryconcurrent:dozensor

hundredsofCPUcores/threads

– Widelyused:NetApp,EMC,Dell,Apple,Juniper,Netflix,Sony,Cisco,Yahoo!,…

• Whyacasestudy?– EmploysC&DSprinciples– Concurrencyperformanceand

composability atscale

2Inthelibrary:MarshallKirkMcKusick,GeorgeV.Neville-Neil,andRobertN.M.Watson.TheDesignandImplementationoftheFreeBSDOperatingSystem (2ndEdition),PearsonEducation,2014.

Page 2: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

2

BSD+FreeBSD:abriefhistory• 1980sBerkeleyStandardDistribution(BSD)– ‘BSD’-styleopen-sourcelicense(MIT,ISC,CMU,…)– UNIXFastFileSystem(UFS/FFS),socketsAPI,DNS,usedTCP/IPstack,FTP,sendmail,BIND,cron,vi,…

• Open-sourceFreeBSDoperatingsystem1993:FreeBSD1.0withoutsupportformultiprocessing1998:FreeBSD3.0with“giant-lock”multiprocessing

2003:FreeBSD5.0withfine-grainedlocking2005:FreeBSD6.0withmaturefine-grainedlocking

2012:FreeBSD9.0withTCPscalabilitybeyond32cores

3

FreeBSD:beforemultiprocessing(1)

• ConcurrencymodelinheritedfromUNIX• Userspace– Preemptivemultitaskingbetween processes– Later,preemptivemultithreadingwithin processes

• Kernel– ‘Just’aCprogramrunning‘baremetal’– Internallymultithreaded– Userthreads‘inkernel’(e.g.,insystemcalls)– Kernelservices(e.g.,async.workforVM,etc.)

4

Page 3: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

3

FreeBSD:beforemultiprocessing(2)

• Cooperativemultitaskingwithinkernel– Mutualexclusionaslongasyoudon’tsleep()– Impliedgloballockmeanslocallocksrarelyrequired– Exceptforinterrupthandlers,non-preemptivekernel– Criticalsectionscontrolinterrupt-handlerexecution

• Waitchannels:impliedconditionvariableforeveryaddresssleep(&x, …); // Wait for event on &xwakeup(&x); // Signal an event on &x

– Mustleaveglobalstateconsistentwhencallingsleep()– Mustreloadanycachedlocalstateaftersleep()returns

• Usetobuildhigher-levelsynchronizationprimitives– E.g.,lockmgr()reader-writerlockcanbeheldoverI/O(sleep)

5

Pre-multiprocessorscheduling

6

Lotsofunexploitedparallelism!

Page 4: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

4

Hardwareparallelism,synchronization

• Late1990s:multi-CPUbeginstomovedownmarket– In2000s:2-processorabigdeal– In2010s:64-coreisincreasinglycommon

• Coherent,symmetric,sharedmemorysystems– Instructionsforatomicmemoryaccess

• Compare-and-swap,test-and-set,loadlinked/storeconditional• SignalingviaInter-ProcessorInterrupts(IPIs)– CPUscantriggeraninterrupthandleroneachanother

• Vendorextensionsforperformance,programmability– MIPSinter-threadmessagepassing– IntelTMsupport:TSX (Whoops:HSW136!)

7

Giantlockingthekernel• FreeBSDfollowsfootstepsofCray,Sun,…• First,allowuserprogramstoruninparallel– Oneinstanceofkernelcode/datasharedbyallCPUs– Differentuserprocesses/threadsondifferentCPUs

• Giantspinlockaroundkernel– Acquireonsyscall/traptokernel;droponreturn– Ineffect:kernelrunsonatmostonceCPUatatime;‘migrates’betweenCPUsondemand

• Interrupts– IfinterruptdeliveredonCPUXwhilekernelisonCPUY,forwardinterrupttoYusinganIPI

8

Page 5: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

5

Giant-lockedscheduling

9

Serialkernelexecution;parallelismopportunitymissed

Kernelgiant-lockcontention

Kernel-userparallelism

User-userparallelism

Fine-grainedlocking• Giant lockingisOKforuser-programparallelism• Kernel-centeredworkloadstriggerGiantcontention

– Scheduler,IPC-intensiveworkloads– TCP/buffercacheonhigh-loadwebservers– Process-modelcontentionwithmultithreading(VM,…)

• Motivatesmigrationtofine-grainedlocking– Greatergranularity(may)affordgreaterparallelism

• Mutexes/conditionvariablesratherthansemaphores– Increasingconsensusonpthreads-likesynchronization– Explicitlocksareeasiertodebugthansemaphores– Supportforpriorityinheritance +prioritypropagation– E.g.,Linuxisalsonowmigratingawayfromsemaphores

10

Page 6: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

6

Fine-grainedscheduling

11

Truekernelparallelism

Kernelsynchronizationprimitives• Spinlocks– Usedtoimplementthescheduler,interrupthandlers

• Mutexes,reader-writer,read-mostlylocks– Mostheavilyused– differentoptimizationtradeoffs– Canbeheldonlyover“bounded”computations– Adaptive:oncontention,sleepisexpensive;spinfirst– Sleepingdependsonscheduler,andhenceonspinlocks…

• Shared-eXclusive (SX)locks,conditionvariables– CanbeheldoverI/Oandotherunboundedwaits

• Conditionvariablesusablewithanylocktype• Mostprimitivessupportprioritypropagation

12

Page 7: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

7

WITNESSlock-orderchecker• Kernelreliesonpartiallockordertopreventdeadlock(Recalldiningphilosophers)– In-fieldlock-relateddeadlocksare(very)rare

• WITNESSisalock-orderdebuggingtool– Warnswhenlockcycles(could)arisebytrackingedges– Onlyindebuggingkernelsduetooverhead(15%+)

• Tracksbothstaticallydeclared,dynamiclockorders– Staticordersmostcommonlyintra-module– Dynamicordersmostcommonlyinter-module

• Deadlocksforconditionvariablesremainhardtodebug– WhatthreadshouldhavewokenupaCVbeingwaitedon?– Similartosemaphoreproblem

13

WITNESS:globallock-ordergraph*

14

*Turnsoutthatthegloballock-ordergraphisprettycomplicated.

Page 8: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

8

15

*CommentaryonWITNESSfull-systemlock-ordergraphcomplexity;courtesyScottLong,Netflix

*

Excerptfromgloballock-ordergraph*

16*Thelocallock-ordergraphisalso complicated.

Thisbitmostlyhastodowithnetworking

Localclusters:e.g.,relatedlocksfromthefirewall:twoleafnodes;oneisheldovercallstoothersubsystems

Networkinterfacelocks:“transmit”occursatthe

bottomofcallstacksviamanylayersholdinglocks

Memoryallocatorlocksfollowmostotherlocks,sincemostkernelcomponentsrequirememoryallocation

Page 9: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

9

WITNESSdebugoutput

17

1st 0xffffff80025207f0 run0_node_lock (run0_node_lock) @ /usr/src/sys/net80211/ieee80211_ioctl.c:13412nd 0xffffff80025142a8 run0 (network driver) @ /usr/src/sys/modules/usb/run/../../../dev/usb/wlan/if_run.c:3368

KDB: stack backtrace:db_trace_self_wrapper() at db_trace_self_wrapper+0x2akdb_backtrace() at kdb_backtrace+0x37_witness_debugger() at _witness_debugger+0x2cwitness_checkorder() at witness_checkorder+0x853_mtx_lock_flags() at _mtx_lock_flags+0x85run_raw_xmit() at run_raw_xmit+0x58ieee80211_send_mgmt() at ieee80211_send_mgmt+0x4d5domlme() at domlme+0x95setmlme_common() at setmlme_common+0x2f0ieee80211_ioctl_setmlme() at ieee80211_ioctl_setmlme+0x7eieee80211_ioctl_set80211() at ieee80211_ioctl_set80211+0x46fin_control() at in_control+0xadifioctl() at ifioctl+0xecekern_ioctl() at kern_ioctl+0xcdsys_ioctl() at sys_ioctl+0xf0amd64_syscall() at amd64_syscall+0x380Xfast_syscall() at Xfast_syscall+0xf7--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800de7aec, rsp = 0x7fffffffd848, rbp = 0x2a ---

Locknamesandsourcecodelocationsof

acquisitionsaddingtheoffendinggraphedge

Stacktracetoacquisitionthattriggeredcycle:802.11calledUSB;

previously,perhapsUSBcalled802.11?

Howdoesthisworkinpractice?• Kernelisheavilymulti-threaded• Eachuserthreadhasacorrespondingkernelthread– Representsuserthreadwheninsyscall,pagefault,etc.

• Kernelsservicesoftenexecuteinasynchronousthreads– Interrupts,timers,I/O,networking,etc.

• Thereforeextensivesynchronization– Lockingmodelisalmostalwaysdata-oriented– Think‘monitors’ratherthan‘criticalsections’– Referencecountingorreader-writerlocksusedforstability– Higher-levelpatterns(producer-consumer,activeobjects,etc.)usedfrequently

18

Page 10: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

10

Kernelthreadsinaction

19

robert@lemongrass-freebsd64:~> procstat –atPID TID COMM TDNAME CPU PRI STATE WCHAN 0 100000 kernel swapper 1 84 sleep sched0 100009 kernel firmware taskq 0 108 sleep -0 100014 kernel kqueue taskq 0 108 sleep -0 100016 kernel thread taskq 0 108 sleep -0 100020 kernel acpi_task_0 1 108 sleep -0 100021 kernel acpi_task_1 1 108 sleep -0 100022 kernel acpi_task_2 1 108 sleep -0 100023 kernel ffs_trim taskq 1 108 sleep -0 100033 kernel em0 taskq 1 8 sleep -1 100002 init - 0 152 sleep wait 2 100027 mpt_recovery0 - 0 84 sleep idle 3 100039 fdc0 - 1 84 sleep -4 100040 ctl_thrd - 0 84 sleep ctl_work5 100041 sctp_iterator - 0 84 sleep waiting_ 6 100042 xpt_thrd - 0 84 sleep ccb_scan7 100043 pagedaemon - 1 84 sleep psleep8 100044 vmdaemon - 1 84 sleep psleep9 100045 pagezero - 1 255 sleep pgzero10 100001 audit - 0 84 sleep audit_wo11 100003 idle idle: cpu0 0 255 run -11 100004 idle idle: cpu1 1 255 run -12 100005 intr swi4: clock 1 40 wait -12 100006 intr swi4: clock 0 40 wait -12 100007 intr swi3: vm 0 36 wait -12 100008 intr swi1: netisr 0 1 28 wait -12 100015 intr swi5: + 0 44 wait -12 100017 intr swi6: Giant task 0 48 wait -12 100018 intr swi6: task queue 0 48 wait -12 100019 intr swi2: cambio 1 32 wait -12 100024 intr irq14: ata0 0 12 wait -12 100025 intr irq15: ata1 1 12 wait -12 100026 intr irq17: mpt0 1 12 wait -12 100028 intr irq18: uhci0 0 12 wait -12 100034 intr irq16: pcm0 0 4 wait -12 100035 intr irq1: atkbd0 1 16 wait -12 100036 intr irq12: psm0 0 16 wait -

12 100037 intr irq7: ppc0 0 16 wait -12 100038 intr swi0: uart uart 0 24 wait -13 100010 geom g_event 0 92 sleep -13 100011 geom g_up 1 92 sleep -13 100012 geom g_down 1 92 sleep -14 100013 yarrow - 1 84 sleep -15 100029 usb usbus0 0 32 sleep -15 100030 usb usbus0 0 28 sleep -15 100031 usb usbus0 0 32 sleep USBWAIT 15 100032 usb usbus0 0 32 sleep -16 100046 bufdaemon - 0 84 sleep psleep17 100047 syncer - 1 116 sleep syncer18 100048 vnlru - 1 84 sleep vlruwt19 100049 softdepflush - 1 84 sleep sdflush104 100055 adjkerntz - 1 152 sleep pause615 100056 dhclient - 0 139 sleep select 667 100075 dhclient - 1 120 sleep select 685 100068 devd - 1 120 sleep wait798 100065 syslogd - 0 120 sleep select 895 100076 sshd - 0 120 sleep select 934 100052 login - 1 120 sleep wait935 100070 getty - 0 152 sleep ttyin936 100060 getty - 0 152 sleep ttyin937 100064 getty - 0 152 sleep ttyin938 100077 getty - 1 152 sleep ttyin939 100067 getty - 1 152 sleep ttyin940 100072 getty - 1 152 sleep ttyin941 100073 getty - 0 152 sleep ttyin9074 100138 csh - 0 120 sleep ttyin3023 100207 ssh-agent - 1 120 sleep select 3556 100231 sh - 0 123 sleep piperd3558 100216 sh - 1 124 sleep wait3559 100145 sh - 0 122 sleep vmo_de3560 100058 sh - 0 123 sleep piperd3588 100176 sshd - 0 122 sleep select 3590 101853 sshd - 1 122 run -3591 100069 tcsh - 0 152 sleep pause3596 100172 procstat - 0 172 run -

Kernel-internalconcurrencyisrepresentedusingafamiliarsharedmemorythreadingmodel

PID TID COMM TDNAME CPU PRI STATE WCHAN 11 100003 idle idle: cpu0 0 255 run -12 100024 intr irq14: ata0 0 12 wait -12 100025 intr irq15: ata1 1 12 wait -12 100008 intr swi1: netisr 0 1 28 wait -

3588 100176 sshd - 0 122 sleep select

Vasthoardsofthreadsrepresentconcurrentactivities

Device-driverinterruptsexecuteinkernelithreads

IdleCPUsareoccupiedbyanidlethread…why?

Asynchronouspacketprocessingoccursinanetisr ‘soft’ithread

Familiaruserspacethread:sshd,blockedinnetworkI/O(‘inkernel’)

Casestudy:thenetworkstack(1)• Whatisanetworkstack?– Kernel-residentlibraryofnetworkingroutines– Sockets,TCP/IP,UDP/IP,Ethernet,…

• Implementsuserabstractions,network-interfaceabstraction,protocolstatemachines,sockets,etc.– Systemcalls:socket(),connect(),send(),recv(),listen(),…

• Highlycomplexandconcurrentsubsystem– Composedfrommany(pluggable)elements– Socketlayer,networkdevicedrivers,protocols,…

• Typicalpaths‘up’and‘down’:packetscomein,goout

20

Page 11: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

11

Network-stackworkflows

21

Applicationssend,receive,awaitdata

onsockets

Data/packetsprocessed;

dispatchedviaproducer-consumer

relationships

Packetsgoinandoutofnetwork

interfaces

Thework:adding/removingheaders,calculatingchecksums,fragmentation/defragmentation,segmentreassembly,reordering,flowcontrol,etc.

Casestudy:thenetworkstack(2)• First,makeitsafe withouttheGiantlock– Lotsofdatastructuresrequirelocks– Conditionsignalingalreadyexistsbutwillbeaddedto– Establishkeyworkflows,lockorders

• Then,makeitfast– Especiallylockingprimitivesthemselves– Increaselockinggranularitywherethereiscontention

• Ashardwarebecomesmoreparallel,identifyandexploitfurtherconcurrencyopportunities– Addmorethreads,distributemorework

22

Page 12: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

12

Whattolockandhow?• Fine-grainedlockingoverhead vs.contention

– Somecontentionisinherent:reflectsnecessarycommunication– Somecontentionisfalsesharing:sideeffectofstructurechoices

• Principle:lockdata,notcode(i.e.,notcriticalsections)– Keystructures:networkinterfaces,sockets,workqueues– Independentstructureinstancesoftenhavetheirownlocks

• Horizontalvs.verticalparallelism– H:Differentlocksfordifferentconnections(e.g.,TCP1vs.TCP2)– H:Differentlockswithinalayer(e.g.,receivevs.sendbuffers)– V:Differentlocksatdifferentlayers(e.g.,socketvs.TCPstate)

• Thingsnottolock:packetsinflight- mbufs (‘work’)

23

Example:UniversalMemoryAllocator(UMA)

• Keykernelservice• Slaballocator

– (Bonwick 1994)• Per-CPUcaches

– Individuallylocked– Amortise (oravoid)global

lockcontention• Someallocationpatterns

useonlyper-CPUcaches• Othersrequiredipping

intotheglobalpool

24

🔒🔒

🔒

Page 13: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

13

Workdistribution• Packets(mbufs)areunitsofwork• Parallelworkrequiresdistributiontothreads– Mustkeeppacketsordered– orTCPgetscranky!

• Implication:strongper-flowserialization– I.e.,nogeneralizedproducer-consumer/roundrobin– Variousstrategiestokeepworkordered;e.g.:

• Processinasinglethread• Multiplethreadsina‘pipeline’linkedbyaqueue

– Misordering allowedbetweenflows,justnotwithinthem• Establishflow-CPUaffinitycanbothorderprocessingandutilizecacheswell

25

Scalability

26

?

Whatmightweexpectifwedidn’thitcontention?

Keyidea:speedup

Asweaddmoreparallelism,wewouldlikethesystemtogetfaster.

Keyidea:performancecollapse

Sometimesparallelismhurtsperformancemorethanithelpsduetowork-distributionoverheads,

contention.

Page 14: Concurrent systems - University of Cambridge · 10/23/16 1 Concurrent systems Case study: FreeBSD kernel concurrency DrRobert N. M. Watson 1 FreeBSD kernel • Open-source OS kernel

10/23/16

14

Longer-termstrategies

• Hardwarechangemotivatescontinuingwork– Optimizeinevitablecontention– Locklessprimitives– rmlocks,read-copy-update(RCU)– Per-CPUdatastructures– Distributeworktomorethreads..toutilisegrowingcorecount

• Optimise forlocality,notjustcontention:cache,NUMA,andI/Oaffinity

27

Conclusions• FreeBSDemploysmanyofC&DStechniques– Mutualexclusion,conditionsynchronization– Producer-consumer,locklessprimitives– AlsoWrite-AheadLogging(WAL)infilesystems

• Real-worldsystemsarereallycomplicated– Hopefully,youwillmostlyconsume,ratherthanproduce,concurrencyprimitiveslikethese

– Compositionisnotstraightforward– Parallelismperformancewinsarealotofwork– Hardwarecontinuestoevolve

• SeeyouinDistributedSystems!

28