high performance data center operating systemshomes.cs.washington.edu/~tom/talks/os.pdfredis hw 13%...

66
High Performance Data Center Operating Systems Tom Anderson Antoine Kaufmann, Youngjin Kwon, Naveen Kr. Sharma, Arvind Krishnamurthy, Simon Peter, Mothy Roscoe, and Emmett Witchel

Upload: others

Post on 10-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

High Performance Data Center Operating Systems

TomAndersonAntoineKaufmann,YoungjinKwon,NaveenKr.Sharma,ArvindKrishnamurthy,SimonPeter,MothyRoscoe,and

EmmettWitchel

Page 2: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

An OS for the Data Center

•  ServerI/Operformancematters•  Key-valuestores,web&fileservers,databases,mailservers,machinelearning,…

• Canwedeliverperformanceclosetohardware?•  Examplesystem:DellPowerEdgeR520

IntelX52010GNIC

IntelRS3RAID1GBflash-backedcache

SandyBridgeCPU6cores,2.2GHz

+ +

Today’sI/Odevicesarefastandgettingfaster

2us/1KBpacket 25us/1KBwrite

40GNIC:500ns/1KBpacketNVDIMM:500ns/1KBwrite

Page 3: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Can’t we just use Linux?

Page 4: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Kernel

Linux I/O Performance

RedisHW13%

HW18%

Kernel84%

Kernel62%

App3%

App20%

SET

GET

%OF1KBREQUESTTIMESPENT

API Multiplexing

Naming Resourcelimits

Accesscontrol I/OScheduling

I/OProcessing Copying

Protection

DataPath

10GNIC2us/1KBpacket

RAIDStorage25us/1KBwrite

9us

163us

Kernelmediationistooheavyweight

Page 5: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Arrakis: Separate the OS control and data plane

Page 6: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

KernelNaming Resourcelimits

Accesscontrol

Redis

How to skip the kernel?

Redis

I/ODevices

API Multiplexing

I/OScheduling

I/OProcessing Copying

Protection

DataPath

Page 7: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Kernel

Naming

Resourcelimits

Accesscontrol

Kernel

Naming

Resourcelimits

Accesscontrol

Redis

Arrakis I/O Architecture

Redis

I/ODevices

API

Multiplexing

I/OScheduling

I/OProcessing

Protection

DataPath

ControlPlane DataPlane

Page 8: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

•  StreamlinenetworkandstorageI/O•  EliminateOSmediationinthecommoncase•  Application-specificcustomizationvs.kernelonesizefitsall

• KeepOSfunctionality•  Process(container)isolationandprotection•  Resourcearbitration,enforceableresourcelimits•  Globalnaming,sharingsemantics

• POSIXcompatibilityattheapplicationlevel•  AdditionalperformancegainsfromrewritingtheAPI

Design Goals

Page 9: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

This Talk

Arrakis(OSDI14)•  OSarchitecturethatseparatesthecontrolanddataplane,forbothnetworkingandstorage

Strata(SOSP17)•  Filesystemdesignforlowlatencypersistence(NVM)andmulti-tierstorage(NVM,SSD,HDD)

TCPasaService/FlexNIC/Floem(ASPLOS15,OSDI18)•  OS,NIC,andapplibrarysupportforfast,agile,secureprotocolprocessing

8

Page 10: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Storage diversification

Betterperform

ance

Highercapacity

9

Largeerasureblock:hardwareGCoverheadRandomwritescause5-6xslowdownbyGC

Byte-addressable:cache-linegranularityIODirectaccesswithload/storeinstructions

Latency Throughput $/GBDRAM 80ns 200GB/s 10.8NVDIMM 200ns 20GB/s 2SSD 10us 2.4GB/s 0.26HDD 10ms 0.25GB/s 0.02

Page 11: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Let’s Build a Fast Server

Keyvaluestore,database,fileserver,mailserver,…Requirements:

•  Smallupdatesdominate• Datasetscalesuptomanyterabytes• Updatesmustbecrashconsistent

10

Page 12: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

A fast server on today’s file system

11

Small,randomIOisslow!

• Smallupdates(1Kbytes)dominate• Datasetscalesupto100TB• Updatesmustbecrashconsistent

Evenwithanoptimizedkernelfilesystem,NVMistoofast,kernelisthebottleneck

91%

Application

Kernelfilesystem

NVM 0 1.5 3 4.5 6

1 KB

IO latency (us)

Write to device Kernel code 91%

Kernelfilesystem:NOVA[FAST16,SOSP17]

Page 13: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

A fast server on today’s file system

12

• Smallupdates(1Kbytes)dominate• Datasetscalesupto100TB• Updatesmustbecrashconsistent

UsingonlyNVMistooexpensive!$200Kfor100TB

Kernelfilesystem

NVM

Application

Tosavecost,needawaytousemultipledevicetypes:NVM,SSD,HDD

Page 14: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

A fast server on today’s file system

13

• Smallupdates(1Kbytes)dominate• Datasetscalesupto10TB• Updatesmustbecrashconsistent

Block-levelcaching

NVMSSDHDD

Block-levelcachingmanagesdatainblocks,butNVMisbyte-addressable!

Kernelfilesystem

0 3 6 9 12 15

1 KB

IO latency (us)

Byte-addressed NVM

Better

Application

Forlow-costcapacitywithhighperformance,mustleveragemultipledevicetypes

Page 15: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

A fast server on today’s file system

14

• Smallupdates(1Kbytes)dominate• Datasetscalesupto10TB• Updatesmustbecrashconsistent

Pillaietal.,OSDI2014

0 2 4 6 8 10 12

SQLite ZooKeeper

HSQLDB Git

Crash vulnerabilities

Applicationsstruggleforcrashconsistency

Page 16: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Today’s file systems: Limited by old design assumptions

15

KernelmediateseveryoperationNVMistoofast,kernelisthebottleneckTiedtoasingletypeofdeviceForlow-costcapacitywithhighperformance,mustleveragemultipledevicetypes(NVM,SSD,HDD)AggressivecachinginDRAM,onlywritetodevicewhenyoumust(fsync)Applicationsstruggleforcrashconsistency

Page 17: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Strata: A Cross Media File System

16

Performance:Especiallysmall,randomIO• Fastuser-leveldeviceaccess

Capacity:leverageNVM,SSD&HDDforlowcost• Transparentdatamigrationacrossdifferentmedia• EfficientlyhandledeviceIOproperties

Simplicity:intuitivecrashconsistencymodel• In-order,synchronousIO• Nofsync()required

Page 18: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Strata: main design principle

LogoperationstoNVMatuser-level

Intuitivecrashconsistency Kernelbypass,butprivate

Coordinatemulti-processaccesses

17

DigestandmigratedatainkernelApplylogoperationstoshareddata

LibFS

KernelFS

Performance,simplicity:

Capacity:

Page 19: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Strata

•  LibFS:logoperationstoNVMatuser-level•  Fastuser-levelaccess•  In-order,synchronousIO

•  KernelFS:Digestandmigratedatainkernel•  Asynchronousdigest•  Transparentdatamigration•  Sharedfileaccess

18

Page 20: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Log operations to NVM at user-level

19

•  Fastwrites• DirectlyaccessfastNVM

•  Sequentiallyappenddata• Cache-linegranularity• Blindwrites

• Crashconsistency• Oncrash,kernelreplayslog

Application

LibFS

Kernel-bypass

NVMPrivateoperationlog

creatsyscall renamesyscall …Fileoperations(data&metadata)madebyasinglesystemcall

Page 21: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Intuitive crash consistency

20

Application

LibFS

Kernel-bypass

NVM

SynchronousIO

• Wheneachsystemcallreturns:• Data/metadataisdurable•  In-orderupdate• Atomicwrite• Limitedsize(logsize)

FastsynchronousIO:NVMandkernel-bypass

Privateoperationlog

fsync()isno-op

Page 22: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

21

KernelFS

•  Visibility:makeprivatelogvisibletootherapplications

•  Datalayout:turnwrite-optimizedtoread-optimizedformat

•  Large,batchedIO•  Coalescelog

Digest data in kernel

Write

NVM

Digest

SharedareaPrivateoperationlog

Application

LibFS

Page 23: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Digest optimization: Log coalescing SQLite,Mailserver:crashconsistentupdateusingwriteaheadlogging

22

Create journal fileWrite data to journal fileWrite data to database fileDelete journal file

Digesteliminatesunneededwork

Write data to database file

Removetemporarydurablewrites

Privateoperationlog

Application

LibFS KernelFS

Throughputoptimization:LogcoalescingsavesIOwhiledigesting

NVMSharedarea

Page 24: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

23

ApplicationStrata: LibFS

Strata: KernelFS NVMSharedareaPrivateoperationlog

Digest and migrate data in kernel

Page 25: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

ApplicationStrata: LibFS

Strata: KernelFS NVMSharedareaPrivateoperationlog

24

SSDSharedarea

HDDSharedarea

Low-costcapacityKernelFSmigratesdatatolowerlayer

Digest and migrate data in kernel

NVMdataLogs

Digest

Resembleslog-structuredmerge(LSM)tree

HandledeviceIOproperties

Write1GBsequentiallyfromNVMtoSSD

AvoidSSDgarbagecollectionoverhead

Page 26: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Device management overhead

0

250

500

750

1000

1250

0.1 0.25 0.5 0.6 0.7 0.8 0.9 1

SSD

Thro

ughp

ut (M

B/s)

SSD utilization

64 MB 128 MB 256 MB 512 MB 1024 MB

UseNVMlayeraspersistentwritebuffer

SSDpreferslargesequentialIO

Page 27: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Read: hierarchical search

26

Application

LibFS

KernelFS

NVMSharedareaPrivateOPlog

SSDSharedarea

HDDSharedarea

NVMdata

SSDdata

HDDdata

Logdata21

3

4

Searchorder

Page 28: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Shared file access

27

Leasesgrantaccessrightstoapplications[SOSP’89]RequiredforfilesanddirectoriesFunctionlikelock,butrevocableExclusivewriter,sharedreaders

Onrevocation,LibFSdigestsleaseddataLeasesserializeconcurrentupdates

Page 29: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Shared file access

28

LeasesgrantaccessrightstoapplicationsAppliedtoadirectoryorafileExclusivewriter,sharedreaders

Application1 LibFS

KernelFS

Application2 LibFS

RequestwriteleasetofileA

LWritefileAdata

RequestwriteleasetofileA

Revokethewritelease

L

WritefileAdata

Data

Example:concurrentwritestothesamefileA

OPlog1 OPlog2 Sharedarea

Page 30: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Experimental setup

•  2xIntelXeonE5-2640CPU,64GBDRAM•  400GBNVMeSSD,1TBHDD•  Ubuntu16.04LTS,Linuxkernel4.8.12

•  EmulatedNVM•  Use40GBofDRAM•  Performancemodel[Y.Zhangetal.MSST2015]

•  Throttlelatency&throughputinsoftware

• CompareStratavs.•  PMFS,Nova,ext4-DAX:NVMkernelfilesystems•  Nova:atomicupdate,in-ordersynchI/O•  PMFS,ext4-DAX:noatomicwrite

29

Page 31: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Latency: LevelDB

0

10

20

30

Write sync.

Write rand.

Delete rand.

Strata PMFS NOVA EXT4-DAX

35.249.2 37.7

30

Better25%better

17%better

Latency(us)

LevelDB(NVM)

Keysize:16B

Valuesize:1KB

300,000objects

Levelcompactioncausesasynchronousdigests

Fastuser-levellogging

Randomwrite25%betterthanPMFS

Overwrite17%betterthanPMFS

Page 32: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Throughput: Varmail

31

MailserverworkloadfromFilebench•  UsingonlyNVM•  10000files•  Read/Writeratiois1:1• Write-aheadlogging

Logcoalescingeliminates86%oflogrecords,saving14GBofIO

0K 100K 200K 300K 400K

Strata

NOVA

Throughput (op/s)

Better

29%better

Create journal fileWrite data to journalWrite data to database fileDelete journal file

Digest eliminates unneeded work

Write data to database file

Removes temporary durable writes

KernelFSApplication

LibFS

Log coalescing

Page 33: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

This Talk

Arrakis(OSDI14)•  OSarchitecturethatseparatesthecontrolanddataplane,forbothnetworkingandstorage

Strata(SOSP17)•  Filesystemdesignforlowlatencypersistence(NVM)andmulti-tierstorage(NVM,SSD,HDD)

TCPasaService/FlexNIC/Floem(ASPLOS15,OSDI18)•  OS,NIC,andapplibrarysupportforfast,agile,secureprotocolprocessing

32

Page 34: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Let’s Build a Fast Server

Keyvaluestore,database,mailserver,ML,…Requirements:

• MostlysmallRPCsoverTCP•  40Gbpsnetworklinks(100+Gbpssoon)•  Enforceableresourcesharing(multi-tenant)• Agileprotocoldevelopment:kernelandapp•  Taillatency,costefficienthardware

33

Page 35: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Let’s Build a Fast Server

34

Application

Linuxnetworkstack

40GbpsNIC

Kernelmediationtooslow

•  SmallRPCsdominate•  Enforceableresourcesharing•  Agileprotocoldevelopment•  Cost-efficienthardware

Overhead: 97%

Page 36: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

• Directaccesstodeviceatuser-level• Multiplexing

•  SR-IOV:VirtualPCIdevicesw/ownregisters,queues,INTs

• Protection•  IOMMU:DMAto/fromappvirtualmemory

•  Packetfilters:ex:legalsourceIPheader

• mTCP:2-3xfasterthanLinux• Whoenforcescongestioncontrol?

Hardware I/O Virtualization

SR-IOVNIC

Packetfilters

Network

Ratelimiters

User-levelVNIC1

User-levelVNIC2

Page 37: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Remote DMA (RDMA)

Programmingmodel:read/writeto(limited)regionofremoteservermemory

•  Modeldatestothe80’s(AlfredSpector)•  HPCcommunityrevivedforcommunicationwithinarack•  ExtendedtodatacenteroverEthernet(RoCE)•  Commerciallyavailable100GNICs

NoCPUinvolvementontheremotenode•  Fastifappcanuseprogrammingmodel

Limitations:• Whatifyouneedremoteapplicationcomputation(RPC)?•  Losslessmodelisperformance-fragile

36

Page 38: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Smart NICs (Cavium, …)

NICwitharrayoflow-endCPUcores(Cavium,…)IfcomputeontheNIC,maybedon’tneedtogoCPU?

•  ApplicationsinhighspeedtradingWe’vebeenherebefore:“wheelofreinvention”

•  Hardwarerelativelyexpensive•  AppsoftensloweronNICvs.CPU(cf.Floem)

37

Page 39: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Step 1

BuildafasterkernelTCPinsoftwareNochangeinisolation,resourceallocation,APIQ:WhyisRPCoverLinuxTCPissoslow?

38

Page 40: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

OS - Hardware Interface

• Highly optimized code path • Buffer descriptor queues

•  No interrupt in common case •  Maximize concurrency

39

Core1 Core2 Core3 Core4

NetworkInterfaceCard

Operating System Tx Rx Tx Rx Tx Rx Tx Rx

Page 41: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

OS Transmit Packet Processing

•  TCPlayer:movefromsocketbuffertoIPqueue•  Locksocket•  Congestion/flowcontollimit•  FillinTCPheader,calculatechecksum•  Copydata•  Armre-transmissiontimeout

•  IPlayer:•  firewall,routing,ARP,trafficshaping

• Driver:movefromIPqueuetoNICqueue• Allocateandfreepacketbuffers

40

Page 42: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Sidebar: Tail Latency

OnLinuxwitha40Gbpslink,400outboundTPCflowssendingRPCs,nocongestion,whatistheminimumrateacrossallflows?

41

Page 43: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Kernel and Socket Overhead

events = poll(….) For e in events: If e.socket != listen_sock: receive(e.socket, ….) … send(e.socket, ….)

Multiple synchronous kernel transitions: • Parameter checks and copies • Cache pollution, pipeline stalls

42

Application Kernel

Synchronoussystemcalls

Page 44: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

TCP Acceleration as a Service (TaS)

•  TCPasauser-levelOSservice•  SRIO-Vtodedicatedcores•  Scalenumberofcoresup/downtomatchdemand•  Optimizeddataplaneforcommoncaseoperations

• Applicationusesitsowndedicatedcores•  Avoidpollutingapplicationlevelcache•  Fewercores=>betterperformancescaling

•  Totheapplication,per-sockettx/rxqueueswithdoorbells

•  Analogoustohardwaredevicetx/rxqueues

43

Page 45: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Streamline common-case data path • Removeunneededcomputationfromdatapath

•  Congestioncontrol,timeoutsperRTT(notperpacket)

• Minimizeper-flowTCPstate•  prefetch2cachelinesonpacketarrival

•  Linearizedcode•  Betterbranchprediction•  Super-scalarexecution

•  EnforceIPlevelaccesscontroloncontrolplaneatconnectionsetup

Page 46: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Small RPC Microbenchmark

45

App:1FP:1

App:4FP:7 App:8

FP:114.9x

•  IX:fastkernelTCPwithsyscallbatching,non-socketAPI•  Linux/TaSlatency:7.3x;IX/TaSlatency:2.9x

Page 47: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

TAS is workload proportional

0

20

40

60

80

100

120

140

160

180

0 10 20 30 40 50 60 70 80 90 1001

2

3

4

5

6

7

8

9

Latency[us]

Time[s]

FastpathCores[#

]

Cores Latency

46

•  Setup:4clientsstartingevery10seconds,thenstoppingincrementally

Page 48: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Step 2

TCPasaServicecansaturatea40GbpslinkwithsmallRPCs,butwhatabout100Gbpsor400Gbpslinks?

Networklinkspeedsscalingupfasterthancores

WhatNIChardwaredoweneedfastdatacentercommunication?

TCPasaServicedataplanecanbeefficientlybuiltinhardware

47

Page 49: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

FlexNIC Design Principles

• RPCsarethecommoncase•  Kernelbypasstoapplicationlogic

•  Enforceableper-flowresourcesharing•  Dataplaneinhardware,policyinkernel

• Agileprotocoldevelopment•  Protocolagnostic(ex:TimelyandDCTCPandRDMA)•  Offloadbothkernelandapppackethandling

• Cost-efficient•  Minimalinstructionsetforpacketprocessing

48

Page 50: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

FlexNIC Hardware Model Programmable

ParserPacket

Stream ...TCAM

SRAM

REGs...

EgressQueues

Eth

IPv4

TCP UDP

RPC

•  TCAMforarbitrarywildcardmatches

•  SRAMforexact/LPMlookups

•  Statefulmemoryforcounters

•  ALUsformodifyingheadersandregisters

1.p=lookup(eth.dst_mac)

2.pkt.egress_port=p

3.counter[ipv4.src_ip]++

...

...

Match

Action

49BarefootNetworksswitchRMT~6Tbps(withparallelstreams)

Page 51: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

FlexNIC Hardware Model

•  TransformpacketsforefficientprocessinginSW•  DMAdirectlyintoandoutofapplicationdatastructures•  SendacknowledgementsonNIC•  Queuemanagerimplementsratelimits•  Improvelocalitybysteeringtocoresbasedonappcriteria

RxPipeline

DBPipeline

TxPipeline

DMAPipeline PCIeDMA

FromNetwork

ToNetwork

FromPCIe(Doorbells)

QMan

50

Page 52: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

FlexTCP: H/W Accelerated TCP

•  FastpathissimpleenoughforFlexNICmodel• ApplicationsdirectlyaccessNICforRX/TX

•  SimilarinterfacetoTCPasaservice:in-memoryqueues•  Softwareslow-pathmanagesNICstate

•  StreamlinesNICprocessing•  ACKsconsumed/generatedinNIC,reducesPCIetraffic•  Nodescriptorqueuesw/dependentDMAreads

•  Evaluation:softwareFlexNICemulator

51

Page 53: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

FlexTCP Performance

•  Latency:7.8xbettervsLinux

52

10.7xvsLinux4.1xvsmTCP

7.2xvsLinux2.2xvsmTCP

Page 54: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Streamlining App RPC Processing

• How do we further reduce CPU overhead? •  Integration of app logic with network code

•  Remove socket abstraction, streamline app code

• But fundamental app bottlenecks remain •  Data structure synchronization, cache utilization •  Copies between app data structures and RPC buffers

•  Idea: leverage FlexNIC to streamline app •  FlexNIC is protocol agnostic, can parse on app protocol

53

Page 55: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Example: Key-Value Store

4

7

Hash Table

Core1

Core2NIC

Receive-sidescaling:core=hash(connection)%N

Client1K=3,4

Client2K=1,4,7

Client3K=1,7,8

•  Lockcontention•  Poorcacheutilization

Page 56: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Optimizing Reads: Key-based Steering

Core1

Core2NIC

1

2

3

4

5

6

7

8

Hash Table

Client1K=4,3

Client2K=1,4,7

Client3K=1,7,8

1

2

3

4

5

6

7

8

Match:IFudp.port==kvsAction:core=HASH(kvs.key)%2DMAhash,kvsTOCores[core]•  Nolocksneeded

•  Bettercacheutilization40%higherCPUefficiency

Page 57: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Summary

Arrakis(OSDI14)•  OSarchitecturethatseparatesthecontrolanddataplane,forbothnetworkingandstorage

Strata(SOSP17)•  Filesystemdesignforlowlatencypersistence(NVM)andmulti-tierstorage(NVM,SSD,HDD)

TCPasaService/FlexNIC/Floem(ASPLOS15,OSDI18)•  OS,NIC,andapplibrarysupportforfast,agile,secureprotocolprocessing

56

Page 58: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Biography

• College:physics->psychology->philosophy•  TookthreeCSclassesasasenior

• Aftercollege:developedanOSforaz80•  Afterprojectshipped,projectgotcancelled•  Appliedtogradschool:sevenoutofeightturnedmedown

• Gradschool•  Learnedalot•  Dissertationhadzerocommercialimpactfordecades

• Asfaculty•  IlearnalotfromthepeopleIworkwith•  Trytopicktopicsthatmatter,andwhereIgettolearn

Page 59: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

What about the CPU overhead?

58

12x

Page 60: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Example: Key-Value Store

4

7

HashTable

Core1

Core2NIC

Receive-sidescaling:core=hash(connection)%N

Client1K=3,4

Client2K=1,4,7

Client3K=1,7,8

•  Lockcontention•  Coherencetraffic•  Poorcacheutilization

59

Page 61: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Optimizing Reads: Key-based Steering

Core1

Core2NIC

1

2

3

4

5

6

7

8

HashTable

Client1K=4,3

Client2K=1,4,7

Client3K=1,7,8

1

2

3

4

5

6

7

8

Match:IFudp.port==kvsAction:core=HASH(kvs.key)%2DMAhash,kvsTOCores[core]•  Nolocksneeded

•  Nocoherencetraffic•  Bettercacheutilization

60

Page 62: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Where is hardware going?

Page 63: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

The Example of Bitcoin

Bitcoinisaprotocolformaintainingabyzantinefaulttolerantpublicledger

•  Cryptographicallysignedledgerofofasequenceoftransfersofdigitalcurrency,butcouldbeanything

Providesanincentivetoparticipateinthefaulttolerantprotocol

•  Proofofwork:Solveacryptographicpuzzle,getareward•  Hardtomonopolize

Extremelylowbandwidth:afewtransactions/second•  EnergycostroughlythesameasIreland

Page 64: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Bitcoin Hardware Progression

CPU->GPU->FPGA->LowtechVLSI->HightechVLSIMarketprice:1petahashofSHA-256=$0.01

Page 65: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

Price and Difficulty

Page 66: High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13% HW 18% Kernel 84% Kernel 62% App 3% App 20% SET GET % OF 1KB REQUEST TIME SPENT

FPGAs, GPU, >55nm out of business