ebbrt: a framework for building per-application … a framework for building per-application library...

19
This paper is included in the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). November 2–4, 2016 • Savannah, GA, USA ISBN 978-1-931971-33-1 Open access to the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. EbbRT: A Framework for Building Per-Application Library Operating Systems Dan Schatzberg, James Cadden, Han Dong, Orran Krieger, and Jonathan Appavoo, Boston University https://www.usenix.org/conference/osdi16/technical-sessions/presentation/schatzberg

Upload: vuongkhanh

Post on 06-May-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

This paper is included in the Proceedings of the 12th USENIX Symposium on Operating Systems Design

and Implementation (OSDI ’16).November 2–4, 2016 • Savannah, GA, USA

ISBN 978-1-931971-33-1

Open access to the Proceedings of the 12th USENIX Symposium on Operating Systems

Design and Implementation is sponsored by USENIX.

EbbRT: A Framework for Building Per-Application Library Operating Systems

Dan Schatzberg, James Cadden, Han Dong, Orran Krieger, and Jonathan Appavoo, Boston University

https://www.usenix.org/conference/osdi16/technical-sessions/presentation/schatzberg

EbbRT: A Framework for Building Per-Application LibraryOperating Systems

Dan Schatzberg, James Cadden, Han Dong, Orran Krieger, Jonathan AppavooBoston University

Abstract

General purpose operating systems sacrifice per-application performance in order to preserve general-ity. On the other hand, substantial effort is required tocustomize or construct an operating system to meet theneeds of an application. This paper describes the designand implementation of the Elastic Building Block Run-time (EbbRT), a framework for building per-applicationlibrary operating systems. EbbRT reduces the effort re-quired to construct and maintain library operating sys-tems without hindering the degree of specialization re-quired for high performance. We combine several tech-niques in order to achieve this, including a distributedOS architecture, a low-overhead component model, alightweight event-driven runtime, and many language-level primitives. EbbRT is able to simultaneously enableperformance specialization, support for a broad range ofapplications, and ease the burden of systems develop-ment.

An EbbRT prototype demonstrates the degree of cus-tomization made possible by our framework approach.In an evaluation of memcached, EbbRT and is able to at-tain 2.08× higher throughput than Linux. The node.jsruntime, ported to EbbRT, demonstrates the broad ap-plicability and ease of development enabled by our ap-proach.

1 Introduction

Performance is a key concern for modern cloud applica-tions. Even relatively small performance gains can re-sult in significant cost savings at scale. The end of Den-nard scaling and increasingly high-speed I/O devices hasshifted the emphasis of performance to the CPU and, inturn, the software stack on which an application is de-ployed.

Existing general purpose operating systems sacrificeper-application performance in favor of generality. In re-

sponse, there has been a renewed interest in library op-erating systems [28, 38], hardware virtualization [5, 44],and kernel bypass techniques [25, 48]. Common to theseapproaches is the desire to enable applications to inter-act with devices with minimal operating system involve-ment. This allows developers to customize the entiresoftware stack to meet the needs of their application.

The problem with this approach is that it still requiressignificant engineering effort to implement the requiredsystem functionality. The consequence being that pastsystems have either targeted a narrow class of applica-tions (e.g. packet processing) or, in order to be broadlyapplicable, constructed general purpose software stacks.

We believe that general purpose software stacks aresubject to the same trade-off between performance andgenerality as existing commodity operating systems. Inorder to achieve high performance for a broad set ofapplications, we must bridge the gap between generalpurpose software, on which application development iseasy, and the performance advantages obtained usingcustomized, per-application software stacks where thedevelopment burden is high.

Our work addresses this gap with the Elastic Build-ing Block Runtime (EbbRT), a framework for construct-ing per-application library operating systems. EbbRTreduces the effort required to construct and maintain li-brary operating systems without restricting the degree ofspecialization required for high performance. We com-bine several techniques in order to achieve this.

1. EbbRT is comprised of a set of components, calledElastic Building Blocks (Ebbs), that developers canextend, replace, or discard in order to constructand deploy a particular application. This enablesa greater degree of customization than a generalpurpose system and promotes the reuse of non-performance-critical components.

2. EbbRT uses a lightweight execution environmentthat allows application logic to directly interact with

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 671

hardware resources, such as memory and devices.

3. EbbRT applications are distributed across both spe-cialized and general purpose operating systems.This allows functionality to be offloaded, which re-duces the engineering effort required to port appli-cations.

In this paper we describe the design and implementa-tion of the EbbRT framework, along with several of itssystem components (e.g., a scalable network stack anda high-performance memory allocator). We demonstratethat library operating systems constructed using EbbRToutperform Linux on a series of compute and networkintensive workloads. For example, a memcached port toEbbRT, run on a commodity hypervisor, is able to attain2.08× greater throughput than memcached run on Linux.Additionally, we show that, with modest developer ef-fort, EbbRT is able to support large, complex applica-tions. In particular, node.js, a managed runtime environ-ment, was ported in two weeks by a single developer.

The remainder of the paper is structured as follows:section 2 outlines the objectives of EbbRT, section 3presents the high-level architecture and design of ourframework, section 4 describes the implementation, sec-tion 5 presents the evaluation of EbbRT, section 6 dis-cusses related work, and section 7 concludes.

2 Objectives

The following three objectives guide our design and im-plementation:

Performance Specialization: To achieve high perfor-mance we seek to allow applications to efficiently spe-cialize the system at every level. To facilitate this,we provide an event-driven execution environment withminimal abstraction over the hardware; EbbRT applica-tions can provide event handlers to directly serve hard-ware interrupts. Also, our Ebb component model has lowenough overhead to be used throughout performance-sensitive paths, while also enabling compiler optimiza-tions, such as aggressive inlining.

Broad Applicability: To ensure high utility, a frame-work should support a broad set of applications. EbbRTis designed and implemented to support the rich set of ex-isting libraries and complex managed runtimes on whichapplications depend. We adopt a heterogeneous dis-tributed architecture, called the MultiLibOS [49] model,wherein EbbRT library operating systems run alongsidegeneral purpose operating systems and offload function-ality transparently. EbbRT library operating systems canbe integrated with a process of the general purpose OS.

Ease of Development: We strive to make the devel-opment of application-specific systems easy. EbbRT ex-ploits modern language techniques to simplify the task ofwriting new system software, while the Ebb model pro-vides an abstraction to encapsulate existing system com-ponents. The barrier to porting existing applications islowered through the use of function offloading betweenan EbbRT library OS and a general purpose OS.

Attaining all three of these objectives simultaneouslyis a challenging but critical step towards bridging the gapbetween general purpose and highly specialized softwarestacks.

3 System Design

This section describes the high-level design of EbbRT. Inparticular the three elements of the design discussed are:1. a heterogeneous distributed structure, 2. a modularsystem structure, and 3. a non-preemptive event-drivenexecution environment.

ramcores nics

ramcores nicsCommodityOS

VM0 VM1

ramcores nics

VM2

Frontend OS Process linked to Hosted library

Backend privileged protection-domain linked to Native library OS

Hosted Native Native

Ebb Instance per-VM, per-core Representatives

Figure 1: High Level EbbRT architecture

3.1 Heterogeneous Distributed Structure

Our design is motivated in-part by the common deploy-ment strategies of cloud applications. Infrastructure asa Service enables a single application to be deployedacross multiple machines within an isolated network. Inthis context, it is not necessary to run general purposeoperating systems on all the machines of a deployment.Rather, an application can be deployed across a heteroge-neous mix of specialized library OSs and general purposeoperating systems as illustrated in Figure 1.

To facilitate this deployment model, EbbRT is imple-mented as both a lightweight bootable runtime and auser-level library that can be linked into a process of ageneral purpose OS [49]. We refer to the bootable library

672 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

OS as the native runtime and the user-level library as thehosted runtime.

The native runtime allows application software to bewritten directly to hardware interfaces uninhibited bylegacy interfaces and protection mechanisms of a generalpurpose operating system. The native runtime sets upa single address space, basic system functionality (e.g.timers, networking, memory allocation) and invokes anapplication entry point, all while running at the highestprivilege level. The EbbRT design depends on applica-tion isolation at the network layer, either through switchprogramming or virtualization, making it amenable toboth virtualized and bare-metal environments.

The hosted user-space library allows EbbRT applica-tions to integrate with legacy software. This frees thenative library OSs from the burden of providing compat-ibility with legacy interfaces. Rather, functionality canbe offloaded via communication with the hosted environ-ment.

A common deployment of a EbbRT application con-sists of a hosted process and one or more native runtimeinstances communicating across a local network. A useris able to interact with the EbbRT application through thehosted runtime, as they would any other process of thegeneral purpose OS, while the native runtime supportsthe performance-critical portion of the application.

3.2 Modular System Structure

To provide a high degree of customization, EbbRT en-ables application developers to modify or extend all lev-els of the software stack. To support this, EbbRT appli-cations are almost entirely comprised of objects we callElastic Building Blocks (Ebbs). As with objects in manyprogramming languages, Ebbs encapsulate implementa-tion details behind a well-defined interface.

Ebbs are distributed, multi-core fragmented objects [9,39, 51], where the namespace of Ebbs is shared acrossboth the native and hosted runtimes. Figure 1 illustratesan Ebb spanning both hosted and native runtimes of anapplication. An EbbRT application typically consists ofmultiple Ebb instances. The framework is composed ofbase Ebb types that a developer can use to construct anEbbRT application.

When an Ebb is invoked, a local representative han-dles the call. Representatives may communicate witheach other to satisfy the invocation. For example, anobject providing file access might have representativeson a native instance simply function-ship requests to ahosted representative which translates these requests intorequests on the local file system. By encapsulating thedistributed nature of the object, optimizations such asRDMA, caching, using local storage, etc. would all behidden from clients of the filesystem Ebb.

Ebb reuse is critical to easing development effort. Ex-ploiting modularity promotes reuse and evolution of theEbbRT framework. Developers can build upon the Ebbstructure to provide additional libraries of componentsthat target specific application use cases.

3.3 Execution Model

Execution in EbbRT is non-preemptive and event-driven.In the native runtime there is one event loop per corewhich dispatches both external (e.g. timer completions,device interrupts) and software generated events to reg-istered handlers. This model is in contrast to a more stan-dard threaded environment where preemptable threadsare multiplexed across one or more cores. Our non-preemptive event-driven execution model provides a lowoverhead abstraction over the hardware. This allows ourimplementation to directly map application software todevice interrupts, avoiding the typical costs of schedul-ing decisions or protection domain switches.

EbbRT provides an analogous environment within thehosted library by providing an event loop using underly-ing OS functionality such as poll or select. Whilethe hosted environment cannot achieve the same effi-ciency as our native runtime, we provide a compati-ble environment to allow software libraries to be reusedacross both runtimes.

Many cloud applications are driven by external re-quests such as network traffic so the event-driven pro-gramming environment provides a natural way to struc-ture the application. Indeed, many cloud applications usea user-level library (e.g. libevent [46], libuv [34], BoostASIO [29]) to provide such an environment.

However, asynchronous, event-driven programmingmay not be a good fit for all applications. To this end, weprovide a simple cooperative threading model on top ofevents. This allows for blocking semantics and a concur-rency model similar to the Go programming language.We discuss support for long running events further inSection 4.2.

The non-preemptive event execution, along with sup-port for cooperative threading, allows the native runtimeto be lightweight yet provides sufficient flexibility for awide range of applications. Such qualities are critical inenabling performance specialization without sacrificingapplicability.

4 Implementation

In this section we provide an overview of the system soft-ware and then describe details of the implementation.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 673

Primitives External Libraries

Futu

res

Lam

bdas

IOB

ufs

std

c++

Boo

st

Inte

lTB

B

capn

prot

o

Description

PageAllocator 3 3 3 Power of two physical page frame allocatorVMemAllocator 3 Allocates virtual address spaceSlabAllocator 3 3 Allocates fixed sized objectsGeneralPurposeAllocator 3 General purpose memory allocatorM

emor

y

EbbAllocator 3 3 Allocates EbbIdsLocalIdMap 3 3 3 Local data store for Ebb data and fault resolutionGlobalIdMap 3 3 3 3 Application-wide data store for Ebb data

EventManager 3 3 3 3 Creates events and manages hardware interrupts

Obj

ects

Timer 3 3 Delay based scheduling of eventsEve

nt

NetworkManager 3 3 3 3 3 Implements TCP/IP stackSharedPoolAllocator 3 3 Allocates network portsNodeAllocator 3 3 3 3 Allocates, configures, and releases IAAS resourcesMessenger 3 3 3 Cross node Ebb to Ebb communicationVirtioNet 3 3 VirtIO network device driver

I/O

Table 1: The core Ebbs that make up EbbRT. A gray row indicates that the Ebb has a multi-core implementation (onerepresentative per core) while the others use a single shared representative.

4.1 Software Structure Overview

EbbRT is comprised of an x86_64 library OS andtoolchain as well as a Linux userspace library. Bothruntimes are written predominately in C++14 totaling14,577 lines of new code [59]. The native library is pack-aged with a GNU toolchain (gcc, binutils, libstdc++) andlibc (newlib) modified to support a x86_64-ebbrt buildtarget. Application code compiled with the toolchain willproduce a bootable ELF binary linked with the libraryOS. We provide C and C++ standard library implemen-tations which make it straightforward to use many thirdparty software libraries as shown in Table 1. The sup-port and use of standard software libraries to implementsystem-level functionality makes it much easier for li-brary and application developers to understand and mod-ify system-level Ebbs.

We chose not to strive for complete Linux or POSIXcompatibility. We feel that enforcing compatibility withexisting OS interfaces would be restrictive and, given thefunction offloading enabled by our heterogeneous dis-tributed structure, unnecessary. Rather, we provide min-imalist interfaces above the hardware, which allows for abroad set of software to be developed on top.

EbbRT provides the necessary functionality for eventsto execute and Ebbs to be constructed and used. Thisentails functionality such as memory management, net-

working, timers, and I/O. This functionality is providedby the core system Ebbs shown in Table 1

4.2 Events

Both the hosted and native environments provide anevent driven execution model. Within the hosted envi-ronment we use the Boost ASIO library [29] in order tointerface with the system APIs. Within the native envi-ronment, our event-driven API is implemented directlyon top of the hardware interfaces. Here, we focus ourdescription on the implementation of events within thenative environment.

When the native environment boots, an event loop percore is initialized. Drivers can allocate a hardware inter-rupt from the EventManager and then bind a handlerto that interrupt. When an event completes and the nexthardware interrupt fires, a corresponding exception han-dler is invoked. Each exception handler execution be-gins on the top frame of a per-core stack. The exceptionhandler checks for an event handler bound to the corre-sponding interrupt and then invokes it. When the eventhandler returns, interrupts are enabled and more eventscan be processed. Therefore events are non-preemptiveand typically generated by a hardware interrupt.

Applications can invoke synthetic events on anycore in the system. The Spawn method of the

674 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

EventManager receives an event handler, which islater invoked from the event loop. Events invoked withSpawn are only executed once. A handler for a reoccur-ring event can be installed as an IdleHandler.

In order to prevent interrupt starvation, when an eventcompletes the EventManager, 1. enables then disablesinterrupts, handling any pending interrupts, 2. dispatchesa single synthetic event, 3. invokes all IdleHandlersand then 4. enables interrupts and halts. If any of thesesteps result in an event handler being invoked, then theprocess starts again at the beginning. This way, hardwareinterrupts and synthetic events are given priority over re-peatedly invoked IdleHandlers.

While our EventManager implementation is sim-ple, it provides sufficient functionality to achieve inter-esting dynamic behavior. For example, our network carddriver implements adaptive polling in the following way:an interrupt is allocated from the EventManager andthe device is programmed to fire that interrupt whenpackets are received. The event handler will processeach received packet to completion before returning con-trol to the event loop. If the interrupt rate exceeds aconfigurable threshold, the driver disables the receiveinterrupt and installs an IdleHandler to process re-ceived packets. The EventManager will repeatedlycall the idle handler from within the event loop, effec-tively polling the device for more data. When the packetarrival rate drops below a configurable threshold, thedriver re-enables the receive interrupt and disables theidle handler, returning to interrupt-driven packet process-ing.

Given our desire to enable reuse of existing software,we adopt a cooperative threading model which allowsevents to explicitly save and restore control state (e.g.,stack and volatile register state). At the point wherethe block would occur, the current event saves its stateand relinquishes control back to the event loop, wherethe processing of pending events is resumed. The orig-inal event state can be restored, and its execution re-sumed, when the asynchronous work completes. Thesave and restore event mechanisms enable explicit coop-erative scheduling between events, facilitating familiarblocking semantics. This has allowed us to quickly portsoftware libraries that require blocking system calls.

A limitation of non-preemptive execution is the diffi-culty of mapping long-running threads with no I/O to anevent-driven model. If the processor is not yielded pe-riodically, event starvation can occur. At present we donot provide a completely satisfactory solution. Buildinga preemptive scheduler on top of events would be possi-ble, though we fear it would fragment the set of Ebbs intothose that depend on non-preemptive execution and thosethat don’t. Alternatively, we have discussed dedicat-ing processors to executing these long-running threads

and therefore avoiding any starvation issues, similar toIX [5]. Nonetheless, we have not run into this problemin practice; most cloud applications rely heavily on I/O,and concern for starvation is reduced as we only supportthe execution of a single process.

4.3 Elastic Building Blocks

Nearly all software in EbbRT is written as elastic build-ing blocks, which encapsulate both the data and functionof a software component. Ebbs hide from clients the dis-tributed or parallel nature of objects and can be extendedor replaced for customization. An Ebb provides an in-terface using a standard C++ class definition. Every in-stance of an Ebb has a system-wide unique EbbId (32bits in our current implementation). Software invokes theEbb by converting the EbbId into an EbbRef whichcan be dereferenced to a per-core representative which isa reference to an instance of the underlying C++ class.We use C++ templates to implement the EbbRef gener-ically for all Ebb classes.

Ebbs may be invoked on any machine or core withinthe application. Therefore, it is necessary for initializa-tion of the per-core representatives to happen on-demandto mitigate initialization overheads for short-lived Ebbs.An EbbId provides an offset into a virtual memory re-gion backed with distinct per-core pages which holds apointer to the per-core representative (or NULL if it doesnot exist). When a function is called on an EbbRef, itchecks the per-core representative pointer — in the com-mon case where it is non-null, it is dereferenced and thecall is made on the per-core representative. If the pointeris null, then a type specific fault handler is invoked whichmust return a reference to a representative to be calledor throw a language-level exception. Typically, a faulthandler will construct a representative and store it in theper-core virtual memory region so future invocations willtake the fast-path. Our hosted implementation of Ebbdereferences uses per-thread hash-tables to store repre-sentative pointers.

The construction of a representative may require com-munication with other representatives either within themachine or on other machines. EbbRT provides coreEbbs that support distributed data storage and messagingservices. These facilities span and enable communica-tion between the EbbRT native and hosted instances andutilize network communication as needed.

Ebb modularity is both flexible and efficient, makingthem suitable for high-performance components. Pre-vious systems providing a partitioned object model ei-ther used relatively heavy weight invocation across a dis-tributed system [56], or more efficient techniques con-strained to a shared memory system [17, 30]. Ebbs areunique in their ability to accommodate both use cases.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 675

The fast-path cost of an Ebb invocation is one predictableconditional branch and one unconditional branch morethan a normal C++ object dereference. Additionally,our use of static dispatch (EbbRef’s are templated bythe representative’s type) enables compiler optimizationssuch as function inlining.

We intentionally avoided using interface definitionlanguages such as COM [60], CORBA [57], or ProtocolBuffers [21]. Our concern was that these often require se-rialization and deserialization at all interface boundaries,which would promote much coarser grained objects thanwe desire. Our ability to use native C++ interfaces al-lows Ebbs to pass complex data structures amongst eachother. This also necessitates that all Ebb invocations belocal. Ebb representative communication is encapsulatedand may internally serialize data structures as needed.

GeneralPurposeAllocator

PageAllocator VMemAllocator

Physical Memory

Identity Mapped Virtual Memory User Allocatable Virtual Memory

malloc() free()

Small allocations served from identity mapped memory

Large allocations reserve virtual memory and map physical pages

SlabAllocatorSlabAllocatorSlabAllocator

Figure 2: Memory Management Ebbs

4.4 Memory ManagementMemory allocation is a performance critical facet ofmany cloud applications, and our focus on short-livedevents puts increased pressure on the memory alloca-tor to perform well. Here we present our default na-tive memory allocator and highlight aspects of it whichdemonstrate the synergy of the elements in the EbbRTdesign. Figure 2 illustrates EbbRT’s memory manage-ment infrastructure and the virtual and physical addressspace relationships.

The EbbRT memory allocation subsystem is similarto that of the Linux Kernel. The lowest-level allocatoris the PageAllocator, which allocates power of twosized pages of memory. Our default PageAllocatorimplementation uses buddy-allocators, one per NUMAnode. On top of the PageAllocator are the

SlabAllocator Ebbs, which are used to allocatefixed size objects. Our default SlabAllocator im-plementation uses per-core and per-NUMA node repre-sentatives to store object free-lists and partially allocatedpages. This design is based on the Linux Kernel’s SLQBallocator [12]. The GeneralPurposeAllocator,invoked via malloc, is implemented using multi-ple SlabAllocator instances, each responsiblefor allocating objects of a different fixed size. Toserve a request, the GeneralPurposeAllocatorinvokes the SlabAllocator with the closestsize greater or equal to the requested size. Forrequests exceeding the largest SlabAllocatorsize, the GeneralPurposeAllocator will al-locate a virtual memory region mapped in from theVMemAllocator and backed by pages mapped infrom the PageAllocator.

By defining the memory allocators as Ebbs, we al-low any one of the components to be replaced or mod-ified without impacting the others. In addition, be-cause our implementation uses C++ templates for staticdispatch, the compiler is able to optimize calls acrossEbb interfaces. For example, calls to malloc thatpass a size known at compile time are optimized to di-rectly invoke the correct SlabAllocator within theGeneralPurposeAllocator.

A key property of memory allocations in EbbRT isthat most allocations are serviced from identity mappedphysical memory. This applies to all allocations made bythe GeneralPurposeAllocator that do not exceedthe largest SlabAllocator size (virtual mappings areused for larger allocations). Identity mapped memory al-lows application software to perform zero-copy I/O withstandard allocations rather than needing to allocate mem-ory specifically for DMA.

Another benefit of the EbbRT design is that, due tothe lack of preemption, most allocations can be ser-viced from a per-core cache without any synchroniza-tion. Avoiding atomic operations is so important thathigh performance allocators like TCMalloc [19] and je-malloc [14] use per-thread caches to do so. These al-locators then require complicated algorithms to balancethe caching across a potentially dynamic set of threads.In contrast, the number of cores is typically static andgenerally not too large, simplifying EbbRT’s balancingalgorithm.

While a portion of the virtual address space is reservedto identity map physical memory and some virtual mem-ory is used to provide per-core regions for Ebb invo-cation, the vast majority of the virtual address space isavailable for application use. Applications can allocatevirtual regions by invoking the VMemAllocator andpassing in a handler to be invoked on faults to that al-located region. This allows applications to implement

676 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

1 // Sends out an IPv4 packet over Ethernet2 Future<void> EthIpv4Send(uint16_t eth_proto, const Ipv4Header& ip_hdr, IOBuf buf) {3 Ipv4Address local_dest = Route(ip_hdr.dst);4 Future<EthAddr> future_macaddr = ArpFind(local_dest); /* asynchronous call */5 return future_macaddr.Then(6 // continuation is passed in as an argument7 [buf = move(buf), eth_proto](Future<EthAddr> f) { /* lambda definition */8 auto& eth_hdr = buf->Get<EthernetHeader>();9 eth_hdr.dst = f.Get();

10 eth_hdr.src = Address();11 eth_hdr.type = htons(eth_proto);12 Send(move(buf));13 }); /* end of Then() call */14 }

Figure 3: Network code path to route and send and Ethernet frame.

arbitrary paging policies.Our memory management demonstrates some of the

advantages provided by EbbRT’s design. First, we useEbbs to create per-core representatives for multi-corescalability and to provide encapsulation to enable thedifferent allocators to be replaced. Second, the lack ofpreemption enables us to use the per-core representa-tives without synchronization. Third, the library OS de-sign enables tighter collaboration between system com-ponents and application components — as exemplifiedby the application’s ability to directly manage virtualmemory regions and achieve zero-copy interactions withdevice code.

4.5 Lambdas and FuturesOne of the core objectives of our design is mitigatingcomplexity to ease development. Critics of event-drivenprogramming point out several properties which place in-creased burden on the developer.

One concern is that event-driven programming tendsto obfuscate the control flow of the application [58].For example, a call path that requires the completion ofan asynchronous event will often pass along a callbackfunction to be invoked when the event completes. Thecallback is invoked within a context different than thatof the original call path, so it falls on the programmer toconstruct continuations, i.e. control mechanisms usedto save and restore state across invocations. C++ hasrecently added support for anonymous inline functionscalled lambdas. Lambdas can capture local state that canbe referred to when the lambda is invoked. This removesthe burden of manually saving and restoring state, andmakes code easier to follow. We use lambdas in EbbRTto alleviate the burden of constructing continuations.

Another concern with event-driven programming isthat error handling is much more complicated. The pre-

dominant mechanism for error handling in C++ is ex-ceptions. When an error is encountered, an exception isthrown and the stack unwound to the most recent try/-catch block, which will handle the error. Because event-driven programming splits one logical flow of controlacross multiple stacks, exceptions must be handled at ev-ery event boundary. This puts the burden on the devel-oper to catch exceptions at additional points in the codeand either handle them or forward them to an error han-dling callback.

Our solution to these problems is our implementationof monadic futures. Futures are a data type for asyn-chronously produced values, originally developed for usein the construction of distributed systems [37]. Figure 3illustrates a code path in the EbbRT network stack thatutilizes lambdas and futures to route and send an Eth-ernet frame. The ArpFind function (line 4) translatesan IP address to the corresponding MAC address eitherthrough a lookup into the ARP cache or by sending outan ARP request to be processed asynchronously. In ei-ther case, ArpFind returns a Future<EthAddr>,which represents the future result of the ARP transla-tion. A future cannot be directly operated on. Instead, alambda can be applied to it using the Then method (line5). This lambda is invoked once the future is fulfilled.When invoked, the lambda receives the fulfilled future asa parameter and can use the Get method to retrieve itsunderlying value (line 9). In the event that the future isfulfilled before the Then method is invoked (for exam-ple, ArpFind retrieves the translation directly from theARP cache) the lambda is invoked synchronously.

The Thenmethod of a future returns a new future rep-resenting the value to be returned by the applied func-tion, hence the term monadic. This allows other soft-ware components to chain further functions to be invokedon completion. In this example, the EthIpv4Sendmethod returns a Future<void> which merely rep-

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 677

resents the completion of some action and provides nodata.

Futures also aid in error processing. Each time Get isinvoked, the future may throw an exception representinga failure to produce the value. If not explicitly handled,the future returned by Then will hold this exception in-stead of a value. The only invocation of Then that musthandle the error is the final one, any intermediate excep-tions will naturally flow to the first function which at-tempts to catch the exception. This behavior mirrors thebehavior of exceptions in synchronous code. In this ex-ample, any error in ARP resolution will be propagated tothe future returned by EthIpv4Send and handled byhigher-level code.

C++ has an implementation of futures in the standardlibrary. Unlike our implementation, it provides no Thenfunction, necessary for chaining callbacks. Instead usersare expected to block on a future (using Get). Otherlanguages such as C# and JavaScript provide monadicfutures similar to ours.

As seen in Table 1, futures are used pervasively ininterface definitions for Ebbs, and lambdas are used inplace of more manual continuation construction. Our ex-perience using lambdas and futures has been positive.Initially, some members of our group had reservationsabout using these unfamiliar primitives as they hide afair amount of potentially performance sensitive behav-ior. As we have gained more experience with these prim-itives, it has been clear that the behavior they encapsulateis common to many cases. Futures in particular encap-sulate sometimes subtle synchronization code around in-stalling a callback and providing a value (potentially con-currently). While this code has not been without bugs,we have more confidence in its correctness based on itsuse across EbbRT.

4.6 Network Stack

We originally looked into porting an existing networkstack to EbbRT. However, we eventually implementeda new network stack for the native environment, provid-ing IPv4, UDP/TCP, and DHCP functionality in order toprovide an event-driven interface to applications, min-imize multi-core synchronization, and enable pervasivezero-copy. The network stack does not provide a stan-dard BSD socket interface, but rather enables tighter in-tegration with the application to manage the resources ofa network connection.

During the development of EbbRT we found it nec-essary to create a common primitive for managing datathat could be received from or sent to hardware devices.To support the development of zero-copy software, wecreated the IOBuf primitive. An IOBuf is a descrip-tor which manages ownership of a region of memory

as well as a view of a portion of that memory. Ratherthan having applications explicitly invoke read with abuffer to be populated, they install a handler which ispassed an IOBuf containing network data for their con-nection. This IOBuf is passed synchronously from thedevice driver through the network stack. The networkstack does not provide any buffering, it will invoke theapplication as long as data arrives. Likewise, the inter-face to send data accepts a chain of IOBufs which canuse scatter/gather interfaces.

Most systems have fixed size buffers in the kernelwhich are used to pace connections (e.g. manage TCPwindow size, cause UDP drops). In contrast, EbbRT al-lows the application to directly manage its own buffer-ing. In the case of UDP, an overwhelmed applicationmay have to drop datagrams. For a TCP connection, anapplication can explicitly set the window size to preventfurther sends from the remote host. Applications mustalso check that outgoing TCP data fits within the cur-rently advertised sender window before telling the net-work stack to send it or buffer it otherwise. This allowsthe application to decide whether or not to delay sendingto aggregate multiple sends into a single TCP segment.Other systems typically accomplish this using Nagle’s al-gorithm which is often associated with poor latency [41].An advantage of EbbRT’s approach to networking is thedegree to which an application can tune the behavior ofits connections at runtime. We provide default behaviorswhich can be inherited from for those applications whichdo not require this degree of customization.

One challenge with high-performance networking isthe need to synchronize when accessing connectionstate [47]. EbbRT stores connection state in an RCU [40]hash table which allows common connection lookup op-erations to proceed without any atomic operations. Dueto the event-driven execution model of EbbRT, RCU isa natural primitive to provide. Because we lack preemp-tion, entering and exiting RCU critical sections are free.Connection state is only manipulated on a single corewhich is chosen by the application when the connectionis established. Therefore, common case network opera-tions require no synchronization.

The EbbRT network stack is an example of the de-gree of performance specialization our design enables.By involving the application in network resource man-agement, the networking stack avoids significant com-plexity. Historically, network stack buffering and queu-ing has been a significant factor in network performance.EbbRT’s design does not solve these problems, but in-stead enables applications to more directly control theseproperties and customize the system to their characteris-tics. The zero-copy optimization illustrates the value ofhaving all physical memory identity mapped, unpaged,and within a single address space.

678 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

5 Evaluation

Through evaluating EbbRT we aim to affirm that our im-plementation fulfills the following three objectives dis-cussed in Section 2: 1. supports high-performance spe-cialization, 2. provides support for a broad set of appli-cations, and 3. simplifies the development of applica-tion-specific systems software.

We run our evaluations on a cluster of servers con-nected via a 10GbE network and commodity switch.Each machine contains two 6-core Xeon E5-2630L pro-cessors (run at 2.4 GHz), 120 GB of RAM, and an IntelX520 network card (82599 chipset). The machines havebeen configured to disable Turbo Boost, hyper-threads,and dynamic frequency scaling. Additionally, we dis-able IRQ balancing and explicitly assign NIC IRQ affin-ity. For the evaluation, we pin each application thread toa dedicated physical core.

Each machine boots Ubuntu 14.04 (trusty) with Linuxkernel version 3.13. The EbbRT native library OSsare run as virtual machines, which are deployed usingQEMU (2.5.0) and the KVM kernel module. In addition,the VMs use a virtio-net paravirtualized networkcard with support of the vhost kernel module. We en-able multiqueue receive flow steering for multicore ex-periments. Unless otherwise stated, all Linux applica-tions are run within a similarly configured VM and onthe same OS and kernel version as the host.

The evaluations are broken down as follows: 1. mi-cro-benchmarks designed to quantify the base over-heads of the primitives in our native environment and2. macro-benchmarks that exercise EbbRT in the contextof real applications. While the EbbRT hosted library is aprimary component of our design, it is not intended forhigh-performance, but rather to facilitate the integrationof functionality between a general purpose OS processand native instances of EbbRT. Therefore, we focus ourevaluation on the EbbRT native environment.

5.1 Microbenchmarks

The first micro-benchmark evaluates the memory alloca-tor and aims to establish that the overheads of our Ebbmechanism do not preclude the construction of high-performance components. The second set of micro-benchmarks evaluate the latencies and throughput of ournetwork stack and exercise several of the system featureswe’ve discussed, including idle event processing, lamb-das, and the IOBuf mechanism.

5.1.1 Memory Allocation

In K42 [30], we did not define its memory allocator asa fragmented object because the invocation overheads

(e.g., virtual function dispatch) were thought to be tooexpensive. A goal for the design of our Ebb mechanismis to provide near-zero overhead so that all componentsof the system can be defined as Ebbs.

The costs of managing memory is critical to the overallperformance of an application. Indeed, custom memoryallocators have shown substantial improvements in appli-cation performance [7]. We’ve ported threadtest from theHoard [6] benchmark suite to EbbRT in order to comparethe performance of the default EbbRT memory allocatorto that of the glibc 2.2.5 and jemalloc 4.2.1 allocators.

1 2 4 6 8

Threads

2

4

6

8

10

12

14

16

Bill

ions

of

Cycl

es

I.

1 2 4 6 8

Threads

II.EbbRT

glibc

jemalloc

Figure 4: Hoard Threadtest. Y-axis represents threads, t.I.) N=100,000, i=1000; II.) N=100, i=1,000,000.

In threadtest, each thread t allocates and frees Nt 8

byte objects. This task is repeated for i iterations. Fig-ure 4 shows the cycles required to complete the work-load across varying amounts of threads. We run thread-test in two configurations. In configuration I., the num-ber of objects, N, is large, while the number of iterationsis small. In configuration II. the number of objects issmaller and the iteration count is increased. The totalnumber of memory operations is the same across bothconfigurations.

In the figure we see EbbRT’s memory allocator scalescompetitively with the production allocators. Our scala-bility advantage is in part due to locality enabled by theper-core Ebb representatives of the memory allocator andour lack of preemption which remove any synchroniza-tion requirements between representatives. The jemallocallocator achieves similar scalability benefits by avoidingsynchronization through the use of per-thread caches.

This comparison is not intended to establish theEbbRT memory allocator to be the best in all situations,nor is it an exhaustive memory allocator study. Rather,we aim to demonstrate that the overheads of the Ebbmechanism do not preclude us from the construction ofhigh-performance components.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 679

5.1.2 Network Stack

To evaluate the performance of our network stack weported the NetPIPE [52] and iPerf [54] benchmarksto EbbRT. NetPIPE is a popular ping-pong benchmarkwhere a client sends a fixed-size message to the server,which is then echoed back after being completely re-ceived. In the iPerf benchmark, a client opens a TCPstream and sends fixed-size messages which the serverreceives and discards. With small message sizes, theNetPIPE benchmark illustrates the latency of sendingand receiving data over TCP. The iPerf benchmark con-firms that our run-to-completion network stack doesn’tpreclude high throughput applications. An EbbRT iPerfserver was shown to saturate our 10GbE network with astream of 1 kB message sizes.

Figure 5 shows NetPIPE goodput achieved as a func-tion of message size. Two EbbRT servers achieve aone-way latency of 24.53 µs for 64 B message sizes andare able to attain 4 Gbps of goodput with messages assmall as 100 kB. In contrast, two Linux VMs achievea one-way latency of 34.27 µs for 64 B message sizesand required 200 kB sized messages to achieve equiva-lent goodput.

100k 200k 300k 400k 5000K 600K 700K

Message Size (B)

2000

4000

6000

8000

Goodput

(Mbps)

Linux VM EbbRT

0 2 4 6 8

1

2

3

Figure 5: NetPIPE performance as a function of messagesize. Inset shows small message sizes.

With small messages, both systems suffer some ad-ditional latency due to hypervisor processing involvedin implementing the paravirtualized NIC. However,EbbRT’s short path from (virtual) hardware to appli-cation achieves a 40% improvement in latency withNetPIPE. This result illustrates the benefits of a non-preemptive event-driven execution model and zero-copyinstruction path. With large messages, both systems mustsuffer a copy on packet reception due to the hypervi-sor, but EbbRT does no further copies, whereas Linuxmust copy to user-space and then again on transmission.

25k 50k 75k 100k 125k 150k

Throughput (RPS)

200

400

600

800

Late

ncy

(us)

OSv Linux VM Linux EbbRT

Figure 6: Memcached Single Core Performance

This explains the difference in Netpipe goodput beforethe network becomes the bottleneck.

5.2 Memcached

We evaluate memcached [15], an in-memory key-valuestore that has become a common benchmark in the ex-amination and optimization of networked systems. Pre-vious work has shown that memcached incurs signifi-cant OS overhead [27], and hence is a natural target forOS customization. Rather than port the existing mem-cached and associated event-driven libraries to EbbRTwe re-implemented memcached, writing it directly to theEbbRT interfaces.

Our memcached implementation is a multi-core ap-plication that supports the standard memcached binaryprotocol. In our implementation, TCP data is receivedsynchronously from the network card and passed up tothe application. The application parses the client requestand constructs a reply, which is sent out synchronously.The entire execution path, up to the application and backagain, is run without pre-emption. Key-value pairs arestored in an RCU hash table to alleviate lock contention,a common cause for poor scalability in memcached.Our implementation of memcached totals 361 lines ofcode. We lack some features of the standard memcached(namely authentication and some of per-key commandssuch as queue operations), but are otherwise protocolcompatible. Functionality support has been added incre-mentally as needed by our workloads.

We compare our EbbRT implementation of mem-cached, run within a VM, to the standard implementation(v.1.4.22) run within a Linux VM, and as a Linux processrun natively on our machine. We use the mutilate[31] benchmarking tool to place a particular load onthe server and measure response latency. We configure

680 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Request/sec Inst/cycle Inst/request LLC ref/cycle I-cache miss/cycleEbbRT 379387 0.81 5557 0.0081 0.0079Linux VM 137194 0.71 13604 0.0098 0.0339

Table 2: Memcached CPU-efficiency metrics

200k 400k 600k 800k 1000K 1200K

Throughput (RPS)

200

400

600

800

Late

ncy

(us)

Linux VM Linux (thread) Linux (process) EbbRT

Figure 7: Memcached Multicore Performance

mutilate to generate load representative of the Face-book ETC workload [2], which has 20 B–70 B keys andmost values sized between 1 B–1024 B. All requests areissued as separate memcached requests (no multiget)over TCP. The client is configured to pipeline up to fourrequests per TCP connection. We dedicate 7 machines toact as load-generating clients for a total of 664 connec-tions per server.

Figure 6 presents the 99th percentile latency as a func-tion of throughput for single core memcached servers.At a 500 µs 99th percentile Service Level Agreement(SLA), single core EbbRT is able to attain 1.88× higherthroughput than Linux within a VM. EbbRT outperformsLinux running natively by 1.15×, even with the hypervi-sor overheads incurred. Additionally, we evaluated theperformance of OSv [28], a general purpose library OSthat similarly targets cloud applications run in a virtual-ized environment. OSv differs from EbbRT by providinga Linux ABI compatible environment, rather than sup-porting a high-degree of specialization. We found thatthe performance of memcached on OSv was not com-petitive with either Linux or EbbRT with a single core.Additionally, OSv’s performance degrades when scaledup to six cores (omitted from figure 7) due to a lack ofmultiqueue support in their virtio-net device driver.

Figure 7 presents the evaluation of memcached run-ning across six cores. At a 500 µs 99th percentile SLA,six core EbbRT is able to attain a 2.08× higher through-put than Linux within a VM and 1.50× higher than

Linux native. To eliminate the performance impact ofapplication-level contention, we also evaluated mem-cached run natively as six separate processes, rather thana single multithreaded process (“Linux (process)” in Fig-ure 7). EbbRT outperforms the multiprocess memcachedby 1.30× at 500 µs 99th percentile SLA.

To gain insight into the source of EbbRT’s perfor-mance advantages, we examine the CPU-efficiency ofthe memcached servers. We use the Linux Kernel perfutility to gather data across a 10 second duration of afully-loaded single core memcached server run within aVM. Table 2 presents these statistics. We see that theEbbRT server is processing requests at 2.75× the rateof Linux. This can be largely attributed to our shorternon-preemptive instruction path for processing requests.Observe that the Linux rate of instructions per requestis 2.44× that of EbbRT. The instructions per cycle ratein EbbRT, a 12.6% increase over Linux, shows that weare running more efficiently overall. This can be againobserved through our decreased per-cycle rates of lastlevel cache (LLC) reference and icache misses, which,on Linux, increase by 1.21× and 4.27×, respectively.

The above efficiency results suggest that our perfor-mance advantages are largely achieved through the con-struction of specialized system software to take advan-tage of properties of the memcached workload. We illus-trate this in greater detail by examining the per-requestlatency for EbbRT and Linux (native) broken down intotime spent processing network ingress, application logic,and network egress. For Linux, we used the perf toolto gather stacktrace samples over 30 seconds of a fullyloaded, single core memcached instance and categorizedeach trace. For EbbRT, we instrumented the source codewith timestamp counters. Table 3 presents this result. Itshould be noted that, for Linux, the “Application” cate-gory includes time spent scheduling, context switching,and handling event notification (e.g. epoll). The la-tency breakdown demonstrates that the performance ad-vantage comes from specialization across the entire soft-ware stack, and not just one component.

By writing to our interfaces, memcached is imple-mented to directly handle memory filled by the device,and can likewise send replies without copying. A re-quest is handled synchronously from the device driverwithout pre-emption, which enables a significant perfor-mance advantage. EbbRT primitives, such as IOBufsand RCU data structures, are used throughout the ap-

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 681

Ingress Application Egress TotalEbbRT 0.89 µs 0.86 µs 0.83 µs 2.59 µsLinux 1.05 µs 1.30 µs 1.46 µs 3.81 µs

Table 3: Memcached Per-Request Latency

plication to simplify the development of the zero-copy,lock-free code.

In the past, significant effort has gone into improvingthe performance of memcached and similar key-valuestores. However, many of these optimizations requireclient modifications [35, 43] or the use of custom hard-ware [26, 36]. By writing memcached as an EbbRTapplication, we are able to achieve significant perfor-mance improvements while maintaining compatibilitywith standard clients, protocols, and hardware.

5.3 Node.js

It is often the case that specialized systems can demon-strate high performance for a particular workload, suchas packet processing, but fail to provide similar benefitsto more full-featured applications. A key objective ofEbbRT is to provide an efficient base set of primitiveson top of which a broad set of applications can be con-structed.

We evaluate node.js, a popular JavaScript executionenvironment for server-side applications. In compari-son to memcached, node.js uses many more features ofan operating system, including virtual memory mapping,file I/O, periodic timers, etc. Node.js links with severalC/C++ libraries to provide its event-driven environment.In particular, the two libraries which involved the mosteffort to port were V8 [23], Google’s JavaScript engine,and libuv [34], which abstracts OS functionality and call-back based event-driven execution.

Porting V8 was relatively straightforward as EbbRTsupports the C++ standard library, on which V8 de-pends. Additional OS functionality required such asclocks, timers, and virtual memory, are provided by thecore Ebbs of the system. Porting libuv required signif-icantly more effort, as there are over 100 functions ofthe libuv interface which require OS specific implemen-tations. In the end, our approach enables the libuv call-backs to be invoked directly from a hardware interrupt,in the same way that our memcached implementation re-ceives incoming requests.

The effort to port node.js was significantly simpli-fied by exploiting EbbRT’s model of function offload-ing. For example, the port included the constructionof an application-specific FileSystem Ebb. Ratherthan implement a file system and hard disk driver withinthe EbbRT library OS, the Ebb calls are offloaded to

a (hosted) representative running in a Linux process.Our default implementation of the FileSystem Ebbis naïve, sending messages and incurring round trip costsfor every access, rather than caching data on local repre-sentatives. For evaluation purposes we use a modifiedversion of the FileSystem Ebb which performs nocommunication and serves a single static node.js scriptas stdin. This implementation allows us to evaluatethe following workloads (which perform no file access)without also involving a hosted library.

One key observation of the node.js port is the mod-est development effort required to get a large piece ofsoftware functional, and, more importantly, the abilityto reuse many of the software mechanisms used in ourmemcached application. The port was largely completedby a single developer in two weeks. Concretely, node.jsand its dependencies total over one million lines of code,the majority of which is the v8 JavaScript engine. Wewrote about 3000 lines of new code in order to supportnode.js on EbbRT. A significant factor in simplifying theport is the fact that EbbRT is distributed with a customtoolchain. Rather than needing to modify the existingnode.js build system, we specified EbbRT as a target andbuilt it as we would any other cross compiled binary.This illustrates EbbRT’s support for a broad class of soft-ware as well as the manner in which we reduce developerburden required to develop specialized systems.

5.3.1 V8 JavaScript Benchmark

To compare the performance of our port to that of Linux,we launch node.js running version 7 of the V8 JavaScriptbenchmark suite [22]. This collection of purely compute-bound benchmarks stresses the core performance of theV8 JavaScript engine. Figure 8 shows the benchmarkscores. Scores are computed by inverting the runningtime of the benchmark and scaling it by the score of areference implementation (higher is better). The overallscore is the geometric mean of the 8 individual scores.The figure normalizes each score to the Linux result.

EbbRT outperforms Linux run within a VM on eachbenchmark, with a 5.1% improvement in overall score.Most prominently, EbbRT is able to attain a 30.3% im-provement in the memory intensive Splay benchmark.As we’ve made no modification to the V8 software, justrunning it on EbbRT accounts for the improved perfor-mance.

We further investigate the sources of the performanceadvantage by running the Linux perf utility to measureseveral CPU efficiency metrics. Table 4 displays theseresults. Several interesting aspects of this table deservehighlighting. First, EbbRT has a slightly better IPC ef-ficiency (3.76%), which can in part be attributed to itsperformance advantage. One reason for decreased effi-

682 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Inst/cycle LLC ref/cycle TLB miss/cycle VM exit Hypervisor time Guest kernel timeEbbRT 2.48 0.0021 1.18e-5 5950 0.33% N/ALinux VM 2.39 0.0028 9.92e-5 66851 0.74% 1.08%

Table 4: V8 JavaScript Benchmark CPU-efficiency metrics

ciency of the Linux VM is simply having to execute moreinstructions, such as additional VM Exits and extraneouskernel functionality (e.g., scheduling). Second, the ad-ditional interactions with the hypervisor and kernel onLinux increase the working set size and cause a 33% in-crease in LLC accesses. Third, Linux suffers nearly 9×more TLB misses than EbbRT. We attribute our TLB ef-ficiency to our use of large pages throughout the system.

Crypto

DeltaBlue

EarleyBoyer

NavierStokes

RayTraceRegExp

RichardsSplay

Overall0.9

1.0

1.1

1.2

1.3

Norm

aliz

ed S

core

EbbRT Linux VM

Figure 8: V8 JavaScript Benchmark

5.3.2 Node.js Webserver

Lastly, we evaluate a trivial webserver written fornode.js, which uses the builtin http module and re-sponds to each GET request with a small static mes-sage totaling 148 bytes. We use the wrk [20] bench-mark to place moderate load on the webserver and mea-sure mean and 99th percentile response-time latencies.EbbRT achieved 91.1 µs mean and 100.0 µs 99th per-centile latencies. Linux achieved 103.5 µs mean and120.6 µs 99th percentile latencies. The node.js webserverrunning on Linux has a 13.61% higher mean latency thanthe same webserver run on EbbRT. 99th percentile la-tency is 20.65% higher on Linux over EbbRT.

These results suggest that an entire class of server-side application written for node.js can achieve immedi-ate performance advantages by simply running on top ofEbbRT. Similar to our memcached evaluation, the abilityfor node.js to serve requests directly from hardware inter-rupts, without context switching or pre-emption, enablesgreater network performance. The non-preemptive run-to-completion execution model particularly improves taillatency. Our V8 benchmark results show that the use oflarge pages and simplified execution paths increases theefficiency of CPU and memory intensive workloads.

Finally, our approach opens up the application to fur-ther optimizations opportunities. For example, one couldmodify V8 to directly access the page tables to improvegarbage collection [4]. We expect that greater perfor-mance can be achieved through continued system spe-cialization.

6 Related Work

The Exokernel [13] introduced the library operating sys-tem structure — where system functionality is directlylinked into the application and executes in the same ad-dress space and protection domain. Library operatingsystems have been shown to provide many useful prop-erties, such as portability [45, 55], security [3, 38], andefficiency [28]. OSv [28] and Mirage [38] are similarto EbbRT in that they target virtual machines deployedin IaaS clouds. OSv constructs a general purpose li-brary OS and supports the Linux ABI. Mirage uses theOCaml programming language to construct minimal sys-tems for security. EbbRT takes a middle ground, sup-porting source-level portability for existing applicationsthrough rich C++ functionality and standard libraries, butavoiding general purpose OS interfaces.

CNK [42], Libra [1], Azul [53], and, more recently,Arrakis [44] and IX [5] have pursued architectures thatenable specialized execution environments for perfor-mance sensitive data flow. While their approaches vary,these systems must each make a trade-off between tar-geting a narrow class of applications (e.g., HPC, Javaweb applications, or packet processing) and targeting abroad class of applications. Rather than supporting asingle specialized execution environment, EbbRT pro-vides a framework to enable the construction of variousapplication-specific library operating systems.

Considerable work has been done on system softwarecustomization [8, 11, 17, 30, 50]. Much of this work fo-cuses on developing general purpose operating that arecustomizable, while EbbRT is focused on the construc-tion of specialized systems.

Choices [10] and OSKit [16] provide operating systemframeworks; in the case of Choices, for maintainabilityand extensibility, and in the case of OSKit, to simplifythe construction of new operating systems. EbbRT dif-fers most significantly in its performance objectives andits focus on enabling application developers to extendand customize system software. For example, by pro-

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 683

viding EbbRT as a modified toolchain, application-levelsoftware libraries (e.g. boost) can be used in a systemscontext with little to no modification.

Others have considered the interaction of an object-oriented framework with the goal of enabling high per-formance. CHAOSarc [18] and TinyOS [32] have bothexplored the use of a fine grain object framework inthe resource limited setting of embedded systems. LikeTinyOS, EbbRT combines language support for eventdriven execution and method dispatch mechanisms thatare friendly to compiler optimization. In a recent retro-spective [33], the authors recognized that their develop-ment and use of the nesC programming language lim-ited TinyOS from even broader, long-term use. EbbRT,however, focuses on integration with existing softwaretooling, development patterns, and libraries to encouragecontinued applicability.

7 Concluding Remarks

We have presented EbbRT, a framework for constructingspecialized systems for cloud applications. Through ourevaluation we have established that EbbRT applicationsachieve their performance advantages through system-wide specialization rather than one particular technique.In addition, we have shown that existing applications canbe ported to EbbRT with modest effort and achieve a no-ticeable performance gain. Throughout this paper wehave conveyed how our primary design elements, i.e.,elastic building blocks, an event-driven execution envi-ronment, and a heterogeneous deployment model work-ing alongside advanced language constructs and new sys-tem primitives, have proven to be a novel, effective ap-proach to achieving our stated objectives.

By encapsulating system components with minimaldispatch overhead, we enable application-specific per-formance specialization throughout all parts of our sys-tem. Furthermore, we have shown that our default Ebbimplementations provide a foundation for achieving per-formance advantages. For example, our results illustratethe combined benefits of using a non-preemptive event-driven execution model, identity mapped memory, andzero-copy paths.

As we gained experience with the system the issue ofeasing development efforts arose naturally. An early fo-cus on enabling use of standard libraries, including theaddition of blocking primitives, greatly simplified devel-opment. Our monadic futures implementation addressedconcrete concerns we had with event-driven program-ming. Futures are now used throughout our implemen-tation. IOBufs came about as a solution for us to enablepervasive zero-copy with little added complexity or over-head.

EbbRT’s long-term utility hinges on its ability to be

used for a broad range of applications, while continuingto enable a high degree of per-application specialization.Our previous work on fragmented objects gives us con-fidence that various specialized implementations of ourexisting Ebbs can be introduced, as needed, without di-minishing the overall value and integrity of the frame-work.

Future work involves further exploration of systemspecialization. Our focus has primarily revolved aroundnetworked applications, however, data storage applica-tions should equally benefit from specialization. Addi-tionally, EbbRT can be used to accelerate existing appli-cations in a fine-grained fashion. We believe the hostedlibrary can be used not just for compatibility for new ap-plications, but as a way to offload performance criticalfunctionality to one or more library operating systems.

The EbbRT framework is open source and activelyused in ongoing systems research. We invite develop-ers and researchers alike to visit our online codebase athttps://github.com/sesa/ebbrt/

Acknowledgments: We owe a great deal of gratitudeto the Massachusetts Open Cloud (MOC) team for theirhelp and support in getting EbbRT running and debuggedon HIL [24]. We would like to thank Michael Stumm,Larry Rudolph and Frank Bellosa for feedback on earlydrafts of this paper. Thanks to Tommy Unger, KyleHogan and David Yeung for helping us get the nits out.We would also like to thank our shepherd, Andrew Bau-mann, for his help in preparing the final version. Thiswork was supported by the National Science Foundationunder award IDs CNS-1012798, CNS-1254029, CCF-1533663 and CNS-1414119. Early work was supportedby by the Department of Energy under Award NumberDE-SC0005365.

684 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

References

[1] Glenn Ammons, Jonathan Appavoo, MariaButrico, Dilma Da Silva, David Grove, KiyokuniKawachiya, Orran Krieger, Bryan Rosenburg,Eric Van Hensbergen, and Robert W. Wisniewski.Libra: A Library Operating System for a Jvm in aVirtualized Execution Environment. In Proceed-ings of the 3rd International Conference on VirtualExecution Environments, VEE ’07, pages 44–54.ACM, 2007.

[2] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg,Song Jiang, and Mike Paleczny. Workload Anal-ysis of a Large-scale Key-value Store. In Pro-ceedings of the 12th ACM SIGMETRICS/PERFOR-MANCE Joint International Conference on Mea-surement and Modeling of Computer Systems, SIG-METRICS ’12, pages 53–64. ACM, 2012.

[3] Andrew Baumann, Marcus Peinado, and GalenHunt. Shielding Applications from an UntrustedCloud with Haven. ACM Trans. Comput. Syst.,33(3):8:1–8:26, August 2015.

[4] Adam Belay, Andrea Bittau, Ali Mashtizadeh,David Terei, David Mazières, and ChristosKozyrakis. Dune: Safe User-level Access to Priv-ileged CPU Features. In Proceedings of the 10thUSENIX Conference on Operating Systems De-sign and Implementation, OSDI’12, pages 335–348. USENIX Association, 2012.

[5] Adam Belay, George Prekas, Ana Klimovic,Samuel Grossman, Christos Kozyrakis, andEdouard Bugnion. IX: A Protected DataplaneOperating System for High Throughput and LowLatency. In 11th USENIX Symposium on OperatingSystems Design and Implementation (OSDI 14),pages 49–65, Broomfield, CO, October 2014.USENIX Association.

[6] Emery D. Berger, Kathryn S. McKinley, Robert D.Blumofe, and Paul R. Wilson. Hoard: A Scal-able Memory Allocator for Multithreaded Appli-cations. In Proceedings of the Ninth InternationalConference on Architectural Support for Program-ming Languages and Operating Systems, ASPLOSIX, pages 117–128. ACM, 2000.

[7] Emery D. Berger, Benjamin G. Zorn, andKathryn S. McKinley. Composing High-performance Memory Allocators. In Proceedingsof the ACM SIGPLAN 2001 Conference on Pro-gramming Language Design and Implementation,PLDI ’01, pages 114–124. ACM, 2001.

[8] Brian N. Bershad, Craig Chambers, Susan Eg-gers, Chris Maeda, Dylan McNamee, PrzemysławPardyak, Stefan Savage, and Emin Gün Sirer.SPIN - an Extensible Microkernel for Application-specific Operating System Services. SIGOPS Oper.Syst. Rev., 29(1):74–77, January 1995.

[9] Georges Brun-Cottan and Mesaac Makpangou.Adaptable Replicated Objects in Distributed Envi-ronments. Research Report RR-2593, 1995. ProjectSOR.

[10] Roy H. Campbell, Nayeem Islam, and PeterMadany. Choices, frameworks and refinement.Computing Systems, 5(3):217–257, 1992.

[11] David R. Cheriton and Kenneth J. Duda. A CachingModel of Operating System Kernel Functional-ity. In Proceedings of the 1st USENIX Conferenceon Operating Systems Design and Implementation,OSDI ’94. USENIX Association, 1994.

[12] Jonathan Corbet. SLQB - and then there were four.http://lwn.net/Articles/311502, Dec.2008.

[13] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr.Exokernel: An Operating System Architecture forApplication-level Resource Management. In Pro-ceedings of the Fifteenth ACM Symposium on Op-erating Systems Principles, SOSP ’95, pages 251–266. ACM, 1995.

[14] Jason Evans. Scalable memory allocation us-ing jemalloc. http://www.canonware.com/jemalloc/, 2011.

[15] Brad Fitzpatrick. Distributed Caching with Mem-cached. Linux Journal, 2004(124):5, August 2004.

[16] Bryan Ford, Godmar Back, Greg Benson, Jay Lep-reau, Albert Lin, and Olin Shivers. The Flux OSKit:A Substrate for Kernel and Language Research.In Proceedings of the Sixteenth ACM Symposiumon Operating Systems Principles, SOSP ’97, pages38–51. ACM, 1997.

[17] Ben Gamsa, Orran Krieger, Jonathan Appavoo, andMichael Stumm. Tornado: Maximizing Localityand Concurrency in a Shared Memory Multiproces-sor Operating System. In Proceedings of the ThirdSymposium on Operating Systems Design and Im-plementation, OSDI ’99, pages 87–100. USENIXAssociation, 1999.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 685

[18] Ahmed Gheith and Karsten Schwan. CHAOSarc:Kernel Support for Multiweight Objects, Invoca-tions, and Atomicity in Real-time Multiproces-sor Applications. ACM Trans. Comput. Syst.,11(1):33–72, February 1993.

[19] Sanjay Ghemawat and Paul Menage. Tc-malloc: Thread-caching malloc. http://goog-perftools.sourceforge.net/doc/tcmalloc.html, 2009.

[20] Will Glozer. wrk: Modern HTTP benchmarkingtool. https://github.com/wg/wrk, 2014.

[21] Google. Protocol Buffers: Google’s Data In-terchange Format. https://developers.google.com/protocol-buffers.

[22] Google. V8 Benchmark Suit - Version 7.https://v8.googlecode.com/svn/data/benchmarks/v7/.

[23] Google. V8 JavaScript Engine. http://code.google.com/p/v8/.

[24] Jason Hennessey, Sahil Tikale, Ata Turk, Em-ine Ugur Kaynar, Chris Hill, Peter Desnoyers, andOrran Krieger. HIL: Designing an Exokernel forthe Data Center. In Proceedings of the SeventhACM Symposium on Cloud Computing, pages 155–168. ACM, 2016.

[25] Intel Corporation. Intel DPDK: Data Plane Devel-opment Kit. http://dpdk.org.

[26] Jithin Jose, Hari Subramoni, Miao Luo, Min-jia Zhang, Jian Huang, Md. Wasi-ur Rahman,Nusrat S. Islam, Xiangyong Ouyang, Hao Wang,Sayantan Sur, and Dhabaleswar K. Panda. Mem-cached Design on High Performance RDMA Capa-ble Interconnects. In Proceedings of the 2011 Inter-national Conference on Parallel Processing, ICPP’11, pages 743–752. IEEE Computer Society, 2011.

[27] Rishi Kapoor, George Porter, Malveeka Tewari,Geoffrey M. Voelker, and Amin Vahdat. Chronos:Predictable Low Latency for Data Center Applica-tions. In Proceedings of the Third ACM Symposiumon Cloud Computing, SoCC ’12, pages 9:1–9:14.ACM, 2012.

[28] Avi Kivity, Dor Laor, Glauber Costa, PekkaEnberg, Nadav Har’El, Don Marti, and VladZolotarov. OSv—Optimizing the Operating Sys-tem for Virtual Machines. In 2014 USENIX AnnualTechnical Conference (USENIX ATC 14), pages61–72. USENIX Association, June 2014.

[29] Kohlhoff, Christopher. Boost.Asio. http://www.boost.org/doc/libs/1_55_0/doc/html/boost_asio.html.

[30] Orran Krieger, Marc Auslander, Bryan Rosen-burg, Robert W. Wisniewski, Jimi Xenidis, DilmaDa Silva, Michal Ostrowski, Jonathan Appavoo,Maria Butrico, Mark Mergen, Amos Waterland,and Volkmar Uhlig. K42: Building a CompleteOperating System. In Proceedings of the 1st ACMSIGOPS/EuroSys European Conference on Com-puter Systems 2006, EuroSys ’06, pages 133–145.ACM, 2006.

[31] Jacob Leverich. Mutilate: High-PerformanceMemcached Load Generator. https://github.com/leverich/mutilate,2014.

[32] P. Levis, S. Madden, J. Polastre, R. Szewczyk,K. Whitehouse, A. Woo, D. Gay, J. Hill, M. Welsh,E. Brewer, and D. Culler. TinyOS: An Operat-ing System for Sensor Networks, pages 115–148.Ambient Intelligence. Springer Berlin Heidelberg,2005.

[33] Philip Levis. Experiences from a Decade ofTinyOS Development. In Proceedings of the 10thUSENIX Conference on Operating Systems De-sign and Implementation, OSDI’12, pages 207–220. USENIX Association, 2012.

[34] libuv. http://libuv.org.

[35] Hyeontaek Lim, Dongsu Han, David G. Andersen,and Michael Kaminsky. MICA: A Holistic Ap-proach to Fast In-Memory Key-Value Storage. In11th USENIX Symposium on Networked SystemsDesign and Implementation (NSDI 14), pages 429–444. USENIX Association, April 2014.

[36] Kevin Lim, David Meisner, Ali G. Saidi,Parthasarathy Ranganathan, and Thomas F.Wenisch. Thin Servers with Smart Pipes: De-signing SoC Accelerators for Memcached. InProceedings of the 40th Annual InternationalSymposium on Computer Architecture, ISCA ’13,pages 36–47. ACM, 2013.

[37] Barbara Liskov and Liuba Shrira. Promises: Lin-guistic Support for Efficient Asynchronous Proce-dure Calls in Distributed Systems. In Proceedingsof the ACM SIGPLAN 1988 Conference on Pro-gramming Language Design and Implementation,PLDI ’88, pages 260–267. ACM, 1988.

686 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

[38] Anil Madhavapeddy, Richard Mortier, Charalam-pos Rotsos, David Scott, Balraj Singh, ThomasGazagnaire, Steven Smith, Steven Hand, and JonCrowcroft. Unikernels: Library Operating Sys-tems for the Cloud. SIGPLAN Not., 48(4):461–472,March 2013.

[39] Mesaac Makpangou, Yvon Gourhant, and Jeanpierre Le Narzul. Fragmented Objects for Dis-tributed Abstractions. In Readings in DistributedComputing Systems, pages 170–186. IEEE Com-puter Society Press, 1992.

[40] Paul E. McKenney, Jonathan Appavoo, AndiKleen, Orran Krieger, Rusty Russell, DipankarSarma, and Maneesh Soni. Read-Copy Update. InOttawa Linux Symposium, July 2001.

[41] Greg Minshall, Yasushi Saito, Jeffrey C. Mogul,and Ben Verghese. Application Performance Pit-falls and TCP’s Nagle Algorithm. SIGMETRICSPerform. Eval. Rev., 27(4):36–44, March 2000.

[42] José Moreira, Michael Brutman, José Castaños,Thomas Engelsiepen, Mark Giampapa, Tom Good-ing, Roger Haskin, Todd Inglett, Derek Lieber, PatMcCarthy, Mike Mundy, Jeff Parker, and BrianWallenfelt. Designing a Highly-Scalable OperatingSystem: The Blue Gene/L story. In Proceedings ofthe 2006 ACM/IEEE conference on Supercomput-ing, SC ’06. ACM, 2006.

[43] Rajesh Nishtala, Hans Fugal, Steven Grimm, MarcKwiatkowski, Herman Lee, Harry C. Li, RyanMcElroy, Mike Paleczny, Daniel Peek, Paul Saab,David Stafford, Tony Tung, and VenkateshwaranVenkataramani. Scaling Memcache at Facebook.In Proceedings of the 10th USENIX Conferenceon Networked Systems Design and Implementation,NSDI’13, pages 385–398. USENIX Association,2013.

[44] Simon Peter, Jialin Li, Irene Zhang, Dan R. K.Ports, Doug Woos, Arvind Krishnamurthy, ThomasAnderson, and Timothy Roscoe. Arrakis: TheOperating System is the Control Plane. In 11thUSENIX Symposium on Operating Systems De-sign and Implementation (OSDI 14), pages 1–16.USENIX Association, October 2014.

[45] Donald E. Porter, Silas Boyd-Wickizer, Jon How-ell, Reuben Olinsky, and Galen C. Hunt. Rethink-ing the Library OS from the Top Down. In Proceed-ings of the Sixteenth International Conference onArchitectural Support for Programming Languagesand Operating Systems, ASPLOS XVI, pages 291–304. ACM, 2011.

[46] Niels Provos and Nick Mathewson. libevent - anevent notification library. http://libevent.org/, 2003.

[47] Injong Rhee, Nallathambi Balaguru, and George N.Rouskas. MTCP: Scalable TCP-like CongestionControl for Reliable Multicast. Comput. Netw.,38(5):553–575, April 2002.

[48] Luigi Rizzo. Netmap: A Novel Framework forFast Packet I/O. In Proceedings of the 2012USENIX Conference on Annual Technical Confer-ence, USENIX ATC’12, page 9. USENIX Associa-tion, 2012.

[49] Dan Schatzberg, James Cadden, Orran Krieger, andJonathan Appavoo. A Way Forward: Enabling Op-erating System Innovation in the Cloud. In 6thUSENIX Workshop on Hot Topics in Cloud Com-puting (HotCloud 14). USENIX Association, June2014.

[50] Margo I. Seltzer, Yasuhiro Endo, ChristopherSmall, and Keith A. Smith. Dealing with disaster:Surviving misbehaved kernel extensions. In Pro-ceedings of the Second USENIX Symposium on Op-erating Systems Design and Implementation, OSDI’96, pages 213–227, New York, NY, USA, 1996.ACM.

[51] Marc Shapiro, Yvon Gourhant, Sabine Habert,Laurence Mosseri, Michel Ruffin, and Céline Valot.SOS: An Object-Oriented Operating System - As-sessment and Perspectives. Computing Systems,2:287–337, 1991.

[52] Quinn O Snell, Armin R Mikler, and John LGustafson. Netpipe: A Network Protocol Indepen-dent Performance Evaluator. In IASTED Interna-tional Conference on Intelligent Information Man-agement and Systems, 1996.

[53] Gil Tene, Balaji Iyengar, and Michael Wolf. C4:The Continuously Concurrent Compacting Collec-tor. In Proceedings of the International Symposiumon Memory Management, ISMM ’11, pages 79–88.ACM, 2011.

[54] Ajay Tirumala, Feng Qin, Jon Dugan, Jim Fer-guson, and Kevin Gibbs. Iperf: The TCP/UDPbandwidth measurement tool. https://iperf.fr/.

[55] Chia-Che Tsai, Kumar Saurabh Arora, NehalBandi, Bhushan Jain, William Jannen, Jitin John,Harry A. Kalodner, Vrushali Kulkarni, DanielaOliveira, and Donald E. Porter. Cooperation

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 687

and Security Isolation of Library OSes for Multi-process Applications. In Proceedings of the NinthEuropean Conference on Computer Systems, Eu-roSys ’14, pages 9:1–9:14. ACM, 2014.

[56] Maarten van Steen, Philip Homburg, and An-drew S. Tanenbaum. Globe: A Wide-Area Dis-tributed System. IEEE Concurrency, 7(1):70–78,January 1999.

[57] S. Vinoski. CORBA: Integrating Diverse Appli-cations Within Distributed Heterogeneous Environ-ments. Comm. Mag., 35(2):46–55, February 1997.

[58] Rob von Behren, Jeremy Condit, and Eric Brewer.Why Events Are a Bad Idea (for High-concurrencyServers). In Proceedings of the 9th Conference onHot Topics in Operating Systems - Volume 9, HO-TOS’03, page 4. USENIX Association, 2003.

[59] David A Wheeler. SLOCCount. http://www.dwheeler.com/sloccount/.

[60] Sara Williams and Charlie Kindel. The ComponentObject Model: A Technical Overview. Dr. DobbsJournal, 356:356–375, 1994.

688 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association