arachne: core-aware thread management · arachne: core-aware thread management henry qin, qian li,...

This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design

and Implementation (OSDI ’18).October 8–10, 2018 • Carlsbad, CA, USA

ISBN 978-1-939133-08-3

Open access to the Proceedings of the 13th USENIX Symposium on Operating Systems

Design and Implementation is sponsored by USENIX.

Arachne: Core-Aware Thread ManagementHenry Qin, Qian Li, Jacqueline Speiser, Peter Kraft,

and John Ousterhout, Stanford University

https://www.usenix.org/conference/osdi18/presentation/qin

Arachne: Core-Aware Thread ManagementHenry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout

{hq6,qianli,jspeiser,kraftp,ouster}@cs.stanford.eduStanford University

AbstractArachne is a new user-level implementation of threads thatprovides both low latency and high throughput for appli-cations with extremely short-lived threads (only a few mi-croseconds). Arachne is core-aware: each application de-termines how many cores it needs, based on its load; it al-ways knows exactly which cores it has been allocated, andit controls the placement of its threads on those cores. Acentral core arbiter allocates cores between applications.Adding Arachne to memcached improved SLO-compliantthroughput by 37%, reduced tail latency by more than 10x,and allowed memcached to coexist with background ap-plications with almost no performance impact. AddingArachne to the RAMCloud storage system increased itswrite throughput by more than 2.5x. The Arachne thread-ing library is optimized to minimize cache misses; it caninitiate a new user thread on a different core (with load bal-ancing) in 320 ns. Arachne is implemented entirely at userlevel on Linux; no kernel modifications are needed.

1 IntroductionAdvances in networking and storage technologies havemade it possible for datacenter services to operate at ex-ceptionally low latencies [5]. As a result, a variety of low-latency services have been developed in recent years, in-cluding FaRM [11], Memcached [23], MICA [20], RAM-Cloud [30], and Redis [34]. They offer end-to-end re-sponse times as low as 5 µs for clients within the samedatacenter and they have internal request service times aslow as 1–2 µs. These systems employ a variety of newtechniques to achieve their low latency, including pollinginstead of interrupts, kernel bypass, and run to comple-tion [6, 31].

However, it is difficult to construct services that pro-vide both low latency and high throughput. Techniquesfor achieving low latency, such as reserving cores for peakthroughput or using polling instead of interrupts, wasteresources. Multi-level services, in which servicing one re-quest may require nested requests to other servers (suchas for replication), create additional opportunities for re-source underutilization, particularly if they use polling toreduce latency. Background activities within a service,such as garbage collection, either require additional re-served (and hence underutilized) resources, or risk in-terference with foreground request servicing. Ideally, itshould be possible to colocate throughput-oriented ser-vices such as MapReduce [10] or video processing [22]with low-latency services, such that resources are fully

occupied by the throughput-oriented services when notneeded by the low-latency services. However, this is rarelyattempted in practice because it impacts the performanceof the latency-sensitive services.

One of the reasons it is difficult to combine low latencyand high throughput is that applications must manage theirparallelism with a virtual resource (threads); they cannottell the operating system how many physical resources(cores) they need, and they do not know which cores havebeen allocated for their use. As a result, applications can-not adjust their internal parallelism to match the resourcesavailable to them, and they cannot use application-specificknowledge to optimize their use of resources. This canlead to both under-utilization and over-commitment ofcores, which results in poor resource utilization and/orsuboptimal performance. The only recourse for appli-cations is to pin threads to cores; this results in under-utilization of cores within the application and does notprevent other applications from being scheduled onto thesame cores.

Arachne is a thread management system that solvesthese problems by giving applications visibility into thephysical resources they are using. We call this approachcore-aware thread management. In Arachne, applicationthreads are managed entirely at user level; they are not vis-ible to the operating system. Applications negotiate withthe system over cores, not threads. Cores are allocatedfor the exclusive use of individual applications and remainallocated to an application for long intervals (tens of mil-liseconds). Each application always knows exactly whichcores it has been allocated and it decides how to sched-ule application threads on cores. A core arbiter decideshow many cores to allocate to each application, and ad-justs the allocations in response to changing applicationrequirements.

User-level thread management systems have been im-plemented many times in the past [39, 14, 4] and the basicfeatures of Arachne were prototyped in the early 1990s inthe form of scheduler activations [2]. Arachne is novel inthe following ways:• Arachne contains mechanisms to estimate the number

of cores needed by an application as it runs.• Arachne allows each application to define a core pol-

icy, which determines at runtime how many cores theapplication needs and how threads are placed on theavailable cores.

• The Arachne runtime was designed to minimize cachemisses. It uses a novel representation of scheduling

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 145

information with no ready queues, which enables low-latency and scalable mechanisms for thread creation,scheduling, and synchronization.

• Arachne provides a simpler formulation than sched-uler activations, based on the use of one kernel threadper core.

• Arachne runs entirely outside the kernel and needsno kernel modifications; the core arbiter is imple-mented at user level using the Linux cpuset mech-anism. Arachne applications can coexist with tradi-tional applications that do not use Arachne.

We have implemented the Arachne runtime and corearbiter in C++ and evaluated them using both syntheticbenchmarks and the memcached and RAMCloud storagesystems. Arachne can initiate a new thread on a differentcore (with load balancing) in about 320 ns, and an appli-cation can obtain an additional core from the core arbiterin 20–30 µs. When Arachne was added to memcached, itreduced tail latency by more than 10x and allowed 37%higher throughput at low latency. Arachne also improvedperformance isolation; a background video processing ap-plication could be colocated with memcached with almostno impact on memcached’s latency. When Arachne wasadded to the RAMCloud storage system, it improved writethroughput by more than 2.5x.

2 The Threading ProblemArachne was motivated by the challenges in creating ser-vices that process very large numbers of very small re-quests. These services can be optimized for low latency orfor high throughput, but it is difficult to achieve both withtraditional threads implemented by the operating system.

As an example, consider memcached [23], a widelyused in-memory key-value-store. Memcached processesrequests in about 10 µs. Kernel threads are too expen-sive to create a new one for each incoming request, somemcached uses a fixed-size pool of worker threads. Newconnections are assigned statically to worker threads in around-robin fashion by a separate dispatch thread.

The number of worker threads is fixed when mem-cached starts, which results in several inefficiencies. Ifthe number of cores available to memcached is smallerthan the number of workers, the operating system willmultiplex workers on a single core, resulting in long de-lays for requests sent to descheduled workers. For bestperformance, one core must be reserved for each workerthread. If background tasks are run on the machine duringperiods of low load, they are likely to interfere with thememcached workers, due to the large number of distinctworker threads. Furthermore, during periods of low load,each worker thread will be lightly loaded, increasing therisk that its core will enter power-saving states with high-latency wakeups. Memcached would perform better if itcould scale back during periods of low load to use a smaller

number of kernel threads (and cores) more intensively.In addition, memcached’s static assignment of con-

nections to workers can result in load imbalances underskewed workloads, with some worker threads overloadedand others idle. This can impact both latency and through-put.

The RAMCloud storage system provides another ex-ample [30]. RAMCloud is an in-memory key-value storethat processes small read requests in about 2 µs. Likememcached, it is based on kernel threads. A dispatchthread handles all network communication and polls theNIC for incoming packets using kernel bypass. When arequest arrives, the dispatch thread delegates it to one of acollection of worker threads for execution; this approachavoids problems with skewed workloads. The number ofactive worker threads varies based on load. The maximumnumber of workers is determined at startup, which createsissues similar to memcached.

RAMCloud implements nested requests, which resultin poor resource utilization because of microsecond-scaleidle periods that cannot be used. When a worker threadreceives a write request, it sends copies of the new valueto backup servers and waits for those requests to returnbefore responding to the original request. All of the repli-cation requests complete within 7-8 µs, so the workerbusy-waits for them. If the worker were to sleep, it wouldtake several microseconds to wake it up again; in addi-tion, context-switching overheads are too high to get muchuseful work done in such a short time. As a result, theworker thread’s core is wasted for 70-80% of the time toprocess a write request; write throughput for a server isonly about 150 kops/sec for small writes, compared withabout 1 Mops/sec for small reads.

The goal for Arachne is to provide a thread managementsystem that allows a better combination of low latency andhigh throughput. For example, each application shouldmatch its workload to available cores, taking only as manycores as needed and dynamically adjusting its internal par-allelism to reflect the number of cores allocated to it. In ad-dition, Arachne should provide an implementation of user-level threads that is efficient enough to be used for veryshort-lived threads, and that allows useful work to be doneduring brief blockages such as those for nested requests.

Although some existing applications will benefit fromArachne, we expect Arachne to be used primarily for newgranular applications whose threads have lifetimes ofonly a few microseconds. These applications are diffi-cult or impossible to build today because of inefficienciesin current threading infrastructure.

3 Arachne OverviewFigure 1 shows the overall architecture of Arachne. Threecomponents work together to implement Arachne threads.The core arbiter consists of a stand-alone user process plus

146 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

sockets

Application

Arbiter Library

Core Ar

sharedmemory

ArachneRuntime

CorePolicy 1

Application

Arbiter Library

ArachneRuntime

CorePolicy 2

Application

Arbiter Library

CustomThread Library

Core Arbiter

Figure 1: The Arachne architecture. The core arbitercommunicates with each application using one socket foreach kernel thread in the application, plus one page ofshared memory.

a small library linked into each application. The Arachneruntime and core policies are libraries linked into applica-tions. Different applications can use different core poli-cies. An application can also substitute its own threadinglibrary for the Arachne runtime and core policy, while stillusing the core arbiter.

The core arbiter is a user-level process that managescores and allocates them to applications. It collects in-formation from each application about how many coresit needs and uses a simple priority mechanism to dividethe available cores among competing applications. Thecore arbiter adjusts the core allocations as application re-quirements change. Section 4 describes the core arbiter indetail.

The Arachne runtime creates several kernel threads anduses them to implement user threads, which are used byArachne applications. The Arachne user thread abstrac-tion contains facilities similar to thread packages basedon kernel threads, including thread creation and deletion,locks, and condition variables. However, all operationson user threads are carried out entirely at user level with-out making kernel calls, so they are an order of magni-tude faster than operations on kernel threads. Section 5describes the implementation of the Arachne runtime inmore detail.

The Arachne runtime works together with a core policy,which determines how cores are used by that application.The core policy computes the application’s core require-ments, using performance information gathered by theArachne runtime. It also determines which user threadsrun on which cores. Each application chooses its core pol-icy. Core policies are discussed in Section 6.

Arachne uses kernel threads as a proxy for cores. Eachkernel thread created by the runtime executes on a separatecore and has exclusive access to that core while it is run-ning. When the arbiter allocates a core to an application,it unblocks one of the application’s kernel threads on thatcore; when the core is removed from an application, thekernel thread running on that core blocks. The Arachneruntime runs a simple dispatcher in each kernel thread,which multiplexes several user threads on the associatedcore.

Arachne uses a cooperative multithreading model for

user threads: the runtime does not preempt a user threadonce it has begun executing. If a user thread needs to exe-cute for a long time without blocking, it must occasionallyinvoke a yield method, which allows other threads torun before the calling thread continues. We expect mostthreads to either block or complete quickly, so it shouldrarely be necessary to invoke yield.

One potential problem with a user-level implementationof threads is that a user thread might cause the underlyingkernel thread to block. This could happen, for example, ifthe user thread invokes a blocking kernel call or incurs apage fault. This prevents the kernel thread from runningother user threads until the kernel call or page fault com-pletes. Previous implementations of user-level threadshave attempted to work around this inefficiency in a varietyof ways, often involving complex kernel modifications.

Arachne does not take any special steps to handle block-ing kernel calls or page faults. Most modern operatingsystems support asynchronous I/O, so I/O can be imple-mented without blocking the kernel thread. Paging isalmost never cost-effective today, given the low cost ofmemory and the large sizes of memories. Modern serversrarely incur page faults except for initial loading, such aswhen an application starts or a file is mapped into virtualmemory. Thus, for simplicity, Arachne does not attemptto make use of the time when a kernel thread is blocked fora page fault or kernel call.

Note: we use the term core to refer to any hardwaremechanism that can support an independent thread ofcomputation. In processors with hyperthreading, we thinkof each hyperthread as a separate logical core, even thoughsome of them share a single physical core.

4 The Core ArbiterThis section describes how the core arbiter claims con-trol over (most of) the system’s cores and allocates themamong applications. The core arbiter has three interestingfeatures. First, it implements core management entirelyat user level using existing Linux mechanisms; it does notrequire any kernel changes. Second, it coexists with exist-ing applications that don’t use Arachne. And third, it takesa cooperative approach to core management, both in itspriority mechanism and in the way it preempts cores fromapplications.

The core arbiter runs as a user process with root priv-ilege and uses the Linux cpuset mechanism to managecores. A cpuset is a collection of one or more cores and oneor more banks of memory. At any given time, each kernelthread is assigned to exactly one cpuset, and the Linuxscheduler ensures that the thread executes only on cores inthat cpuset. By default, all threads run in a cpuset contain-ing all cores and all memory banks. The core arbiter usescpusets to allocate specific cores to specific applications.

The core arbiter divides cores into two groups: man-


aged cores and unmanaged cores. Managed cores are allo-cated by the core arbiter; only the kernel threads created byArachne run on these cores. Unmanaged cores continue tobe scheduled by Linux. They are used by processes that donot use Arachne, and also by the core arbiter itself. In addi-tion, if an Arachne application creates new kernel threadsoutside Arachne, for example, using std::thread,these threads will run on the unmanaged cores.

When the core arbiter starts up, it creates one cpuset forunmanaged cores (the unmanaged cpuset) and places allof the system’s cores into that set. It then assigns everyexisting kernel thread (including itself) to the unmanagedcpuset; any new threads spawned by these threads will alsorun on this cpuset. The core arbiter also creates one man-aged cpuset corresponding to each core, which containsthat single core but initially has no threads assigned to it.To allocate a core to an Arachne application, the arbiter re-moves that core from the unmanaged cpuset and assigns anArachne kernel thread to the managed cpuset for that core.When a core is no longer needed by any Arachne applica-tion, the core arbiter adds the core back to the unmanagedcpuset.

This scheme allows Arachne applications to coexistwith traditional applications whose threads are managedby the Linux kernel. Arachne applications receive prefer-ential access to cores, except that the core arbiter reservesat least one core for the unmanaged cpuset.

The Arachne runtime communicates with the core ar-biter using three methods in the arbiter’s library package:• setRequestedCores: invoked by the runtime

whenever its core needs change; indicates the totalnumber of cores needed by the application at variouspriority levels (see below for details).

• blockUntilCoreAvailable: invoked by a ker-nel thread to identify itself to the core arbiter and putthe kernel thread to sleep until it is assigned a core. Atthat point the kernel thread wakes up and this methodreturns the identifier of the assigned core.

• mustReleaseCore: invoked periodically by theruntime; a true return value means that the callingkernel thread should invoke blockUntilCore-Available to return its core to the arbiter.

Normally, the Arachne runtime handles all communica-tion with the core arbiter, so these methods are invisibleto applications. However, an application can implementits own thread and core management by calling the arbiterlibrary package directly.

The methods described above communicate with thecore arbiter using a collection of Unix domain sockets anda shared memory page (see Figure 1). The arbiter libraryopens one socket for each kernel thread. This socket isused to send requests to the core arbiter, and it is also usedto put the kernel thread to sleep when it has no assignedcore. The shared memory page is used by the core arbiter

to pass information to the arbiter library; it is written bythe core arbiter and is read-only to the arbiter library.

When the Arachne runtime starts up, it invokes set-RequestedCores to specify the application’s initialcore requirements;setRequestedCores sends a mes-sage to the core arbiter over a socket. Then the runtimecreates one kernel thread for each core on the machine;all of these threads invoke blockUntilCoreAvail-able. blockUntilCoreAvailable sends a requestto the core arbiter over the socket belonging to that ker-nel thread and then attempts to read a response from thesocket. This has two effects: first, it notifies the core ar-biter that the kernel thread is available for it to manage(the request includes the Linux identifier for the thread);second, the socket read puts the kernel thread to sleep.

At this point the core arbiter knows about the applica-tion’s core requirements and all of its kernel threads, andthe kernel threads are all blocked. When the core arbiterdecides to allocate a core to the application, it chooses oneof the application’s blocked kernel threads to run on thatcore. It assigns that thread to the cpuset corresponding tothe allocated core and then sends a response message backover the thread’s socket. This causes the thread to wake up,and Linux will schedule the thread on the given core; theblockUntilCoreAvailable method returns, withthe core identifier as its return value. The kernel threadthen invokes the Arachne dispatcher to run user threads.

If the core arbiter wishes to reclaim a core from an ap-plication, it asks the application to release the core. Thecore arbiter does not unilaterally preempt cores, since thecore’s kernel thread might be in an inconvenient state(e.g. it might have acquired an important spin lock);abruptly stopping it could have significant performanceconsequences for the application. So, the core arbiter setsa variable in the shared memory page, indicating whichcore(s) should be released. Then it waits for the applica-tion to respond.

Each kernel thread is responsible for testing the infor-mation in shared memory at regular intervals by invokingmustReleaseCore. The Arachne runtime does this inits dispatcher. If mustReleaseCore returns true, thenthe kernel thread cleans up as described in Section 5.4 andinvokes blockUntilCoreAvailable. This notifiesthe core arbiter and puts the kernel thread to sleep. At thispoint, the core arbiter can reallocate the core to a differentapplication.

The communication mechanism between the core ar-biter and applications is intentionally asymmetric: re-quests from applications to the core arbiter use sockets,while requests from the core arbiter to applications useshared memory. The sockets are convenient because theyallow the core arbiter to sleep while waiting for requests;they also allow application kernel threads to sleep whilewaiting for cores to be assigned. Socket communication is


relatively expensive (several microseconds in each direc-tion), but it only occurs when application core require-ments change, which we expect to be infrequent. Theshared memory page is convenient because it allows theArachne runtime to test efficiently for incoming requestsfrom the core arbiter; these tests are made frequently (ev-ery pass through the user thread dispatcher), so it is impor-tant that they are fast and do not involve kernel calls.

Applications can delay releasing cores for a short timein order to reach a convenient stopping point, such as atime when no locks are held. The Arachne runtime will notrelease a core until the dispatcher is invoked on that core,which happens when a user thread blocks, yields, or exits.

If an application fails to release a core within a timeoutperiod (currently 10 ms), then the core arbiter will forciblyreclaim the core. It does this by reassigning the core’s ker-nel thread to the unmanaged cpuset. The kernel thread willbe able to continue executing, but it will probably experi-ence degraded performance due to interference from otherthreads in the unmanaged cpuset.

The core arbiter uses a simple priority mechanism forallocating cores to applications. Arachne applications canrequest cores on different priority levels (the current im-plementation supports eight). The core arbiter allocatescores from highest priority to lowest, so low-priority appli-cations may receive no cores. If there are not enough coresfor all of the requests at a particular level, the core arbiterdivides the cores evenly among the requesting applica-tions. The core arbiter repeats this computation wheneverapplication requests change. The arbiter allocates all ofthe hyperthreads of a particular hardware core to the sameapplication whenever possible. The core arbiter also at-tempts to keep all of an application’s cores on the samesocket.

This policy for core allocation assumes that applications(and their users) will cooperate in their choice of prioritylevels: a misbehaving application could starve other appli-cations by requesting all of its cores at the highest prioritylevel. Anti-social behavior could be prevented by requir-ing applications to authenticate with the core arbiter whenthey first connect, and allowing system administrators toset limits for each application or user. We leave such amechanism to future work.

5 The Arachne RuntimeThis section discusses how the Arachne runtime imple-ments user threads. The most important goal for the run-time is to provide a fast and scalable implementation ofuser threads for modern multi-core hardware. We wantArachne to support granular computations, which consistof large numbers of extremely short-lived threads. For ex-ample, a low latency server might create a new thread foreach incoming request, and the request might take only amicrosecond or two to process; the server might process

millions of these requests per second.

5.1 Cache-optimized designThe performance of the Arachne runtime is dominated bycache misses. Most threading operations, such as creat-ing a thread, acquiring a lock, or waking a blocked thread,are relatively simple, but they involve communication be-tween cores. Cross-core communication requires cachemisses. For example, to transfer a value from one coreto another, it must be written on the source core and readon the destination core. This takes about three cache misstimes: the write will probably incur a cache miss to firstread the data; the write will then invalidate the copy of thedata in the destination cache, which takes about the sametime as a cache miss; finally, the read will incur a cachemiss to fetch the new value of the data. Cache misses cantake from 50-200 cycles, so even if an operation requiresonly a single cache miss, the miss is likely to cost morethan all of the other computation for the operation. On ourservers, the cache misses to transfer a value from one coreto another in the same socket take 7-8x as long as a contextswitch between user threads on the same core. Transfersbetween sockets are even more expensive. Thus, our mostimportant goal in implementing user threads was to mini-mize cache misses.

The effective cost of a cache miss can be reduced by per-forming other operations concurrently with the miss. Forexample, if several cache misses occur within a few in-structions of each other, they can all be completed for thecost of a single miss (modern processors have out-of-orderexecution engines that can continue executing instructionswhile waiting for cache misses, and each core has multi-ple memory channels). Thus, additional cache misses areessentially free. However, modern processors have an out-of-order execution limit of about 100 instructions, so codemust be designed to concentrate likely cache misses neareach other.

Similarly, a computation that takes tens of nanosecondsin isolation may actually have zero marginal cost if it oc-curs in the vicinity of a cache miss; it will simply fill thetime while the cache miss is being processed. Section 5.3will show how the Arachne dispatcher uses this techniqueto hide the cost of seemingly expensive code.

5.2 Thread creationMany user-level thread packages, such as the one inGo [14], create new threads on the same core as the par-ent; they use work stealing to balance load across cores.This avoids cache misses at thread creation time. How-ever, work stealing is an expensive operation (it requirescache misses), which is particularly noticeable for short-lived threads. Work stealing also introduces a time lag be-fore a thread is stolen to an unloaded core, which impactsservice latency. For Arachne we decided to perform load-balancing at thread creation time; our goal is to get a new


thread on an unloaded core as quickly as possible. By op-timizing this mechanism based on cache misses, we wereable to achieve thread creation times competitive with sys-tems that create child threads on the parent’s core.

Cache misses can occur during thread creation for thefollowing reasons:• Load balancing: Arachne must choose a core for the

new thread in a way that balances load across availablecores; cache misses are likely to occur while fetchingshared state describing current loads.

• State transfer: the address and arguments for thethread’s top-level method must be transferred from theparent’s core to the child’s core.

• Scheduling: the parent must indicate to the child’score that the child thread is runnable.

• Thread context: the context for a thread consists ofits call stack, plus metadata used by the Arachne run-time, such as scheduling state and saved executionstate when the thread is not running. Depending onhow this information is managed, it can result in addi-tional cache misses.

We describe below how Arachne can create a new userthread in four cache miss times.

In order to minimize cache misses for thread contexts,Arachne binds each thread context to a single core (thecontext is only used by a single kernel thread). Each userthread is assigned to a thread context when it is created,and the thread executes only on the context’s associatedcore. Most threads live their entire life on a single core.A thread moves to a different core only as part of an ex-plicit migration. This happens only in rare situations suchas when the core arbiter reclaims a core. A thread contextremains bound to its core after its thread completes, andArachne reuses recently-used contexts when creating newthreads. If threads have short lifetimes, it is likely that thecontext for a new thread will already be cached.

To create a new user thread, the Arachne runtime mustchoose a core for the thread and allocate one of the threadcontexts associated with that core. Each of these opera-tions will probably result in cache misses, since they ma-nipulate shared state. In order to minimize cache misses,Arachne uses the same shared state to perform both op-erations simultaneously. The state consists of a 64-bitmaskAndCount value for each active core. 56 bits of thevalue are a bit mask indicating which of the core’s threadcontexts are currently in use, and the remaining 8 bits area count of the number of ones in the mask.

When creating new threads, Arachne uses the “powerof two choices” approach for load balancing [26]. It se-lects two cores at random, reads their maskAndCountvalues, and selects the core with the fewest active threadcontexts. This will likely result in a cache miss for eachmaskAndCount, but they will be handled concurrentlyso the total delay is that of a single miss. Arachne then

scans the mask bits for the chosen core to find an avail-able thread context and uses an atomic compare-and-swapoperation to update the maskAndCount for the chosencore. If the compare-and-swap fails because of a concur-rent update, Arachne rereads themaskAndCount for thechosen core and repeats the process of allocating a threadcontext. This creation mechanism is scalable: with a largenumber of cores, multiple threads can be created simulta-neously on different cores.

Once a thread context has been allocated, Arachnecopies the address and arguments for the thread’s top-levelmethod into the context and schedules the thread for ex-ecution by setting a word-sized variable in the context toindicate that the thread is runnable. In order to minimizecache misses, Arachne uses a single cache line to hold allof this information. This limits argument lists to 6 one-word parameters on machines with 64-byte cache lines;larger parameter lists must be passed by reference, whichwill result in additional cache misses.

With this mechanism, a new thread can be invoked ona different core in four cache miss times. One cache missis required to read the maskAndCount and three cachemiss times are required to transfer the line containing themethod address and arguments and the scheduling flag, asdescribed in Section 5.1.

The maskAndCount variable limits Arachne to 56threads per core at a given time. As a result, programmingmodels that rely on large numbers of blocked threads maybe unsuitable for use with Arachne.

5.3 Thread schedulingThe traditional approach to thread scheduling uses oneor more ready queues to identify runnable threads (typ-ically one queue per core, to reduce contention), plus ascheduling state variable for each thread, which indicateswhether that thread is runnable or blocked. This represen-tation is problematic from the standpoint of cache misses.Adding or removing an entry to/from a ready queue re-quires updates to multiple variables. Even if the queue islockless, this is likely to result in multiple cache misseswhen the queue is shared across cores. Furthermore, weexpect sharing to be common: a thread must be added tothe ready queue for its core when it is awakened, but thewakeup typically comes from a thread on a different core.

In addition, the scheduling state variable is subject toraces. For example, if a thread blocks on a condition vari-able, but another thread notifies the condition variable be-fore the blocking thread has gone to sleep, a race over thescheduling state variable could cause the wakeup to belost. This race is typically eliminated with a lock that con-trols access to the state variable. However, the lock resultsin additional cache misses, since it is shared across cores.

In order to minimize cache misses, Arachne does notuse ready queues. Instead of checking a ready queue, theArachne dispatcher repeatedly scans all of the active user


thread contexts associated with the current core until itfinds one that is runnable. This approach turns out to berelatively efficient, for two reasons. First, we expect onlya few thread contexts to be occupied for a core at a giventime (there is no need to keep around blocked threads forintermittent tasks; a new thread can be created for eachtask). Second, the cost of scanning the active thread con-texts is largely hidden by an unavoidable cache miss on thescheduling state variable for the thread that woke up. Thisvariable is typically modified by a different core to wakeup the thread, which means the dispatcher will have to takea cache miss to observe the new value. 100 or more cycleselapse between when the previous value of the variableis invalidated in the dispatcher’s cache and the new valuecan be fetched; a large number of thread contexts can bescanned during this time. Section 7.4 evaluates the cost ofthis approach.

Arachne also uses a new lockless mechanism forscheduling state. The scheduling state of a thread is repre-sented with a 64-bit wakeupTime variable in its threadcontext. The dispatcher considers a thread runnable ifits wakeupTime is less than or equal to the processor’sfine-grain cycle counter. Before transferring control to athread, the dispatcher sets its wakeupTime to the largestpossible value. wakeupTime doesn’t need to be modi-fied when the thread blocks: the large value will preventthe thread from running again. To wake up the thread,wakeupTime is set to 0. This approach eliminates therace condition described previously, since wakeupTimeis not modified when the thread blocks; thus, no synchro-nization is needed for access to the variable.

The wakeupTime variable also supports timer-basedwakeups. If a thread wishes to sleep for a given timeperiod, or if it wishes to add a timeout to some otherblocking operation such as a condition wait, it can setwakeupTime to the desired wakeup time before block-ing. A single test in the Arachne dispatcher detects bothnormal unblocks and timer-based unblocks.

Arachne exports the wakeupTime mechanism to ap-plications with two methods:• block(time) will block the current user thread.

The time argument is optional; if it is specified,wakeupTime is set to this value (using compare-and-swap to detect concurrent wakeups).

• signal(thread) will set the given user thread’swakeupTime to 0.

These methods make it easy to construct higher-level syn-chronization and scheduling operators. For example, theyieldmethod, which is used in cooperative multithread-ing to allow other user threads to run, simply invokesblock(0).

5.4 Adding and releasing coresWhen the core arbiter allocates a new core to an applica-tion, it wakes up one of the kernel threads that was blocked

in blockUntilCoreAvailable. The kernel threadnotifies the core policy of the new core as described in Sec-tion 6 below, then it enters the Arachne dispatcher loop.

When the core arbiter decides to reclaim a core from anapplication, mustReleaseCore will return true in theArachne dispatcher running on the core. The kernel threadmodifies its maskAndCount to prevent any new threadsfrom being placed on it, then it notifies the core policy ofthe reclamation. If any user threads exist on the core, theArachne runtime migrates them to other cores. To migratea thread, Arachne selects a new core (destination core) andreserves an unoccupied thread context on that core usingthe same mechanism as for thread creation. Arachne thenexchanges the context of the thread being migrated withthe unoccupied context, so that the thread’s context is re-bound to the destination core and the unused context fromthe destination core is rebound to the source core. Onceall threads have been migrated away, the kernel thread onthe reclaimed core invokes blockUntilCoreAvail-able. This notifies the core arbiter that the core is nolonger in use and puts the kernel thread to sleep.

6 Core PoliciesOne of our goals for Arachne is to give applications pre-cise control over their usage of cores. For example, inRAMCloud the central dispatch thread is usually the per-formance bottleneck. Thus, it makes sense for the dis-patch thread to have exclusive use of a core. Furthermore,the other hyperthread on the same physical core should beidle (if both hyperthreads are used simultaneously, theyeach run about 30% slower than if only one hyperthread isin use). In other applications it might be desirable to colo-cate particular threads on hyperthreads of the same core orsocket, or to force all low-priority background threads toexecute on a single core in order to maximize the resourcesavailable for foreground request processing.

The Arachne runtime does not implement the policiesfor core usage. These are provided in a separate core pol-icy module. Each application selects a particular core pol-icy at startup. Writing high-performance core policiesis likely to be challenging, particularly for policies thatdeal with NUMA issues and hyperthreads. We hope thata small collection of reusable policies can be developed tomeet the needs of most applications, so that it will rarelybe necessary for an application developer to implement acustom core policy.

In order to manage core usage, the core policy mustknow which cores have been assigned to the application.The Arachne runtime provides this information by invok-ing a method in the core policy whenever the applicationgains or loses cores.

When an application creates a new user thread, it spec-ifies an integer thread class for the thread. Thread classesare used by core policies to manage user threads; each


thread class corresponds to a particular level of service,such as “foreground thread” or “background thread.” Eachcore policy defines its own set of valid thread classes. TheArachne runtime stores thread classes with threads, buthas no knowledge of how they are used.

The core policy uses thread classes to manage the place-ment of new threads. When a new thread is created,Arachne invokes a method getCores in the core pol-icy, passing it the thread’s class. The getCores methoduses the thread class to select one or more cores that areacceptable for the thread. The Arachne runtime places thenew thread on one of those cores using the “power of twochoices” mechanism described in Section 5. If the corepolicy wishes to place the new thread on a specific core,getCores can return a list with a single entry. Arachnealso invokes getCores to find a new home for a threadwhen it must be migrated as part of releasing a core.

One of the unusual features of Arachne is that each ap-plication is responsible for determining how many coresit needs; we call this core estimation, and it is handledby the core policy. The Arachne runtime measures twostatistics for each core, which it makes available to thecore policy for its use in core estimation. The first statis-tic is utilization, which is the average fraction of time thateach Arachne kernel thread spends executing user threads.The second statistic is load factor, which is an estimate ofthe average number of runnable user threads on that core.Both of these are computed with a few simple operationsin the Arachne dispatching loop.6.1 Default core policyArachne currently includes one core policy; we used thedefault policy for all of the performance measurements inSection 7. The default policy supports two thread classes:exclusive and normal. Each exclusive thread runs on a sep-arate core reserved for that particular thread; when an ex-clusive thread is blocked, its core is idle. Exclusive threadsare useful for long-running dispatch threads that poll. Nor-mal threads share a pool of cores that is disjoint from thecores used for exclusive threads; there can be multiple nor-mal threads assigned to a core at the same time.6.2 Core estimationThe default core policy requests one core for each ex-clusive thread, plus additional cores for normal threads.Estimating the cores required for the normal threads re-quires making a tradeoff between core utilization and fastresponse time. If we attempt to keep cores busy 100%of the time, fluctuations in load will create a backlog ofpending threads, resulting in delays for new threads. Onthe other hand, we could optimize for fast response time,but this would result in low utilization of cores. The morebursty a workload, the more resources it must waste inorder to get fast response.

The default policy uses different mechanisms for scal-ing up and scaling down. The decision to scale up is based

CloudLab m510[36]

CPU Xeon D-1548 (8 x 2.0 GHz cores)RAM 64 GB DDR4-2133 at 2400 MHzDisk Toshiba THNSN5256GPU7 (256 GB)NIC Dual-port Mellanox ConnectX-3 10 GbSwitches HPE Moonshot-45XGc

Table 1: Hardware configuration used for benchmarking.All nodes ran Linux 4.4.0. C-States were enabled andMeltdown mitigations were disabled. Hyperthreads wereenabled (2 hyperthreads per core). Machines were notconfigured to perform packet steering such as RSS or XPS.

on load factor: when the average load factor across allcores running normal threads reaches a threshold value,the core policy increases its requested number of cores by1. We chose this approach because load factor is a fairlyintuitive proxy for response time; this makes it easier forusers to specify a non-default value if needed. In addition,performance measurements showed that load factor worksbetter than utilization for scaling up: a single load factorthreshold works for a variety of workloads, whereas thebest utilization for scaling up depends on the burstinessand overall volume of the workload.

On the other hand, scaling down is based on utiliza-tion. Load factor is hard to use for scaling down becausethe metric of interest is not the current load factor, butrather the load factor that will occur with one fewer core;this is hard to estimate. Instead, the default core policyrecords the total utilization (sum of the utilizations of allcores running normal threads) each time it increases itsrequested number of cores. When the utilization returnsto a level slightly less than this, the runtime reduces itsrequested number of cores by 1 (the “slightly less” fac-tor provides hysteresis to prevent oscillations). A separatescale-down utilization is recorded for each distinct numberof requested cores.

Overall, three parameters control the core estimationmechanism: the load factor for scaling up, the intervalover which statistics are averaged for core estimation, andthe hysteresis factor for scaling down. The default corepolicy currently uses a load factor threshold of 1.5, an av-eraging interval of 50 ms, and a hysteresis factor of 9%utilization.

7 EvaluationWe implemented Arachne in C++ on Linux; source code isavailable on GitHub [33]. The core arbiter contains 4500lines of code, the runtime contains 3400 lines, and the de-fault core policy contains 270 lines.

Our evaluation of Arachne addresses the followingquestions:• How efficient are the Arachne threading primitives,

and how does Arachne compare to other threading sys-tems?

• Does Arachne’s core-aware approach to threading pro-duce significant benefits for low-latency applications?


Benchmark Arachne Arachne RQ std::thread Go uThreadsNo HT HT No HT HT

Thread Creation 275 ns 320 ns 524 ns 520 ns 13329 ns 444 ns 6132 nsOne-Way Yield 83 ns 149 ns 137 ns 199 ns N/A N/A 79 nsNull Yield 14 ns 23 ns 13 ns 24 ns N/A N/A 6 nsCondition Notify 251 ns 272 ns 459 ns 471 ns 4962 ns 483 ns 4976 nsSignal 233 ns 254 ns N/A N/A N/A N/A N/AThread Exit Turnaround 328 ns 449 ns 408 ns 484 ns N/A N/A N/A

Table 2: Median cost of scheduling primitives. Creation, notification, and signaling are measured end-to-end, from initiationin one thread until the target thread wakes up and begins execution on a different core. Arachne creates all threads on a differentcore from the parent. Go always creates Goroutines on the parent’s core. uThreads uses a round-robin approach to assignthreads to cores; when it chooses the parent’s core, the median cost drops to 250 ns. In “One-Way Yield”, control passes fromthe yielding thread to another runnable thread on the same core. In “Null Yield”, there are no other runnable threads, so controlreturns immediately to the yielding thread. “Thread Exit Turnaround” measures the time from the last instruction of one threadto the first instruction of the next thread to run on a core. N/A indicates that the threading system does not expose the measuredprimitive. “Arachne RQ” means that Arachne was modified to use a ready queue instead of the queueless dispatch mechanismdescribed in Section 5.3. “No HT” means that each thread ran on a separate core using one hyperthread; the other hyperthreadof each core was inactive. “HT” means the other hyperthread of each core was active, running the Arachne dispatcher.

• How do Arachne’s internal mechanisms, such asits queue-less approach to thread scheduling and itsmechanisms for core estimation and core allocation,contribute to performance?

Table 1 describes the configuration of the machines usedfor benchmarking.7.1 Threading PrimitivesTable 2 compares the cost of basic thread opera-tions in Arachne with C++ std::thread, Go, anduThreads [4]. std::thread is based on kernel threads;Go implements threads at user level in the language run-time; and uThreads uses kernel threads to multiplex userthreads, like Arachne. uThreads is a highly rated C++ userthreading library on GitHub and claims high performance.The measurements use microbenchmarks, so they repre-sent best-case performance.

Arachne’s thread operations are considerably fasterthan any of the other systems, except that yields are fasterin uThreads. Arachne’s cache-optimized design performsthread creation twice as fast as Go, even though Arachneplaces new threads on a different core from the parentwhile Go creates new threads on the parent’s core.

To evaluate Arachne’s queueless approach, we modi-fied Arachne to use a wait-free multiple-producer-single-consumer queue [7] on each core to identify runnablethreads instead of scanning over all contexts. We se-lected this implementation for its speed and simplicityfrom several candidates on GitHub. Table 2 shows thatthe queueless approach is 28–40% faster than one using aready queue (we counted three additional cache misses forthread creation in the ready queue variant of Arachne).

We designed Arachne’s thread creation mechanism notjust to minimize latency, but also to provide high through-put. We ran two experiments to measure thread creationthroughput. In the first experiment (Figure 2(a)), a sin-gle “dispatch” thread creates new threads as quickly as

0 2 4 6 8 10 12 14

Number of Cores

0

1

2

3

4

5

6

Thre

ad C

om

ple

tions

(M/s

ec)

(a)

0 2 4 6 8 10 12 14

Number of Cores

0

5

10

15

20

25

30

35

40

Thre

ad C

om

ple

tions

(M/s

ec)

Go

Arachne

Arachne-RQ

uThread

std::thread

(b)

Figure 2: Thread creation throughput as a function ofavailable cores. In (a) a single thread creates new threads asquickly as possible; each child consumes 1 µs of executiontime and then exits. In (b) 3 initial threads are created foreach core; each thread creates one child and then exits.

possible (this situation might occur, for example, if a sin-gle thread is polling a network interface for incoming re-quests). A single Arachne thread can spawn more than5 million new threads per second, which is 2.5x the rateof Go and at least 10x the rate of std::thread oruThreads. This experiment demonstrates the benefits ofperforming load balancing at thread creation time. Go’s


Experiment Program Keys Values Items PUTs Clients Threads Conns Pipeline IR Dist

Realistic Mutilate [27] ETC ETC 1M .03 20+1 16+8 1280+8 1+1 GParetoColocation Memtier [24] 30B 200B 8M 0 1+1 16+8 320+8 10+1 PoissonSkew Memtier 30B 200B 8M 0 1 16 512 100 Poisson

Table 3: Configurations of memcached experiments. Program is the benchmark program used to generate the workload (ourversion of Memtier is modified from the original). Keys and Values give sizes of keys and values in the dataset (ETC recreatesthe Facebook ETC workload [3], which models actual usage of memcached). Items is the total number of objects in the dataset.PUTs is the fraction of all requests that were PUTs (the others were GETs). Clients is the total number of clients (20+1 means20 clients generated an intensive workload, and 1 additional client measured latency using a lighter workload). Threads is thenumber of threads per client. Conns is the total number of connections per client. Pipeline is the maximum number of outstandingrequests allowed per connection before shedding workload. IR Dist is the inter-request time distribution. Unless otherwiseindicated, memcached was configured with 16 worker threads and memcached-A scaled automatically between 2 and 15 cores.

101

102

103

104

Late

ncy (

us)

0.0 0.3 0.6 0.9 1.2 1.5 1.8Throughput (MOps/Sec)

101

102

103

104

Late

ncy (

us)


(a) memcached: 16 worker threads, 16 cores

101

102

103

104

Late

ncy (

us)


101

102

103

104

Late

ncy (

us)


(b) memcached: 16 workers; both: 8 cores

Memcached (99%)

Memcached-A (99%)

Memcached (50%)

Memcached-A (50%)

Figure 3: Median and 99th-percentile request latency as a function of achieved throughput for both memcached and memcached-A, under the Realistic benchmark. Each measurement ran for 30 seconds after a 5-second warmup. Y-axes use a log scale.

work stealing approach creates significant additional over-head, especially when threads are short-lived, and the par-ent’s work queue can become a bottleneck. At low corecounts, Go exhibits higher throughput than Arachne be-cause it runs threads on the creator’s core in addition toother available cores, while Arachne only uses the cre-ator’s core for dispatching.

The second experiment measures thread creationthroughput using a distributed approach, where each ofmany existing threads creates one child thread and thenexits (Figure 2(b)). In this experiment both Arachne andGo scaled in their throughput as the number of availablecores increased. Neither uThreads nor std::threadhad scalable throughput; uThreads had 10x less through-put than Arachne or Go andstd::threadhad 100x lessthroughput. Go’s approach to thread creation worked wellin this experiment; each core created and executed threadslocally and there was no need for work stealing since theload naturally balanced itself. As a result, Go’s throughputwas 1.5–2.5x that of Arachne. Arachne’s performance re-flects the costs of thread creation and exit turnaround fromTable 2, as well as occasional conflicts between concurrentthread creations.

Figure 2 also includes measurements of the ready queuevariant of Arachne. Arachne’s queueless approach pro-vided higher throughput than the ready queue variant forboth experiments.

7.2 Arachne’s benefits for memcachedWe modified memcached [23] version 1.5.6 to useArachne; the source is available on GitHub [19]. In the

modified version (“memcached-A”), the pool of workerthreads is replaced by a single dispatch thread, which waitsfor incoming requests on all connections. When a requestarrives, the dispatch thread creates a new Arachne thread,which lives only long enough to handle all available re-quests on the connection. Memcached-A uses the defaultcore policy; the dispatch thread is “exclusive” and workersare “normal” threads.

Memcached-A provides two benefits over the originalmemcached. First, it reduces performance interference,both between kernel threads (there is no multiplexing)and between applications (cores are dedicated to applica-tions). Second, memcached-A provides finer-grain load-balancing (at the level of individual requests rather thanconnections).

We performed three experiments with memcached;their configurations are summarized in Table 3. Thefirst experiment, Realistic, measures latency as a func-tion of load under realistic conditions; it uses the Muti-late benchmark [17, 27] to recreate the Facebook ETCworkload [3]. Figure 3(a) shows the results. The max-imum throughput of memcached is 20% higher thanmemcached-A. This is because memcached-A uses twofewer cores (one core is reserved for unmanaged threadsand one for the dispatcher); in addition, memcached-A incurs overhead to spawn a thread for each request.However, memcached-A has significantly lower latencythan memcached. Thus, if an application has service-level requirements, memcached-A provides higher use-able throughput. For example, if an application requires


a median latency less than 100 µs, memcached-A cansupport 37.5% higher throughput than memcached (1.1Mops/sec vs. 800 Kops/sec). At the 99th percentile,memcached-A’s latency ranges from 3–40x lower thanmemcached. We found that Linux migrates memcachedthreads between cores frequently: at high load, each threadmigrates about 10 times per second; at low load, threadsmigrate about every third request. Migration adds over-head and increases the likelihood of multiplexing.

One of our goals for Arachne is to adapt automatically toapplication load and the number of available cores, so ad-ministrators do not need to specify configuration optionsor reserve cores. Figure 3(b) shows memcached’s behav-ior when it is given fewer cores than it would like. Formemcached, the 16 worker threads were multiplexed ononly 8 cores; memcached-A was limited to at most 8 cores.Maximum throughput dropped for both systems, as ex-pected. Arachne continued to operate efficiently: latencywas about the same as in Figure 3(a). In contrast, mem-cached experienced significant increases in both medianand tail latency, presumably due to additional multiplex-ing; with a median latency limit of 100 µs, memcachedcould only handle 300 Kops/sec, whereas memcached-Ahandled 780 Kops/sec.

The second experiment, Colocation, varied the load dy-namically to evaluate Arachne’s core estimator. It alsomeasured memcached and memcached-A performancewhen colocated with a compute-intensive application (thex264 video encoder [25]). The results are in Figure 4. Fig-ure 4(a) shows that memcached-A used only 2 cores atlow load (dispatch and one worker) and ramped up to useall available cores as the load increased. Memcached-Amaintained near-constant median and tail latency as theload increased, which indicates that the core estimatorchose good points at which to change its core requests.Memcached’s latency was higher than memcached-A andit varied more with load; even when running without thebackground application, 99th-percentile latency was 10xhigher for memcached than for memcached-A. Tail la-tency for memcached-A was actually better at high loadthan low load, since there were more cores available toabsorb bursts of requests.

When a background video application was colocatedwith memcached, memcached’s latency doubled, both atthe median and at the 99th percentile, even though thebackground application ran at lower priority. In con-trast, memcached-A was almost completely unaffectedby the video application. This indicates that Arachne’score-aware approach improves performance isolation be-tween applications. Figure 4(a) shows that memcached-Aramped up its core usage more quickly when colocatedwith the video application. This suggests that there wassome performance interference from the video applica-tion, but that the core estimator detected this and allocated

0 20 40 60 80 100 120 140 160

Time (Seconds)

0

8

16

Core

s

(a)

)

0 20 40 60 80 100 120 140 160

Time (Seconds)

0

4

8

99%

Late

ncy (

ms) Memcached-A alone

Memcached-A w/ x264

Memcached alone

Memcached w/ x264

Time (Seconds)

(b)

0.0

0.5

1.0

Thro

ughput

(MO

ps/S

ec)

0 20 40 60 80 100 120 140 160

Time (Seconds)

0

50

100

50

% L

ate

ncy (

us)

Throughput

Time (Seconds)

(c)Time (Seconds)

0 20 40 60 80 100 120 140 160

Time (Seconds)

0

80

160

Fra

mes/s

ec

x264 Alone

x264 w/ Memcached-A

x264 w/ Memcached

(d)

Figure 4: Memcached performance in the Colocationexperiment. The request rate increased gradually from10 Kops/sec to 1 Mops/sec and then decreased back to 10Kops/sec. In some experiments the x264 video encoder [25]ran concurrently, using a raw video file (sintel-1280.y4m)from Xiph.org [21]. When memcached-A ran with x264,the core arbiter gave memcached-A as many cores as itrequested; x264 was not managed by Arachne, so Linuxscheduled it on the cores not used by memcached-A. Whenmemcached ran with x264, the Linux scheduler determinedhow many cores each application received. x264 sets a“nice” value of 10 for itself by default; we did not changethis behavior in these experiments. (a) shows the number ofcores allocated to memcached-A; (b) shows 99th percentiletail latency for memcached and memcached-A; (c) showsmedian latency, plus the rate of requests; (d) shows thethroughput of the video decoder (averaged over trailing4 seconds) when running by itself or with memcached ormemcached-A.

cores more aggressively to compensate.Figure 4(d) shows the throughput of the video applica-

tion. At high load, its throughput when colocated withmemcached-A was less than half its throughput whencolocated with memcached. This is because memcached-A confined the video application to a single unmanagedcore. With memcached, Linux allowed the video appli-


Time (Seconds)

0 20 40 60 80 100 120 140 160

Time (Seconds)

0

5

10

159

9%

Late

ncy (

ms) Memcached-A alone

Memcached-A w/ x264

NoArbiter alone

NoArbiter w/ x264

Time (Seconds)

0 20 40 60 80 100 120 140 160

Time (Seconds)

0

60

120

50

% L

ate

ncy (

us)

Figure 5: Median and tail latency in the Colocationexperiment with the core arbiter (“Memcached-A”, same asFigure 4) and without the core arbiter (“NoArbiter”).

cation to consume more resources, which degraded theperformance of memcached.

Figure 5 shows that dedicated cores are fundamentalto Arachne’s performance. For this figure, we ran theColocation experiment using a variant of memcached-Athat did not have dedicated cores: instead of using thecore arbiter, Arachne scaled by blocking and unblock-ing kernel threads on semaphores, and the Linux ker-nel scheduled the unblocked threads. As shown in Fig-ure 5, this resulted in significantly higher latency bothwith and without the background application. Additionalmeasurements showed that latency spikes occurred whenLinux descheduled a kernel thread but Arachne contin-ued to assign new user threads to that kernel thread; dur-ing bursts of high load, numerous user threads could bestranded on a descheduled kernel thread for many mil-liseconds. Without the dedicated cores provided by thecore arbiter, memcached-A performed significantly worsethan unmodified memcached.

Figures 4 and 5 used the default configuration of thebackground video application, in which it lowered its ex-ecution priority to “nice” level 10. We also ran the exper-iments with the video application running at normal pri-ority; median and tail latencies for memcached increasedby about 2x, while those for memcached-A were almostcompletely unaffected. We omit the details, due to spacelimitations.

The final experiment for memcached is Skew, shownin Figure 6. This experiment evaluates memcached per-formance when the load is not balanced uniformly acrossclient connections. Since memcached statically parititionsclient connections among worker threads, hotspots candevelop, where some workers are overloaded while oth-ers are idle; this can result in poor overall throughput. Incontrast, memcached-A performs load-balancing on eachrequest, so performance is not impacted by the distribution

0.0

0.5

1.0

1.5

Thro

ughput

(MO

ps/S

ec)

0 20 40 60 80 100 120 140 160Time (Seconds)

0.0

0.5

1.0

1.5

Thro

ughput

(MO

ps/S

ec)

0 20 40 60 80 100 120 140 160Time (Seconds)

Memcached-A

Memcached

Memcached Hot Worker

0.0

0.5

1.0

Fra

c. Load

0.0

0.5

1.0

Fra

c. Load

0.0625

Hot Worker

Average Worker

Figure 6: The impact of workload skew on memcachedperformance with a target load of 1.5 Mops/sec. Initially,the load was evenly distributed over 512 connections (eachmemcached worker handled 512/16 = 32 connections); overtime, an increasing fraction of total load was directed toone specific “hot” worker thread by increasing the requestrate on the hot worker’s connections and decreasing therequest rate on all other connections. The bottom graphshows the overall throughput, as well as the throughput ofthe overloaded worker thread in memcached.

0 10 20 30

Number of Clients

0

50

100

150

200

250

300

350

Wri

te T

hro

ughput

(kops/s

)

RAMCloud

RAMCloud with Arachne

Figure 7: Throughput of a single RAMCloud server whenmany clients perform continuous back-to-back write RPCsof 100-byte objects. Throughput is measured as the numberof completed writes per second.

of load across client connections.

7.3 Arachne’s Benefits for RAMCloudWe also modified RAMCloud [30] to use Arachne. Inthe modified version (“RAMCloud-A”), the long-runningpool of worker threads is eliminated, and the dispatchthread creates a new worker thread for each request.Threads that are busy-waiting on nested RPCs yield af-ter each iteration of their polling loop. This allows otherrequests to be processed during the waiting time, so thatthe core isn’t wasted. Figure 7 shows that RAMCloud-Ahas 2.5x higher write throughput than RAMCloud. On theYCSB benchmark [9] (Figure 8), RAMCloud-A provided54% higher throughput than RAMCloud for the write-heavy YCSB-A workload. On the read-only YCSB-Cworkload, RAMCloud-A’s throughput was 15% less thanRAMCloud, due to the overhead of Arachne’s thread in-vocation and thread exit. These experiments demonstrate


A B C D F

YCSB Workload

0

2

4

6

8

10

12

Aggre

gate

Mops/s

ec

RAMCloud

RAMCloud-A

Figure 8: Comparison between RAMCloud andRAMCloud-A on a modified YCSB benchmark [9]using 100-byte objects. Both were run with 12 storageservers. Y-values represent aggregate throughput across 30independent client machines, each running with 8 threads.

0 10 20 30 40 50

Number of Occupied Thread Contexts

0

100

200

300

400

500

Late

ncy (

ns)

Latency (50%)

Latency (99%)

Figure 9: Cost of signaling a blocked thread as the numberof threads on the target core increases. Latency is measuredfrom immediately before signaling on one core until thetarget thread resumes execution on a different core.

that Arachne makes it practical to schedule other workduring blockages as short as a few microseconds.

7.4 Arachne Internal MechanismsThis section evaluates several of the internal mechanismsthat are key to Arachne’s performance. As mentionedin Section 5.3, Arachne forgoes the use of ready queuesas part of its cache-optimized design; instead, the dis-patcher scans the wakeupTime variables for occupiedthread contexts until it finds a runnable thread. Conse-quently, as a core fills with threads, its dispatcher mustiterate over more and more contexts. To evaluate the costof scanning these flags, we measured the cost of signal-ing a particular blocked thread while varying the numberof additional blocked threads on the target core; Figure 9shows the results. Even in the worst case where all 56thread contexts are occupied, the average cost of wakingup a thread increased by less than 100 ns, which is equiv-alent to about one cache coherency miss. This means thatan alternative implementation that avoids scanning all theactive contexts must do so without introducing any newcache misses; otherwise its performance will be worsethan Arachne. Arachne’s worst-case performance in Fig-ure 9 is still better than the ready queue variant of Arachnein Table 2.

10 20 30 40 50 60 70

Microseconds

0.0

0.2

0.4

0.6

0.8

1.0

Cum

ula

tive F

racti

on

Reclaimed

Idle

Figure 10: Cumulative distribution of latency from corerequest to core acquisition (a) when the core arbiter has afree core available and (b) when it must reclaim a core froma competing application.

Figures 10 and 11 show the performance of Arachne’score allocation mechanism. Figure 10 shows the distri-bution of allocation times, measured from when a threadcallssetRequestedCores until a kernel thread wakesup on the newly-allocated core. In the first scenario, thereis an idle core available to the core arbiter, and the costis merely that of moving a kernel thread to the core andunblocking it. In the second scenario, a core must bereclaimed from a lower priority application so the costincludes signaling another process and waiting for it torelease a core. Figure 10 shows that Arachne can real-locate cores in about 30 µs, even if the core must be re-claimed from another application. This makes it practicalfor Arachne to adapt to changes in load at the granularityof milliseconds.

Figure 11 shows the timing of each step of a core requestthat requires the preemption of another process’s core.About 80% of the time is spent in socket communication.

8 Related WorkNumerous user-level threading packages have been de-veloped over the last several decades. We have alreadycompared Arachne with Go [14] and uThreads [4]. Boostfibers [1], Folly [13], and Seastar [37] implement user-level threads but do not multiplex user threads across mul-tiple cores. Capriccio [39] solved the problem of blockingsystem calls by replacing them with asynchronous sys-tem calls, but it does not scale to multiple cores. Cilk [8]is a compiler and runtime for scheduling tasks over ker-nel threads, but does not handle blocking and is not core-aware. Carbon [16] proposes the use of hardware queuesto dispatch hundred-instruction-granularity tasks, but itrequires changes to hardware and is limited to a fork-joinmodel of parallelism. Wikipedia [40] lists 21 C++ thread-ing libraries as of this writing. Of these, 10 offer onlykernel threads, 3 offer compiler-based automatic paral-lelization, 3 are commercial packages without any pub-lished performance numbers, and 5 appear to be defunct.None of the systems listed above supports load balancingat thread creation time, the ability to compute core require-


Idle kernel

thread

Requesting

kernel thread

Acquiescing

kernel thread

Core Arbiter

Recompute core allocations: 2.7 s

Sleep in blockUntilCoreAvailable

Time to wake up on core: 29 s

Notify arbiter of new core requirements (socket): 6.6 s

Prepare to release core: 0.4 s Sleep in blockUntilCoreAvailable

Tell arbiter this thread is blocking (socket): 6.5 s

Request release

(shared memory):

0.5 s

Recompute core

allocations: 1.9 s Wake up blocked thread (socket): 7.9 s

Change blocked thread’s cpuset: 2.6 s

setRequestedCores

Figure 11: Timeline of a core request to the core arbiter. There are two applications. Both applications begin with a singlededicated core, and the top application also begins with a thread waiting to be placed on a core. The top application has higherpriority than the bottom application, so when the top application requests an additional core the bottom application is asked torelease its core.

ments and conform to core allocations, or a mechanism forimplementing application-specific core policies.

Scheduler activations [2] are similar to Arachne in thatthey allocate processors to applications to implement user-level threads efficiently. A major focus of the scheduleractivations work was allowing processor preemption dur-ing blocking kernel calls; this resulted in significant kernelmodifications. Arachne focuses on other issues, such asminimizing cache misses, estimating core requirements,and enabling application-specific core policies.

Akaros [35] and Parlib [15] follow in the tradition ofscheduler activations. Akaros is an operating system thatallocates dedicated cores to applications and makes allblocking system calls asynchronous; Parlib is a frameworkfor building user schedulers on dedicated cores. Akarosoffers functionality analogous to the Arachne core arbiter,but it does not appear to have reached a level of maturitythat can support meaningful performance measurements.

The core arbiter’s controlling of process scheduling pol-icy in userspace while leaving mechanism to the kernelresembles policy modules in Hydra [18].

The traditional approach for managing multi-threadedapplications on multi-core machines has been gangscheduling [12, 29]. In gang scheduling, each applicationunilaterally determines its threading requirements; the op-erating system then attempts to schedule all of an applica-tion’s threads simultaneously on different cores. Tuckerand Gupta pointed out that gang scheduling results in in-efficient multiplexing when the system is overloaded [38].They argued that it is more efficient to divide the coresso that each application has exclusive use of a few cores;an application can then adjust its degree of parallelism tomatch the available cores. Arachne implements this ap-proach.

Event-based applications such as Redis [34] and ng-inx [28] represent an alternative to user threads forachieving high throughput and low latency. Behren etal. [39] argued that event-based approaches are a form ofapplication-specific optimization and such optimization isdue to the lack of efficient thread runtimes; Arachne of-fers efficient threading as a more convenient alternative to

events.Several recent systems, such as IX [6] and Zygos [32],

have combined thread schedulers with high-performancenetwork stacks. These systems share Arachne’s goal ofcombining low latency with efficient resource usage, butthey take a more special-purpose approach than Arachneby coupling the threading mechanism to the network stack.Arachne is a general-purpose mechanism; it can be usedwith high-performance network stacks, such as in RAM-Cloud, but also in other situations.

9 Future WorkWe believe that Arachne’s core-aware approach toscheduling would be beneficial in other domains. Forexample, virtual machines could use a multi-level core-aware approach, where applications use Arachne to nego-tiate with their guest OS over cores, and the guest OSes usea similar approach to negotiate with the hypervisor. Thiswould provide a more flexible and efficient way of man-aging cores than today’s approaches, since the hypervisorwould know how many cores each virtual machine needs.

Core-aware scheduling would also be beneficial in clus-ter schedulers for datacenter-scale applications. The clus-ter scheduler could collect information about core require-ments from the core arbiters on each of the cluster ma-chines and use this information to place applications andmove services among machines. This would allow deci-sions to be made based on actual core needs rather thanstatically declared maximum requirements. Arachne’sperformance isolation would allow cluster schedulers torun background applications more aggressively withoutfear of impacting the response time of foreground applica-tions.

There are a few aspects of Arachne that we have notfully explored. We have only preliminary experience im-plementing core policies, and our current core policies donot address issues related to NUMA machines, such ashow to allocate cores in an application that spans multiplesockets. We hope that a variety of reusable core policieswill be created, so that application developers can achievehigh threading performance without having to write a cus-


tom policy for each application. In addition, our experi-ence with the parameters for core estimation is limited.We chose the current values based on a few experimentswith our benchmark applications. The current parametersprovide a good trade-off between latency and utilizationfor our benchmarks, but we don’t know whether these willbe the best values for all applications.

10 ConclusionOne of the most fundamental principles in operating sys-tems is virtualization, in which the system uses a set ofphysical resources to implement a larger and more diverseset of virtual entities. However, virtualization only worksif there is a balance between the use of virtual objects andthe available physical resources. For example, if the usageof virtual memory exceeds available physical memory, thesystem will collapse under page thrashing.

Arachne provides a mechanism to balance the usage ofvirtual threads against the availability of physical cores.Each application computes its core requirements dynam-ically and conveys that to a central core arbiter, whichthen allocates cores among competing applications. Thecore arbiter dedicates cores to applications and tells eachapplication which cores it has received. The applica-tion can then use that information to manage its threads.Arachne also provides an exceptionally fast implementa-tion of threads at user level, which makes it practical to usethreads even for very short-lived tasks. Overall, Arachne’score-aware approach to thread management enables gran-ular applications that combine both low latency and highthroughput.

11 AcknowledgementsWe would like to thank our shepherd, Michael Swift, andthe anonymous reviewers for helping us improve this pa-per. Thanks to Collin Lee for giving feedback on design,and to Yilong Li for help with debugging the RAMCloudnetworking stack. This work was supported by C-FAR(one of six centers of STARnet, a Semiconductor Re-search Corporation program, sponsored by MARCO andDARPA) and by the industrial affiliates of the StanfordPlatform Lab.

References[1] Boost fibers. http://www.boost.org/doc/

libs/1_64_0/libs/fiber/doc/html/fiber/overview.html.

[2] T. E. Anderson, B. N. Bershad, E. D. Lazowska,and H. M. Levy. Scheduler activations: Effectivekernel support for the user-level management ofparallelism. ACM Transactions on ComputerSystems (TOCS), 10(1):53–79, 1992.

[3] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, andM. Paleczny. Workload analysis of a large-scale

key-value store. In Proceedings of the 12th ACMSIGMETRICS/PERFORMANCE Joint Interna-tional Conference on Measurement and Modelingof Computer Systems, SIGMETRICS ’12, pages53–64, 2012.

[4] S. Barghi. uthreads: Concurrent user threads in c++.https://github.com/samanbarghi/uThreads.

[5] L. Barroso, M. Marty, D. Patterson, and P. Ran-ganathan. Attack of the Killer Microseconds. Com-munications of the ACM, 60(4):48–54, Mar. 2017.

[6] A. Belay, G. Prekas, A. Klimovic, S. Grossman,C. Kozyrakis, and E. Bugnion. IX: A ProtectedDataplane Operating System for High Throughputand Low Latency. In 11th USENIX Symposiumon Operating Systems Design and Implementation(OSDI 14), pages 49–65, Oct. 2014.

[7] D. Bittman. Mpscq - multiple producer, singleconsumer wait-free queue. https://github.com/dbittman/waitfree-mpsc-queue.

[8] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E.Leiserson, K. H. Randall, and Y. Zhou. Cilk: Anefficient multithreaded runtime system. In Pro-ceedings of the Fifth ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming,PPOPP ’95, pages 207–216, 1995.

[9] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrish-nan, and R. Sears. Benchmarking cloud serving sys-tems with ycsb. In Proceedings of the 1st ACM sym-posium on Cloud computing, pages 143–154, 2010.

[10] J. Dean and S. Ghemawat. MapReduce: SimplifiedData Processing on Large Clusters. Communica-tions of the ACM, 51:107–113, January 2008.

[11] A. Dragojevic, D. Narayanan, M. Castro, andO. Hodson. FaRM: Fast Remote Memory. In 11thUSENIX Symposium on Networked Systems Designand Implementation (NSDI 14), pages 401–414,2014.

[12] D. G. Feitelson and L. Rudolph. Gang schedulingperformance benefits for fine-grain synchronization.Journal of Parallel and Distributed Computing,16(4):306–318, 1992.

[13] Folly: Facebook open-source library.https://github.com/facebook/folly.

[14] The Go Programming Language. https://golang.org/.

[15] K. A. Klues. OS and Runtime Support for EfficientlyManaging Cores in Parallel Applications. PhDthesis, University of California, Berkeley, 2015.


[16] S. Kumar, C. J. Hughes, and A. Nguyen. Carbon:Architectural support for fine-grained parallelismon chip multiprocessors. In Proceedings of the34th Annual International Symposium on ComputerArchitecture, ISCA ’07, pages 162–173, 2007.

[17] J. Leverich and C. Kozyrakis. Reconciling HighServer Utilization and Sub-millisecond Quality-of-Service. In Proc. Ninth European Conference onComputer Systems, EuroSys ’14, pages 4:1–4:14,2014.

[18] R. Levin, E. Cohen, W. Corwin, F. Pollack, andW. Wulf. Policy/mechanism separation in hydra. InACM SIGOPS Operating Systems Review, volume 9,pages 132–140. ACM, 1975.

[19] Q. Li. memcached-a. https://github.com/PlatformLab/memcached-A.

[20] H. Lim, D. Han, D. G. Andersen, and M. Kaminsky.MICA: A Holistic Approach to Fast In-MemoryKey-Value Storage. In Proc. 11th USENIXSymposium on Networked Systems Design and Im-plementation (NSDI 14), pages 429–444, Apr. 2014.

[21] M. Lora. Xiph.org::test media. https://media.xiph.org/.

[22] A. Lottarini, A. Ramirez, J. Coburn, M. A. Kim,P. Ranganathan, D. Stodolsky, and M. Wachsler.vbench: Benchmarking video transcoding in thecloud. In Proceedings of the Twenty-Third Inter-national Conference on Architectural Support forProgramming Languages and Operating Systems,pages 797–809, 2018.

[23] memcached: a Distributed Memory Object CachingSystem. http://www.memcached.org/.

[24] Memtier benchmark. https://github.com/RedisLabs/memtier_benchmark.

[25] L. Merritt and R. Vanam. x264: Ahigh performance H.264/AVC encoder.http://neuron2.net/library/avc/overview_x264_v8_5.pdf.

[26] M. Mitzenmacher. The Power of Two Choices inRandomized Load Balancing. IEEE Transactionson Parallel and Distributed Systems, 12(10):1094–1104, 2001.

[27] Mutilate: high-performance memcached load gen-erator. https://github.com/leverich/mutilate.

[28] Nginx. https://nginx.org/en/.

[29] J. Ousterhout. Scheduling Techniques for Concur-rent Systems. In Proc. 3rd International Conference

on Distributed Computing Systems, pages 22–30,1982.

[30] J. Ousterhout, A. Gopalan, A. Gupta, A. Kejriwal,C. Lee, B. Montazeri, D. Ongaro, S. J. Park, H. Qin,M. Rosenblum, et al. The RAMCloud StorageSystem. ACM Transactions on Computer Systems(TOCS), 33(3):7, 2015.

[31] S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos,A. Krishnamurthy, T. Anderson, and T. Roscoe.Arrakis: The Operating System is the Control Plane.In 11th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 14), pages 1–16,2014.

[32] G. Prekas, M. Kogias, and E. Bugnion. ZygOS:Achieving Low Tail Latency for Microsecond-scaleNetworked Tasks. In Proc. of the 26th Symposiumon Operating Systems Principles, SOSP ’17, pages325–341, 2017.

[33] H. Qin. Arachne. https://github.com/PlatformLab/Arachne.

[34] Redis. http://redis.io.

[35] B. Rhoden, K. Klues, D. Zhu, and E. Brewer.Improving per-node efficiency in the datacenter withnew os abstractions. In Proceedings of the 2Nd ACMSymposium on Cloud Computing, SOCC ’11, pages25:1–25:8, 2011.

[36] R. Ricci, E. Eide, and The CloudLab Team. In-troducing CloudLab: Scientific infrastructure foradvancing cloud architectures and applications.USENIX ;login:, 39(6), December 2014.

[37] Seastar. http://www.seastar-project.org/.

[38] A. Tucker and A. Gupta. Process Control andScheduling Issues for Multiprogrammed Shared-memory Multiprocessors. In Proc. of the TwelfthACM Symposium on Operating Systems Principles,SOSP ’89, pages 159–166, 1989.

[39] R. Von Behren, J. Condit, F. Zhou, G. C. Necula, andE. Brewer. Capriccio: scalable threads for internetservices. In ACM SIGOPS Operating SystemsReview, volume 37, pages 268–281, 2003.

[40] Wikipedia. List of c++ multi-threading libraries —wikipedia, the free encyclopedia, 2017.


arachne: core-aware thread management · arachne: core-aware thread management henry qin, qian li,...

Documents