optimizing the tlb shootdown algorithm with page access … · nadav amit vmware research abstract...

15
This paper is included in the Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC ’17). July 12–14, 2017 • Santa Clara, CA, USA ISBN 978-1-931971-38-6 Open access to the Proceedings of the 2017 USENIX Annual Technical Conference is sponsored by USENIX. Optimizing the TLB Shootdown Algorithm with Page Access Tracking Nadav Amit, VMware Research https://www.usenix.org/conference/atc17/technical-sessions/presentation/amit

Upload: others

Post on 01-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

This paper is included in the Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC ’17).

July 12–14, 2017 • Santa Clara, CA, USA

ISBN 978-1-931971-38-6

Open access to the Proceedings of the 2017 USENIX Annual Technical Conference

is sponsored by USENIX.

Optimizing the TLB Shootdown Algorithm with Page Access Tracking

Nadav Amit, VMware Research

https://www.usenix.org/conference/atc17/technical-sessions/presentation/amit

Page 2: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

Optimizing the TLB ShootdownAlgorithm with Page Access Tracking

Nadav AmitVMware Research

AbstractThe operating system is tasked with maintaining thecoherency of per-core TLBs, necessitating costly syn-chronization operations, notably to invalidate stalemappings. As core-counts increase, the overhead ofTLB synchronization likewise increases and hindersscalability, whereas existing software optimizationsthat attempt to alleviate the problem (like batching)are lacking.

We address this problem by revising the TLBsynchronization subsystem. We introduce severaltechniques that detect cases whereby soon-to-beinvalidated mappings are cached by only one TLBor not cached at all, allowing us to entirely avoidthe cost of synchronization. In contrast to existingoptimizations, our approach leverages hardwarepage access tracking. We implement our techniquesin Linux and find that they reduce the number ofTLB invalidations by up to 98% on average and thusimprove performance by up to 78%. Evaluationsshow that while our techniques may introduceoverheads of up to 9% when memory mappingsare never removed, these overheads can be avoidedby simple hardware enhancements.

1. IntroductionTranslation lookaside buffers (TLBs) are perhaps themost frequently accessed caches whose coherencyis not maintained by modern CPUs. The TLB istasked with caching virtual-to-physical translations(“mappings”) of memory addresses, and so it isaccessed upon every memory read or write operation.Maintaining TLB coherency in hardware hampersperformance [33], so CPU vendors require OSes tomaintain coherency in software. But it is difficult forOSes to efficiently achieve this goal [27, 38, 39, 41, 48].

To maintain TLB coherency, OSes employ theTLB shootdown protocol [8]. If a mapping m thatpossibly resides in the TLB becomes stale (due tomemory mapping changes) the OS flushes m fromthe local TLB to restore coherency. Concurrently,the OS directs remote cores that might house m intheir TLB to do the same, by sending them an inter-processor interrupt (IPI). The remote cores flush

their TLBs according to the information supplied bythe initiator core, and they report back when theyare done. TLB shootdown can take microseconds,causing a notable slowdown [48]. Performing TLBshootdown in hardware, as certain CPUs do, is fasterbut still incurs considerable overheads [22].

In addition to reducing performance, shootdownoverheads can negatively affect the way applicationsare constructed. Notably, to avoid shootdown la-tency, programmers are advised against using mem-ory mappings, against unmapping them, and evenagainst building multithreaded applications [28, 42].But memory mappings are the efficient way to usepersistent memory [18, 47], and avoiding unmap-pings might cause corruption of persistent data [12].

OSes try to cope with shootdown overheads bybatching them [21, 43], avoiding them on idle cores,or, when possible, performing them faster [5]. Butthe potential of these existing solutions is inherentlylimited to certain specific scenarios. To have a gener-ally applicable, efficient solution, OSes need do knowwhich mappings are cached by which cores. Such in-formation can in principle be obtained by replicatingthe translation data structures for each core [11], butthis approach might result in significantly degradedperformance and wasted memory.

We propose to avoid unwarranted TLB shoot-downs in a different manner: by monitoring accessbits. While TLB coherency is not maintained by theCPU, CPU architectures can maintain the consis-tency of access bits, which are set when a mappingis cached. We contend that these bits can thereforebe used to reveal which mappings are cached bywhich cores. To our knowledge, we are the first touse access bits in this way.

In the x86 architecture, which we study in thispaper, access bit consistency is maintained by thememory subsystem. Exploiting it, we propose tech-niques to identify two types of common mappingswhose shootdown can be avoided: (1) short-livedprivate mappings, which are only cached by a singlecore; and (2) long-lived idle mappings, which arereclaimed after the corresponding pages have notbeen used for a while and are not cached at all. Using

USENIX Association 2017 USENIX Annual Technical Conference 27

Page 3: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

these techniques, we implement a fully functionalprototype in Linux 4.5. Our evaluation shows thatour proposal can eliminate more than 90% of TLBshootdowns and improve the performance of mem-ory migration by 78%, of copy-on-write events by18–25%, and of multithreaded applications (Apacheand parallel bzip2) by up to 12%.

Our system introduces a worst case slowdownof up to 9% when mappings are only set and neverremoved or changed, which means no shootdownactivity is conducted. This slowdown is caused,according to our measurements, due to the overheadof our TLB manipulation software techniques. Toeliminate it, we propose a CPU extension that wouldallow OSes to write entries directly into the TLB, andresembles the functionality provided by CPUs thatemploy software-TLB.

2. Background and Motivation2.1 Memory Management Hardware

Virtual memory is supported by most modern CPUsand used by all the major OSes [9, 32]. Using vir-tual memory allows the OS to utilize the physicalmemory more efficiently and to isolate the addressspace of each process. The CPU translates the virtualaddresses to physical addresses before memory ac-cesses are performed. The OS sets the virtual addresstranslations (also called “mappings”) according toits policies and considerations.

The memory mappings of each address space arekept in a memory-resident data structure, which isdefined by the CPU architecture. The most commondata structure, used by the x86 architecture, is a radix-tree, which is also known as a page-table hierarchy.The leaves of the tree, called the page-table entries(PTEs), hold the translations of fixed-sized virtualmemory pages to physical frames. To translate avirtual address into a physical address, the CPUincorporates a memory management unit (MMU),which performs a “page table walk” on the pagetable hierarchy, checking access permissions at everylevel. During a page-walk, the MMU updates thestatus bits in each PTE, indicating whether the pagewas read from and/or written to (dirtied).

To avoid frequent page-table walks and their as-sociated latency, the MMU caches translations ofrecently used pages in a translation lookaside buffer(TLB). In the x86 architecture, these caches are main-tained by the hardware, bringing translations intothe cache after page walks and evicting them accord-ing to an implementation-specific cache replacementpolicy. Each x86 core holds a logically private TLB.

Unlike memory caches, TLBs of different CPUs arenot maintained coherent by hardware. Specifically,

x86 CPUs do not maintain coherence between theTLB and the page-tables, nor among the TLBs ofdifferent cores. As a result, page-table changes mayleave stale entries in the TLBs until coherence isrestored by the OS. The instruction set enables theOS to do so by flushing (“invalidating”) individualPTEs or the entire TLB. Global and individual TLBflushes can only be performed locally, on the TLB ofthe core that executes the flush instruction.

Although the TLB is essential to attain reasonabletranslation latency, some workloads experience fre-quent TLB cache-misses [4]. Recently, new featureswere introduced into the x86 architecture to reducethe number and latency of TLB misses. A new instruc-tion set extension allows each page-table hierarchy tobe associated with an address-space ID (ASID) andavoid TLB flushes during address-space switching,thus reducing the number of TLB misses. Micro-architectural enhancements introduced page-walkcaches that enable the hardware to cache internalnodes in the page-table hierarchy, thereby reducingTLB-miss latencies [3].

2.2 TLB Software Challenges

The x86 architecture leaves maintaining TLB co-herency to the OSes, which often requires frequentTLB invalidations after PTE changes. OS kernelscan make such PTE changes independently of therunning processes, upon memory migration acrossNUMA nodes [2], memory deduplications [49], mem-ory reclamation, and memory compaction for accom-modating huge pages [14]. Processes can also triggerPTE changes by using system calls, for examplemprotect, which changes protection on a memoryrange, or by writing to copy-on-write pages (COW).

These PTE changes can require a TLB flush toavoid caching of stale PTEs in the TLB. We distin-guish between two types of flushes: local and remote,in accordance with the core that initiated the PTEchange. Remote TLB flushes are significantly moreexpensive, since most CPUs cannot flush remoteTLBs directly. OSes therefore perform a TLB shoot-down: The initiating core sends an inter-processorinterrupt (IPI) to the remote cores and waits fortheir interrupt handlers to invalidate their TLBs andacknowledge that they are done.

TLB shootdowns introduce a variety of overheads.IPI delivery can take several hundreds of cycles [5].Then, the IPI may be kept pending if the remote corehas interrupts disabled, for instance while runninga device driver [13]. The x86 architecture doesnot allow OSes to flush multiple PTEs efficiently,requiring the OS to either incur the overhead ofmultiple flushes or flush the entire TLB and increasethe TLB miss rate. In addition, TLB flushes may

28 2017 USENIX Annual Technical Conference USENIX Association

Page 4: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

indirectly cause lock contention since they are oftenperformed while the OS holds a lock [11, 15]. It isnoteworthy that while some CPU architectures (e.g.,ARM) enable to perform remote TLB shootdownswithout IPIs, remote shootdowns still incur higherperformance overhead than local ones [22].

2.3 OS Solutions and Shortcomings

To reduce TLB related overheads, OSes employseveral techniques to avoid unnecessary shootdowns,reduce their time, and avoid TLB misses.

A TLB shootdown can be avoided if the OS canensure that the modified PTE is either not cached inremote TLBs or can be flushed at a later time, butbefore it can be used for an address translation. Inpractice, OSes can only avoid remote shootdowns incertain cases. In Linux, for example, each userspacePTE is only set in a single address space page-tablehierarchy, allowing the OS to track which addressspace is active on each core and flush only the TLBsof cores that currently use this address space. TheTLB can be flushed during context switch, before anystale entry would be used.

A common method to reduce shootdown time isto batch TLB invalidations if they can be deferred [21,47]. Batching, however, cannot be used in manycases, for example when a multithreaded applicationchanges the access permissions of a single page.Another way to reduce shootdown overhead isto acknowledge its IPI immediately, even beforeinvalidation is performed [5, 43].

Flush time can be reduced by lowering the numberof TLB flushes. Flushing multiple individual PTEs isexpensive, and therefore OSes can prefer to flush theentire TLB if the number of PTEs exceeds a certainthreshold. This is a delicate trade-off, as such a flushincreases the number of TLB misses [23].

Linux tries to balance between the overheads ofTLB flushes and TLB misses when a core becomesidle, using a lazy TLB invalidation scheme. Since theprocess that ran before the core became idle may bescheduled to run again, the OS does not switch itsaddress space, in order to avoid potential future TLBmisses. However, when the first TLB shootdown isdelivered to the idle core, the OS performs a full TLBinvalidation and indicates to the other cores not tosend it further shootdown IPIs while it is idle.

Despite all of these techniques, shootdowns caninduce high overheads in real systems. Arguably, thisoverhead is one of the reasons people refrain fromusing multithreading, in which mapping changesneed to propagate to all threads. Moreover, applica-tion writers often prefer copying data over memoryremapping, which requires TLB shootdown [42].

2.4 Per-Core Page Tables

Currently, the state-of-the-art software solution forTLB shootdowns is setting per-core page tables, andaccording to the experienced page-faults track whichcores used each PTE [11,19]. When a PTE invalidationis needed, a shootdown is sent only to cores whosepage tables hold the invalidated PTE.

Maintaining per-core page tables, however, canintroduce substantial overheads when some PTEsare accessed by multiple cores. In such a case, OSmemory management operations become more ex-pensive, as mapping modifications require changesthe of PTEs in multiple page-tables. The overhead ofPTE changes is not negligible, as some require atomicoperations. RadixVM [11] reduces this overhead bychanging PTEs in parallel: sending IPIs to cores thathold the PTE and changing them locally. This schemeis efficient when shootdowns are needed, as one IPItriggers both the PTE change and its invalidation.Yet, if a shootdown is not needed, for example whenthe other cores run a different process, this solutionmay increase the overhead due to the additional IPIs.

Holding per-core page tables can also introducehigh memory overheads if memory is accessed bymultiple cores. For example, in recent 288 coreCPUs [24], if half of the memory is accessed byall cores, the page tables will consume 18% of thememory or more if memory is overcommitted ormappings are sparse.

While studies showed substantial performancegains when per-core page tables are used, the limi-tations of this approach may have not been studiedwell enough. For example, in an experiment we con-ducted memory migration between NUMA nodeswas 5 times slower when memory was mapped in48 page-table hierarchies (of 48 Linux running pro-cesses in our experiment) instead of one. Previousstudies may have not shown these overheads asthey considered a teaching OS, which lacks basicmemory management features [11]. In addition, pre-vious studies experienced shootdown latencies ofover 500k cycles, which is over 24x of the latencythat we measured. Presumably, the high overheadof shootdowns could overshadow other overheads.

3. The IdeaThe challenge in reducing TLB shootdown overheadis determining which cores, if at all, might be cachinga given PTE. Although architectural paging struc-tures do not generally provide this information, wecontend that the OS can nevertheless deduce it bycarefully tracking and manipulating PTE access-bits.The proclaimed goal of access bits is to indicate ifmemory pages have been accessed. This functional-

USENIX Association 2017 USENIX Annual Technical Conference 29

Page 5: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

ity is declared by architectural manuals and is usedby OSes to make informed swapping decisions. Ourinsight is that access bits can be additionally usedfor a different purpose: to indicate if PTEs are cachedin TLBs, as explained next.

Let us assume: that (1) a PTE e might be cachedby a set of cores S at time t0; that (2) e’s access bit isclear at t0 (because it was never set, or because theOS explicitly cleared it); and that (3) this bit is stillclear at some later time t1. Since access bits are setby hardware whenever it caches the correspondingtranslations in the TLB [25], we can safely concludethat e is not cached by any core c < S at t1.

We note that our reasoning rests on the fact thatlast-level TLBs are private per core [6, 27, 29] and sotranslations are not transferred between them. Linux,for example, relies on this fact when shooting downa PTE of some address space α while avoiding theshootdown at remote cores whose current addressspaces are different than α (§2.3). This optimizationwould have been erroneous if TLBs were shared,because Linux permits the said remote cores tofreely load αwhile the shootdown takes place, whichwould have allowed them to cache stale mappingsfrom a shared last-level TLB, thereby creating aninconsistency bug.

We identify two types of mappings that can helpus optimize TLB shootdown by leveraging access-bitinformation. The first is short-lived private mappings ofpages that are accessed exclusively by a single threadand then removed shortly after; this access patternmay be exhibited, for example, by multithreadedapplications that use memory-mapped files to readdata. The second type is long-lived idle mappings ofpages that are reclaimed by the OS after they havenot been accessed for a while; this pattern is typicalfor pages that cease to be part of the working set ofa process, prompting the OS to unmap them, flushtheir PTEs, and reuse their frames elsewhere.

4. The SystemUsing the above reasoning (§3), we next describethe Linux enhancements we deploy on an x86Intel machine to optimize TLB shootdown of short-lived private mappings (§4.1) and long-lived idlemappings (§4.2). We then describe “software-PTEs”,the data structures we use when implementing ourmechanisms (§4.3). To distinguish our enhancementsfrom the baseline OS, we collectively denote themas ABIS—access-based invalidation system.

4.1 Private PTE Detection

To avoid TLB shootdown due to a private mapping,we must (1) identify the core that initially uses this

mapping and (2) make sure that other cores have notused it too at a later time. As previously shown [27],the first item is achievable via demand paging,the standard memory management technique thatOSes employ, which traps upon the first access to amemory page and only then sets a valid mapping [9].The second item, however, is more challenging,as existing approaches to detect PTE sharing canintroduce overheads that are much higher than thosewe set out to eliminate (§6).

Direct TLB Insertion Our goal is therefore to finda low-overhead way to detect PTE sharing. As afirst step, we note that this goal would have beeneasily achievable if it was possible to conduct directTLB insertion—inserting a mapping m directly intoa TLB of a core c without setting the access bit ofthe corresponding PTE e. Given such a capability, aslong as m resides in the TLB, subsequent uses of mby c would not set the access-bit of e, as no page tablewalks are needed. In contrast, if some other core c̄ends up using m as well, the hardware will walk thepage table when inserting m to the TLB of c̄, and itwill therefore set e’s access bit, thereby indicatingthat m is not private.

Direct TLB insertion would have thus allowedus to use turned-off access bits as identifiers ofprivate mappings. We remark that this method is best-effort and might lead to false-positive indicationsof sharing in cases where m is evicted from theTLB and reinserted later. This issue does not affectcorrectness, however. It simply implies that someuseless shootdown activity is possible. The approachis thus more suitable for short-lived PTEs.

Alas, current x86 processors do not support di-rect TLB insertion. One objective of this study isto motivate such support. When proposing a newhardware feature, architects typically resort to simu-lation since it is unrealistic to fabricate chips to testresearch features. We do not employ simulation fortwo reasons. First, because we suspect that it mightyield questionable results, as the OS memory man-agement subsystems that are involved are complexto realistically simulate. Second, since TLB insertionis possible on existing hardware even without hard-ware support, and can benefit workloads that aresensitive to shootdown overheads, shortening run-times by 0.56x (= 1

1.78 ; see Figure 5) at best. Althoughruntimes might be 1.09x longer in the worst case,our results indicate that real hardware support willeliminate this overhead (§5.1).

Note that although direct TLB insertion is notsupported in the x86 architecture, it is supportedin CPUs that employ software-managed TLBs. Forexample, Power CPUs support the tlbwe instruction

30 2017 USENIX Annual Technical Conference USENIX Association

Page 6: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

that can insert PTE directly into the TLB. We there-fore consider this enhancement achievable with areasonable effort.

Approximation Let us first rule out the naive ap-proach to approximate direct TLB insertion by: (1) set-ting a PTE e; (2) accessing the page and thus prompt-ing hardware to load the corresponding mappingm into the TLB and to set e’s access bit; and then(3) having the OS clear e’s access bit. This approach isbuggy due to the time window between the secondand third items, which allows other cores to cachem in their TLBs before the bit is cleared, resulting ina false sharing indications that the page is private.Shootdown will then be erroneously skipped.

We resolve this problem and avoid the above raceby using Intel’s address space IDs, which is knownas process-context identifiers (PCIDs) [25]. PCIDsenable TLBs to hold mappings of multiple addressspaces by associating every cached PTE with a PCIDof its address space. The PCID of the current addressspace is stored in the same register as the pointer tothe root of the page table hierarchy (CR3), and TLBentries are associated with this PCID when they arecached. The CPU uses for address translation onlyPTEs whose PCID matches the current one. Thisfeature is intended to allow OSes to avoid globalTLB invalidations during context switch and reducethe number of TLB misses.

PCID is not currently used by Linux due to thelimited number of supported address spaces andquestionable performance gains from TLB miss re-duction. We indeed exploit this feature in a differentmanner. Nevertheless, our use does not prevent orlimit future PCID support in the OS.

The technique ABIS employs to provide direct TLBinsertion is depicted in Figure 1. Upon initialization,ABIS preallocates for each core a “secondary” page-table hierarchy, which consists of four pages, onefor each level of the hierarchy. The uppermost levelof the page-table (PGD) is then set to point to thekernel mappings (like all other address spaces). Theother three pages are not connected at this stage tothe hierarchy, but wired dynamically later accordingto the address of the PTE that is inserted to the TLB.

While executing, the currently running thread Toccasionally experiences page faults, notably dueto demand paging. When a page fault fires, the OShandler is invoked and locks the PT that holds thefaulting PTE—no other core will simultaneouslyhandle the same fault.

At this point, ABIS loads the secondary space toCR3 along with a PCID equal to that of T (Step 1 inFigure 1). After, ABIS wires the virtual-to-physicalmapping of the target page in both primary and

page

CR

3 0s adrs0 pcid0

PG

Ds

PU

Ds

PM

Ds

PTE

s

0s adrs1 pcid0

(1) change CR3 to secondary hierarchy[same PCID, different address]

(2) wire both hierarchies [primary bit is 0]

(4) change CR3 backto primary hierarchy

(3) read page [translation loaded to TLB from

secondary CR3, primary bit remains 0]access bit

(5) zerosecondary

PTE

phys. addressCR3 change

primary secondary

PTE

Figure 1: Direct TLB insertion using a secondary hierarchy.

secondary spaces, leaving the corresponding accessbit in the primary hierarchy clear (Step 2).

Then, ABIS reads from the page. Because the asso-ciated mapping is currently missing from the TLB (apage fault fired), and because CR3 currently pointsto the secondary space, reading the page promptsthe hardware to walk the secondary hierarchy andto insert the appropriate translation to the TLB, leav-ing the primary bit clear (Step 3). Importantly, theinserted translation is valid and usable within theprimary space, because both spaces have the samePCID and point to the same physical page using thesame virtual address. This approach eliminates heaforementioned race: no other core is able to accessthe secondary space, as it is private to the core.

After reading the page, ABIS loads the primaryhierarchy back to CR3, to allow the thread to continueas usual (Step 4). It then clears the PTE from thesecondary space, thereby preventing further use oftranslation data from the secondary hierarchy thatmay have been cached in the hardware page-walkcache (PWC). If the secondary tables are used by theCPU for translation, no valid PTE will be found andthe CPU will restart a page-walk from the root entry.

Finally, using our “software-PTE” (SPTE) datastructure (§4.3), ABIS associates the faulting PTE ewith the current core c that has just resolved e. Whenthe time comes to flush e, if ABIS determines that eis still private to c, it will invalidate e on c only, thusavoiding the shootdown overhead.

Coexisting with Linux Linux reads and clears ar-chitectural access bits (hwA-s) via a small API, allow-ing us to easily mask these bits while making sure

USENIX Association 2017 USENIX Annual Technical Conference 31

Page 7: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

that both Linux and ABIS simultaneously operatecorrectly. Notably, when Linux attempts to clear anhwA, ABIS (1) checks whether the bit is turned on,in which case it (2) clears the bit and (3) recordsin the SPTE the fact that the associated PTE is notprivate (using the value ALL_CPUS discussed furtherbelow). Note, however, that Linux and ABIS canuse the access bit in a conflicting manner. For ex-ample, after a page fault, Linux could expect to seethe access bit turned on, whereas ABIS’s direct TLBinsertion makes sure that the opposite happens. Toavoid any such conflicts, we maintain in the SPTE anew per-PTE “software access bit” (swA) for Linux,which reflects Linux’s expectations. The swA bits aregoverned by the following rules: upon a page fault,we set the swA; when Linux clears the bit, we clearthe swA; and when Linux queries the bit, we returnan OR’d value of swA and hwA. These rules ensurethat Linux always observes the values it would haveobserved in an ABIS-less system.

ABIS attempts to reduce false indications of PTEsharing when possible. We find that Linux performsexcessive full flushes to reduce the number of IPIssent to idle cores as part of the shootdown procedure(§2.3). In Linux, this behavior is beneficial as itreduces the number of TLB shootdowns at the costof more TLB misses, whose impact is relatively small.In our system, however, this behavior can result inmore shootdowns, as it increases the number of falseindications. ABIS therefore relaxes this behavior,allowing idle cores to service a few individual PTEflushes before resorting to a full TLB flush.

Overhead Overall, the overhead of direct TLB in-sertions in our system is≈550 cycles per PTE (respon-sible for the worst-case 9% slowdown mentionedearlier). This overhead is amortized when multi-ple PTEs are mapped together, for example, via onemmap system-call invocation, or when Linux servesa page-fault on a file-backed page and maps adjacentPTEs to avoid future page-faults [36].

4.2 TLB Version Tracking

Based on our observations from §3, we build a TLBversion tracking mechanism to avoid flushes of long-lived idle mappings. Let us assume that a PTE emight be cached by a set of cores S at time t0, andthat each core c ∈ S performed a full TLB flush duringthe time period (t0, t1). If at time t1 the access bit ofe remains clear (i.e., was not cleared by software),then we know for a fact e is not cached by anyTLB. If the OS obtained the latter information byatomically reading and zeroing e, then all TLB flushesassociated with e (local and remote) can be avoided.To detect such cases, we first need to maintain a “full-

uncachedver=uncached

privateCPU=[current]ver=[AS].ver

potentiallysharedCPU=all

ver=[AS].ver

PTEchangewhenaccess-bit isset

faulted-in

PTEflush

PTEflush

PTEchangewhenaccess-bitisset

Figure 2: A finite state machine that describes the various statesof a PTE. In each state, the assignment of the caching core andversion are denoted. On each transition the access-bit is cleared.

flush version number” for S, such that the version isincremented whenever all cores c ∈ S perform a fullTLB flush. Recording this version for each e at thetime e is updated would then allow us to employ theoptimization.

TLB version tracking The most accurate way totrack full flushes is by maintaining a version foreach core, advancing it after each local full flush,and storing a vector of the versions for every PTE.Then, if a certain core’s version differs from thecorresponding vector coordinate (and the access-bit is clear), a flush on that core is not required.Despite its accuracy, this scheme is impractical, as itconsumes excessive memory and requires multiplememory accesses to update version vectors. Wetherefore trade off accuracy in order to reduce thememory consumption of versions and the overheadsof updating them.

ABIS therefore tracks versions for each addressspace (AS, corresponds to the above S) and not foreach core. To this end, for every AS, we save a versionnumber and a bitmask that marks which cores havenot performed a full TLB flush in the current version.The last core to perform a full TLB flush in a certainversion advances the version. At the same time, itmarks in the bitmask which cores currently use thisAS and can therefore cache PTEs in the next version.To mitigate cache line bouncing, the core that initiatesa TLB shootdown updates the version on behalf thetarget cores.

Avoiding flushes After a PTE access-bit is cleared,ABIS stores the current AS version as the PTE version.Determining later whether a shootdown is neededrequires some attention, as even if the PTE and the ASversions differ, a flush may be necessary. Consider

32 2017 USENIX Annual Technical Conference USENIX Association

Page 8: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

Noflush

PTE.A=1

FlushallCPUs

SPTE.ver =uncached

SPTE.ver <AS.ver-1

SPTE.CPU=all

FlushSPTE.CPU

FlushPTE

True

False

False

False

True

True

True

False

Figure 3: Flush type decision algorithm.

a situation in which the access-bit is cleared, andthe PTE version is updated to hold the AS version.At this time, some of the cores may have alreadyflushed their TLB for the current AS version, andtheir respective bit in the bitmask is clear. The ASversion may therefore advance before these coresflush their TLB again, and these cores can hold stalePTEs even when the versions differ. Thus, our systemavoids shootdown only if there is a gap of at leastone version between the AS and the PTE versions,which indicates a flush was performed on all cores.

Since flushes cannot be avoided when the access-bit is set, this bit should be cleared and the PTEversion updated as frequently as possible, assumingit introduces negligible overheads. In practice, ABISclears the bit and updates the version whenever theOS already accesses a PTE for other purposes, forexample during an mprotect system-call or whenthe OS considers a page for reclamation.

Uncached PTEs The version tracking mechanismcan also prevent unwarranted multiple flushes ofthe same PTE. Such flushes may occur, for example,when a user first calls an msync system call, whichperforms writeback of a memory mapped file, andthen unmaps the file. Both operations require flush-ing the TLB since the first clears PTEs’ dirty-bit andthe second sets a non-present PTE. However, if thePTE was not accessed after the first flush, the secondflush is unnecessary, regardless of whether a full TLBflush happened in between. To avoid this scenario,we set a special version value, UNCACHED, as the PTEversion when it is flushed. This value indicates thePTE is not cached in any TLB if the access-bit iscleared, regardless of the current AS version.

Coexisting with Private PTE Detection Versiontracking coexists with private PTE detection. Theinteraction between the two can be described in a

generation swA cachingcore

15987 0

softwarepagetable(SPT)

pagetable(PT)

framemeta-data(struct page )

PTspin-lock

SPTpointer

PFN stat/protection

63 11 0

Figure 4: Software PTE (SPTE) and its association to the pagetable through the meta-data of the page-table frame.

state machine,as shown in Figure 2. In the “uncached”state a TLB flush is unnecessary; in the “private”state at most one CPU needs to perform a TLB flush;and in the “potentially shared” state all the CPUsperform TLB flush.1 In the latter two states, a TLBflush may still be avoided if the access-bit is clearand the current address space version is at least twoversions ahead of the PTE version. Figure 3 showsABIS flush decision algorithm.

4.3 Software PTEs

As we noted before, for our system to perform in-formed TLB invalidation decisions, additional infor-mation must be saved for each PTE: the PTE ver-sion, the CPU which caches the PTE, and a softwareaccess-bit. Although we are capable of squeezingthis information into two bytes, the architectural PTEonly accommodates three bits for software use. Wetherefore allocate a separate “software page-table”(SPT) for each PT, which holds the corresponding“software-PTEs” (SPTEs). The SPTE is not used bythe CPU during page-walks and therefore causeslittle cache pollution and overhead.

An SPTE is depicted in Figure 4. We use 7 bitsfor the version, 1 bit for the software access-bit, andanother byte to track the core that caches the PTEif the access-bit is cleared. We want to define theSPTE in a manner that ensures a zeroed SPTE wouldbehave in the legacy manner, allowing us to makefewer code changes. To do so, we reserve the zerovalue of the “caching core” field to indicate thatthe PTE may be cached by all CPUs (ALL_CPUS) andinstead store the core number plus one.

When the OS wishes to access the SPTE of a certainPTE, it should be able to easily access it. Yet the PTEcannot accommodate a pointer to its SPTE. A possiblesolution is to allocate two page-frames for each page-

1 A TLB flush is not required on CPUs that currently use a differentpage-table hierarchy as explained in §2

USENIX Association 2017 USENIX Annual Technical Conference 33

Page 9: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

0

0.2

0.4

0.6

0.8

1

mig

rate

cow

-seq-m

t

cow

-rand

-mt

mm

ap

-read

msyn

c-m

t

anon

-r-s

eq

0

0.2

0.4

0.6

0.8

1norm

aliz

ed r

untim

e

norm

aliz

ed T

LB

shoo

tdow

ns

853

613

368 25

112

0

0.2

0.4

0.6

0.8

1

mig

rate

cow

-seq-m

t

cow

-rand

-mt

mm

ap

-read

msyn

c-m

t

anon

-r-s

eq

0

0.2

0.4

0.6

0.8

1norm

aliz

ed r

untim

e

norm

aliz

ed T

LB

shoo

tdow

ns

297

632 305 149 160

0

runtimeshootdowns

Figure 5: Normalized runtime and number of TLB shootdowns inABIS when running vm-scalability benchmarks. The numbersabove the bars indicate the baseline (left) runtime in seconds and(right) rate of TLB shootdowns in thousands/second.

table, one holding the CPU architectural PTEs andthe second holding the corresponding SPTEs, eachin a fixed offset from its PTE. While this scheme issimple, it wastes memory as it requires the SPTE tobe the same size as a PTE (8B), when in fact SPTEonly occupies two bytes.

We therefore allocate an SPT separately during thePT construction, and set a pointer to the SPT in thePT page-frame meta-data (page struct). Linux canquickly retrieve this meta-data, allowing us to accessthe SPTE of a certain PTE with small overhead. TheSPTE pointer does not increase the page-frame meta-data, as it is set in an unused PT meta-data field(second quadword). The SPT therefore increasespage table memory consumption by 25%. ABISprevents races during SPT changes by protecting itwith the same lock that is used to protect PT changes.It is noteworthy that although SPT managementintroduces a overhead, it is negligible relatively toother overheads in the workloads we evaluated.

5. EvaluationWe implemented a fully-functional prototype ofthe system, ABIS, which is based on Linux 4.5.As a baseline system for comparison we use thesame version of Linux, which includes recent TLBshootdown optimizations. We run each test 5 timesand report the average result. Our testbed consistsof a two-socket Dell PowerEdge R630 with Intel 24-cores Haswell EP CPUs. We enable x2APIC cluster-mode, which speeds up IPI delivery.

In our system we disable transparent huge pages(THP), which may cause frequent full TLB flushes, in-crease the TLB miss-rate [4] and introduce additionaloverheads [26]. In practice, when THP is enabled,ABIS still shows benefit when small pages are used

(e.g., in the Apache benchmark shown later) and noimpact when huge pages are used (e.g., PBZIP2).

As a fast block device for our experiments weuse ZRAM, a compressed RAM block device, whichis used by Google Chrome OS and Ubuntu. Thisdevice latency is similar to that of emerging non-volatile memory modules. In our test, we disablememory deduplication and deep sleep states whichmay increase the variance of the results.

5.1 VM-Scalability

We use the vm-scalability test suite [34], which isused by Linux kernel developers to exercise thekernel memory management mechanisms, test theircorrectness and measure their performance.

We measure ABIS performance by running bench-marks that experience high number of TLB shoot-downs.2 To run the benchmarks in a reasonable time,we limit the amount of memory each test consumesto 32GB. Figure 5 presents the measured speedup,the runtime, the relative number of sent TLB shoot-downs and their rate. We now discuss these results.

Migrate. This benchmark reads a memory mappedfile and waits while the OS is instructed to migratethe process memory between NUMA nodes. Duringmigration, we set the benchmark to perform a busy-wait loop to practice TLB flushes. We present thetime that a 1TB memory migration would take. ABISreduces runtime by 44% and shootdowns by 92%.

Multithreaded copy-on-write (cow-mt). Multiplethreads read and write a private memory mappedfile. Each write causes the kernel to copy the originalpage, update the PTE to point to the copy, and flushthe TLB. ABIS prevents over 97% of the shootdowns,reducing runtime by 20% for sequential memoryaccesses and 15% for random by avoiding over 97%.

Memory mapped reads (mmap-read). Multipleprocesses read a big sparse memory mapped file. Asa result, memory pressure builds up, and memoryis reclaimed. While almost all the shootdowns areeliminated, the runtime is not affected, as apparentlythere are more significant overheads, specificallythose of the page frame reclamation algorithm.

Multithreaded msync (msync-mt). Multiple threadsaccess a memory mapped file and call the msyncsystem-call to flush the memory changes to the file.msync can cause an overwhelming number of flushes,as the OS clears the dirty-bit. ABIS eliminates 98%of the shootdowns but does not reduce the runtime,as file system overhead appears to be the mainperformance bottleneck.

2 We find that due to some benchmarks practice unrealisticscenarios. Our revised tests are released with ABIS code.

34 2017 USENIX Annual Technical Conference USENIX Association

Page 10: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

0

20

40

60

80

100

120

140

5 10 15 20 25 30 35 40 0.9

1

1.1

1.2

1.3

1.4

1.5

requ

ests

/se

c [th

ousan

ds]

spe

edu

p

cores [#]

baselineABISspeedup

(a) Throughput

0

20

40

60

80

100

120

5 10 15 20 25 30 35 40 0

100

200

300

400

500

600

700

se

nt sh

ootd

ow

ns [th

ou

sands/s

ec]

receiv

ed s

hootd

ow

ns [th

ou

sand

s/s

ec]

cores [#]

baseline - sendABIS - sendbaseline - receiveABIS - receive

(b) TLB shootdownsFigure 6: Execution of an Apache web server which serves the Wrk workload generator.

Anonymous memory read (anon-r-seq). To evalu-ate ABIS overheads we run a benchmark that per-forms sequential anonymous memory reads anddoes not cause TLB shootdowns. This benchmark’sruntime is 9% longer using ABIS. Profiling the bench-mark shows that the software TLB manipulationsconsume 9% of the runtime, suggesting that hard-ware enhancements to manipulate the TLB can elim-inate most of the overheads.

5.2 Apache Web Server

Apache is the most widely used web server soft-ware. In our tests, we use Apache v2.4.18 and enablebuffered server logging for more efficient disk ac-cesses. We use the multithreaded Wrk workloadgenerator to create web requests [50], and set it torepeatedly request the default Apache web page for30 seconds, using 400 connections and 6 threads.We use the same server for both the generator andApache, and isolate each one on a set of cores. Weensure that the generator is unaffected by ABIS.

Apache provides several multi-processing mod-ules. We use the default “mpm_event” module,which spawns multiple processes, each of whichruns multiple threads. Apache serves each requestby creating a memory mapping of the requestedfile, sending its content and unmapping it. This be-havior effectively causes frequent invalidations ofshort-lived mappings. In the baseline system, the in-validation also requires expensive TLB shootdownsto the cores that run other threads of the Apacheprocess. Effectively, when Apache serves concurrentrequests using multiple threads, it triggers a TLBshootdown for each request that it serves.

Figure 6a depicts the number of requests per sec-ond that are served when the server runs on differentnumber of cores. ABIS improves performance by 12%

when all cores are used. Executing the benchmarkreveals that the effect of ABIS on performance isinconsistent when the number of cores is low, asABIS causes slowdown of up to 8% and speedupsof to 42%. Figure 6b presents the number of TLBshootdown that are sent and received in the baselinesystem and ABIS. As shown, in the baseline sys-tem, as more cores are used, the amount of sent TLBshootdowns becomes almost identical to the numberof requests that Apache serves. ABIS reduces thenumber of both sent and received shootdowns byup to 90% as it identifies that PTEs are private andthat local invalidation would suffice.

5.3 PBZIP2

Parallel bzip2 (PBZIP2) is a multithreaded imple-mentation of the bzip2 file compressor [20]. In thisbenchmark we evaluate the effect of reclamationdue to memory pressure on PBZIP2, which in itselfdoes not cause many TLB flushes. We use PBZIP2 tocompress the Linux 4.4 tar file. We configured thebenchmark to read the input file into RAM and splitit between processors using 500k block size. We runPBZIP2 in a container and limit its memory to 300MBto induce swap activity. This activity causes the in-validation of long-lived idle mappings as inactivememory is reclaimed.

The time of compression is shown in Figure 7a.ABIS outperforms Linux by up to 12%, and thespeedup grows with the number of cores. Figure 7bpresents the number of TLB shootdowns per secondwhen this benchmark runs. The baseline Linuxsystem sends nearly 200k shootdowns regardless ofthe number of threads, and the different shootdownsend rate is merely due to the shorter runtimewhen the number of cores is higher. The numberof received shootdowns in the baseline system is

USENIX Association 2017 USENIX Annual Technical Conference 35

Page 11: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

runtim

e [secon

ds]

spe

edu

p

threads [#]

baselineABIS

speedup

(a) Runtime

0

5

10

15

20

25

30

35

5 10 15 20 25 30 35 40 45 0

100

200

300

400

500

600

se

nt sh

ootd

ow

ns [th

ou

sands/s

ec]

receiv

ed s

hootd

ow

ns [th

ou

sand

s/s

ec]

threads [#]

baseline - sendABIS - sendbaseline - receiveABIS - receive

(b) Remote shootdowns.Figure 7: Execution of PBZIP2 when compressing the Linux kernel. The process memory is limited to 300MB to practice pagereclamation.

proportional to the number of cores, as the OScannot determine which TLBs cache the entry, andbroadcasts the shootdown messages to all the coresthat run the process threads. In contrast, ABIS canusually determine that a single TLB needs to beflushed. When 48 threads are spawned, a shootdownis sent on average to 10 remote cores in ABIS, and to18 cores using baseline Linux.

5.4 PARSEC Benchmark Suite

We run the PARSEC 3.0 benchmark suite [7], whichis composed of multithreaded applications that areintended to represent emerging shared-memoryprograms. We set up the benchmark suite to use thenative dataset and spawn 32 threads. The measuredspeedup, the runtime, the normalized number of TLBshootdowns and their rate in the baseline systemare presented in Figure 8. As shown, ABIS canimprove performance by over 3% but can also induceoverheads of up to 2.5%. ABIS reduces the numberof TLB shootdowns by 96% on average.

The benefit of ABIS appears to be limited by theoverhead of the software technique it uses to insertPTEs into the TLB. As this overhead is incurred aftereach page fault, workloads which trigger consider-ably more page faults than TLB shootdowns experi-ence slowdown. For example, “canneal” benchmarkcauses 1.5k TLB shootdowns per second in the base-line system, and ABIS prevents 91% of them. How-ever, since the benchmark triggers over 55k page-faults per second, ABIS reduces performance by 2.5%.In contrast, “dedup” triggers 33k shootdowns and370k page faults per second correspondingly. ABISsaves 39% of the shootdowns and improves perfor-mance by 3%. Hardware enhancements or selectiveenabling of ABIS could prevent the overheads.

5.5 Limitations

ABIS is not free of limitations. The additional opera-tions and data introduce performance and memoryoverheads, specifically the insertions of PTEs intothe TLB without setting the access-bit. However, rel-atively simple hardware enhancements could haveeliminated most of the overhead (§7). In addition,the CPU incurs overhead of roughly 600 cycles whenit sets the access-bit of shared PTEs [37].

To detect short-lived private mappings, our sys-tem requires that the TLB be able to accommodatethem during their lifetime. New CPUs include ratherlarge TLBs of up to 1536 entries, which may map6MB of memory. However, non-contiguous or verylarge working sets may cause TLB pressure, induceevictions, and cause false indications that PTEs areshared. In addition, frequent full TLB flushes, for in-stance during address-space switching or when theOS sets the CPU to enter deep sleep-state have simi-lar implications. Process migration between cores isalso damaging as it causes PTEs to be shared betweencores and requires shootdowns. These limitationsare often irrelevant to a well-tuned system [30, 31].

Finally, our system relies on micro-architecturalbehavior of the TLBs. We assume the MMU does notperform involuntary flushes and that the same PTEis not marked as “accessed” multiple times when itis already cached. Experimentally, this is not alwaysthe case. We further discuss these limitations in §7.

6. Related WorkHardware Solutions. The easiest solution from asoftware point of view is to maintain TLB coherencyin hardware. DiDi uses a shared second-level TLBdirectory that tracks which PTEs are cached by which

36 2017 USENIX Annual Technical Conference USENIX Association

Page 12: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

0.96

0.97

0.98

0.99

1

1.01

1.02

1.03

1.04

bla

ckscho

les

bodytr

ack

ca

nne

al

ded

up

ferr

et

flu

ida

nim

ate

freqm

ine

raytr

ace

str

eam

clu

ste

r

sw

ap

tions

vip

s

0

0.2

0.4

0.6

0.8

1norm

aliz

ed r

untim

e

norm

aliz

ed T

LB

shoo

tdow

ns

3714

37

6

1529

32

65

49

16

10

0.96

0.97

0.98

0.99

1

1.01

1.02

1.03

1.04

bla

ckscho

les

bodytr

ack

ca

nne

al

ded

up

ferr

et

flu

ida

nim

ate

freqm

ine

raytr

ace

str

eam

clu

ste

r

sw

ap

tions

vip

s

0

0.2

0.4

0.6

0.8

1norm

aliz

ed r

untim

e

norm

aliz

ed T

LB

shoo

tdow

ns

1

0

1

33

1 72 1

1

7

runtimeshootdowns

Figure 8: Normalized runtime and number of TLB shootdownsin ABIS when running PARSEC benchmarks. The numbersabove the bars indicate the baseline (left) runtime in seconds and(right) rate of TLB shootdowns in thousands/second.

core and performs TLB flushes accordingly [48].Teller et al. proposed that OSes save a version countfor each PTE, to be used by hardware to perform TLBinvalidations only when memory is addressed via astale TLB entry [39]. Li et al. eliminate unwarrantedshootdowns of PTEs that are only used by a singlecore by extending PTEs to accommodate the corethat first accessed a page, enhancing the CPU to trackwhether a PTE is private and avoiding shootdownsaccordingly [27].

These studies present compelling evaluationresults; however, they require intrusive micro-architecture changes, which CPU vendors are appar-ently reluctant to introduce, presumably due to ahistory of TLB bugs [1, 16, 17, 35, 46].

Software Solutions. To avoid unnecessary recur-ring TLB flushes of invalidated PTEs, Uhlig tracksTLB versions and avoids shootdowns when the re-mote cores already performed full TLB flushes afterthe PTE changed [43, 44]. However, the potential ofthis approach is limited since even when TLB invali-dations are batched, the TLB is flushed shortly afterthe last PTE is modified.

An alternative approach for reducing TLB flushesis to require applications to inform the OS how mem-ory is used or to control TLB flushes explicitly. CoreyOS avoids TLB shootdowns of private PTEs by re-quiring that user applications define which memoryranges are private and which are shared [10]. C4 usesan enhanced Linux version that allows applicationsto control TLB invalidations [40]. These systems,however, place an additional burden on applicationwriters. Finally, we should note that reducing thenumber of memory mapping changes, for exampleby improving the memory reclamation policy, can

reduce the number of TLB flushes. However, thesesolutions are often workload dependent [45].

7. Hardware SupportAlthough our system saves most of the TLB shoot-downs, it does introduce some overheads. Hardwaresupport that would allow privileged OSes to insertPTEs directly to the TLB without setting the access-bit would eliminate most of ABIS’s overhead. Suchan enhancement should be easy to implement as weachieve an equivalent behavior in software.

ABIS would able to save even more TLB flushes ifCPUs avoid setting the PTE access-bit after the PTEis cached in the TLBs. We encountered, however,in situations where such events occur. It appearsthat when Intel CPUs set the PTE dirty-bit due towrite access, they also set the access-bit, even if thePTE is already cached in the TLB. Similarly, before aCPU triggers a page-fault, it performs a page-walkto retrieve the updated PTE from memory and mayset the access-bit even if the PTE disallows access.Since x86 CPUs invalidate the PTE immediately after,before invoking the page-fault exception handler,setting the access-bit is unnecessary.

CPUs should not invalidate the TLB unnecessarily,as such invalidations hurt performance regardless ofABIS. ABIS is further affected, as these invalidationscause the the access-bit to be set again when theCPU re-caches the PTE. We found that Intel CPUs(unlike AMD CPUs) may perform full TLB flusheswhen virtual machines invalidate huge pages thatare backed by small host pages.

8. ConclusionWe have presented two new software techniquesthat prevent TLB shootdowns in common cases,without replicating the mapping structures andwithout incurring more page-faults. We have shownits benefits in a variety of workloads. While oursystem introduces overheads in certain cases, thesecan be reduced by minor CPU enhancements. Ourstudy suggests that providing OSes better controlover TLBs may be an efficient and simple way toreduce TLB coherency overheads.

AvailabilityThe source code is publicly available at:http://nadav.amit.to/publications/tlb.

AcknowledgmentThis work could not have been done without thecontinued support of Dan Tsafrir and Assaf Schuster.I also thank the paper reviewers and the shepherdJean-Pierre Lozi.

USENIX Association 2017 USENIX Annual Technical Conference 37

Page 13: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

References[1] Lukasz Anaczkowski. Linux VM workaround for

Knights Landing A/D leak. Linux Kernel MailingList, lkml.org/lkml/2016/6/14/505, 2016.

[2] Manu Awasthi, David W Nellans, Kshitij Sudan, Ra-jeev Balasubramonian, and Al Davis. Handling theproblems and opportunities posed by multiple on-chip memory controllers. In ACM/IEEE InternationalConference on Parallel Architecture & Compilation Tech-niques (PACT), pages 319–330, 2010.

[3] Thomas W Barr, Alan L Cox, and Scott Rixner. Trans-lation caching: skip, don’t walk (the page table). InACM/IEEE International Symposium on Computer Ar-chitecture (ISCA), pages 48–59, 2010.

[4] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang,Mark D Hill, and Michael M Swift. Efficient virtualmemory for big memory servers. In ACM/IEEE Inter-national Symposium on Computer Architecture (ISCA),pages 237–248, 2013.

[5] Andrew Baumann, Paul Barham, Pierre-Evariste Da-gand, Tim Harris, Rebecca Isaacs, Simon Peter, Timo-thy Roscoe, Adrian Schüpbach, and Akhilesh Sing-hania. The multikernel: a new OS architecture forscalable multicore systems. In ACM Symposium on Op-erating Systems Principles (SOSP), pages 29–44, 2009.

[6] Abhishek Bhattacharjee, Daniel Lustig, and MargaretMartonosi. Shared last-level TLBs for chip multi-processors. In IEEE International Symposium on HighPerformance Computer Architecture (HPCA), pages 62–63, 2011.

[7] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh,and Kai Li. The PARSEC benchmark suite: Character-ization and architectural implications. In ACM/IEEEInternational Conference on Parallel Architecture & Com-pilation Techniques (PACT), pages 72–81, 2008.

[8] David L Black, Richard F Rashid, David B Golub,Charles R Hill, and Robert V Baron. Translationlookaside buffer consistency: a software approach. InACM Architectural Support for Programming Languages& Operating Systems (ASPLOS), pages 113–122, 1989.

[9] Daniel Bovet and Marco Cesati. Understanding theLinux Kernel, Third Edition. O’Reilly & Associates, Inc.,2005.

[10] Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yan-dong Mao, M Frans Kaashoek, Robert Morris, Alek-sey Pesterev, Lex Stein, Ming Wu, Yue-hua Dai, et al.Corey: An operating system for many cores. InUSENIX Symposium on Operating Systems Design &Implementation (OSDI), pages 43–57, 2008.

[11] Austin T Clements, M Frans Kaashoek, and NickolaiZeldovich. RadixVM: Scalable address spaces formultithreaded applications. In ACM SIGOPS Euro-pean Conference on Computer Systems (EuroSys), pages211–224, 2013.

[12] Joel Coburn, Adrian M Caulfield, Ameen Akel,

Laura M Grupp, Rajesh K Gupta, Ranjit Jhala, andSteven Swanson. NV-Heaps: making persistent ob-jects fast and safe with next-generation, non-volatilememories. In ACM Architectural Support for Program-ming Languages & Operating Systems (ASPLOS), pages105–118, 2011.

[13] Jonathan Corbet. Realtime and interrupt la-tency. LWN.net, https://lwn.net/Articles/139784/, 2005.

[14] Jonathan Corbet. Memory compaction. LWN.net,https://lwn.net/Articles/368869/, 2010.

[15] Jonathan Corbet. Memory management locking.LWN.net, https://lwn.net/Articles/591978/,2014.

[16] Christopher Covington. arm64: Work around Falkorerratum 1003. Linux Kernel Mailing List, https://lkml.org/lkml/2016/12/29/267, 2016.

[17] Linux Kernel Driver DataBase. CON-FIG_ARM_ERRATA_720789. http://cateee.net/lkddb/web-lkddb/ARM_ERRATA_720789.html, 2012.

[18] Jake Edge. Persistent memory. LWN.net, https://lwn.net/Articles/591779/, 2014.

[19] Balazs Gerofi, Akira Shimada, Atsushi Hori, and YozoIshikawa. Partially separated page tables for effi-cient operating system assisted hierarchical mem-ory management on heterogeneous architectures. InIEEE/ACM International Symposium on Cluster, Cloudand Grid Computing (CCGrid), pages 360–368, 2013.

[20] Jeff Gilchrist. Parallel data compression with bzip2.In IASTED International Conference on Parallel and Dis-tributed Computing and Systems (ICPDCS), volume 16,pages 559–564, 2004.

[21] Mel Gorman. TLB flush multiple pages per IPI v4.Linux Kernel Mailing List, https://lkml.org/lkml/2015/4/25/125, 2015.

[22] Julien Grall. Force broadcast of TLB and instructioncache maintenance instructions. Xen developmentmailing list https://patchwork.kernel.org/patch/8955801/, 2016.

[23] Dave Hansen. Patch: x86: set TLB flush tunableto sane value. https://patchwork.kernel.org/patch/4460841/, 2014.

[24] Xeon phi processor. http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html.

[25] Intel Corporation. Intel 64 and IA-32 ArchitecturesSoftware Developer’s Manual. Reference number:325462-057US, 2015. https://software.intel.com/en-us/articles/intel-sdm.

[26] Youngjin Kwon, Hangchen Yu, Simon Peter, Christo-pher J Rossbach, and Emmett Witchel. Coordinatedand efficient huge page management with Ingens. InUSENIX Symposium on Operating Systems Design &Implementation (OSDI), pages 705–721, 2016.

38 2017 USENIX Annual Technical Conference USENIX Association

Page 14: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,

[27] Yong Li, Rami Melhem, and Alex K Jones. PS-TLB: Leveraging page classification informationfor fast, scalable and efficient translation for futureCMPs. ACM Transactions on Architecture and CodeOptimization (TACO), 9(4):28, 2013.

[28] Likai Liu. Parallel computing and the cost of TLBshoot-down. http://lifecs.likai.org/2010/06/parallel-computing-and-cost-of-tlb.html,2010.

[29] Daniel Lustig, Abhishek Bhattacharjee, and MargaretMartonosi. TLB improvements for chip multiproces-sors: Inter-core cooperative prefetchers and sharedlast-level TLBs. ACM Transactions on Architecture andCode Optimization (TACO), 10(1), 2013.

[30] Ophir Maor. What is CPU affinity? https://community.mellanox.com/docs/DOC-1924, 2014.

[31] Ophir Maor. Mellanox BIOS performance tuning ex-ample. https://community.mellanox.com/docs/DOC-2297, 2015.

[32] Marshall Kirk McKusick, George V Neville-Neil, andRobert NM Watson. The design and implementationof the FreeBSD operating system. Pearson Education,2014.

[33] Jinzhan Peng,Guei-Yuan Lueh,Gansha Wu,XiaogangGou, and Ryan Rakvic. A comprehensive studyof hardware/software approaches to improve TLBperformance for Java applications on embeddedsystems. In ACM Workshop on Memory SystemPerformance and Correctness (MSPC), pages 102–111,2006.

[34] Aristeu Rozanski. VM-scalability benchmarksuite. https://github.com/aristeu/vm-scalability, 2010.

[35] Anand Lal Shimpi. AMD’s B3 stepping Phenompreviewed, TLB hardware fix tested. AnandTechhttp://www.anandtech.com/show/2477/2, 2008.

[36] Kirill A. Shutemov. mm: map few pages aroundfault address if they are in page cache. LinuxKernel Mailing List, https://lwn.net/Articles/588802, 2014.

[37] Kirill A. Shutemov. unixbench.score -6.3%regression. Linux Kernel Mailing List,http://lkml.kernel.org/r/[email protected], 2016.

[38] Patricia J Teller. Translation-lookaside buffer consis-tency. IEEE Computer, 23(6):26–36, June 1990.

[39] Patricia J Teller, Richard Kenner, and Marc Snir. TLB

consistency on highly-parallel shared-memory multipro-cessors. Courant Inst. of Math. Sci, 1987.

[40] Gil Tene, Balaji Iyengar, and Michael Wolf. C4:The continuously concurrent compacting collector.ACM International Symposium on Memory Management(ISMM), pages 79–88, 2011.

[41] Michael Y Thompson, JM Barton, TA Jermoluk, andJC Wagner. Translation lookaside buffer synchroniza-tion in a multiprocessor system. In USENIX Winter,pages 297–302, 1988.

[42] Linus Torvalds. Splice: fix race with page inval-idation. http://yarchive.net/comp/linux/zero-copy.html, 2008.

[43] Volkmar Uhlig. Scalability of microkernel-basedsystems. PhD thesis, TH Karlsruhe, 2005.https://os.itec.kit.edu/downloads/publ_2005_uhlig_scalability_phd-thesis.pdf.

[44] Volkmar Uhlig. The mechanics of in-kernel synchro-nization for a scalable microkernel. ACM SIGOPSOperating Systems Review (OSR), 41(4):49–58, 2007.

[45] Ahsen J Uppal and Mitesh R Meswani. Towardsworkload-aware page cache replacement policiesfor hybrid memories. In International Symposiumon Memory Systems (MEMSYS), pages 206–219, 2015.

[46] Theo Valich. Intel explains the Core 2 CPUerrata. The Inquirer http://www.theinquirer.net/inquirer/news/1031406/intel-explains-core-cpu-errata, 2007.

[47] Brian Van Essen, Henry Hsieh, Sasha Ames, and MayaGokhale. DI-MMAP: A high performance memory-map runtime for data-intensive applications. InIEEE International Workshop on Data-Intensive ScalableComputing Systems (SCC), pages 731–735, 2012.

[48] Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova,Yoav Etsion, Alex Ramirez, Avi Mendelson, NachoNavarro, Adrian Cristal, and Osman S Unsal. DiDi:Mitigating the performance impact of TLB shoot-downs using a shared TLB directory. In ACM/IEEEInternational Conference on Parallel Architecture & Com-pilation Techniques (PACT), pages 340–349, 2011.

[49] Carl A. Waldspurger. Memory resource managementin VMware ESX server. In USENIX Symposium onOperating Systems Design & Implementation (OSDI),volume 36, pages 181–194, 2002.

[50] wrk. HTTP benchmarking tool. https://github.com/wg/wrk, 2015.

USENIX Association 2017 USENIX Annual Technical Conference 39

Page 15: Optimizing the TLB Shootdown Algorithm with Page Access … · Nadav Amit VMware Research Abstract The operating system is tasked with maintaining the coherency of per-core TLBs,