to waffinity and beyond: a scalable architecture for incremental … · 2017-09-21 · to...

This paper is included in the Proceedings of the 12th USENIX Symposium on Operating Systems Design

and Implementation (OSDI ’16).November 2–4, 2016 • Savannah, GA, USA

ISBN 978-1-931971-33-1

Open access to the Proceedings of the 12th USENIX Symposium on Operating Systems

Design and Implementation is sponsored by USENIX.

To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code

Matthew Curtis-Maury, Vinay Devadas, Vania Fang, and Aditya Kulkarni, NetApp, Inc.

https://www.usenix.org/conference/osdi16/technical-sessions/presentation/curtis-maury

To Waffinity and Beyond: A Scalable Architecture for IncrementalParallelization of File System Code

Matthew Curtis-Maury, Vinay Devadas, Vania Fang, and Aditya KulkarniNetApp, Inc.

{mcm,vdevadas,vania,adityak}@netapp.com

AbstractIn order to achieve higher I/O throughput and betteroverall system performance, it is necessary for commer-cial storage systems to fully exploit the increasing corecounts on modern systems. At the same time, legacysystems with millions of lines of code cannot simply berewritten for improved scalability. In this paper, we de-scribe the evolution of the multiprocessor software ar-chitecture (MP model) employed by the Netapp R© DataONTAP R© WAFL R© file system as a case study in incre-mentally scaling a production storage system.

The initial model is based on small-scale data partition-ing, whereby user-file reads and writes to disjoint fileregions are parallelized. This model is then extendedwith hierarchical data partitioning to manage concurrentaccesses to important file system objects, thus benefit-ing additional workloads. Finally, we discuss a fine-grained lock-based MP model within the existing data-partitioned architecture to support workloads where dataaccesses do not map neatly to the predefined partitions.In these data partitioning and lock-based MP models,we have facilitated incremental advances in parallelismwithout a large-scale code rewrite, a major advantagein the multi-million line WAFL codebase. Our resultsshow that we are able to increase CPU utilization by asmuch as 104% on a 20-core system, resulting in through-put gains of up to 130%. These results demonstratethe success of the proposed MP models in deliveringscalable performance while balancing time-to-market re-quirements. The models presented can also inform scal-able system redesign in other domains.

1 IntroductionTo maintain a competitive advantage in the storage mar-ket, it is imperative for companies to provide cutting-edge platforms and software to maximize the returnsfrom such systems. Recent technological trends havemade this prospect more difficult, as increases in CPUclock speed have been abandoned in favor of increasingcore counts. Thus, to achieve continuing performancegains, it has become necessary for storage systems to

scale to an ever-higher number of cores. As one of theprimary computational bottlenecks in storage systems,the file system itself must be designed for scalable pro-cessing.

Due to time-to-market objectives, it is simply not feasibleto rewrite the entire code base of a production file sys-tem to use a new multiprocessor (MP) model. Reimple-menting such a system to use explicit fine-grained lock-ing would require massive code inspection and changesand would also carry with it the potential for introduc-ing races, performance issues caused by locking over-head and contention, and the risk of deadlocks. In thispaper, we present a series of techniques that have al-lowed us to simultaneously meet scalability and schedulerequirements in the WAFL file system over the courseof a decade and to minimize the code changes requiredfor parallelization. In particular, all of our approachesemphasize incremental parallelization whereby commoncode paths can be optimized without having to makechanges in less critical code paths. Although we eval-uate these techniques in a file system, the approaches arenot inherently limited to that context.

The first technique we discuss—referred to as Classi-cal Waffinity—applied data partitioning to fixed-size re-gions of user files. This approach provided a mech-anism to allow read and write operations to differentranges of user files to occur in parallel without requir-ing substantial code rewrite, because the use of data par-titioning minimized the need for explicit locking. Ex-tending this model, Hierarchical Waffinity further par-allelized operations that modify systemwide data struc-tures or metafiles, such as the creation and deletion offiles, by implementing a hierarchical data partitioninginfrastructure. Compared to Classical Waffinity, Hierar-chical Waffinity improves core usage by up to 38% andachieves 95% average utilization across a range of criti-cal workloads.

Finally, we have extended Hierarchical Waffinity to han-dle workloads that do not map neatly to these parti-tions by adding a fine-grained lock-based MP modelwithin the existing data-partitioned architecture. This in-novative model—called Hybrid Waffinity—provides sig-

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 419

nificant parallelism benefits for previously problematicaccess patterns while not requiring any code changesfor workloads where Hierarchical Waffinity already ex-celled. That is, the hybrid model introduces minimal ex-plicit locking in narrowly defined cases to overcome spe-cific scalability limitations in Hierarchical Waffinity toincrease core usage by up to 104% and improve through-put by as much as 130%. Using these techniques hasallowed WAFL to scale to the highest-end Data ONTAPplatforms of their time (up to 20 cores) while constrain-ing modifications to the code base.

The primary contributions of this paper are:

• We present a set of multiprocessor software archi-tectures that facilitate incremental parallelization oflarge legacy code bases.

• We discuss the application of these techniqueswithin a high-performance, commercial file system.

• We evaluate each of our approaches in the context ofa real production storage system running a varietyof realistic benchmarks.

Next, we present a short background on WAFL. Sec-tions 3 through 5 present the evolution of the WAFLmultiprocessor model through the various steps outlinedabove, and Section 6 evaluates each of the new models.Section 7 presents related work, and we conclude in Sec-tion 8.

2 Background on the WAFL FileSystem

WAFL implements the core file system functionalitywithin the Data ONTAP operating system. WAFLhouses and exports multiple file systems called NetAppFlexVol R© volumes from within a shared pool of storagecalled an aggregate and handles file system operations tothem. In WAFL, all metadata and user data (includinglogical units in SAN protocols) is stored in files, calledmetafiles and user files, respectively. The file system isorganized as a tree rooted at the super block. File sys-tem operations are dispatched to the WAFL subsystem inthe form of messages, with payloads containing pertinentparameters. For detailed descriptions of WAFL, see Hitzet al. [17] and Edwards et al. [13].

Data ONTAP itself was first parallelized by dividing eachsubsystem into a private domain, where only a singlethread from a given domain could execute at a time. Forexample, domains were created for RAID, networking,storage, file system (WAFL), and the protocols. Com-munication between domains used message passing. Do-mains were intentionally defined such that data sharing

was rare between the threads of different domains, so thisapproach allowed scaling to multiple cores with minimalcode rewrite, because little locking was required. The filesystem module executed on a dedicated set of threads ina single domain, such that only a single thread could runat a time. This simplistic model provided sufficient per-formance because systems at the time had very few cores(e.g., four), so parallelism within the file system was notimportant. Over time, each of these domains has becomeparallel, but in this paper we focus on the approachesused to parallelize WAFL.

3 Classical WaffinityAs core counts increased, serialized execution of filesystem operations became a scalability bottleneck be-cause such operations represented a large fraction ofthe computational requirements in our system. To pro-vide the initial parallelism in the file system, we im-plemented a multiprocessor model called Waffinity (forWAFL affinity), the first version of which was calledClassical Waffinity and shipped with Data ONTAP 7.2in 2006.

In Classical Waffinity, the file system message sched-uler defined message execution contexts called affinities.User files were then partitioned into file stripes that cor-responded to a contiguous range of blocks in the file (ap-proximately 2MB in size), and these were rotated overa set of Stripe affinities. This model ensured that mes-sages operating in different Stripe affinities were guar-anteed to be working on different partitions of user files,so they could be safely executed in parallel by threadsexecuting on different cores. In contrast, any two mes-sages that were operating on the same region of a filewould be enqueued within the same affinity and wouldtherefore execute sequentially. This data partitioningprovided an implicit coarse-grained synchronization thateliminated the need for explicit locking on partitionedobjects, thereby greatly reducing the complexity of theprogramming model and the development cost of paral-lelizing the file system. Some locking was still requiredto protect shared global data structures that could be ac-cessed by multiple affinities.

In Waffinity, we introduced a set of threads to exe-cute the messages in each affinity, and we allowed thethread scheduler to simultaneously run multiple Waffin-ity threads. The Waffinity scheduler maintained a FIFOlist of affinities with work; that is, affinities that hadbeen sent messages that operated within that partition.Any running thread dynamically called into the Waffin-ity scheduler to be assigned an affinity from which toexecute messages. The number of Stripe affinities was

420 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

empirically tuned to be 12 and the number of Waffinitythreads was defined per platform to scale linearly withthe number of cores. NetApp storage systems at the timemaxed out at 8 cores, which ensured more affinities thanthreads. Having more affinities than threads and creat-ing a dynamic association between them decreased thelikelihood of any thread being unable to find work.

The benefit of this model comes from the fact thatmost performance-critical messages at the time, such asuser file reads and writes, could be executed in Stripeaffinities, because they operated within a single user-filestripe. However, reads and writes across file stripes oroperations such as Open and Close that touch file sys-tem metadata could not be performed in parallel fromthe Stripe affinities. To handle such cases, we provideda single-threaded execution context—called the Serialaffinity—such that when it was scheduled, no Stripeaffinities were scheduled and vice versa. This approachis analogous to a Reader-Writer Lock, where the Serialaffinity behaves like a writer with exclusive file systemaccess and the Stripe affinities behave like readers withshared access. Therefore, any messages requiring accessto data that was unsafe from the Stripe affinities couldbe assured exclusive access to all file system data struc-tures by executing in the Serial affinity. Use of the Serialaffinity excessively serialized many operations; however,it allowed us to incrementally optimize the file systemby parallelizing only those messages found to be perfor-mance critical. This approach also provided an optionwhereby unsafe code paths could be dynamically re-sentto the Serial affinity (called a slowpath).

Classical Waffinity exposed sufficient concurrency to ex-ploit high-end NetApp platform core counts at the time,i.e., 8 cores. That is, further parallelizing file system op-erations would not have had a major impact on overallperformance because the cores were already well used.However, as the number of cores increased, limitations inthe partitioning provided by the model led to scalabilitybottlenecks. For example, an SFS2008 benchmark run-ning on 12 cores saw the Serial affinity in use 48% of thetime, as a result of the metadata operations such as Se-tattr, Create, and Remove. This serialization resulted inconsiderable core idle time, as evaluated in Section 6.1.1,which reduced potential performance. Further, most ofthe work remaining could not be moved to a Stripe affin-ity due to the strict rules of what can run there (i.e.,operating within a single user file stripe). With highercore counts, serialized execution had a larger impact onperformance, because the speedup achieved through par-allelism was limited by the serial time, in accordancewith Amdahl’s law. Thus, although sufficient at the time,Classical Waffinity as first designed was simply unableto provide the required performance going forward.

Figure 1: The structure of Classical Waffinity with the Serialaffinity on top and the twelve Stripe affinities below.

Figure 2: The Hierarchical Waffinity hierarchy rooted at theSerial affinity.

4 Hierarchical WaffinityHierarchical Waffinity builds on top of the ClassicalWaffinity model to enable increased levels of parallelism.This model, which first shipped with Data ONTAP 8.1in 2011, greatly reduces the increasingly critical single-threaded path by providing a way to parallelize addi-tional work. Further, the model offers insight into pro-tecting hierarchically structured systems in other do-mains [22] by using data partitioning.

4.1 The Affinity Hierarchy

An alternative view of the Serial and Stripe affinities ofClassical Waffinity is as a hierarchy, as shown in Fig-ure 1, where a node is mutually exclusive with its chil-dren but can run in parallel with other nodes at the samelevel. Hierarchical Waffinity facilitates additional paral-lelism by extending this simple 2-level hierarchy to ef-ficiently coordinate concurrent accesses to other funda-mental file system objects beyond data blocks of userfiles.

Each affinity is associated with certain permissions, suchas access to metadata files, that serialized execution inthe file system in Classical Waffinity (Figure 2 and Ta-ble 1). The new affinity scheduler enforces executionexclusivity between a given affinity and its children, soHierarchical Waffinity only restricts the execution of anaffinity’s ancestors and descendants (hierarchy parentsand children, respectively); all other affinities can safelyrun in parallel. For example, if the Volume Logical affin-ity is running, then its Stripe affinities are excluded alongwith its parent Volume, Aggregate, and Serial affinities.


This design ensures that no two messages with conflict-ing data accesses run concurrently, because they wouldrun in affinities that exclude one another. HierarchicalWaffinity is analogous to a hierarchy of Reader-WriterLocks, where running in any affinity acquires the lock asa writer and thus prevents any readers (i.e., descendants)from running concurrently, and vice versa.

The WAFL file system is itself hierarchical (i.e., bufferswithin inodes within FlexVol volumes within aggre-gates), making Hierarchical Waffinity a natural fit.Knowledge of the specific data access patterns that aremost common in WAFL informed the decision of affin-ity layout in the hierarchy and the mapping of specificobject accesses to those affinities.

As a general rule, client-facing data, such as user files,logical units, and directories, is mapped to the VolumeLogical branch of the hierarchy and internal metadata ismapped to Volume VBN—so named because it is typi-cally indexed by volume block number (VBN). This al-lows parallelism between client-facing and internal oper-ations within a single volume. The Aggregate, Volume,Stripe, and Range affinity types have multiple instances,which allows parallel execution of messages operatingon disjoint data, such as any two operations in differentaggregates, FlexVol volumes, or regions of blocks in afile. Each file system object is mapped to a particular in-stance in the affinity hierarchy, based on its location inthe file system. The number of instances of each affin-ity as well as the mapping of objects to instances canbe adapted to the observed workload to maximize per-formance. Further, new affinity types can be (and havebeen) added to the hierarchy over time in response to newworkloads and data access patterns.

4.2 Affinity Access Rules

The affinity permissions required by a message are de-termined by the type of objects being accessed and theaccess types. We used knowledge of the system toderive affinity permissions that allowed performance-critical operations to run with the most parallelism. InWAFL, any access is associated with a specific affinity,using the rules shown in Table 1. Exclusive access toan object ensures that no concurrently running affinitiescan access that object. The fundamental objects that areprotected by data partitioning in Waffinity are buffers,files, FlexVol volumes, and aggregates. Accesses to otherobject types are infrequent but are protected with fine-grained locking, and deadlock is prevented through theuse of lock hierarchies.

Each FlexVol volume and aggregate is mapped to a Vol-ume and Aggregate affinity when it comes online. In theWAFL file system, files are represented by inodes, which

are mapped to an affinity within the hierarchy of the vol-ume or aggregate in which they reside. Inodes can beaccessed in either exclusive or shared mode. In exclu-sive mode, only one message has access to the inode andconsequently can change any inode property or free theinode. In shared mode, the inode’s fundamental prop-erties remain read-only, but the majority of an inode’sfields can be modified and are protected from concurrentaccesses by fine-grained locking. Similar distinctions arein place for FlexVol volumes and aggregates.

Blocks are represented in memory by a buffer header anda 4KB payload. Details of the WAFL buffer cache archi-tecture have previously been published [11]. Accesses tobuffers fall into the following four categories:

• Insert: A new buffer object is allocated and the pay-load is read in from disk. The buffer becomes asso-ciated with the corresponding inode.

• Read: The payload of an in-memory buffer is read.No disk I/O is required in this case.

• Write: The payload of the in-memory buffer is mod-ified in memory. The buffer header is also modifiedto indicate that it is dirty.

• Eject: The buffer is evicted from memory.

Each of these access modes is mapped to a specific affin-ity instance for a given buffer. For Write, Insert, andEject accesses, our data partitioning model requires thatonly a single operation perform any of these accessesat a time. Thus, operations must run in the designatedaffinity for a given access mode or its ancestor. On theother hand, read accesses are safe in parallel with eachother, but it is necessary to ensure that the buffer will notbe ejected underneath it, so that it can run in any ances-tor or descendant of the Eject affinity. Typically, Writeand Eject access are equivalent, which similarly preventsconcurrent reads and writes. The affinity mappings forbuffers are chosen to allow maximum parallelism, whileensuring access to the buffers from all necessary affini-ties. For example, user file reads and writes run in theStripe affinities because the relevant buffers map to theseaffinities.

4.3 Mapping Operations to Affinities

Messages are sent to a predetermined affinity based onthe required permissions of the operation, as identifiedby the software developer. If a message requires the per-missions assigned to multiple affinities, then it is directedto an affinity that is an ancestor of each. For example, anoperation that requires privileges assigned to a particular


Affinity Name Access Privileges ProvidedStripe Provides exclusive access to a predefined, distinct set of blocks. Enables concurrent access

to sub-file-level user data.Volume Logical Provides exclusive access to most client-visible files (and directories) within a volume and

to certain file system internal metadata.Volume VBN Provides exclusive access to most file system metadata, such as those files that track block

usage. Such files are typically indexed by volume block number (VBN).Volume VBNRange

Provides access to specific ranges of blocks within files belonging to Volume VBN.

Volume Provides exclusive access to all files within a volume as well as per-volume metadata thatlives in the containing aggregate.

Aggregate VBN Similarly to Volume VBN, provides access to most file system metadata in an aggregate.Aggregate VBNRange

Provides access to specific ranges of blocks within files belonging to Aggregate VBN.

Aggregate Provides exclusive access to files in an aggregate.Serial Inherits the access rights of all other affinities and provides access to other global data.

Table 1: The affinities and the access rights that they provide.

Stripe affinity and to a Volume VBN affinity could safelyexecute in the Volume affinity. As in Classical Waffin-ity, if a message is routed to a particular affinity and laterdetermines that it requires additional permissions, it maybe dynamically re-sent (i.e., slowpath) to a coarser (i.e.,less parallel) affinity that provides the necessary accessrights. Object accesses are achieved through a limitedset of APIs that we have updated to enforce the requiredaffinity rules, including slowpathing if necessary. Thus,the programming model helps ensure code correctness.The more access rights required by a message, the higherin the affinity hierarchy it must run and the more it lim-its parallelism by excluding a larger number of affinitiesfrom executing. Thus, it is preferable to run in as finean affinity as possible. For this reason, we have selecteddata mappings that allow the most common operationsto be mapped to fine affinities, and objects that are fre-quently accessed together are given similar mappings.

Figure 3 shows several example affinity mappings of op-erations under a single Volume affinity. For simplic-ity, we assume a file stripe size of 100 contiguous userblocks, or 10 blocks in metafiles. Reads and writeswithin a file stripe map to a Stripe affinity. However, auser file deletion (“Remove: A”) runs in the Volume Log-ical affinity, because it requires exclusive access to thatuser file, as does a write operation that spans multiplefile stripes (“W: A[100..200]”). Running in this affinityprevents the execution of any reads or writes to blocksin that file. However, reads and writes to files in othervolumes are not impacted, nor are operations on file sys-tem metadata within the same volume. Note that “W:A, MD” must run in Volume affinity to write to both auser file and metafile. As noted earlier, key message pa-

Figure 3: Example affinity mappings of Read (“R”), Write(“W”), Remove, and Create operations to user files A and B,metadata file MD, and FlexVol V. A block offset of 100 withfile A is denoted as A[100].

rameters are tracked in the message payload. When amessage is sent into WAFL, its payload is inspected todetermine the type of the message and other data fromwhich the destination affinity is calculated. Effectively,each message exposes the details of its data accesses tothe scheduler so that MP safety can be enforced at thislevel, similar to some language constructs for task-basedsystems [3, 34]. For example, a write message exposesthe offsets being written and thus the required affinity forexecution.

4.4 Development Experience with Hierar-chical Waffinity

Hierarchical Waffinity allows parallelization at the gran-ularity of a message type, running all other messagetypes in the Serial affinity, allowing incremental opti-mization over time. Thus, parallelization effort scales


with the size of the message, rather than requiring theentire code base to be updated at once. A typical mes-sage handler is on the order of hundreds or thousands oflines of code, whereas the file system is on the order ofmillions. Further, a message can first be parallelized intoone affinity and later moved to a finer affinity when needarises and the extra development cost can be justified.

Messages often require only minimal code changes dur-ing parallelization because data access guarantees areprovided by the model. In such cases, after a detailedline-by-line code inspection to evaluate multiprocessorsafety, messages can be moved into fine affinities justby changing the routing logic that computes the targetaffinity. For example, parallelizing the Link operation torun in the Volume Logical affinity required fewer than20 lines of code to be written, none of which were ex-plicit synchronization. In other cases, major changes arerequired to safely operate within a single affinity, for ex-ample by restructuring data accesses, thus requiring po-tentially thousands of lines of code changes. In WAFL,the underlying infrastructure required to implement theaffinities, scheduling, and rule enforcement amounted toapproximately 22K lines of code. Compared to the alter-native of migrating the whole file system to fine-grainedlocking, which would involve inspecting and updating alarge fraction of the millions of lines of code, these costsare relatively small.

Software systems in many domains employ hierarchicaldata structures (such as linear algebra [12] and computa-tional electromagnetics [14]), and a variety of techniqueshave been developed to provide multiprocessor safety insuch cases [15, 22]. Hierarchical Waffinity offers an al-ternative architecture that is capable of incremental par-allelization. In applying this approach to other systems,the types of affinities to create, the number of instances ofeach type, and the mapping of objects to affinities wouldbe based on domain-specific knowledge of the data ac-cess patterns. In practice, this approach applies most nat-urally to message passing systems where a subset of mes-sage handlers could be parallelized while leaving othersserialized, or alternatively to task-based systems such asCilk++ [26].

4.5 Hierarchical Scheduler

The Hierarchical Waffinity scheduler is an extensionof the Classical Waffinity scheduler that maps the nowgreater set of runnable affinities to the Waffinity threadsfor execution while enforcing the hierarchical exclusionrules. As before, Waffinity threads are exposed to thegeneral CPU scheduler, and when they are selected forexecution, they begin by calling into the Waffinity sched-uler for work. A running thread then begins processing

the messages queued up to that affinity for the durationof an assigned quantum, after which the thread calls backinto the scheduler to request another affinity to run.

As messages are sent into WAFL and processed by theWaffinity threads, the Waffinity scheduler tracks the setsof affinities that are runnable (i.e., with work and not ex-cluded), running, or excluded by other executing affini-ties. When threads request work, the scheduler selects anaffinity from the runnable list, assigns it to the thread, andupdates the scheduler state to reflect the newly runningaffinity. In particular, the scheduler maintains a queue ofaffinities that is walked in FIFO order to find an affinitythat is not excluded. An excess of work in coarse affini-ties manifests in the scheduler as a shortage of runnableaffinities, resulting in situations where available threadscannot find work to do and must sit idle, which in turnresults in wasted CPU cycles. We also track the numberof runnable affinities and ensure that the optimal num-ber of threads are in flight any time an affinity beginsor ends execution. To prevent the starvation of coarseaffinities, we periodically drain all running affinities andensure that all affinities run with some regularity. Fig-ure 4 shows a sample scheduler state. In this example,there are seven affinities currently running and five morethat can be selected for execution by the affinity sched-uler.

4.6 Waffinity Limitations and Alternatives

Fundamental in the Waffinity architecture is a mappingof file system objects to a finite set of affinities, oftencausing independent operations to unnecessarily becomeserialized. For example, any two operations that requireexclusive access to two user files in the same volume(such as deletion) are serialized. As long as sufficientparallelism is found to exploit available cores on the tar-get platforms, this limitation is acceptable in the sensethat increasing available parallelism will not result in in-creased performance. Two other scenarios that can resultin significant performance loss are 1) when frequentlyaccessed objects directly map to a coarse affinity; and2) when two objects mapped to different affinities mustbe accessed by the same operation. These scenarios arenot well handled in Hierarchical Waffinity and result inpoor scaling in several important workloads, as shown inSection 6.

An alternative to the Waffinity architecture would be touse fine-grained locking to provide MP safety. In suchan approach, all file system objects would be protectedby using traditional locking, and no limits need to beimposed on which operations can be executed in paral-lel. This would provide additional flexibility in the pro-gramming model; however, it carries many drawbacks


Figure 4: Sample hierarchical affinity scheduler state.

that led to the decision to implement Classical and Hi-erarchical Waffinity. Reimplementing the file system toexclusively use fine-grained locks would require a mas-sive code rewrite and would carry with it the potential forintroducing races, performance issues caused by lockingoverhead and contention, and the risk of deadlocks. In-stead, our approach reduces the overall amount of lock-ing required, and even allows messages to run in coarseaffinities without locking.

5 Hybrid WaffinityThe next step in the multiprocessor evolution of WAFL isHybrid Waffinity, which shipped with Data ONTAP 9.0in 2016. This model is designed to scale workloads forwhich Hierarchical Waffinity is not well suited, due to apoor mapping of data accesses to affinities, as discussedin Section 4.6. This model supports fine-grained lockingwithin the existing hierarchical data partitioned architec-ture to protect particular objects when accesses do notmap neatly to fine partitions. At the same time, we con-tinue to use partitioning where it already excels, such asfor user file reads and writes. Overall, Hybrid Waffinityleverages both fine-grained locking and data partitioningin cases where each approach excels. Although this mayseem like an about-face from the data partitioned mod-els, in fact it is merely an acknowledgement that in somecases fine-grained locking is required for effective scal-ing. Retaining partition-based protection in most cases iscritical so that only the code that scales poorly with datapartitioning needs to be updated.

In Hybrid Waffinity, we allow buffers in a few selectmetafiles to be protected by locking, while the vast ma-jority of buffers, as well as all other file system objects,continue to use data partitioning for protection. Theresult is a hybrid model of MP-safety where differentbuffer types have different protection mechanisms. Theuse of locking allows these buffers to be accessed from

finer affinities; however, we continue to allow lock-freeaccess from a coarser affinity. That is, because all objectaccesses occur within some affinity subtree, a messagerunning in the root of that subtree can safely access theobject without locking.

5.1 Hybrid-Insert

With Hierarchical Waffinity, each buffer is associatedwith a specific Insert affinity that protects the steps in-volved in inserting that buffer. This mandates that allmessages working on inserting the buffer will run in thesame affinity to be serialized. This design is effectivein the common case where a single buffer is accessed ormultiple buffers with similar affinity mappings are ac-cessed. However, in cases where a message accessesmultiple buffers in different partitions, the message mustrun in a coarser affinity that provides all of the necessarypermissions. Figure 5 illustrates a scenario in which bothUser file buffers and Metafile buffers must be accessed,which happens when replicating a FlexVol volume (dis-cussed in Section 6.2.3). This operation must run inthe AGGR1 affinity, rather than the more parallel S1 orAVR1 affinity. In such cases, parallelism in the system isreduced, because time spent running in coarse affinitieslimits the available affinities that can be run concurrently,potentially starving threads and cores of work.

We overcome these shortcomings with Insert by allow-ing Hybrid-Insert access to certain buffers from multiplefine affinities. Only a few buffer types are frequentlyaccessed in tandem with other buffers, and we applyHybrid-Insert only in such cases. Thus, a message ac-cessing two buffers now runs in the traditional (fine) In-sert affinity of one buffer and protects the second bufferby using Hybrid-Insert, rather than in a coarse affinitywith Insert access to both buffers. Allowing multipleaffinities to insert a buffer means that two messages cansimultaneously insert the same buffer, but Hybrid-Insertresolves such races and synchronizes all callers of any


Figure 5: Hierarchical Waffinity model with User data accessin S1 and Metadata access in AVR1.

Figure 6: Hybrid Waffinity model where User data access con-tinues to be in S1, but Metadata access is permitted from anydescendant of AGGR1.

insert code path. Referring back to Figure 5, the Userdata can continue to be protected with partitioning in S1and the Metadata can now be accessed from the S1 affin-ity (for example) with explicit synchronization, as shownin Figure 6. Further, noncritical messages that operate onMetadata can run in AGGR1 without being rewritten, be-cause the scheduler will not run any other messages thataccess this data.

For explicit synchronization, Hybrid-Insert uses what werefer to as MP-barriers. MP-barriers employ a set ofspin-locks that are hashed based on buffer properties.The insert process consists of a series of critical sectionsof varying lengths. Short critical sections can simplyhold the spin-lock for their duration. However, longercritical sections avoid holding the lock for long periodsby instead stamping the buffer with an in-progress flagunder the lock at the beginning of the critical section andclearing it at the end. Other messages that encounter thein-progress flag can then block, knowing that a messageis already moving this buffer toward insertion.

5.2 Hybrid-Write

The next buffer access mode we consider is Write, whichinvolves changing the contents of a buffer and updatingassociated metadata. Hierarchical Waffinity serializes all

readers and writers to the same buffer by mapping allsuch operations to the same affinity. Thus, Write ac-cess made it safe for writers to modify the buffer withoutlocking and for readers to know that no writes were hap-pening concurrently to the buffer. However, as with In-sert access, messages requiring access to multiple bufferswith different affinity mappings needed to run in a coarseaffinity, thereby limiting parallelism.

Hybrid-Write allows writes to certain types of buffersfrom multiple affinities so that messages routed to anaffinity for Write access to one buffer will also be able toaccess a different buffer by using Hybrid-Write, again asin Figure 6. Thus, readers and writers must synchronizeby using fine-grained locking to ensure MP-safety anddata consistency. In the new model, we have retainedthe traditional Write affinity in which an operation canrun without locking. Read access has been redefined forHybrid-Write buffer types such that it now maps to thetraditional Write affinity, thereby providing (slow) readaccess to the buffer without any locking. New accessmodes called Shared-Write and Shared-Read have beenadded to provide access from finer affinities, and only byexplicitly using these access modes—and thus implicitlyagreeing to add the necessary locking—is any additionalparallelism achieved. Thus, Hybrid-Write uses an opt-in model wherein legacy code remains correct by defaultuntil it is manually optimized.

Hybrid-Write uses spin-locks to protect buffer data andmetadata. Spin-locks are sufficient here because the crit-ical sections for reads and writes are typically small. Theintroduction of fine-grained locking increases complex-ity; however, it can be done incrementally as required byspecific messages without a large-scale code rewrite. Asan example of the increased complexity, buffer state ob-served at any time in Hierarchical Waffinity could alwaysbe trusted because the entire message execution was un-der the implicit locking of the scheduler. In contrast,buffer state observed inside an explicit critical sectioncan no longer be trusted once the lock is released.

5.3 Hybrid-Eject

The final buffer access mode is Eject, which provides ex-clusive access and allows arbitrary updates to the buffer,including evicting the buffer from memory. Thus, Ejectaccess maps to an affinity that excludes all affinitieswith access to this buffer. Hybrid-Eject instead usesfine-grained locking to provide exclusive access from afiner affinity. Unlike Hybrid-Insert and Hybrid-Write,we compute a single Hybrid-Eject affinity for each bufferto serialize all code paths. While Hybrid-Eject could al-ways be used in place of Hybrid-Write, the semantics ofHybrid-Eject are more restrictive and would limit perfor-


Figure 7: Waffinity model where traditional Eject access mapsto AGGR1, but Hybrid-Eject maps to a specific fine affinity S1.

mance. We have also retained the traditional Eject affin-ity to minimize required code changes. Figure 7 showsa sample hierarchy with Eject affinity in AGGR1 andHybrid-Eject in S1.

Simply protecting each buffer with a spin-lock would notbe feasible for Hybrid-Eject because messages can readmany buffers, each of which must be protected from ejec-tion. Instead, we track a global serialization count thatis incremented at periodic serialization points within thefile system. Whenever a reference to a buffer is taken,it is stamped with the current serialization count under aspin-lock and is said to have an active stamp. Bufferswith active stamps cannot be evicted and are implic-itly unlocked when the serialization count is next incre-mented. Because message execution cannot span seri-alization points, buffers with stale stamps can be safelyejected. Preventing the ejection of a buffer for this du-ration is excessive but practical, since only 0.002% ofbuffers considered for ejection had an active stamp dur-ing an experiment with heavy load on a high-end plat-form.

To extend this infrastructure beyond buffer ejection, wealso define an exclusive stamp that prevents any concur-rent access at all. A message requiring exclusive accesscan simply put this value on the buffer and all subsequentaccesses to the buffer spin until the exclusive stamp hasbeen cleared. Spinning is acceptable in practice, sinceonly 0.007% of exclusive stamp attempts encountered anactive stamp in an experiment with heavy load.

5.4 Development Experience with HybridWaffinity

Adopting Hybrid Waffinity within WAFL involved cre-ating the underlying infrastructure and then paralleliz-ing individual messages to take advantage of it. Foreach of the access modes, we required approximately 3Klines of code changes, which included extensive rule en-forcement and checking. As in the case of Hierarchi-

cal Waffinity, the effort involved in each specific mes-sage parallelization varied widely. Leveraging Hybrid-Insert and Hybrid-Eject requires few code changes be-cause the infrastructure is primarily embedded within ex-isting APIs. Hybrid-Write, on the other hand, requiresmore code changes due to the addition of fine-grainedlocking throughout the message handler. For example,all three messages that we optimized using Hybrid-Ejectrequired fewer than 20 lines of code changes. In con-trast, two messages parallelized using Hybrid-Insert andHybrid-Write required a few thousand lines of changes.

We have already begun to apply this technique to otherobjects within WAFL. In particular, a project is underway to further parallelize access to certain inodes inWAFL by using Hybrid Waffinity, and we have foundthe code changes to be relatively modest. We are opti-mistic that the ease with which this technique was ap-plied to inodes will translate to other software systems.The techniques of Hybrid Waffinity can potentially beapplied in any data partitioned system, not only those thatare hierarchically arranged. Prior work has discussedthe difficulty in operating on data from different parti-tions [21, 38], and our approaches can be used in suchcases to improve parallelism. For example, consider ascientific code operating on two arrays. Tasks could bedivided up based on a partitioning of one array, and ac-cess to the other array could be protected through fine-grained locking.

6 Performance Analysis

6.1 Hierarchical Waffinity Evaluation

To highlight the improvements provided by HierarchicalWaffinity, we chose two performance benchmarks thatemphasize the limitations of Classical Waffinity. Hier-archical Waffinity was released in 2011 as part of DataONTAP 8.1, and we used this software to evaluate itsbenefits. In this section, the benchmarks were run on a12-core experimental platform that was the highest-endData ONTAP platform available at the time this featureshipped. Many changes were made in Data ONTAP be-tween releases, so we cannot directly compare the ap-proaches. Instead, we used an instrumented kernel thatruns messages in the same affinities as would ClassicalWaffinity for our baseline to isolate the impact of ourchanges. We used multiple FlexVol volumes in these ex-periments because this is representative of the majorityof customer setups.


Figure 8: Throughput and core usage improvements with Hi-erarchical Waffinity compared to Classical Waffinity on a SpecSFS2008 workload.

6.1.1 Spec SFS2008

We first measured performance using the Spec SFS2008benchmark [35] using NFSv3 on 64 FlexVol volumes.This workload generates mixed workloads that simulatea “typical” file server, including Read, Write, Getattr,Lookup, Readdir, Create, Remove, and Setattr opera-tions. Several of these operations involve modificationof file attributes, so the corresponding messages wereforced to run in the Serial affinity in the Classical Waffin-ity model. Hierarchical Waffinity allowed us to moveCreate and Setattr into the Volume Logical affinity andRemove to the Volume affinity. Since there are eightVolume affinities, up to eight Create/Setattr/Remove op-erations can now run in parallel with each other and canalso run in parallel with the remaining client operationsin Stripe affinities.

Figure 8 shows the throughput and core usage improve-ments of Hierarchical Waffinity over Classical Waffinityfor SFS2008. As noted, in Classical Waffinity many op-erations ran in the Serial affinity, thereby serializing allfile system operations and resulting in idle cores. In thebaseline, the Serial affinity was busy 48% of the time,thus limiting parallel execution to only 52% of the time.In contrast, by providing additional levels of parallelism,the hierarchical model was able to reduce Serial affinityusage to 9%. Alleviating this significant scalability bot-tleneck increased the system-wide core usage by 2.59 outof 12 cores, showing that parallelism is significantly im-proved through the ability to process operations on meta-data in parallel with each other and with reads and writes.Most importantly, the additional core usage successfullytranslated into a 23% increase in throughput. That is,Hierarchical Waffinity effectively exploits the additionalprocessing bandwidth to improve performance.

6.1.2 Random Overwrite

We next evaluate the benefit of Hierarchical Waffinity ona 64 FlexVol volume random overwrite workload. Block

Figure 9: Throughput and core usage improvements with Hier-archical Waffinity compared to Classical Waffinity on a randomoverwrite workload.

overwrites in the file system are particularly interestingbecause WAFL always writes data to new blocks on disk.Thus, for each block overwritten, the previously usedblock on disk must be freed and the corresponding filesystem metadata tracking block usage must be updated.It is not the write operation itself that is of interest in thisexperiment, because that was parallelized even in Classi-cal Waffinity. This benchmark instead demonstrates thegains from parallelizing block free operations that usedto run in the Serial affinity, because they involve updat-ing file system metadata. With Hierarchical Waffinity,these messages can now run in the Volume VBN andAggregate VBN affinities (when freeing in a FlexVol vol-ume and aggregate, respectively) because these affinitiesprovide access to all of the required metafiles. Thus,the changes provided in Hierarchical Waffinity 1) allowblock free messages to run concurrently with each otheron different FlexVol volumes because each volume hasits own Volume VBN affinity; and 2) allow block freemessages to run in parallel with client operations in theStripe affinities.

Here again, Hierarchical Waffinity demonstrates a signif-icant reduction in Serial affinity usage, from 27% to 7%,as a result of parallelizing the block free operations. Thisreduction in serialization made it possible for the sameworkload to scale to an additional 2.88 cores comparedto Classical Waffinity, as shown in Figure 9. This ex-tra core usage allowed an overall improvement in bench-mark throughput of 28%. The random write workloadis interesting because it demonstrates the benefits of run-ning internally generated metadata operations in VolumeVBN affinity in parallel with front-end client traffic inthe Stripe affinities, which was not possible in ClassicalWaffinity.

6.1.3 Overall CPU Scaling

The above analysis of Hierarchical Waffinity showed asubstantial improvement in the number of cores used,but 1.5 cores were still idle. These benchmarks were se-


Figure 10: Core usage with Hierarchical Waffinity on a varietyof workloads under increasing levels of load.

lected to illustrate the benefits of Hierarchical Waffinitycompared to Classical Waffinity, not maximum utiliza-tion. Idle cores could have been driven down still furtherby parallelizing even the remaining 9% of Serial execu-tion in SFS2008 and 7% in random write, but this wasnot required to meet performance and scaling objectivesat the time.

In subsequent releases, we have further parallelizedmany operations, resulting in the improved CPU scala-bility shown in Figure 10. These parallelization effortshave included both reducing Serial affinity usage andmoving already parallelized work into still finer affini-ties. In particular, the graph shows the achieved coreusage at increasing levels of load (i.e., load points) fora variety of workloads on 64 FlexVol volumes on a 20-core storage server running Data ONTAP 9.0. The keytakeaways are the low core usage that occurs at low loadand the high core usage achieved in the presence of highload. In particular, four of the six benchmarks achieve autilization of 19+ cores, with all benchmarks reaching atleast 18 cores. This data demonstrates that HierarchicalWaffinity is able take advantage of computational band-width up to 20 cores in a broad spectrum of importantworkloads, and cores are not starved for work by an ex-cess of computation in coarse affinities. In Section 6.3,we discuss the issue of continued scaling on future plat-forms.

6.2 Hybrid Waffinity Evaluation

Although the previous section demonstrated the successof Hierarchical Waffinity across a wide range of work-loads, there are also cases where its scalability is lim-ited. Thus, we next evaluate the benefits of the HybridWaffinity model by considering benchmarks that empha-size cases where the hierarchical model falls short. Wefocus on single FlexVol volume scenarios because anycoarse affinity utilization significantly limits parallelism;

however, we also consider the scalability with multipleFlexVol volumes where Hierarchical Waffinity itself istypically very effective already. Single-volume config-urations are less common, but they are still a very im-portant customer setup. This section describes analysisdone with Data ONTAP 9.0 on a 20-core platform, thehighest-end system available in 2016.

6.2.1 Sequential Overwrite

We first look at the benefits of Hybrid Waffinity on a se-quential overwrite workload. As discussed above, blockoverwrites are interesting in WAFL because they resultin block frees. In Hierarchical Waffinity, block free workruns in the Volume VBN affinity, because it requires up-dating various file system metadata files, so it alreadyruns in parallel with front-end traffic in the Stripe affini-ties. However, if the system is unable to keep up withthe block free work being created, then client operationsmust perform part of the block free work, which hurtsperformance. This problem is exacerbated on single-volume configurations where all block free work in thesystem must go through the single active Volume VBNaffinity, which can become a major bottleneck.

Increasing the parallelism of this workload requires run-ning block free operations in Volume VBN Range (orsimply Range) affinities. Certain metafile buffers re-quired for tracking free blocks can be mapped to spe-cific Range affinities, but other metafiles need to be up-dated from any Range affinity, so the operation must runin an affinity that excludes both relevant affinities. For-tunately, this is an ideal scenario for Hybrid-Insert andHybrid-Write, in that buffers from certain metafiles canbe mapped to a specific Range affinity and others can beprotected by using locking from any Range affinity. Us-ing a combination of partitioning and fine-grained lock-ing to protect its buffer accesses, Hybrid Waffinity allowsthe block free operation to run in 1) a finer affinity and2) an affinity of which there are multiple instances perFlexVol volume.

We evaluate a single-volume sequential overwrite bench-mark on a 20-core platform with all flash drives. Fig-ure 11 shows the throughput achieved and core usage atincreasing levels of sustained load from a set of clients(i.e., the load point). Comparing peak load points showsa 62.8% improvement in throughput from the use of 4.8additional cores. Under sufficient load, the Hierarchi-cal Waffinity performance falls off, because block freework is unable to keep pace with the client traffic gener-ating the frees. Hybrid Waffinity prevents this from hap-pening by increasing the computational bandwidth thatcan be applied to block free work on a single FlexVolvolume. This experiment demonstrates scalability up to13.2 cores; however, repeating the test with 64 volumes


Figure 11: Throughput and core usage improvements at var-ious levels of server load with Hybrid Waffinity compared toHierarchical Waffinity for a sequential overwrite workload.

shows the system scaling further to 16.6 cores due to theactivation of additional Volume affinity hierarchies. Hy-brid Waffinity benefits throughput by only 7.5% and coreusage by 1.6 cores in the multi-volume case because Hi-erarchical Waffinity is already so effective.

6.2.2 NetApp SnapMirror

The next workload we consider is NetApp SnapMirror R©,a technology that replicates the contents of a FlexVol vol-ume to a remote Data ONTAP storage server for dataprotection [33]. During the “init” phase, the entire con-tents of the volume are transferred to the destinationand subsequent “update” transfers replicate data that haschanged since the last transfer. In particular, SnapMirroroperates by loading the blocks of a metadata file repre-senting the entire contents of the volume on the source,sending the data to a remote storage server, and writ-ing to the same metafile on the destination volume. Ex-clusive access to this file’s buffers belongs to the Vol-ume affinity, which results in substantial serialization,because such work serializes all processing within thevolume. Hybrid-Eject allows these buffers to instead beprocessed in the Stripe affinities.

Figure 12 shows the benefits of Hybrid-Eject on a single-FlexVol SnapMirror transfer. The transfer is destination-limited, so we evaluate core usage on the destination.During the init phase, core usage goes up by 1.1 coresand the throughput is improved by 24.2%. Similarly, theupdate phase uses an additional 0.7 cores, resulting in a32.4% gain in throughput. Despite the benefit, core us-age is low in the single-volume case; however, an exper-iment replicating 24 volumes scales to 12.5 cores (init)and 9.8 cores (update), at which point the workload be-comes bottlenecked elsewhere and Hybrid Waffinity pro-vides no benefit beyond the hierarchical model alone.

Figure 12: Throughput and core usage improvements with Hy-brid Waffinity compared to Hierarchical Waffinity for a Snap-Mirror workload. Core usage is out of 20 available cores.

Figure 13: Throughput and core usage improvements with Hy-brid Waffinity compared to Hierarchical Waffinity for a Snap-Vault workload. Core usage is out of 20 available cores.

6.2.3 NetApp SnapVault

We conclude our analysis by looking at the performancebenefits of all three hybrid access modes together. An-other Data ONTAP technology for replicating FlexVolvolumes, called SnapVault R©, writes to both user filesand metafiles on the destination server. In HierarchicalWaffinity, the operations run in a coarse affinity (suchas Volume or Volume Logical) with access to both typesof buffers. With Hybrid Waffinity, user file buffers re-main mapped to Stripe affinities, but Hybrid-Write andHybrid-Insert allow the metafile buffers to be accessedby any child of the Volume affinity. Thus, SnapVault op-erations can run in the Stripe affinity of their user fileaccesses and use locking to protect metadata accesses.Further, the metafile buffers map to Volume Logical forEject access, which can be optimized by using Hybrid-Eject to facilitate processing in the Stripe affinities.

Figure 13 presents the improvements in throughput andcore usage of Hybrid Waffinity on a single-volume Snap-Vault transfer. The new model facilitated a throughputgain of 130% in the init phase on an extra 4.29 coresused out of 20 available. Update performance is simi-larly improved by 112% on a core usage increase of 1.83cores. Increasing the transfer to 8 volumes increases thetotal core usage to 17.7 cores (init) and 10.6 cores (up-


date), at which point the primary bottlenecks move toother subsystems. Even here, the hybrid model improvesscalability by 0.5 cores and 3.1 cores and throughput by20.2% and 18.1%, for init and update, respectively.

In summary, Hybrid Waffinity is able to greatly improveperformance in single-volume scenarios where Hierar-chical Waffinity struggles most, and even improves scal-ing in certain multi-volume workloads.

6.3 Discussion of Future Scaling

The analysis described in this paper was conducted onthe highest-end platforms available at the time each fea-ture was released. These platforms drive the scalabilityinvestment that is made, because scaling beyond avail-able cores does not add customer value. New platformswith higher core counts will continue to enter the marketin the future and our requirements will continue to in-crease as a result. We expect that the infrastructure nowin place will continue to pay rich dividends as future par-allelization investments focus on utilizing the techniquesdiscussed in this paper to greater degrees rather than in-venting new ones. That is, the bottlenecks on the horizonare not a limitation of the architecture itself. Internal sys-tems with more cores are currently undergoing extensivetuning, and the techniques discussed in this paper havealready allowed scaling well beyond 30 cores.

7 Related WorkOperating system scalability for multicore systems hasbeen the subject of extensive research. Recent workhas emphasized minimizing the use of shared memoryin the operating system in favor of message passing be-tween cores that are dedicated to specific functional-ity [2, 4, 18, 27, 40]. Such designs allow scaling tomany-core systems; however, their new designs cannotbe easily adopted in legacy systems because they requirethe re-architecting of major kernel components and prob-ably are best suited for new operating systems. So al-though such research is crucial to the OS community, ourapproaches for incremental scaling are also required inpractice. A recent study [5] investigated the scalability ofthe Linux operating system and found that traditional OSkernel designs, such as that of Data ONTAP, can scale ef-fectively for near-term core counts, despite the presenceof specific scalability bugs in the CPU scheduler [29]. Incontrast, Min, et al. [31] analyze the scalability of fiveproduction file systems and find many bottlenecks, in-cluding some that may require core design changes.

Recently, many file system and operating system designshave been offered to improve scalability. NOVA [41]

is a log-structured file system designed to exploit non-volatile memories that allows synchronization-free con-currency on different files. Hare [19] implements a scal-able file system designed for non-cache-coherent multi-core processors. The work most similar to our own mit-igates contention for shared data structures by runningmultiple OS instances within virtual machines [7, 36]. Ina similar way, MultiLanes [23] and SpanFS [24] createindependent virtualized storage devices to eliminate con-tention for shared resources in the storage stack. Otherapproaches to OS scaling on multicore systems includereducing OS overhead by collectively managing “many-core” processes [25], tuning the scheduler to optimizeuse of on-chip memory [6], and even exposing vectorinterfaces within the OS to more efficiently use parallelhardware [39].

One obvious alternative to data partitioning is the use offine-grained synchronization, to which many optimiza-tions have been applied. Read-copy update is an ap-proach to improve the performance of shared access todata structures, in particular within the Linux kernel [30].Both flat combining [16] and remote core locking [28]improve the efficiency of synchronization by assigningparticular threads the role of executing all critical sec-tions for a given lock. Specifically in the context of hi-erarchical data structures, intention locks [15] synchro-nize access to one branch of a hierarchy, and Dom-Lock [22] makes such locking more efficient. Our ap-proach provides lock-free access to hierarchical struc-tures in the common case, although the locking intro-duced by Hybrid Waffinity certainly stands to benefitfrom some of these optimizations. Parallel execution canalso be provided via runtime systems that infer task data-independence without explicit data partitions [3, 34].

The database community has long used data partitioningto facilitate parallel and distributed processing of trans-actions. Many algorithms exist for deploying a scalabledatabase partitioning [1]. The process of defining parti-tions for optimal performance can also be done on the flywhile monitoring workload patterns [20]. The Dora [32]and H-Store [37] models provide data partitions such thatoperations in a partition can be performed without re-quiring fine-grained locking. Other work [21, 38] seeksto address the problem of accesses to multiple partitionsin a partitioned database. In these proposals, operationswith a partition are serialized (and therefore lock-free),but transactions applied to multiple partitions are facil-itated through use of a two-phase commit protocol (orsimilar). In our Hierarchical Waffinity model, we insteadsuperimpose a locking model on top of the partitioneddata such that the operation can run safely from within asingle partition.

Hierarchical data partitioning has also been explored. In


most cases, the partitioning model is optimized aroundthe presence of a hierarchical computational substrate [8,9, 10, 12]. In contrast, our work focuses on the hierarchyinherent in the data itself in order to provide ample lock-free parallelism on traditional multicore systems. Similarwork has been done to optimize scientific applications byusing a hierarchical partitioning of the input data [14].

8 ConclusionIn this paper, we have presented the evolution of the mul-tiprocessor model in WAFL, a high-performance produc-tion file system. At each step along the way we have al-lowed continued multiprocessor scaling without requir-ing significant changes to the massive and complicatedcode base. Through this work we have 1) provided asimple data partitioning model to parallelize the major-ity of file system operations; 2) removed excessive seri-alization constraints imposed by Classical Waffinity oncertain workloads by using hierarchical data partition-ing; and 3) implemented a hybrid model based on thetargeted use of fine-grained locking within a larger data-partitioned architecture. Our work has resulted in sub-stantial scalability and performance improvements on avariety of critical workloads, while meeting aggressiveproduct release deadlines, and it offers an avenue for con-tinued scaling in the future. We also believe that the tech-niques discussed in this paper can influence other sys-tems, because the hierarchical model is relevant to anyhierarchically structured system and the hybrid modelcan be employed in insufficiently scalable systems basedon partitioning.

References[1] Sanjay Agrawal, Vivek Narasayya, and Beverly

Yang. Integrating vertical and horizontal partition-ing into automated physical database design. InProceedings of the Internal Conference on Man-agement of Data (SIGMOD), 2004.

[2] Andrew Baumann, Paul Barham, Pierre-EvaristeDagand, Tim Harris, Rebecca Isaacs, Simon Peter,Timothy Roscoe, Adrian Schupbach, and AkhileshSignhania. The multikernel: A new OS architec-ture for scalable multicore systems. In Proceedingsof the Symposium on Operating System Principles(SOSP), 2009.

[3] Micah J. Best, Share Mottishaw, Craig Mustard,Mark Roth, Alexandra Federova, and AndrewBrownsword. Synchronization via scheduling. InProceedings of the ACM SIGPLAN Conference on

Programming Language Design and Implementa-tion (PLDI), 2011.

[4] Silas Boyd-Wickizer, Haibo Chen, Rong Chen,Yandong Mao, M. Frans Kaashoek, Robert Mor-ris, Aleksey Pesterev, Lex Stein, Ming Wu, YuehuaDai, Yang Zhang, and Zheng Zhang. Corey: Anoperating system for many cores. In Proceedingsof the Symposium on Operating System Design andImplementation (OSDI), 2008.

[5] Silas Boyd-Wickizer, Austin T. Clemens, YandongMao, Aleksey Pesterev, M. Frans Kaashoek, RobertMorris, and Nikolai Zeldovich. An analysis of linuxscalability to many cores. In Proceedings of theSymposium on Operating System Design and Im-plementation (OSDI), 2010.

[6] Silas Boyd-Wickizer, Robert Morris, and M. FransKaashoek. Reinventing scheduling for multicoresystems. In Proceedings of the Workshop on HotTopics in Operating Systems (HotOS), 2012.

[7] Edouard Bugnion, Scott Devine, Kinshuk Govil,and Mendel Rosenblum. Disco: Running com-modity operating systems on scalable multiproces-sors. ACM Transaction on Computer Systems,15(4), 1997.

[8] M. Chu, K. Fan, and S. Mahlke. Region-based hier-archical operation partitioning for multicluster pro-cessors. In ACM SIGPLAN Notices, 2003.

[9] M. Chu, R. Ravindra, and S. Mahlke. Data accesspartitioning for fine-grain parallelism on multicorearchitectures. In Proceedings of the InternationalSymposium on Microarchitecture (MICRO), 2007.

[10] D. Clarke, A. Ilic, A. Lastovetsky, and L. Sousa.Hierarchical partitioning algorithm for scientificcomputing on highly heterogeneous CPU+GPUclusters. In Proceedings of the European Confer-ence on Parallel and Distributed Computing (Euro-Par), 2012.

[11] Peter Denz, Matthew Curtis-Maury, and Vinay De-vadas. Think global, act local: A buffer cache de-sign for global ordering and parallel processing inthe WAFL file system. In Proceedings of the In-ternal Conference on Parallel Processing (ICPP),2016.

[12] H. Dutta, F. Hannig, and J. Teich. Hierarchical par-titioning for piecewise linear algorithms. In Paral-lel Computing in Electrical Engineering, 2006.

[13] John K. Edwards, Daniel Ellard, Craig Ever-hart, Robert Fair, Eric Hamilton, Andy Kahn,Arkady Kanevsky, James Lentini, Ashish Prakash,Keith A. Smith, and Edward Zayas. FlexVol: flex-ible, efficient file volume virtualization in WAFL.In USENIX Annual Technical Conference (ATC),


2008.[14] O. Ergul and L. Gurel. A hierarchical partitioning

strategy for an efficient parallelization of the mul-tilevel fast multipole algorithm. In IEEE Transac-tions on the and Propagation, 2009.

[15] J. N. Gray, R. A. Lorie, and G. R. Putzolu. Granu-larity of locks in a shared data base. In Proceedingsof the International Conference on Very Large DataBases (VLDB), 1975.

[16] Danny Hendler, Itai Incze, Nir Shavit, and MoranTzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the Sym-posium on Parallelism in Algorithms and Architec-tures (SPAA), 2010.

[17] Dave Hitz, James Lau, and Michael Malcolm. Filesystem design for an NFS file server appliance. InUSENIX Winter Technical Conference, 1994.

[18] David A. Holland and Margo I. Seltzer. MulticoreOSes: Looking forward from 1991, er, 2011. InProceedings of the Workshop on Hot Topics in Op-erating Systems (HotOS), 2011.

[19] Charles Gruenwald III, Filippo Sironi, M. FransKaashoek, and Nickolai Zeldovich. Hare: a filesystem for non-cache-coherent multicores. In Pro-ceedings of the European Conference on ComputerSystems (EuroSys), 2015.

[20] A. Jindal and J. Dittrich. Relax and let the databasedo the partitioning online. In Enabling Real-TimeBusiness Intelligence, 2012.

[21] Evan P. C. Jones, Daniel J. Abadi, and SameulMadden. Low overhead concurrency control forpartitioned main memory databases. In Proceed-ings of the Internal Conference on Management ofData (SIGMOD), 2010.

[22] Saurabh Kalikar and Rupesh Nasre. DomLock:A new multi-granularity locking technique for hi-erarchies. In Proceedings of the Symposium onPrinciples and Practices of Parallel Programming(PPoPP), 2016.

[23] Junbin Kang, Benlong Zhang, Tianyu Wo, Chun-ming Hu, and Jinpeng Huai. MultiLanes: Provid-ing virtualized storage for OS-level virtualizationon many cores. In Proceedings of Conference onFile and Storage Technologies (FAST), 2014.

[24] Junbin Kang, Benlong Zhang, Tianyu Wo, WeirenYun, Lian Du, Shuai Ma, and Jinpeng Huai.SpanFS: A scalable file system on fast storage de-vices. In Proceedings of the USENIX Annual Tech-nical Conference (ATC), 2015.

[25] Kevin Klues, Barret Rhoden, Andrew Waterman,David Zhu, and Eric Brewer. Processes and re-source management in a scalable many-core OS. In

Proceedings of the Workshop on Hot Topics in Par-allelism (HotPar), 2010.

[26] Charles E. Leiserson. The Cilk++ concurrency plat-form. In Proceedings of the Design AutomationConference (DAC), 2009.

[27] Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, FeiMeng, Xiaosong Ma, Youngjae Kim, Christian En-gelmann, and Galen Shipman. Functional partition-ing to optimize end-to-end performance on many-core architectures. In Proceedings of the Interna-tional Conference on High Performance Comput-ing, Networking, Storage and Analysis (SC), 2010.

[28] Jean-Pierre Lozi, Florian David, Gael Thomas, Ju-lia Lawall, and Gilles Muller. Remote core locking:Migrating critical-section execution to improve theperformance of multithreaded applications. In Pro-ceedings of the USENIX Annual Technical Confer-ence (ATC), 2012.

[29] Jean-Pierre Lozi, Baptiste Lepers, Justin Fun-ston, Fabien Gaud, Vivien Quema, and AlexandraFederova. The Linux scheduler: a decade of wastedcores. In Proceedings of the European Conferenceon Computer Systems (EuroSys), 2016.

[30] Paul E. McKenney, Dipankar Sarma, Andrea Ar-cangeli, Andi Kleen, Orran Krieger, and Rusty Rus-sell. Read copy update. In Proceedings of the Ot-tawa Linux Symposium, 2002.

[31] Changwoo Min, Sanidhya Kashyap, Steffen Maass,Woonhak Kang, and Taesoo Kim. Understandingmanycore scalability of file systems. In Proceed-ings of the USENIX Annual Technical Conference(ATC), 2016.

[32] I. Pandis, R. Johnson, N. Hardavellas, and A. Aila-maki. Data-oriented transaction execution. In Pro-ceedings of the VLDB Endowment (PVLDB), 2010.

[33] Hugo Patterson, Stephen Manley, Mike Feder-wisch, Dave Hitz, Stever Kleiman, and ShaneOwara. SnapMirror: File system based asyn-chronous mirroring for disaster recovery. In Pro-ceedings of Conference on File and Storage Tech-nologies (FAST), 2002.

[34] Martin C. Rinard and Monica S. Lam. The de-sign, implementation, and evaluation of Jade. ACMTransaction on Programming Languages and Sys-tems, 20(1), 1998.

[35] SPEC SFS (System File Server) benchmark. www.

spec.org/sfs2008. 2014.[36] Xiang Song, Haibo Chen, Rong Chen, Yuanxuan

Wang, and Binyu Zang. A case for scaling appli-cations to many-core with OS clustering. In Pro-ceedings of the European Conference on ComputerSystems (EuroSys), 2011.


[37] M. Stonebraker, S. Madden, D. J. Abadi, S. Hari-zopoulos, N. Hachem, and P. Helland. The endof an architectural era: (it’s time for a completerewrite). In Proceedings of the International Con-ference on Very Large Data Bases (VLDB), 2007.

[38] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J.Abadi. Calvin: fast distributed transactions for par-titioned database systems. In Proceedings of theInternal Conference on Management of Data (SIG-MOD), 2012.

[39] Vijay Vasudevan, David G. Andersen, and MichaelKaminsky. The case for VOS: The vector operatingsystem. In Proceedings of the Workshop on HotTopics in Operating Systems (HotOS), 2011.

[40] David Wentzlaff and Anant Agarwal. Factored op-erating systems (fos): The case for a scalable op-erating system for multicores. Operating SystemsReview, 43(2), 2009.

[41] Jian Xu and Steven Swanson. NOVA: A log-structured file system for hybrid volatile/non-volatile main memories. In Proceedings of Con-ference on File and Storage Technologies (FAST),2016.

Copyright notice

NetApp, the NetApp logo, Data ONTAP, FlexVol, Snap-Mirror, SnapVault, and WAFL are trademarks or regis-tered trademarks of NetApp, Inc. in the United Statesand/or other countries. All other brands or productsare trademarks or registered trademarks of their respec-tive holders and should be treated as such. A currentlist of NetApp trademarks is available on the web athttp://www.netapp.com/us/legal/netapptmlist.aspx.


to waffinity and beyond: a scalable architecture for incremental … · 2017-09-21 · to...

Documents