performance of memory reclamation for lockless synchronization

50
CS510 – Concurrent Systems 1 Performance of memory reclamation for lockless synchronization By Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, Jonathan Walpole Handwaved about by Jim Cotillier

Upload: belva

Post on 29-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Performance of memory reclamation for lockless synchronization. By Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, Jonathan Walpole Handwaved about by Jim Cotillier. The Problem. Why not just stick with classical locks? Performance issues (blocking) CAS-class instruction overhead - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Performance of memory reclamation for lockless synchronization

CS510 – Concurrent Systems 1

Performance of memory reclamation for lockless synchronization

By Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, Jonathan Walpole

Handwaved about by Jim Cotillier

Page 2: Performance of memory reclamation for lockless synchronization

The Problem

Why not just stick with classical locks?o Performance issues (blocking)o CAS-class instruction overheado Susceptible to:

• Deadlock• Priority Inversion• Convoying

Lockless synchronization addresses this, but is exposed to Read/Reclaim Races

o Reclamation of shared data elements without coordination with all contenders leads to an inconsistent global state

• Such ex post facto references to deleted data yield unpredictable results

Page 3: Performance of memory reclamation for lockless synchronization

Uncoordinated reclamation…

Page 4: Performance of memory reclamation for lockless synchronization

Some Approaches to Solutions

QSBR –- Quiescent State Based Reclamation EBR/NEBR –- Epoch Based Reclamation HPBR –- Hazard Pointers Based Reclamation LFRC –- Lock-free Reference Counting

Functionality provided by a client/library interface

o But no single, invariant set of interface semantics exists across all schemes

Page 5: Performance of memory reclamation for lockless synchronization

QSBR

Permits the reclamation of data only after a time interval elapses, called a Grace Period

QSBR defines a Grace Period to be the temporal interval (a,b), such that any data element deleted before a can be reclaimed after b

A Quiescent State is a state of a thread, T, in which T holds no references to shared elements, active or deleted (zombie)

Any interval in which each thread passes through a Quiescent State is a QSBR Grace Period

Page 6: Performance of memory reclamation for lockless synchronization

Three-thread QSBR example…

Page 7: Performance of memory reclamation for lockless synchronization

QSBR Fuzzy Barriers

Protect access to “protected” code which no thread should execute before all other threads reach a specified point

Do not absolutely block, a la hard barriers, only prevent execution of “protected” code until barrier opens

Thus, can be used to synchronize reclamation

Page 8: Performance of memory reclamation for lockless synchronization

Using QSBR

Client explicitly declares Quiescent State:

… and thereby enters a fuzzy barrier

Page 9: Performance of memory reclamation for lockless synchronization

Problem: thread failure

A dead thread cannot call quiescent_state() and thus can force QSBR to block…

Page 10: Performance of memory reclamation for lockless synchronization

EBR (Fraser)

Uses Grace Periods, like QSBRo But does not rely upon explicit client Quiescent State

declarations, as QSBR does Encapsulates lockless operations within Critical

Sectionso …which the client explicitly declares, via the functions

critical_enter() and critical_exit() Counts the number of Critical Region

invocations, and then attempts to enter a fuzzy barrier to reclaim memory

Page 11: Performance of memory reclamation for lockless synchronization

Linked list search using EBR

Page 12: Performance of memory reclamation for lockless synchronization

EBR Epochs

Epochs are modeled after [3], the group of equivalence classes modulo 3

Epochs are hierarchical: Global and Local Each epoch has an associated zombie element

list Fuzzy barrier for reclamation is entered upon

entry to each new epoch A thread entering a Critical Region updates its

Local Epoch to match the global epoch After M (magic number) LE updates, a thread will

attempt to increment the GE

Page 13: Performance of memory reclamation for lockless synchronization

EBR Epochs Cont’d.

A GE update attempt only succeeds if the LE of each thread in a CR matches the GE

Since threads update their LE only at the start of a CR, whenever, for a thread T, its LE = GE, then all lockless operations of other threads in progress the last time T was in epoch GE have completed

Thankfully, a grace period has expired!

Page 14: Performance of memory reclamation for lockless synchronization

EBR Epoch Cycle

Page 15: Performance of memory reclamation for lockless synchronization

NEBR – a Modest EBR Improvement

EBR must pay for the expensive fences at the beginning and end of a CS

Modeled a little after QSBR: have the application set/reset a “critical section(s) may be in here” flag

o NEBR then does not “automatically” do this in each CSo “Application independence” dies in favor of performance

Reduces EBR’s overhead modestly--closer to QSBRo NEBR is attractive as the programmer’s responsibilities are

limited to marking sections that might contain lockless operations

Page 16: Performance of memory reclamation for lockless synchronization

HPBR/SMR (Michael)

Each thread T has (magic) K Hazard Pointers used to protect elements from reclamation by other threads

o Thus, for N threads, H = NK HP’s exist in totoo K is small, often 2 (queues and lists); 1 (stacks)

T caches removed elements privately in a list P of size (magic) R

After R removals, T reclaims each element in P that does not have a corresponding HP

If T fails, a maximum of K+R removed elements can be leaked

Page 17: Performance of memory reclamation for lockless synchronization

HPBR Paradigm

Page 18: Performance of memory reclamation for lockless synchronization

HPBR Paradigm Cont’d.

Hazardous References—references to shared elements that may now be zombies or ABA situations

o Algorithms using HPBR must identify a Hazardous Reference, set a Hazard Pointer, then check for element removal

o If an element has not been removed, it continues to be referentially safe

Page 19: Performance of memory reclamation for lockless synchronization
Page 20: Performance of memory reclamation for lockless synchronization

LFRC (Valois, Detlefs, et al.)

Threads track the instantaneous count of references to elements

o When count = 0, element can be reclaimed Many variations on this scheme may or may not

allow element types to change upon reclamationo May require type invariance (Valios); type

independence requires DCAS (Detlefs, et al.)o Zombies may consume unbounded memory

Performance may be worse than lock-basedo CAS, FAA (Intel: LOCK XADD) very expensive

Page 21: Performance of memory reclamation for lockless synchronization

Summary of Schemes

QSBR –- Detects grace periods using application-specified quiescent states

EBR -- Detects grace periods using application-independent epochs

HPBR –- Uses per-thread Hazard Pointers to synchronize reclamation

LFRC – Uses per-element reference counts to synchronize reclamation

Page 22: Performance of memory reclamation for lockless synchronization

Performance Factors…

Depends on a lot of stuffo Memory consistency and constraintso Workload, contention and thread scheduling

Sequentially consistent memory model is still generally assumed by the lock-free literature

o But the hardware trends are toward weaker models• Coder needs to rely on fences (MBarriers), which

artificially add overhead• HPBR, EBR and LFRC require per-operation fences, but not

QSBR—this is shown to be a distinct advantage

Page 23: Performance of memory reclamation for lockless synchronization

Performance Factors Cont’d.

Thread preemptiono Can start when number threads > number CPUs

Descheduled threads are blocked threads, as far as reclamation schemes are concerned

o Anything that prevents a Grace Period from closing is bad

Threads may sometimes need to borrow memory from a locked, global pool

o A thread may be preempted whilst holding such a lock; setting up a thread convoy on memory

o HPBR bounds memory stress and has an advantage here

Page 24: Performance of memory reclamation for lockless synchronization

The μBenchmark

Page 25: Performance of memory reclamation for lockless synchronization

The μBenchmark Cont’d.

Master thread flow logico Create N childreno Start a timero When timer expires, stop children

Average execution time/measured operation = test duration/number of operations

Net CPU time = execution time * number of threads

o If thread count > CPU count, report execution time; otherwise report CPU time.

Driver parameters were selected not to be biased toward any particular reclamation scheme

Page 26: Performance of memory reclamation for lockless synchronization

The μBenchmark Cont’d.

CS implemented on POWER via larx/stcx (LL/SC) Fences implemented via eieio (“Enforce In-order

Execution of I/O”) Spin locks implemented via cas and fences Statically allocated HPBR Hazard Pointers

o Some algorithms may require unbounded HP counts Choice of placement of QSBR QS declarations

may not be obvious in some algorithms

Page 27: Performance of memory reclamation for lockless synchronization

Performance Measurement Guidelines

Measure the base costs firsto Single-threaded execution, small data structures

• No contention, preemption, traversing long lists• Non-blocking queues, single-element linked lists…

Then move toward complexityo Pedagogical approach--try to change only one factor at

a timeo Consider the R/O, the W/O and the R/W cases in each of

the examined reclamation schemes

Page 28: Performance of memory reclamation for lockless synchronization

Base Performance Costs

Page 29: Performance of memory reclamation for lockless synchronization

Scalability with Fractional Workload

Page 30: Performance of memory reclamation for lockless synchronization

Scalability with Traversal Length

Page 31: Performance of memory reclamation for lockless synchronization

Scalability of LFRC

Page 32: Performance of memory reclamation for lockless synchronization

No Preemption; R/O Workload

Page 33: Performance of memory reclamation for lockless synchronization

No Preemption; W/O Workload

Page 34: Performance of memory reclamation for lockless synchronization

Preemption; R/O Workload

Page 35: Performance of memory reclamation for lockless synchronization

Preemption; W/O Workload

Page 36: Performance of memory reclamation for lockless synchronization

Memory Stress Busy Wait

Page 37: Performance of memory reclamation for lockless synchronization

Hash Tables; Update Fraction Workload

Page 38: Performance of memory reclamation for lockless synchronization

No Preemption; R/O Workload with NEBR

Page 39: Performance of memory reclamation for lockless synchronization

Case Study—RCU API in Linux

RCU concepts—”Read/Copy/Update”o Lockless concurrent reads with deferred destruction of

zombie elementso Writers may not prevent readers from accessing shared

datao Writers must coordinate with each other in some way

• RCU does not specify what wayo RCU neither blocks nor fails for readerso Preemptable kernels necessitate the use of

rcu_read_lock() and rcu_read_unlock()to toggle kernel preemption

• …so that context switches do not occur at intolerable times

Page 40: Performance of memory reclamation for lockless synchronization

Case Study—RCU Cont’d.

QSBR is a natural choice for memory reclamation

o EBR could be used as well, but would not offer any advantages over QSBR

RCU is best targeted to read-mostly data structures

o Rare updates imply rare reclamation

Page 41: Performance of memory reclamation for lockless synchronization

Case Study—RCU Cont’d.

SysV IPC subsystem implemented in Linux via CR-QSBR

o Implements semaphores, message queues and shared memory

o Apps use an integer Accessor ID to access in-kernel data structures (essentially a “resource handle”)

o The dynamic, mostly-read (AID/resource) array, formerly spinlocked in stock Linux, was protected here through CR-QSBR instead, and benchmarked

Page 42: Performance of memory reclamation for lockless synchronization

Case Study—RCU Cont’d.

Semopbench, 8-CPU, 700 MHz Intel P-III

Page 43: Performance of memory reclamation for lockless synchronization

Case Study—RCU Cont’d.

DBT1 Database Benchmark Raw Results

Page 44: Performance of memory reclamation for lockless synchronization

Case Study—RCU Cont’d.

DBT1 database benchmark results (TPS)

Page 45: Performance of memory reclamation for lockless synchronization

Conclusions

Reclamation has a huge effect on lockless algorithm performance

o So one must tune to the design of the application Both QSBR and EBR can suffer in the face of

memory exhaustion HPBR and EBR have higher base costs than

QSBR due to fences The NEBR enhancements modestly improve EBR LFRC has the highest overhead due to the per-

element atomic instruction requirement

Page 46: Performance of memory reclamation for lockless synchronization

Conclusons Cont’d.

HPBR scales poorly as the traversal length increases

QSBR is, overall, the best performing reclamation scheme

o …and best suited to an OS kernel environmento Lockless approaches using QSBR can widely

outperform locking approaches by a large margin

Page 47: Performance of memory reclamation for lockless synchronization

Rantings -- STAE

STAE – Specified Thread Abnormal Exito User provides Exit code to be run on condition of thread

error trapo Exit is driven by the etrap interrupt logic; Exit is called

immediately after etrap is detected, e.g., SEGVo Exit has full access to environment of failing thread; may

modify any data, etc.o Exit may:

• Allow failing thread to die (the status quo)• Resuscitate failing thread by telling the dispatcher to restart

the thread at an Exit-specified point in its code• Call a completely new program to run in place of failing

thread (with all of the failing thread’s credentials and context)

Page 48: Performance of memory reclamation for lockless synchronization

Rantings -- PLO

PLO – Perform Locked Operation (IBM z Platform)o Meta instruction that atomically encapsulates all of:

CAL, CAS, DCAS, CASAS, CASADS, CASATS

into single-instruction global atomicityo 32, 64, or 128-bit operands are supportedo Acquires a global hardware interlock unique to PLOo Is very powerful and flexible, but is so complex that it

may require a pre-built parameter list just to “program” it!

o Usually needs to be coded with a zillion operands o Its proprietary μalgorithm has to be huge, but whether

its utility outstrips its cost enough to yield a net gain in performance, has not yet been answered (afaik)

Page 49: Performance of memory reclamation for lockless synchronization

Questions/Musings

Suppose DCAS was “improved” so that it uses an order of magnitude fewer clocks than today.

o To what extent could macroscopically faster hardware atomicity affect the utility of these lockless schemes?

Could the STAE formalism provably solve the failed thread blocking problem in QSBR?

o If you believe the answer is yes, based on the empirical data in this paper, would the paradigm (QSBR+STAE) satisfy Ockham’s Razor and thus become the overall best solution to the lockless reclamation problem?

Page 50: Performance of memory reclamation for lockless synchronization