a dynamic elimination-combining stack algorithm
DESCRIPTION
A Dynamic Elimination-Combining Stack Algorithm. Gal Bar-Nissan, Danny Hendler and Adi Suissa Department of Computer Science, BGU, January 2011. Presnted by: Ilya Mirsky 28.03.2011. Outline. Concurrent programming terms Motivation Introduction DECS: The Algorithm - PowerPoint PPT PresentationTRANSCRIPT
A Dynamic Elimination-Combining Stack AlgorithmGal Bar-Nissan, Danny Hendler and Adi SuissaDepartment of Computer Science, BGU, January 2011
Presnted by: Ilya Mirsky 28.03.2011
2
Outline Concurrent programming terms Motivation Introduction DECS: The Algorithm DECS Performance evaluation NB-DECS Summary
3
Concurrent programming terms Locks (coarse and fine grained) Non blocking algorithms
Wait-freedom Lock-freedom Obstruction-freedom
Linearizability Memory Contention Latency
4
Outline Concurrent programming terms Motivation Introduction DECS: The Algorithm DECS Performance evaluation NB-DECS Summary
5
Motivation Concurrent stacks are widely used in
parallel applications and operating systems. A simple implementation using coarse
grained locking mechanism causes a “hot spot” at the central stack object and poses a sequential bottleneck.
There is a need in a scalable concurrent stack, which presents a good performance under low, medium and high workloads, with no dependency in the ratio of the operations type (push/ pop).
6
Outline Concurrent programming terms Motivation Introduction DECS: The Algorithm DECS Performance evaluation NB-DECS Summary
7
Introduction Two key synchronization paradigms for construction of
scalable concurrent data structures are software combining and elimination.
The most highly scalable concurrent stack algorithm previously known is the lock-free elimination-backoff stack )Hendler, Shavit, Yershalmi).
The HSY stack is highly efficient under low contention, as well as under high contention when workload is symmetric.
Unfortunately, when workloads are asymmetric, the performance of HSY deteriorates to a sequential stack.
Flat-combining (by Hendler et al.) significantly outperforms HSY in low and medium contentions, but it does not scale and even deteriorates at high contention level.
8
Introduction - DECS DECS employs both combining & elimination
mechanism. Scales well for all workload types, and
outperforms other stack implementations. Maintains the simplicity and low overhead of
the HSY stack. Uses a contention-reduction layer as a backoff
scheme for a central stack- an elimination-combining layer.
A non blocking implementation is presented, NB-DECS, a lock-free variant of DECS in which threads that have waited for too long may cancel their “combining contract” and retry their operation on the central stack.
9
Introduction - DECS
10
Introduction - DECS
CentralStack
Elimination-combining layer
11
Introduction - DECS
CentralStack
Elimination-combining layer
12
Introduction - DECS
CentralStack
zzz…
zzz…
zzz…
Elimination-combining layer
13
Introduction - DECSzzz…
zzz…
zzz…
Wake up!
CentralStack
Elimination-combining layer
14
Introduction - DECS
CentralStack
zzz…
Elimination-combining layer
15
Introduction - DECS
CentralStack
zzz…
Elimination-combining layer
16
Introduction - DECS
CentralStack
zzz…
Elimination-combining layer
17
Outline Concurrent programming terms Motivation Introduction DECS: The Algorithm DECS Performance evaluation NB-DECS Summary
18
DECS- The Algorithm The data structures
1 6 4Collision Array
Locations Array
MultiOpint id;int op;int length;int cStatus;Cell cell;MultiOp next;MultiOp last;
CellData data;Cell next;
CellData data;Cell next;
CellData data;Cell next;
CellData data;Cell next;
CentralStack
Elimination-combining layer
19
DECS- The Algorithm
CentralStack
push(data1)
push(data2)
pop()
I wish there was someone
in similar situation…
I wish there was someone
in similar situation…
20
DECS- The Algorithm
multiOp tInfo = initMultiOp();
multiOp tInfo = initMultiOp(data);
DECS- The Algorithm
21
Collision Array
Locations Array
T. 6
T. 2
MultiOpid = 2op = POPlength = 1cStatus = INITcellnext = NULLlast
EMPTY
MultiOpid = 6op = PUSHlength = 1cStatus = INITcellnext = NULLlast
data1
…4
…4
EMPTY 6
6
I’ll wait, maybe
someone will arrive…
Yay, I can collide with
thread 6!
Active collider
Passive collider
DECS- The Algorithm Central Stack Functions
23
DECS- The Algorithm
24
DECS- The Algorithm
25
DECS- The Algorithm
T. 6
T. 2
zzz…
Collision Array
Locations Array
MultiOpid = 2op = POPlength = 1cStatus = INITcellnext = NULLlast
EMPTY
MultiOpid = 6op = PUSHlength = 1cStatus = INITcellnext = NULLlast
data1
I see that T. 6 got PUSH, and I got POP-
we can eliminate!
26
DECS- The Algorithm Elimination-Combining Layer Functions
27
DECS- The Algorithm
T. 6
T. 2
zzz…
MultiOpid = 2op = POPlength = 1cStatus = INITcellnext = NULLlast
EMPTY
MultiOpid = 6op = PUSHlength = 1cStatus = INITcellnext = NULLlast
data1
MultiOpid = 6op = PUSHlength = 0cStatus = FINISHEDcellnext = NULLlast
MultiOpid = 2op = POPlength = 0cStatus = FINISHEDcellnext = NULLlast
Working…
28
DECS- The Algorithm
T. 6
T. 2
zzz…
MultiOpid = 2op = POPlength = 1cStatus = INITcellnext = NULLlast
MultiOpid = 6op = PUSHlength = 1cStatus = INITcellnext = NULLlast
data1
MultiOpid = 6op = PUSHlength = 0cStatus = FINISHEDcellnext = NULLlast
MultiOpid = 2op = POPlength = 0cStatus = FINISHEDcellnext = NULLlast
Working…Done!
29
DECS- The Algorithm
30
DECS- The Algorithm
T. 6
T. 2
zzz…
Wake up man, I’ve done your
job!
Thank you T. 2, let’s go
have a beer; I’m buying!
31
DECS- The Algorithm
32
DECS- The Algorithm
33
Outline Concurrent programming terms Motivation Introduction DECS: The Algorithm DECS Performance evaluation NB-DECS Summary
34
DECS Performance Evaluation Hardware
128-way UltraSparc T2 Plus (T5140) server. A 2 chip system, in which each chip contains 8 cores, and each core multiplexes 8 hardware threads.
Running Solaris 10 OS. The cores in each CPU share the same L2 cache. C++ code compiled with GCC with the –O3 flag.
Compared VS: Treiber stack The HSY elimination-backoff stacks Flat-combining stack
35
DECS Performance Evaluation Course of experiments
Threads repeatedly apply operations on the stack for a fixed duration of 1 sec, and the resulting throughput is measured, varying the level of concurrency from 1 to 128.
Throughput is measured on both symmetric and asymmetric workloads.
Stacks are pre-populated with enough cells so that pop operations do not operate on an empty stack.
Each data point is the average of 3 runs.
36
DECS Performance Evaluation
X-axis: threads number
Symmetric workload
37
DECS Performance Evaluation
X-axis: threads number
Moderately-asymmetric workload
38
DECS Performance Evaluation
X-axis: threads number
Fully-asymmetric workload
39
Outline Concurrent programming terms Motivation Introduction DECS: The Algorithm DECS Performance evaluation NB-DECS Summary
40
NB-DECS DECS is blocking. For some applications non-blocking
implementation may be preferable because it’s more robust to thread failures.
NB-DECS is a lock-free variant of DECS that allows threads that delegated their operations to another thread, and have waited for too long, to cancel their “combining contracts”, and retry their operations.
41
Outline Concurrent programming terms Motivation Introduction DECS: The Algorithm DECS Performance evaluation NB-DECS Summary
42
Summary DECS comprises a combining-elimination
layer, therefore benefits from collision of operations of reverse, as well as identical semantics.
Empirical evaluation showed that DECS outperforms all best known stack algorithms for all workloads.
NB-DECS The idea of combining-elimination layer could
be used to efficiently implement other concurrent data-structures.