behavior of synchronization methods in commonly used languages and systems

Behavior of Synchronization Methods in Commonly Used Languages and Systems

Yiannis [email protected]

Joint work with:D. Cederman, B. Chatterjee, N. Nguyen,

M. Papatriantafilou, P. Tsigas

Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden

2

Developing a multithreaded application…


The boss wants .NET

The client wants speed…

(C++?)

Java is nice

Multicores everywhere

3Yiannis [email protected]

The worker threads need to access data

Concurrent Data Structures

Then we need Synchronization.

Developing a multithreaded application…

4

Implementation

Coarse Grain Locking

Fine Grain Locking

Test And Set

Array Locks

And more!


Implementing Concurrent Data Structures

Performance Bottleneck

5

Implementation

Coarse Grain Locking

Fine Grain Locking

Test And Set

Array Locks

And more!

Lock Free


Implementing Concurrent Data Structures

Runtime System

Hardware platform

Which is the fastest/most

scalable?

6

Implementing concurrent data structures


7

Problem Statement

• How the interplay of the above parameters and the different synchronization methods, affect the performance and the behavior of concurrent data structures.


Yiannis Nikolakopoulos [email protected]

8

Outline

Introduction

Experiment SetupHighlights of Study and ResultsConclusion

9

Which data structures to study?

Represent different levels of contention:• Queue - 1 or 2 contention points• Hash table - multiple contention points


10

How do we choose implementation?

Possible criteria:• Framework dependencies• Programmability• “Good” performance


11

Interpreting “good”

• Throughput:The more operations completed per time unit the better.

• Is this enough?


12

Non-fairness

13

• Throughput:Data structure operations completed per time unit.

What to measure?


Operations by thread i

Average operations per

thread

14

Implementation Parameters


Programming Environments C++ Java C# (.NET, Mono)

SynchronizationMethods

TAS, TTAS, Lock-free, Array lock

PMutex, Lock-free memory

management

Reentrant, synchronized

lock construct,Mutex

NUMAArchitectures

Intel Nehalem, 2 x 6 core(24 HW threads)

AMD Bulldozer, 4 x 12 core(48 HW threads)

Do they influence fairness?

15

Experiment Parameters

• Different levels of contention• Number of threads• Measured time intervals



16

Outline

• Queue– Fairness– Intel vs AMD– Throughput vs Fairness

• Hash Table– Intel vs AMD– Scalability

IntroductionExperiment Setup

Highlights of Study and ResultsConclusion


17

Fairness can change along different time intervals24 Threads, High contention

Observations: Queue

0

0,2

0,4

0,6

0,8

1

400 600 800 1000 2000 3000 4000 5000 10000

Fairn

ess

Measurement interval (ms)

C# (.NET)

Intel - Lock-free AMD - Lock-free

Intel - TAS AMD - TAS


18

Significantly different fairness behavior in different architectures24 Threads, High contention

Observations: Queue

0

0,2

0,4

0,6

0,8

1

400 600 800 1000 2000 3000 4000 5000 10000


Java

Intel - TAS Intel - TTAS

Intel - Synchronized Intel - Lock-free

Fairn

ess


19

Significantly different fairness behavior in different architectures24 Threads, High contention

Lock-free is less affected in this case

Observations: Queue

Fairn

ess

0

0,2

0,4

0,6

0,8

1

400 600 800 1000 2000 3000 4000 5000 10000

Fairn

ess


Java

Intel - TAS AMD - TASIntel - TTAS AMD - TTASIntel - Synchronized AMD - SynchronizedIntel - Lock-free AMD - Lock-free


20

Queue: Throughput vs Fairness

Fairness 0.6 s, Intel Throughput

0

0,2

0,4

0,6

0,8

1

2 4 6 8 12 24 48

Fairn

ess

Threads

C++

TTAS Lock-free PMutex

0

2

4

6

8

10

12

14

16

2 4 6 8 12 24 48

Ope

ratio

ns p

er m

s (th

ousa

nds)

Threads

C++

21

Observations: Hash table

• Operations are distributed in different buckets• Things get interesting when

#threads > #buckets• Tradeoff between throughput and fairness– Different winners and losers– Contention is lowered in the linked list

components



22

0

0,2

0,4

0,6

0,8

1

400 600 800 1000 2000 3000 4000 5000 10000

Fairn

ess


C# (Mono)

Intel - TAS Intel - TTAS Intel - Lock-free

Fairness differences in Hash table across architectures24 Threads, High contention



23

Fairness differences in Hash table across architectures24 Threads, High contention

Lock-free is again not affected


0

0,2

0,4

0,6

0,8

1

400 600 800 1000 2000 3000 4000 5000 10000

Fairn

ess


C# (Mono)

Intel - TAS AMD - TASIntel - TTAS AMD - TTASIntel - Lock-free AMD - Lock-free


24

Observations: Hash tableIn C++, custom memory management and lock-free implementations excel in

scalability and performance.

0

5

10

15

20

25

30

2 4 6 8 12 24 48

Suce

ssfu

l ope

ratio

ns p

er m

s (th

ousa

nds)

Threads

C++

TAS TTAS Lock-free

Array Lock PMutex Lock-free, MM

0

1

2

3

4

5

6

2 4 6 8 12 24 48

Threads

Java

TAS TTAS Lock-freeArray Lock Reentrant Reentrant FairSynchronized


25

Conclusion

• Complex synchronization mechanisms (Pmutex, Reentrant lock) pay off in heavily contended hot spots

• Scalability via more complex, inherently parallel designs and implementations

• Tradeoff between throughput and fairness– LF Hash table – Reentrant lock vs Array Lock vs LF Queue

• Fairness can be heavily influenced by HW– Interesting exceptions

Which is the fastest/most

scalable?

Is fairness influenced by

NUMA?

behavior of synchronization methods in commonly used languages and systems

Documents