behavior of synchronization methods in commonly used languages and systems
DESCRIPTION
Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden. Behavior of Synchronization Methods in Commonly Used Languages and Systems. Yiannis Nikolakopoulos [email protected] Joint work with: D. Cederman, B. Chatterjee, N. Nguyen, - PowerPoint PPT PresentationTRANSCRIPT
Behavior of Synchronization Methods in Commonly Used Languages and Systems
Yiannis [email protected]
Joint work with:D. Cederman, B. Chatterjee, N. Nguyen,
M. Papatriantafilou, P. Tsigas
Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden
2
Developing a multithreaded application…
Yiannis [email protected]
The boss wants .NET
The client wants speed…
(C++?)
Java is nice
Multicores everywhere
3Yiannis [email protected]
The worker threads need to access data
Concurrent Data Structures
Then we need Synchronization.
Developing a multithreaded application…
4
Implementation
Coarse Grain Locking
Fine Grain Locking
Test And Set
Array Locks
And more!
Yiannis [email protected]
Implementing Concurrent Data Structures
Performance Bottleneck
5
Implementation
Coarse Grain Locking
Fine Grain Locking
Test And Set
Array Locks
And more!
Lock Free
Yiannis [email protected]
Implementing Concurrent Data Structures
Runtime System
Hardware platform
Which is the fastest/most
scalable?
7
Problem Statement
• How the interplay of the above parameters and the different synchronization methods, affect the performance and the behavior of concurrent data structures.
Yiannis [email protected]
Yiannis Nikolakopoulos [email protected]
8
Outline
Introduction
Experiment SetupHighlights of Study and ResultsConclusion
9
Which data structures to study?
Represent different levels of contention:• Queue - 1 or 2 contention points• Hash table - multiple contention points
Yiannis [email protected]
10
How do we choose implementation?
Possible criteria:• Framework dependencies• Programmability• “Good” performance
Yiannis [email protected]
11
Interpreting “good”
• Throughput:The more operations completed per time unit the better.
• Is this enough?
Yiannis [email protected]
12
Non-fairness
13
• Throughput:Data structure operations completed per time unit.
What to measure?
Yiannis [email protected]
Operations by thread i
Average operations per
thread
14
Implementation Parameters
Yiannis [email protected]
Programming Environments C++ Java C# (.NET, Mono)
SynchronizationMethods
TAS, TTAS, Lock-free, Array lock
PMutex, Lock-free memory
management
Reentrant, synchronized
lock construct,Mutex
NUMAArchitectures
Intel Nehalem, 2 x 6 core(24 HW threads)
AMD Bulldozer, 4 x 12 core(48 HW threads)
Do they influence fairness?
15
Experiment Parameters
• Different levels of contention• Number of threads• Measured time intervals
Yiannis [email protected]
Yiannis Nikolakopoulos [email protected]
16
Outline
• Queue– Fairness– Intel vs AMD– Throughput vs Fairness
• Hash Table– Intel vs AMD– Scalability
IntroductionExperiment Setup
Highlights of Study and ResultsConclusion
Yiannis Nikolakopoulos [email protected]
17
Fairness can change along different time intervals24 Threads, High contention
Observations: Queue
0
0,2
0,4
0,6
0,8
1
400 600 800 1000 2000 3000 4000 5000 10000
Fairn
ess
Measurement interval (ms)
C# (.NET)
Intel - Lock-free AMD - Lock-free
Intel - TAS AMD - TAS
Yiannis Nikolakopoulos [email protected]
18
Significantly different fairness behavior in different architectures24 Threads, High contention
Observations: Queue
0
0,2
0,4
0,6
0,8
1
400 600 800 1000 2000 3000 4000 5000 10000
Measurement interval (ms)
Java
Intel - TAS Intel - TTAS
Intel - Synchronized Intel - Lock-free
Fairn
ess
Yiannis Nikolakopoulos [email protected]
19
Significantly different fairness behavior in different architectures24 Threads, High contention
Lock-free is less affected in this case
Observations: Queue
Fairn
ess
0
0,2
0,4
0,6
0,8
1
400 600 800 1000 2000 3000 4000 5000 10000
Fairn
ess
Measurement interval (ms)
Java
Intel - TAS AMD - TASIntel - TTAS AMD - TTASIntel - Synchronized AMD - SynchronizedIntel - Lock-free AMD - Lock-free
Yiannis Nikolakopoulos [email protected]
20
Queue: Throughput vs Fairness
Fairness 0.6 s, Intel Throughput
0
0,2
0,4
0,6
0,8
1
2 4 6 8 12 24 48
Fairn
ess
Threads
C++
TTAS Lock-free PMutex
0
2
4
6
8
10
12
14
16
2 4 6 8 12 24 48
Ope
ratio
ns p
er m
s (th
ousa
nds)
Threads
C++
21
Observations: Hash table
• Operations are distributed in different buckets• Things get interesting when
#threads > #buckets• Tradeoff between throughput and fairness– Different winners and losers– Contention is lowered in the linked list
components
Yiannis [email protected]
Yiannis Nikolakopoulos [email protected]
22
0
0,2
0,4
0,6
0,8
1
400 600 800 1000 2000 3000 4000 5000 10000
Fairn
ess
Measurement interval (ms)
C# (Mono)
Intel - TAS Intel - TTAS Intel - Lock-free
Fairness differences in Hash table across architectures24 Threads, High contention
Observations: Hash table
Yiannis Nikolakopoulos [email protected]
23
Fairness differences in Hash table across architectures24 Threads, High contention
Lock-free is again not affected
Observations: Hash table
0
0,2
0,4
0,6
0,8
1
400 600 800 1000 2000 3000 4000 5000 10000
Fairn
ess
Measurement interval (ms)
C# (Mono)
Intel - TAS AMD - TASIntel - TTAS AMD - TTASIntel - Lock-free AMD - Lock-free
Yiannis Nikolakopoulos [email protected]
24
Observations: Hash tableIn C++, custom memory management and lock-free implementations excel in
scalability and performance.
0
5
10
15
20
25
30
2 4 6 8 12 24 48
Suce
ssfu
l ope
ratio
ns p
er m
s (th
ousa
nds)
Threads
C++
TAS TTAS Lock-free
Array Lock PMutex Lock-free, MM
0
1
2
3
4
5
6
2 4 6 8 12 24 48
Threads
Java
TAS TTAS Lock-freeArray Lock Reentrant Reentrant FairSynchronized
Yiannis Nikolakopoulos [email protected]
25
Conclusion
• Complex synchronization mechanisms (Pmutex, Reentrant lock) pay off in heavily contended hot spots
• Scalability via more complex, inherently parallel designs and implementations
• Tradeoff between throughput and fairness– LF Hash table – Reentrant lock vs Array Lock vs LF Queue
• Fairness can be heavily influenced by HW– Interesting exceptions
Which is the fastest/most
scalable?
Is fairness influenced by
NUMA?