multithreading inside out

64
Multithreading Inside Out

Upload: alinpandichi

Post on 17-Feb-2017

865 views

Category:

Technology


1 download

TRANSCRIPT

Multithreading

Inside Out

The Black Magic

The Black Magic

• 2009 – Product development

– Poor performance 30 / 200 CAPS

– Removed synchronization here and there

– Removed “+” between strings

– Expert advice – too few threads

– Increased heavily the number of threads

– Reached 300 CAPS

The Black Magic

• 2012 – Customer UAT

– Solaris SPARC deployment

– Same platform as before

– 300 CAPS in testing (Solaris Intel)

– 40 CAPS in production elsewhere (Linux Intel)

– Merely reached 5 CAPS

– Crisis; Task force onsite

– Expert advice – too many threads

– Decreased from 6-700 to 100

– The actual advice was… 5

– Reached 15 CAPS

– Applause!

The Black Magic

• 2013 – Support

– Dreadful deadlock in C application

– Identified

– Never really solved

– Solution – timed lock on a mutex

– The customer still fears it

The Black Magic

• 2015 – Support

– 3 JVM-reported deadlocks in 2 months

– Other logical deadlocks in Java applications

– Untouched software

– Known to be running well for years

– Digital madness?

– The rise of a new design pattern

– The deadlocking connection pool

» A connection pool

» A connection with a timer

» A cross monitor locking scheme

• Application Pool Connection

• Timer Connection Pool

The Black Magic

• All the time

– Events from the future

– Strange behaviors

– Strange statistics

– Memory corruption

Rumors

• Applications run faster without synchronization

• Context switch is very expensive on certain hardware; the number of threads must be decreased

• There is no way that two threads reaches this point in the same time

• Operations (like ++) on 32-bit types are atomic. The are executed in a single assembler instruction by the CPU

• Unprotected operations on fundamental types are never leading to data inconsistency

• These are statistics counters, they do not need to be precise, hence they do not need to be protected (by synchronization)

Rumors

• I have seen always only one CPU actively running this application, therefore it is like a single thread multiplexing (faking) multiple threads

• It is ok to have little inconsistency given the big gain in performance

• How to fix deadlocks? – maybe by removing synchronization, at best timing it

• How to fix inconsistencies? – Put some locks here and there and see if things are better

• How to fix performance issues? – Remove some pointless synchronization and, perhaps, add more threads

• The reverse – let’s synchronize this, this and this, maybe some other thread will reach here

• The Questions

• The Research

• The Conclusions

The Engineer Way

• About threads

– What are they?

– How are they working?

– Why/when we need them?

– What is a context switch?

• About synchronization

– What is it?

– How is it working?

– Why/when do we need it?

– What are the instruments?

– How to avoid deadlocks?

– How we diagnose and handle a deadlock?

• About performance

– What is the impact of multithreading?• Slower? How much?

• What to do and what not to?

– What is atomic?

– What is non-blocking?

– How to diagnose and handle bottlenecks?

• About correctness

– What is the right way?

– Can we cheat?• At which cost?

• What we gain?

• How safe it is?

The Questions… few of them

• PThreads Primer – Lewis, Berg, 1996

• POSIX.1-2008, The Open Group Base Specification Issue 7, IEEE Std 1003.1, 2013 Edition

• The Java Language Specification, Java SE 8 Edition

• The Java Virtual Machine Specification, Java SE 8 Edition

• Benchmarks

– C

– Java

The Researchreading & playing

Multithreading… a short introduction

• SMP – Shared-memory, symmetric multiprocessor

– Two or more identical CPU-s / cores

– Single, shared memory

Multithreadinghardware at first

Main Memory

Other devices

CPU 0

SB I$

E$Memory Bus

CPU 1

SB I$

E$

CPU N

SB I$

E$

Bus Snooper

Multithreadinghardware at first

Main Memory

CPU n

External Cache

Memory BusBus Snooper

Store Buffer

Internal Cache

Memory Interface

write #2

#3Write Order

#3Execution Order

#1 read

• #1 – Mutual exclusion

– Synchronization

• #2, #3 – Store barrier

– [CPU] Flush and store buffer (SPARC stbar)

– Just before critical section ends

Multithreadinghardware at first

• An instance of a computer program being executed

• User space

– User code & Program Counter (PC)

– Stack & Stack Pointer (SP)

– Global data

• Kernel space

– Process Structure• PID, UID, GID, etc.

• Signal Dispatch Table

• File Descriptors

• Memory Map

• LWP-s

– LWP (Lightweight Process)• ID, Priority, Signal Mask

• Registers, Kernel Stack, etc

Multithreadingprocesses

Kernel

Process

Process Structure

User spaceData

PC

SP

• System calls

• State (of computation)

– Program Counter

– Stack Pointer

– General registers

– Memory Management Unit (MMU) paging table

• Context switch: LWP1 LWP2

– Trap in kernel mode

– Save state in LWP1 structure

– Load state from LWP2 structure

– Return to user mode

Multithreadingprocesses

• Different flows of work that overlap in time

• Defined by

– Stack & Stack Pointer

– Program Counter

– Thread information• ID, priority, signal mask, etc.

– CPU registers

Multithreadingthreads

Kernel

Process

Process Structure

User space

User Code

Threads Library DataLWPs

Thread structures

• Threads vs. LWPs

– A LWP can be regarded as a “virtual processor”

– Multiple threads may be scheduled on the same LWP

– The kernel is not aware of the user threads, the LWP is

• Context switch

– Similar with LWP context switch, but

– Lighter than for a process / LWP

– Usually done in user space

Multithreadingthreads

• Concurrency

– On the same CPU

• Parallelism

– In the same time on different CPUs

• Synchronization

– Limit the concurrent access on data

• Critical section

• Scheduling

– Placing threads on CPUs

• Contention

– The fight for a resource (lock)

• Race condition

• TSD – Thread specific data

Multithreadingconcepts

• Hardware support

– Atomic instructions• No matter how many threads

• No matter how many CPUs

• No interrupts (signals, context switch, etc.)

– [CPU] Test And Set (SPARC ldsub) – TAS• Reads something

• Writes 1

– [CPU] Flush And Store Buffer (SPARC stbar) – FASB• The Store Buffer content is written in the Main Memory

• Optional

– [CPU] Compare And Swap – CAS• Two values: expected, update

• Reads something

• If the value is [as] expected, writes update

Multithreadingsynchronization

• Goals

– Protect shared data

– Prevent threads from running for nothing

Multithreadingsynchronization

• Mutex – Mutual exclusion lock

– TAS to lock

– FASB just before unlock

– trylock

• Semaphore

– post increase

– wait attempt to decrease; sleep if already 0

– [POSIX] Async Signal Safe

• Condition variable

– Paired with a mutex• Locked before calling wait

• Unlocked when sleeping

• Locked again when awakening

– Optionally timed

Multithreadingsynchronization variables

• Read/write lock

• Recursive mutex

• Non-blocking synchronization

– try* variants

• Monitor

– Encapsulate the shared data with the lock

• Spin lock (!)

– Try lock in a loop

– Short critical sections, multiple CPUs

– Needed in very rare cases

• Adaptive spin lock

– Used in kernels

– Do not try to lock when the owner is known to be running

Multithreadingsynchronization variables

• Barrier

– Defines a synchronization point

– Blocks until N threads reach it

• Join

• Inter-process synchronization variables

Multithreadingsynchronization variables

• Thread sleep is NOT a synchronization variable

• Sleep does not affect any synchronization variable

• Must NOT be used to “avoid” race conditions

• One possible good use:

– Have a non-blocking queue

– Consumer/s is/are doing spinning sleep on empty queue

– High latency, excellent throughput

Multithreadingthread sleep

• Types

– Asynchronous• Inconsistent application state

• Async Cancel Safe

– Deferred [POSIX]• Cancellation points

– NO – mutex

– YES – conditions, semaphores, unexpected

• Cleanup handlers

• Difficulties

– Joining another thread

– Sleeping – mutex or condition waiting

• Not bounded CPU time – “soon”

• Avoid cancellation if at all possible

– Extremely difficult to make it right

– Simple pooling

Multithreadingthread cancellation

• A wakeup signal on a condition variable is lost

• Various sources, various scenarios

• Possible scenario

– Producer / Consumer with a shared queue

– [1] C sleeps waiting for something in the queue

– [2] P puts En in the queue and sends wakeup

– [3] C is awaken

– [4] C takes En from the queue

– [5] C goes to sleep

– Loop to [2]

• Background information

– Put to and take from the queue are protected by a mutex

– C waits on a condition associated with the above mutex

– P sends wakeup on the above condition

Multithreadingsynchronization problems – the lost wakeup

• Definition

– Two or more competing actions are each waiting for the other to finish, and thus neither ever does

• Livelock

– The states of the processes involved constantly change with regard to one another, none progressing

Multithreadingsynchronization problems – deadlocks

• Coffman conditions (Edward G. Coffman, Jr, 1971)

1. Mutual exclusion condition

2. Hold and wait condition

3. No-preemptive condition

4. Circular wait condition

• Avoiding

– [4] Use the same locking order

– [1] Non-blocking synchronization

• Special situations

– Thread death while holding locks

– Attempting to acquire an already held lock (non-reentrant)

• Recovery is very, very tricky

– Should only be attempted in multi-process synchronization

Multithreadingsynchronization problems – deadlocks

• Indeterminacy in deterministic programs

• Typical – missing locking protection

• Out-of-order locking (example?)

• Result depending on the execution order

– Java volatile ++

• volatile int i = 0

• 2 threads attempting to do i++ (2 operations = one read + one write)

• Possible scenarios

– T1 read, T1 write, T2 read, T2 write 2

– T1 read, T2 read, T1 write, T2 write 1

– …

• ABA problem

– T1 reads A

– T2 changes A to B, then B to A

– T1 reads A – “nothing has changed”

Multithreadingsynchronization problems – race conditions

• Throughput vs. Latency

• Speedup Limits (on increasing no. of CPUs)

– Synchronization time

– Main memory access

– I/O subsystem

• Amdahl’s Law (Gene Amdahl, 1967)

– “If a program has one section which is parallelizable, and another section which must run serially, then the program execution time will asymptomatically approach the time for the serial section as more CPUs are added.”

• Gustafson’s Law (John L. Gustafson, 1988)

– “The problem size scales with the number of processors.”

Multithreadingperformance

• Algorithm

• Correct compiler / VM configuration

• Enough RAM

• Minimize I/O

• Minimize cache misses

• Loop optimizations

• Thread specific optimizations

Multithreadingperformance optimizations

• Reduce contention

– Fine-grained locking (divide data)

– Use read/write locks

– Use non-blocking synchronization

– Use spin locks (!)

• Minimize MT overhead

– Choose the right granularity• Fine-grained locking increases the overhead

– Avoid cancellation and signal handling

• Right number of threads

– Avoid unnecessary context switches

• Avoid short-lived threads

• Optimize critical sections

– Get out of them as soon as possible

Multithreadingperformance thread specific optimizations

• We may skip sometimes shared data locking

• Global variables used as constants

– Initialized once by the main thread

– Before any other thread is created

– Never-ever changed

• Test a variable on a probably being correct basis

– Only when there is no need for correctness

– May improve approaching a critical section• Spin lock improvement

Multithreadingcheating

Multithreadingjava memory model

Shared Memory

Thread

Local Memory

Write Read• Synchronization actions

– Volatile read

– Volatile write

– Lock a monitor

– Unlock a monitor

– First and last action of a thread

– Start a thread and detect a thread termination

• Monitor

– Associated with each object

– Both lock and condition

– Mutex – synchronized statement• Recursive

– Condition – wait and notify

• The local memory

– Acts like the CPU external cache

– Synchronization actions act like FASB

• Notable:

– final is like the “cheating” constant

– Word Tearing• byte[] on processors unable to write single bytes

– Non-atomic treatment of long and double• Two separate writes (each 32-bit half)

Multithreadingjava memory model

• Unsafe class

– Offers at Java level:• Several types of CAS and

• Kind of FASB (without monitors)

– The base of all “concurrent” classes

• Atomic* classes

– Spinning CAS on a volatile variable

– Long/Double Accumulator/Adder

• Locks

– Read/Write Lock

– Lock with try lock, interruptible lock, etc.

• Semaphore

– Release == put

– Acquire == wait

Multithreadingjava advanced synchronization

• Concurrent structures

– ConcurrentHashMap• Non-blocking reads

– ConcurrentLikedQueue• Non-blocking reads and writes

– …

• CountDownLatch

– A simple kind of barrier

• CyclicBarrier

– A barrier that can be reused

• Phaser

– A versatile kind of barrier

– Can be reused

– Flexible (registration, tiering, etc.)

Multithreadingjava advanced synchronization

testing environment

Benchmarks

• Machine

– CPU: Intel Core 2 Duo P8700, 2 x 2.53GHz

– Memory: 4G

– OS: Linux x86-64, Kernel 4.2.0, SMP

– C: 5.2.1, GNU, POSIX threads

– Java: 1.8.0_60, Oracle, 64-Bit Server VM

• Tests

– Languages: C, Java

– Iterations: 100M / thread

– Threads: 1, 2, 3, 4, 8 and 100

– Parallel: threads operating on distinct objects

testing summary

Benchmarks

• T1 – 64bit unprotected increment

– Typical “it’s ok” approach

• T1P – T1 parallel

• T2 – 64bit protected increment

– Old school mutex protection

• T2P – T2 parallel

• T3 – 64bit simple increment, protected read & write

– Typical mistake – volatile increment

• T3P – T3 parallel

• T4 – 64bit spinning CAS increment, protected read

• T4P – T4 parallel

• T5 – java queues – blocking vs. spinning sleep

Benchmarks – CT1 – 64bit unprotected increment

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – CT1P – 64bit unprotected increment, parallel

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – CT2 – 64bit protected increment

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – CT2 – 64bit protected increment, parallel

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – CT3 – 64bit simple increment, protected read & write

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.00

5.00

10.00

15.00

20.00

25.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – CT3 – 64bit increment, protected read & write, parallel

0.00

5.00

10.00

15.00

20.00

25.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – CT4 – 64bit spinning CAS increment, protected read

0.00

5.00

10.00

15.00

20.00

25.00

30.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – CT4 – 64bit spinning CAS increment, protected read, parallel

0.00

5.00

10.00

15.00

20.00

25.00

30.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – CT2 – synchronization vs. T4 – CAS spinning

0

2

4

6

8

10

12

14

16

18

20

1 2 3 4 8 100

Millio

ns

T2 - Performance (ME/s) T4 - Performance (ME/s)

Benchmarks – JavaT1 – 64bit unprotected increment

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – JavaT1P – 64bit unprotected increment, parallel

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – JavaT2 – 64bit protected increment

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – JavaT2 – 64bit protected increment, parallel

0.00

50.00

100.00

150.00

200.00

250.00

300.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – JavaT3 – 64bit simple increment, protected read & write

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – JavaT3 – 64bit increment, protected read & write, parallel

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – JavaT4 – 64bit spinning CAS increment, protected read

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – JavaT4 – 64bit spinning CAS increment, protected read, parallel

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

1 2 3 4 8 100

Millio

ns

Performance (ME/s) Error (%)

Benchmarks – JavaT2 – synchronization vs. T4 – CAS spinning

0

20

40

60

80

100

120

140

1 2 3 4 8 100

Millio

ns

T2 - Performance (ME/s) T4 - Performance (ME/s)

Benchmarks – JavaT5 – java queues – blocking vs. spinning sleep

0.00

2.00

4.00

6.00

8.00

10.00

12.00

1 2 3 4 8 100

Millio

ns

Linked Blocking Queue (MO/s) Spinning Sleep Queue (MO/s)

• Lock shared data – ALWAYS

– Correctness is not negotiable

• Lock acquisition – always use the SAME locking order

• Define data model and threading model:

– Avoid contention

– Avoid loosing control of threads (model, count, etc.)

• Beware of the logical flow

– Lock acquisition order

– Execution order• Remember – volatile ++

• Unless they are TRULY needed, avoid

– Optimization

– Synchronization

• Do NOT cancel / forcefully stop threads

The Conclusion… to do or not to do

• Multithreading is a wonderful thing

• I dare you READing about it, THINKing about it

• Easy it is not… but try to DO IT RIGHT

• Then TEACH the others about doing it right

• And, please, remember:

• Multithreading can get you to the sky, but, as well, it can get you on the ground.

The Conclusion… sort of

Computariswww.computaris.com

Copyright @ Computaris

Thank you!