speed up synchronization locks: how and why?

Speed Up Synchronization Locks: Speed Up Synchronization Locks: A Scaleform Case StudyA Scaleform Case Study

Abhishek AgrawalAbhishek Agrawal

Software Solutions GroupSoftware Solutions Group

3

AgendaAgenda Common Locking IssuesCommon Locking Issues Windows* Locking Methodologies and associated performanceWindows* Locking Methodologies and associated performance User Level Atomic Locks with Scaleform* case StudyUser Level Atomic Locks with Scaleform* case Study Hot Locks and Lock Contention with Flight Simulator* Case StudyHot Locks and Lock Contention with Flight Simulator* Case Study Locks in Intel TBBLocks in Intel TBB®®

Summary & Call to ActionSummary & Call to Action

4

Why care for Locking ??Why care for Locking ??

Locking code can be the most frequently run Locking code can be the most frequently run code in a multi-threaded applicationcode in a multi-threaded application

Determining which methodology of locking Determining which methodology of locking to utilize can be as critical as identification of to utilize can be as critical as identification of parallelism within an applicationparallelism within an application

Improper use of locking mechanism can lead Improper use of locking mechanism can lead to situations like lock stuttering, very high to situations like lock stuttering, very high contention and new types of programming contention and new types of programming bugsbugs

Proper use of locks Proper use of locks is crucial for multi-threading applicationsis crucial for multi-threading applications

5

Common Lock PathologiesCommon Lock Pathologies

Can introduce performance and correctness Can introduce performance and correctness problems problems

Some potential problemsSome potential problems– DeadlockDeadlock

Happens when tasks are trying to acquire more than one lock and Happens when tasks are trying to acquire more than one lock and each holds some of the locks the other tasks need in order to proceedeach holds some of the locks the other tasks need in order to proceed

– Convoying Convoying Occurs when the operating system interrupts a task that is holding a Occurs when the operating system interrupts a task that is holding a

locklock

– Priority InversionPriority Inversion Refers to the scenario where a lower-priority task holds a shared Refers to the scenario where a lower-priority task holds a shared

resource that is required by a higher-priority taskresource that is required by a higher-priority task

6

How to avoid Lock PathologiesHow to avoid Lock Pathologies

DeadlocksDeadlocks– Avoid needing to hold two locks at the same timeAvoid needing to hold two locks at the same time

– Always acquire locks in the same order (e.g. outer Always acquire locks in the same order (e.g. outer container and inner container mutexes)container and inner container mutexes)

– Use Use atomic operationsatomic operations

Convoying & Priority InversionConvoying & Priority Inversion– Use Use atomic operationsatomic operations instead of locks where possible instead of locks where possible

Use Atomic Operations and Use Atomic Operations and User-Level LocksUser-Level Locks

7



8

Windows* Locking Methodologies Windows* Locking Methodologies

Interlocked FunctionsInterlocked Functions– Located in kernel32.dllLocated in kernel32.dll

– Essentially just utilizing atomic instructionsEssentially just utilizing atomic instructions

TryEnterCriticalSection (Non-Blocking)TryEnterCriticalSection (Non-Blocking)– Attempts to get a lock N times in ring 3Attempts to get a lock N times in ring 3

EnterCriticalSection (Blocking)EnterCriticalSection (Blocking)– Attempts to get the lock one time in ring 3 and then jumps Attempts to get the lock one time in ring 3 and then jumps

into ring 0into ring 0

WaitForSingleObjectWaitForSingleObject– Jumps into ring 0 100% of the time whether the lock is Jumps into ring 0 100% of the time whether the lock is

achieved or notachieved or not

– Mutexes and Semaphore APIs follow the same pathMutexes and Semaphore APIs follow the same path

9

WaitForSingleObject Vs. EnterCriticalSectionWaitForSingleObject Vs. EnterCriticalSection

WaitForSingleObjectWaitForSingleObject EnterCriticalSectionEnterCriticalSection

An overloaded Microsoft API which can An overloaded Microsoft API which can be used to check and modify the state of be used to check and modify the state of a number of different objects such as a number of different objects such as events, jobs etcevents, jobs etc

Advantage of WaitForSingleObject is Advantage of WaitForSingleObject is that it can be processed globally which that it can be processed globally which enables it to be used for synchronization enables it to be used for synchronization between processesbetween processes

One major disadvantage of One major disadvantage of WaitForSingleObject is that it will always WaitForSingleObject is that it will always obtain a kernel lock, so it enters obtain a kernel lock, so it enters privileged mode (ring 0) whether the privileged mode (ring 0) whether the lock is achieved or notlock is achieved or not

Can be used by putting an Can be used by putting an EnterCriticalSection and EnterCriticalSection and LeaveCriticalSection API call surrounding LeaveCriticalSection API call surrounding the critical section codethe critical section code

The API has the advantage over The API has the advantage over WaitForSingleObject in that it will not WaitForSingleObject in that it will not enter the kernel unless there is enter the kernel unless there is contention on the lockcontention on the lock

Disadvantage of EnterCriticalSectionDisadvantage of EnterCriticalSection

- It’s a blocking call- It’s a blocking call

- It cannot be processed globally - It cannot be processed globally

and there is no guarantee on the and there is no guarantee on the

order which threads obtain the order which threads obtain the

locklock

10

EnterCriticalSection Vs. EnterCriticalSection Vs. WaitForSingleObjectWaitForSingleObject

EnterCriticalSection is much faster under 1 thread (no contention) EnterCriticalSection is much faster under 1 thread (no contention) since it will not jump into the kernel if lock is achievedsince it will not jump into the kernel if lock is achieved

WaitForSingleObject and EnterCriticalSection have similar costs WaitForSingleObject and EnterCriticalSection have similar costs associated with them under high contention scenariosassociated with them under high contention scenarios

Timings for the sample memory management kernel for 1 and 2 threads.

Timings for the sample memory management kernel for 1 to 64 threads.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations (http://www.intel.com/performance/resources/limits.htm)

11

Where is the Performance Hit ??Where is the Performance Hit ??

Window’s locking APIs have the possibility of Window’s locking APIs have the possibility of jumping into the operating system kerneljumping into the operating system kernel

Both EnterCriticalSection and WaitForSingleObject Both EnterCriticalSection and WaitForSingleObject will enter the kernel if there is contention on the will enter the kernel if there is contention on the lock. The transition from user mode to privileged lock. The transition from user mode to privileged mode can be costly if accomplished excessivelymode can be costly if accomplished excessively

Most performance impact is in the case of granular Most performance impact is in the case of granular locking where the lock is achieved and released in locking where the lock is achieved and released in hundreds of cycleshundreds of cycles

User Level Locks should be used for GranularUser Level Locks should be used for Granular Operations and in High Contention ScenariosOperations and in High Contention Scenarios

12



13

User Level Atomic LocksUser Level Atomic Locks

Involves utilizing the atomic instructions of Involves utilizing the atomic instructions of processor to atomically update a memory spaceprocessor to atomically update a memory space

The atomic instructions involve utilizing a lock The atomic instructions involve utilizing a lock prefix on the instruction and having the prefix on the instruction and having the destination operand assigned to a memory destination operand assigned to a memory addressaddress

Some of the instructions which can run atomically Some of the instructions which can run atomically with a lock prefix on current Intel processors are: with a lock prefix on current Intel processors are: ADD, ADC, AND, BTC, BTR, CMPXCHG, DEC, INT, ADD, ADC, AND, BTC, BTR, CMPXCHG, DEC, INT, SUB, XOR, XADD, XCHG etcSUB, XOR, XADD, XCHG etc

14

A Sample User Level Atomic LockA Sample User Level Atomic Lock

Figure shows the assembly of a simple mutex lock Figure shows the assembly of a simple mutex lock demonstrating usage of utilizing an atomic demonstrating usage of utilizing an atomic instruction with a lock prefix for obtaining a lockinstruction with a lock prefix for obtaining a lock

Is it necessary to write assembly to take

advantage of user land locks which utilize the

lock prefix ??

15

Windows Interlocked FunctionsWindows Interlocked Functions

Windows provides access to the most frequently used Windows provides access to the most frequently used atomic instructions for synchronization through the atomic instructions for synchronization through the “interlocked” APIs InterlockedExchange, “interlocked” APIs InterlockedExchange, InterlockedIncrement, InterlockedDecrement, InterlockedIncrement, InterlockedDecrement, InterlockedCompareExchange and InterlockedCompareExchange and InterlockedExchangeAdd etc.InterlockedExchangeAdd etc.

API’s reside in kernel32.dllAPI’s reside in kernel32.dll

The interlocked functions do not have any possibility The interlocked functions do not have any possibility of jumping into the Windows kernelof jumping into the Windows kernel

16

Atomic Lock (Performance Comparison) Atomic Lock (Performance Comparison)

The figure compares the The figure compares the cost of user-level atomic cost of user-level atomic lock vs. WaitForSingleObjectlock vs. WaitForSingleObject

Both under high and low Both under high and low contention scenarios, the contention scenarios, the user-level atomic lock is user-level atomic lock is several orders of magnitude several orders of magnitude cheaper. For this reason, a cheaper. For this reason, a user-level lock is preferable user-level lock is preferable for frequently called for frequently called granular lockinggranular locking

Cost of user-level atomic lock vs. WaitForSingleObject for the memory management locking kernel example

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations (http://www.intel.com/performance/resources/limits.htm)

17

Scaleform*Scaleform* Scaleform GFx: The #1 Video Game UI SolutionScaleform GFx: The #1 Video Game UI Solution GFx is a rich media player that supports Flash GFx is a rich media player that supports Flash Licensed for Crysis, Mass Effect, and 150+ gamesLicensed for Crysis, Mass Effect, and 150+ games Available on all leading PC and Console platformsAvailable on all leading PC and Console platforms

Used for Menus, HUDs, and Animated TexturesUsed for Menus, HUDs, and Animated Textures

Recently introduced Thread Support into the GFx Recently introduced Thread Support into the GFx for Simultaneous Playback, Optimized Loading, for Simultaneous Playback, Optimized Loading, ActionScript Processing and other tasksActionScript Processing and other tasks

18

Why Is Threaded UI Important ?? Why Is Threaded UI Important ??

The Future of Animated Flash and Video Textures!The Future of Animated Flash and Video Textures!

19

Scaleform* Case Study Summary Scaleform* Case Study Summary

Background loading, vector tessellation, Flash playback and Background loading, vector tessellation, Flash playback and ActionScript execution may require many allocations, which ActionScript execution may require many allocations, which reduce performance.reduce performance.

Solution: Innovative allocator that uses about 35 cycles for Solution: Innovative allocator that uses about 35 cycles for allocate/free requests but that optimization is meaningless allocate/free requests but that optimization is meaningless if it needs to be synchronized with a critical section. if it needs to be synchronized with a critical section.

In allocation-heavy examples, system lock can reduce In allocation-heavy examples, system lock can reduce performance by 10-30%.performance by 10-30%.

GLock gives about 50% locking performance improvement. GLock gives about 50% locking performance improvement.

Based on “Fast Critical Sections” post by Vladislav Gelfer Based on “Fast Critical Sections” post by Vladislav Gelfer on Code Project.on Code Project.

20

volatile DWORD LockedThreadId = 0;volatile DWORD LockedThreadId = 0;

void GLock::Lock()void GLock::Lock(){{ DWORD threadId = GetCurrentThreadId();DWORD threadId = GetCurrentThreadId();

if (threadId != LockedThreadId)if (threadId != LockedThreadId) {{ if (if ((LockedThreadId == 0) &&(LockedThreadId == 0) && (InterlockedCompareExchange((long*)&LockedThreadId, threadId, 0) == 0)))) {{ // Single instruction atomic quick-lock was successful.// Single instruction atomic quick-lock was successful. }} elseelse {{ // Potentially locked elsewhere, so do a more expensive// Potentially locked elsewhere, so do a more expensive // lock with system wait on semaphore.// lock with system wait on semaphore. PerfLock(threadId); PerfLock(threadId); }} }} RecursiveLockCount++;RecursiveLockCount++;}}

void GLock::Unlock()void GLock::Unlock(){{ if (--RecursiveLockCount == 0)if (--RecursiveLockCount == 0) {{ // Release lock does not need atomic op on Intel Architecture!// Release lock does not need atomic op on Intel Architecture! LockedThreadId = 0;LockedThreadId = 0;

// Release other system semaphore waiters, if any.// Release other system semaphore waiters, if any. }}}}

Using Fast Locks in Scaleform*Using Fast Locks in Scaleform*

21

Scaleform GFx* Multi-threaded Demo Scaleform GFx* Multi-threaded Demo

Playback multiple files at once on separate threadsPlayback multiple files at once on separate threads ActionScript intensive Flash fileActionScript intensive Flash file

22

AgendaAgenda Common Locking IssuesCommon Locking Issues Windows Locking Methodologies and associated performanceWindows Locking Methodologies and associated performance User Level Atomic Locks with Scaleform* case StudyUser Level Atomic Locks with Scaleform* case Study Hot Locks and Lock Contention with Flight Simulator* Case StudyHot Locks and Lock Contention with Flight Simulator* Case Study Locks in Intel TBBLocks in Intel TBB®®


23

Finding Lock Contention Using Intel ToolsFinding Lock Contention Using Intel Tools

Lock Contention is another major issue which Lock Contention is another major issue which limits Scalability and adds Complexitylimits Scalability and adds Complexity

Intel Tools can help in finding high contention Intel Tools can help in finding high contention scenariosscenarios– VTune™VTune™

Collecting clock ticks event via event based sampling using the Intel Collecting clock ticks event via event based sampling using the Intel VTune Analyzer can be useful to help determine how much VTune Analyzer can be useful to help determine how much contention is occurringcontention is occurring

– Thread Profiler™Thread Profiler™ Provides an API for users to instrument user synchronizationProvides an API for users to instrument user synchronization Spin waits appear as a hashed color in the Thread Profiler GUISpin waits appear as a hashed color in the Thread Profiler GUI

Please refer to Intel Session on “Comparative Analysis Please refer to Intel Session on “Comparative Analysis of Game Parallelization” for more details on Thread Profiler of Game Parallelization” for more details on Thread Profiler

24

Contention using VTune™ (Where to Contention using VTune™ (Where to Look)Look)

EnterCriticalSectionEnterCriticalSection– Ring0 ntoskrnl.exe becomes hotterRing0 ntoskrnl.exe becomes hotter

– For very high contention scenario, ring 0 becomes hot and number of context switches become very highFor very high contention scenario, ring 0 becomes hot and number of context switches become very high

TryEnterCriticalSectionTryEnterCriticalSection– Ntdll.dll will become hotter as you add threadsNtdll.dll will become hotter as you add threads

WaitForSingleObjectWaitForSingleObject– Similar behavior as EnterCriticalSectionSimilar behavior as EnterCriticalSection

Interlocked FunctionsInterlocked Functions– kernel32.dll will get hotkernel32.dll will get hot

25

Contention in WaitForSingleObject Contention in WaitForSingleObject using VTune™using VTune™

Example shows the hot functions within the Windows OS kernel, ntdll.dll, and hal.dll under no contention and high contention for WaitForSingleObject call

26

Possible Ways to Reduce Lock ContentionPossible Ways to Reduce Lock Contention

Lock Stripping.Lock Stripping.– Does your whole array really need to be protected by the Does your whole array really need to be protected by the

same lock or can you give each element its own lock?same lock or can you give each element its own lock?

Protect data, not code.Protect data, not code.– Common technique is to put a lock around the whole Common technique is to put a lock around the whole

function call. Remember that it’s only data that needs to function call. Remember that it’s only data that needs to be protected, not the code.be protected, not the code.

Use Reader-Writer Locks where applicable.Use Reader-Writer Locks where applicable.– For the cases where a lot of threads read a memory For the cases where a lot of threads read a memory

location that is rarely changed. location that is rarely changed.

– Ensures that multiple readers can enter the lock at the Ensures that multiple readers can enter the lock at the same time.same time.

27

Microsoft Flight Simulator* Case StudyMicrosoft Flight Simulator* Case Study

Multi-Threading GoalMulti-Threading Goal– Separate terrain processing from renderingSeparate terrain processing from rendering

Loading games once in the beginningLoading games once in the beginning The engine keeps loading contents in the background while The engine keeps loading contents in the background while

playingplaying Main thread runs D3D, physics, etc.Main thread runs D3D, physics, etc. All other threads loads and pre-processes the terrain textures All other threads loads and pre-processes the terrain textures

and other contentsand other contents

– Loading and processing textures without slowing down Loading and processing textures without slowing down frame-rateframe-rate Expected to scale in terms of processing more contents as Expected to scale in terms of processing more contents as

more processors are availablemore processors are available

28

Symptoms and Thread ProfilingSymptoms and Thread Profiling– Occasional StutteringOccasional Stuttering

– Doesn’t scale well from 2->4 Cores because of very high Doesn’t scale well from 2->4 Cores because of very high contentioncontention

Locking ProblemLocking Problem

Main Thread

BKG Thread

Main Thread

BKG Thread

29

Locking Root-CauseLocking Root-Cause

Both cases lead to global hash map access.Both cases lead to global hash map access.– Only 1 thread can access the hash map while all other threads are Only 1 thread can access the hash map while all other threads are

blockedblocked– Entire hash map was protected by a critical section (probably the worst Entire hash map was protected by a critical section (probably the worst

choice)choice)

SolutionSolution– Protect each bucket in the hash map instead of the whole hash map.Protect each bucket in the hash map instead of the whole hash map.

As long as multiple threads are accessing different buckets, they are As long as multiple threads are accessing different buckets, they are safe and don’t block each othersafe and don’t block each other

– Use of Lock Free LibraryUse of Lock Free Library Microsoft* internal toolsMicrosoft* internal tools The concept is to have a single thread to write, but multiple threads The concept is to have a single thread to write, but multiple threads

can read at the same time as long as it is not being written. can read at the same time as long as it is not being written. TBB provides similar locking mechanismTBB provides similar locking mechanism

30

Flight Simulator* ResultFlight Simulator* Result Reduced stuttering, lower latency in terrain loading, and Reduced stuttering, lower latency in terrain loading, and

better visuals without sacrificing frame ratesbetter visuals without sacrificing frame rates

31

Synchronization Primitives in Intel TBBSynchronization Primitives in Intel TBB®®

Atomic OperationsAtomic Operations High-level abstraction for atomic instructions.High-level abstraction for atomic instructions.

OS/Compiler PortableOS/Compiler Portable Supports Processors like (Itanium) which have weak memory Supports Processors like (Itanium) which have weak memory

consistencyconsistency

Exception-safe LocksException-safe Locks

ScalableScalable FairFair ReentrantReentrant SleepsSleeps

mutexmutex OS dependentOS dependent OS dependentOS dependent NoNo YesYes

spin_mutexspin_mutex NoNo NoNo NoNo NoNo

queuing_mutexqueuing_mutex YesYes YesYes NoNo NoNo

spin_rw_mutexspin_rw_mutex NoNo NoNo NoNo NoNo

queuing_rw_mutexqueuing_rw_mutex YesYes YesYes NoNo NoNo

32

Example TBBExample TBB®® Reader-Writer Lock Reader-Writer Lock

If exception occurs within the protected code block destructor will If exception occurs within the protected code block destructor will automatically release the lock if it’s acquired avoiding a dead-lockautomatically release the lock if it’s acquired avoiding a dead-lock

Any reader lock may be upgraded to writer lock; upgrade_to_writer Any reader lock may be upgraded to writer lock; upgrade_to_writer indicates whether the lock had to be released before it can upgradeindicates whether the lock had to be released before it can upgrade

#include “tbb/spin_rw_mutex.h”#include “tbb/spin_rw_mutex.h”using namespace tbb;using namespace tbb;

spin_rw_mutex MyMutex;spin_rw_mutex MyMutex;

int foo (){int foo (){/* Construction of ‘lock’ acquires ‘MyMutex’ *//* Construction of ‘lock’ acquires ‘MyMutex’ */ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false);spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … … if (!lock.upgrade_to_writer ()) {if (!lock.upgrade_to_writer ()) { /*data may have been modified since the last read*/ /*data may have been modified since the last read*/ }} else { /* data was not modified by other thread */ }else { /* data was not modified by other thread */ } return 0; return 0; /* Destructor of ‘lock’ releases ‘MyMutex’ *//* Destructor of ‘lock’ releases ‘MyMutex’ */}}

33

General Recommendations for TBBGeneral Recommendations for TBB®® Locks Locks

spin_mutex is VERY FAST in lightly contended spin_mutex is VERY FAST in lightly contended situations; use it if you need to protect very few situations; use it if you need to protect very few instructionsinstructions

Use queuing_rw_mutex when scalability and Use queuing_rw_mutex when scalability and fairness are importantfairness are important

Use reader-writer mutex to allow non-blocking Use reader-writer mutex to allow non-blocking read for multiple threadsread for multiple threads

Please refer to Intel Session on “Comparative Analysis Please refer to Intel Session on “Comparative Analysis of Game Parallelization” for more details on TBB of Game Parallelization” for more details on TBB

34

Summary & Call to ActionSummary & Call to Action The use of inefficient synchronization strategy can have a The use of inefficient synchronization strategy can have a

big impact on the performance of your Multi-Threaded big impact on the performance of your Multi-Threaded application: if it doesn’t hit you today then it sure will do application: if it doesn’t hit you today then it sure will do tomorrow.tomorrow.

Try using User-Level Atomic Locks instead of very Try using User-Level Atomic Locks instead of very expensive Kernel-Locks.expensive Kernel-Locks.

Use Intel Tools (VTune™ and Thread Profiler™) to help Use Intel Tools (VTune™ and Thread Profiler™) to help identify potential lock problems.identify potential lock problems.

Use the locks properly to avoid high contention scenarios Use the locks properly to avoid high contention scenarios and make your code more scalable.and make your code more scalable.

35

Contact InfoContact Info

For more info –see our Graphics, Game For more info –see our Graphics, Game Development and Threading resources at: Development and Threading resources at: http://http://softwarecommunity.intel.comsoftwarecommunity.intel.com//

Feel free to contact me directly: Feel free to contact me directly: [email protected]@intel.com

http://softwarecommunity.intel.com/



speed up synchronization locks: how and why?

Technology

lock contention

kernel lock

lock stuttering

intel logo

case study locks

synchronization locks

lock priority inversion

common lock pathologies