speed up synchronization locks: how and why?
DESCRIPTION
A brief introduction on synchronization primitives used for gaming consoles and Windows platforms and ways to identify potential problems with locks using Intel tools. The talk will discuss an alternate optimized implementation of the Windows Critical_Section with Scaleform as a case study highlighting the importance of using optimized locks.TRANSCRIPT
Speed Up Synchronization Locks: Speed Up Synchronization Locks: A Scaleform Case StudyA Scaleform Case Study
Abhishek AgrawalAbhishek Agrawal
Software Solutions GroupSoftware Solutions Group
3
AgendaAgenda Common Locking IssuesCommon Locking Issues Windows* Locking Methodologies and associated performanceWindows* Locking Methodologies and associated performance User Level Atomic Locks with Scaleform* case StudyUser Level Atomic Locks with Scaleform* case Study Hot Locks and Lock Contention with Flight Simulator* Case StudyHot Locks and Lock Contention with Flight Simulator* Case Study Locks in Intel TBBLocks in Intel TBB®®
Summary & Call to ActionSummary & Call to Action
4
Why care for Locking ??Why care for Locking ??
Locking code can be the most frequently run Locking code can be the most frequently run code in a multi-threaded applicationcode in a multi-threaded application
Determining which methodology of locking Determining which methodology of locking to utilize can be as critical as identification of to utilize can be as critical as identification of parallelism within an applicationparallelism within an application
Improper use of locking mechanism can lead Improper use of locking mechanism can lead to situations like lock stuttering, very high to situations like lock stuttering, very high contention and new types of programming contention and new types of programming bugsbugs
Proper use of locks Proper use of locks is crucial for multi-threading applicationsis crucial for multi-threading applications
5
Common Lock PathologiesCommon Lock Pathologies
Can introduce performance and correctness Can introduce performance and correctness problems problems
Some potential problemsSome potential problems– DeadlockDeadlock
Happens when tasks are trying to acquire more than one lock and Happens when tasks are trying to acquire more than one lock and each holds some of the locks the other tasks need in order to proceedeach holds some of the locks the other tasks need in order to proceed
– Convoying Convoying Occurs when the operating system interrupts a task that is holding a Occurs when the operating system interrupts a task that is holding a
locklock
– Priority InversionPriority Inversion Refers to the scenario where a lower-priority task holds a shared Refers to the scenario where a lower-priority task holds a shared
resource that is required by a higher-priority taskresource that is required by a higher-priority task
6
How to avoid Lock PathologiesHow to avoid Lock Pathologies
DeadlocksDeadlocks– Avoid needing to hold two locks at the same timeAvoid needing to hold two locks at the same time
– Always acquire locks in the same order (e.g. outer Always acquire locks in the same order (e.g. outer container and inner container mutexes)container and inner container mutexes)
– Use Use atomic operationsatomic operations
Convoying & Priority InversionConvoying & Priority Inversion– Use Use atomic operationsatomic operations instead of locks where possible instead of locks where possible
Use Atomic Operations and Use Atomic Operations and User-Level LocksUser-Level Locks
7
AgendaAgenda Common Locking IssuesCommon Locking Issues Windows* Locking Methodologies and associated performanceWindows* Locking Methodologies and associated performance User Level Atomic Locks with Scaleform* case StudyUser Level Atomic Locks with Scaleform* case Study Hot Locks and Lock Contention with Flight Simulator* Case StudyHot Locks and Lock Contention with Flight Simulator* Case Study Locks in Intel TBBLocks in Intel TBB®®
Summary & Call to ActionSummary & Call to Action
8
Windows* Locking Methodologies Windows* Locking Methodologies
Interlocked FunctionsInterlocked Functions– Located in kernel32.dllLocated in kernel32.dll
– Essentially just utilizing atomic instructionsEssentially just utilizing atomic instructions
TryEnterCriticalSection (Non-Blocking)TryEnterCriticalSection (Non-Blocking)– Attempts to get a lock N times in ring 3Attempts to get a lock N times in ring 3
EnterCriticalSection (Blocking)EnterCriticalSection (Blocking)– Attempts to get the lock one time in ring 3 and then jumps Attempts to get the lock one time in ring 3 and then jumps
into ring 0into ring 0
WaitForSingleObjectWaitForSingleObject– Jumps into ring 0 100% of the time whether the lock is Jumps into ring 0 100% of the time whether the lock is
achieved or notachieved or not
– Mutexes and Semaphore APIs follow the same pathMutexes and Semaphore APIs follow the same path
9
WaitForSingleObject Vs. EnterCriticalSectionWaitForSingleObject Vs. EnterCriticalSection
WaitForSingleObjectWaitForSingleObject EnterCriticalSectionEnterCriticalSection
An overloaded Microsoft API which can An overloaded Microsoft API which can be used to check and modify the state of be used to check and modify the state of a number of different objects such as a number of different objects such as events, jobs etcevents, jobs etc
Advantage of WaitForSingleObject is Advantage of WaitForSingleObject is that it can be processed globally which that it can be processed globally which enables it to be used for synchronization enables it to be used for synchronization between processesbetween processes
One major disadvantage of One major disadvantage of WaitForSingleObject is that it will always WaitForSingleObject is that it will always obtain a kernel lock, so it enters obtain a kernel lock, so it enters privileged mode (ring 0) whether the privileged mode (ring 0) whether the lock is achieved or notlock is achieved or not
Can be used by putting an Can be used by putting an EnterCriticalSection and EnterCriticalSection and LeaveCriticalSection API call surrounding LeaveCriticalSection API call surrounding the critical section codethe critical section code
The API has the advantage over The API has the advantage over WaitForSingleObject in that it will not WaitForSingleObject in that it will not enter the kernel unless there is enter the kernel unless there is contention on the lockcontention on the lock
Disadvantage of EnterCriticalSectionDisadvantage of EnterCriticalSection
- It’s a blocking call- It’s a blocking call
- It cannot be processed globally - It cannot be processed globally
and there is no guarantee on the and there is no guarantee on the
order which threads obtain the order which threads obtain the
locklock
10
EnterCriticalSection Vs. EnterCriticalSection Vs. WaitForSingleObjectWaitForSingleObject
EnterCriticalSection is much faster under 1 thread (no contention) EnterCriticalSection is much faster under 1 thread (no contention) since it will not jump into the kernel if lock is achievedsince it will not jump into the kernel if lock is achieved
WaitForSingleObject and EnterCriticalSection have similar costs WaitForSingleObject and EnterCriticalSection have similar costs associated with them under high contention scenariosassociated with them under high contention scenarios
Timings for the sample memory management kernel for 1 and 2 threads.
Timings for the sample memory management kernel for 1 to 64 threads.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations (http://www.intel.com/performance/resources/limits.htm)
11
Where is the Performance Hit ??Where is the Performance Hit ??
Window’s locking APIs have the possibility of Window’s locking APIs have the possibility of jumping into the operating system kerneljumping into the operating system kernel
Both EnterCriticalSection and WaitForSingleObject Both EnterCriticalSection and WaitForSingleObject will enter the kernel if there is contention on the will enter the kernel if there is contention on the lock. The transition from user mode to privileged lock. The transition from user mode to privileged mode can be costly if accomplished excessivelymode can be costly if accomplished excessively
Most performance impact is in the case of granular Most performance impact is in the case of granular locking where the lock is achieved and released in locking where the lock is achieved and released in hundreds of cycleshundreds of cycles
User Level Locks should be used for GranularUser Level Locks should be used for Granular Operations and in High Contention ScenariosOperations and in High Contention Scenarios
12
AgendaAgenda Common Locking IssuesCommon Locking Issues Windows* Locking Methodologies and associated performanceWindows* Locking Methodologies and associated performance User Level Atomic Locks with Scaleform* case StudyUser Level Atomic Locks with Scaleform* case Study Hot Locks and Lock Contention with Flight Simulator* Case StudyHot Locks and Lock Contention with Flight Simulator* Case Study Locks in Intel TBBLocks in Intel TBB®®
Summary & Call to ActionSummary & Call to Action
13
User Level Atomic LocksUser Level Atomic Locks
Involves utilizing the atomic instructions of Involves utilizing the atomic instructions of processor to atomically update a memory spaceprocessor to atomically update a memory space
The atomic instructions involve utilizing a lock The atomic instructions involve utilizing a lock prefix on the instruction and having the prefix on the instruction and having the destination operand assigned to a memory destination operand assigned to a memory addressaddress
Some of the instructions which can run atomically Some of the instructions which can run atomically with a lock prefix on current Intel processors are: with a lock prefix on current Intel processors are: ADD, ADC, AND, BTC, BTR, CMPXCHG, DEC, INT, ADD, ADC, AND, BTC, BTR, CMPXCHG, DEC, INT, SUB, XOR, XADD, XCHG etcSUB, XOR, XADD, XCHG etc
14
A Sample User Level Atomic LockA Sample User Level Atomic Lock
Figure shows the assembly of a simple mutex lock Figure shows the assembly of a simple mutex lock demonstrating usage of utilizing an atomic demonstrating usage of utilizing an atomic instruction with a lock prefix for obtaining a lockinstruction with a lock prefix for obtaining a lock
Is it necessary to write assembly to take
advantage of user land locks which utilize the
lock prefix ??
15
Windows Interlocked FunctionsWindows Interlocked Functions
Windows provides access to the most frequently used Windows provides access to the most frequently used atomic instructions for synchronization through the atomic instructions for synchronization through the “interlocked” APIs InterlockedExchange, “interlocked” APIs InterlockedExchange, InterlockedIncrement, InterlockedDecrement, InterlockedIncrement, InterlockedDecrement, InterlockedCompareExchange and InterlockedCompareExchange and InterlockedExchangeAdd etc.InterlockedExchangeAdd etc.
API’s reside in kernel32.dllAPI’s reside in kernel32.dll
The interlocked functions do not have any possibility The interlocked functions do not have any possibility of jumping into the Windows kernelof jumping into the Windows kernel
16
Atomic Lock (Performance Comparison) Atomic Lock (Performance Comparison)
The figure compares the The figure compares the cost of user-level atomic cost of user-level atomic lock vs. WaitForSingleObjectlock vs. WaitForSingleObject
Both under high and low Both under high and low contention scenarios, the contention scenarios, the user-level atomic lock is user-level atomic lock is several orders of magnitude several orders of magnitude cheaper. For this reason, a cheaper. For this reason, a user-level lock is preferable user-level lock is preferable for frequently called for frequently called granular lockinggranular locking
Cost of user-level atomic lock vs. WaitForSingleObject for the memory management locking kernel example
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations (http://www.intel.com/performance/resources/limits.htm)
17
Scaleform*Scaleform* Scaleform GFx: The #1 Video Game UI SolutionScaleform GFx: The #1 Video Game UI Solution GFx is a rich media player that supports Flash GFx is a rich media player that supports Flash Licensed for Crysis, Mass Effect, and 150+ gamesLicensed for Crysis, Mass Effect, and 150+ games Available on all leading PC and Console platformsAvailable on all leading PC and Console platforms
Used for Menus, HUDs, and Animated TexturesUsed for Menus, HUDs, and Animated Textures
Recently introduced Thread Support into the GFx Recently introduced Thread Support into the GFx for Simultaneous Playback, Optimized Loading, for Simultaneous Playback, Optimized Loading, ActionScript Processing and other tasksActionScript Processing and other tasks
18
Why Is Threaded UI Important ?? Why Is Threaded UI Important ??
The Future of Animated Flash and Video Textures!The Future of Animated Flash and Video Textures!
19
Scaleform* Case Study Summary Scaleform* Case Study Summary
Background loading, vector tessellation, Flash playback and Background loading, vector tessellation, Flash playback and ActionScript execution may require many allocations, which ActionScript execution may require many allocations, which reduce performance.reduce performance.
Solution: Innovative allocator that uses about 35 cycles for Solution: Innovative allocator that uses about 35 cycles for allocate/free requests but that optimization is meaningless allocate/free requests but that optimization is meaningless if it needs to be synchronized with a critical section. if it needs to be synchronized with a critical section.
In allocation-heavy examples, system lock can reduce In allocation-heavy examples, system lock can reduce performance by 10-30%.performance by 10-30%.
GLock gives about 50% locking performance improvement. GLock gives about 50% locking performance improvement.
Based on “Fast Critical Sections” post by Vladislav Gelfer Based on “Fast Critical Sections” post by Vladislav Gelfer on Code Project.on Code Project.
20
volatile DWORD LockedThreadId = 0;volatile DWORD LockedThreadId = 0;
void GLock::Lock()void GLock::Lock(){{ DWORD threadId = GetCurrentThreadId();DWORD threadId = GetCurrentThreadId();
if (threadId != LockedThreadId)if (threadId != LockedThreadId) {{ if (if ((LockedThreadId == 0) &&(LockedThreadId == 0) && (InterlockedCompareExchange((long*)&LockedThreadId, threadId, 0) == 0)))) {{ // Single instruction atomic quick-lock was successful.// Single instruction atomic quick-lock was successful. }} elseelse {{ // Potentially locked elsewhere, so do a more expensive// Potentially locked elsewhere, so do a more expensive // lock with system wait on semaphore.// lock with system wait on semaphore. PerfLock(threadId); PerfLock(threadId); }} }} RecursiveLockCount++;RecursiveLockCount++;}}
void GLock::Unlock()void GLock::Unlock(){{ if (--RecursiveLockCount == 0)if (--RecursiveLockCount == 0) {{ // Release lock does not need atomic op on Intel Architecture!// Release lock does not need atomic op on Intel Architecture! LockedThreadId = 0;LockedThreadId = 0;
// Release other system semaphore waiters, if any.// Release other system semaphore waiters, if any. }}}}
Using Fast Locks in Scaleform*Using Fast Locks in Scaleform*
21
Scaleform GFx* Multi-threaded Demo Scaleform GFx* Multi-threaded Demo
Playback multiple files at once on separate threadsPlayback multiple files at once on separate threads ActionScript intensive Flash fileActionScript intensive Flash file
22
AgendaAgenda Common Locking IssuesCommon Locking Issues Windows Locking Methodologies and associated performanceWindows Locking Methodologies and associated performance User Level Atomic Locks with Scaleform* case StudyUser Level Atomic Locks with Scaleform* case Study Hot Locks and Lock Contention with Flight Simulator* Case StudyHot Locks and Lock Contention with Flight Simulator* Case Study Locks in Intel TBBLocks in Intel TBB®®
Summary & Call to ActionSummary & Call to Action
23
Finding Lock Contention Using Intel ToolsFinding Lock Contention Using Intel Tools
Lock Contention is another major issue which Lock Contention is another major issue which limits Scalability and adds Complexitylimits Scalability and adds Complexity
Intel Tools can help in finding high contention Intel Tools can help in finding high contention scenariosscenarios– VTune™VTune™
Collecting clock ticks event via event based sampling using the Intel Collecting clock ticks event via event based sampling using the Intel VTune Analyzer can be useful to help determine how much VTune Analyzer can be useful to help determine how much contention is occurringcontention is occurring
– Thread Profiler™Thread Profiler™ Provides an API for users to instrument user synchronizationProvides an API for users to instrument user synchronization Spin waits appear as a hashed color in the Thread Profiler GUISpin waits appear as a hashed color in the Thread Profiler GUI
Please refer to Intel Session on “Comparative Analysis Please refer to Intel Session on “Comparative Analysis of Game Parallelization” for more details on Thread Profiler of Game Parallelization” for more details on Thread Profiler
24
Contention using VTune™ (Where to Contention using VTune™ (Where to Look)Look)
EnterCriticalSectionEnterCriticalSection– Ring0 ntoskrnl.exe becomes hotterRing0 ntoskrnl.exe becomes hotter
– For very high contention scenario, ring 0 becomes hot and number of context switches become very highFor very high contention scenario, ring 0 becomes hot and number of context switches become very high
TryEnterCriticalSectionTryEnterCriticalSection– Ntdll.dll will become hotter as you add threadsNtdll.dll will become hotter as you add threads
WaitForSingleObjectWaitForSingleObject– Similar behavior as EnterCriticalSectionSimilar behavior as EnterCriticalSection
Interlocked FunctionsInterlocked Functions– kernel32.dll will get hotkernel32.dll will get hot
25
Contention in WaitForSingleObject Contention in WaitForSingleObject using VTune™using VTune™
Example shows the hot functions within the Windows OS kernel, ntdll.dll, and hal.dll under no contention and high contention for WaitForSingleObject call
26
Possible Ways to Reduce Lock ContentionPossible Ways to Reduce Lock Contention
Lock Stripping.Lock Stripping.– Does your whole array really need to be protected by the Does your whole array really need to be protected by the
same lock or can you give each element its own lock?same lock or can you give each element its own lock?
Protect data, not code.Protect data, not code.– Common technique is to put a lock around the whole Common technique is to put a lock around the whole
function call. Remember that it’s only data that needs to function call. Remember that it’s only data that needs to be protected, not the code.be protected, not the code.
Use Reader-Writer Locks where applicable.Use Reader-Writer Locks where applicable.– For the cases where a lot of threads read a memory For the cases where a lot of threads read a memory
location that is rarely changed. location that is rarely changed.
– Ensures that multiple readers can enter the lock at the Ensures that multiple readers can enter the lock at the same time.same time.
27
Microsoft Flight Simulator* Case StudyMicrosoft Flight Simulator* Case Study
Multi-Threading GoalMulti-Threading Goal– Separate terrain processing from renderingSeparate terrain processing from rendering
Loading games once in the beginningLoading games once in the beginning The engine keeps loading contents in the background while The engine keeps loading contents in the background while
playingplaying Main thread runs D3D, physics, etc.Main thread runs D3D, physics, etc. All other threads loads and pre-processes the terrain textures All other threads loads and pre-processes the terrain textures
and other contentsand other contents
– Loading and processing textures without slowing down Loading and processing textures without slowing down frame-rateframe-rate Expected to scale in terms of processing more contents as Expected to scale in terms of processing more contents as
more processors are availablemore processors are available
28
Symptoms and Thread ProfilingSymptoms and Thread Profiling– Occasional StutteringOccasional Stuttering
– Doesn’t scale well from 2->4 Cores because of very high Doesn’t scale well from 2->4 Cores because of very high contentioncontention
Locking ProblemLocking Problem
Main Thread
BKG Thread
Main Thread
BKG Thread
29
Locking Root-CauseLocking Root-Cause
Both cases lead to global hash map access.Both cases lead to global hash map access.– Only 1 thread can access the hash map while all other threads are Only 1 thread can access the hash map while all other threads are
blockedblocked– Entire hash map was protected by a critical section (probably the worst Entire hash map was protected by a critical section (probably the worst
choice)choice)
SolutionSolution– Protect each bucket in the hash map instead of the whole hash map.Protect each bucket in the hash map instead of the whole hash map.
As long as multiple threads are accessing different buckets, they are As long as multiple threads are accessing different buckets, they are safe and don’t block each othersafe and don’t block each other
– Use of Lock Free LibraryUse of Lock Free Library Microsoft* internal toolsMicrosoft* internal tools The concept is to have a single thread to write, but multiple threads The concept is to have a single thread to write, but multiple threads
can read at the same time as long as it is not being written. can read at the same time as long as it is not being written. TBB provides similar locking mechanismTBB provides similar locking mechanism
30
Flight Simulator* ResultFlight Simulator* Result Reduced stuttering, lower latency in terrain loading, and Reduced stuttering, lower latency in terrain loading, and
better visuals without sacrificing frame ratesbetter visuals without sacrificing frame rates
31
Synchronization Primitives in Intel TBBSynchronization Primitives in Intel TBB®®
Atomic OperationsAtomic Operations High-level abstraction for atomic instructions.High-level abstraction for atomic instructions.
OS/Compiler PortableOS/Compiler Portable Supports Processors like (Itanium) which have weak memory Supports Processors like (Itanium) which have weak memory
consistencyconsistency
Exception-safe LocksException-safe Locks
ScalableScalable FairFair ReentrantReentrant SleepsSleeps
mutexmutex OS dependentOS dependent OS dependentOS dependent NoNo YesYes
spin_mutexspin_mutex NoNo NoNo NoNo NoNo
queuing_mutexqueuing_mutex YesYes YesYes NoNo NoNo
spin_rw_mutexspin_rw_mutex NoNo NoNo NoNo NoNo
queuing_rw_mutexqueuing_rw_mutex YesYes YesYes NoNo NoNo
32
Example TBBExample TBB®® Reader-Writer Lock Reader-Writer Lock
If exception occurs within the protected code block destructor will If exception occurs within the protected code block destructor will automatically release the lock if it’s acquired avoiding a dead-lockautomatically release the lock if it’s acquired avoiding a dead-lock
Any reader lock may be upgraded to writer lock; upgrade_to_writer Any reader lock may be upgraded to writer lock; upgrade_to_writer indicates whether the lock had to be released before it can upgradeindicates whether the lock had to be released before it can upgrade
#include “tbb/spin_rw_mutex.h”#include “tbb/spin_rw_mutex.h”using namespace tbb;using namespace tbb;
spin_rw_mutex MyMutex;spin_rw_mutex MyMutex;
int foo (){int foo (){/* Construction of ‘lock’ acquires ‘MyMutex’ *//* Construction of ‘lock’ acquires ‘MyMutex’ */ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false);spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … … if (!lock.upgrade_to_writer ()) {if (!lock.upgrade_to_writer ()) { /*data may have been modified since the last read*/ /*data may have been modified since the last read*/ }} else { /* data was not modified by other thread */ }else { /* data was not modified by other thread */ } return 0; return 0; /* Destructor of ‘lock’ releases ‘MyMutex’ *//* Destructor of ‘lock’ releases ‘MyMutex’ */}}
33
General Recommendations for TBBGeneral Recommendations for TBB®® Locks Locks
spin_mutex is VERY FAST in lightly contended spin_mutex is VERY FAST in lightly contended situations; use it if you need to protect very few situations; use it if you need to protect very few instructionsinstructions
Use queuing_rw_mutex when scalability and Use queuing_rw_mutex when scalability and fairness are importantfairness are important
Use reader-writer mutex to allow non-blocking Use reader-writer mutex to allow non-blocking read for multiple threadsread for multiple threads
Please refer to Intel Session on “Comparative Analysis Please refer to Intel Session on “Comparative Analysis of Game Parallelization” for more details on TBB of Game Parallelization” for more details on TBB
34
Summary & Call to ActionSummary & Call to Action The use of inefficient synchronization strategy can have a The use of inefficient synchronization strategy can have a
big impact on the performance of your Multi-Threaded big impact on the performance of your Multi-Threaded application: if it doesn’t hit you today then it sure will do application: if it doesn’t hit you today then it sure will do tomorrow.tomorrow.
Try using User-Level Atomic Locks instead of very Try using User-Level Atomic Locks instead of very expensive Kernel-Locks.expensive Kernel-Locks.
Use Intel Tools (VTune™ and Thread Profiler™) to help Use Intel Tools (VTune™ and Thread Profiler™) to help identify potential lock problems.identify potential lock problems.
Use the locks properly to avoid high contention scenarios Use the locks properly to avoid high contention scenarios and make your code more scalable.and make your code more scalable.
35
Contact InfoContact Info
For more info –see our Graphics, Game For more info –see our Graphics, Game Development and Threading resources at: Development and Threading resources at: http://http://softwarecommunity.intel.comsoftwarecommunity.intel.com//
Feel free to contact me directly: Feel free to contact me directly: [email protected]@intel.com
36