quaketm: parallelizing a complex serial application using transactional memory vladimir gajinov 1,2,...
TRANSCRIPT
![Page 1: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/1.jpg)
QuakeTM: Parallelizing a Complex Serial Application
Using Transactional Memory
Vladimir Gajinov1,2, Ferad Zyulkyarov1,2,Osman S. Unsal1, Adrián Cristal1, Eduard Ayguadé1,2, Tim Harris3, Mateo Valero1,2
1Barcelona Supercomputing Center
2Universitat Politècnica de Catalunya
3Microsoft Research
![Page 2: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/2.jpg)
2
Outline
Introduction & motivation
Quake description
Parallelization
Results
Conclusion
![Page 3: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/3.jpg)
3CPU processing is the bottleneck.
Introduction
Topic of this workParallelization of the Quake server.
What is Quake? The first person shooter game.
A sequential application.
Close to instantaneous control of player actions.
High degree of interaction among players in a detailed 3D virtual world.
Requirements of a sequential game server
OpenMP + Transactional MemoryMethod
![Page 4: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/4.jpg)
4
Background
• OpenMP:– API for writing shared-memory parallel programming
in C/C++ and Fortran. – Compiler directives and library routines.– Fork-Join parallelism.
• Transactional Memory (TM):– concurrency control mechanism.– series of reads and writes to shared memory are
handled atomically. – When successful transaction commits,
otherwise it aborts.
![Page 5: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/5.jpg)
5
Motivation• Just a few TM applications available
– STAMP, Haskell STM benchmark, RMS-TM …– Clear need for more complex applications.
• Contribution:Parallelization of a complex sequential application using
TM.
• Question:Is it possible to achieve fine-grained locking performance with the coarse-grained parallelization effort?
• MOTIVATION - Test TM programmability:– Start with a coarse-grained approach.– Test the performance.– Determine the problems.– Compare with a fine-grained approach.
![Page 6: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/6.jpg)
6
Outline
Introduction & motivation
Quake description
Parallelization
Results
Conclusion
![Page 7: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/7.jpg)
7
Quake Organization
Typical client – server architecture
ServerMaintains the consistency of
the game world.
Handles the coordination among clients.
Clients
Update graphics
Implement user-interface operations
![Page 8: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/8.jpg)
8
The Server• The main server task - computing a new frame
Process
Read
Physics Update
SELECT
Reply
Yes
No
Tx
Rx
Frame execution diagram
Request Processing
Sequential server execution with 8 connected clients.
Execution breakdown
We concentrate on the request
processing stage
2.1%
87.8%
3.1%
![Page 9: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/9.jpg)
9
LEVEL 4
LEVEL 1
LEVEL 2
LEVEL 3
LEVEL 5
Areanode tree Top view
• 3D volume in a 3D coordinate space.• Represented as a binary space partition tree.• Fine grained and inefficient.
Areanode tree:- balanced binary tree.- each 3D point in the map must
either be in an areanode that is a leaf or in a division plane.
- areanodes maintain a list of game objects (entities).
Quake Map
![Page 10: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/10.jpg)
10
Outline
Introduction & motivation
Quake description
Parallelization
Results
Conclusion
![Page 11: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/11.jpg)
11
Parallelization
• Only the request processing stage is parallelized• OpenMP to start parallel execution.• Transactions for synchronization.• Coarse-grained approach.• Comparison with the fine-grained implementation
of Atomic Quake [PPoPP2009]• Application characteristics:
Coarse-grained8 TM blocks
Big read & write setsLong transactionsAbort rate 35.3%
Fine-grained
58 TM blocksAbort rate 4.1%
![Page 12: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/12.jpg)
12
Shared Data
• Three types of shared data structures:– Areanode tree – Game objects– Message buffers
• Common global state buffer • Per-player reply buffers
• Most intensive sharing inside the request processing stage.
![Page 13: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/13.jpg)
13
Client Requests
Two types of requests:
• Connection related messages – associated with the connection or disconnection protocols,
used when the client wants to join or leave the server game session, or other facilities that do not affect gameplay
• Gameplay messages– most important type of requests – model the player’s interaction with the game world. – the most used – MOVE COMMAND.
![Page 14: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/14.jpg)
14
Pseudocode for the request processing stage
while (NET_GetPacket ()) { // Filter packets
if (connection related packet) { SV_ConnectionlessPacket (); continue; }
// game play packets for (i=0 ; i<MAX_CLIENTS ; i++) { // Do some checking here SV_ExecuteClientMessage (); }}
while (NET_GetPacket ()) { // Filter packets
if (connection related packet){ SV_ConnectionlessPacket (); continue; }
AddPacketToList(); CopyBuffer();}
#pragma intel omp parallel taskq shared(packetlist, ...){while (packetlist != NULL) { #pragma intel omp task captureprivate(packetlist) { NET_Message_Init(..); // check for packets from connected clients for (i=0, cl=svs.clients ; i<MAX_CLIENTS ; i++,cl++) { // Do some checking here SV_ExecuteClientMessage (cl); } }
packetlist = packetlist->next;}
Sequential Parallel
![Page 15: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/15.jpg)
15
The Move Command
ExecutionConstruct the bounding box.
Traverse the areanode tree.
Find objects contained in the bounding box.
Associate them with the command.
Simulate the move.
Remove the player from the old position.
Add him to the new position.
Parameters
Player’s origin
View angles
Motion indicators
Time to run
![Page 16: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/16.jpg)
16
Move Command Execution
AddLinksToPmove
Execute Move
ClientPhysics
ClientThink
PmoveInit
PlayerMove
LinkEntity
PlayerTouch
Transaction 1
Transaction 2
Transaction 3
Transaction 4
T1
T2
T3
T4
ClientPhysics client’s physics update
ClientThink execute actions registered in previous frames
PmoveInit pmove (player move) structure initialization
AddLinksToPmove determines which entities could be affected by the current move command.
PlayerMove constructs a trajectory line and determines the client's final position
LinkEntity re-links the player’s entity to the new position in the areanode tree
PlayerTouch model influence on the other game objects
![Page 17: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/17.jpg)
17
ReachPoints
int reachpoints[NumThreads][x*16]
TM_PUREvoid PointReached(int check) {
reachpoints[ThreadId][check]++;}
int main () {. . .TRANSACTION
PointReached (1);
statement_1;PointReached
(2);TRANSACTION_END. . .
}
Helps to:• Identify thread private variables.• Discover where transactions abort• Discover causes for the aborts.• Discover TM false sharing conflicts
(conflict management granularity).
![Page 18: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/18.jpg)
18
Outline
Introduction & motivation
Quake description
Parallelization
Results
Conclusion
![Page 19: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/19.jpg)
19
Evaluation• TraceBot:
– An automatic trace client.– Behavior is controlled by a finite state machine.
• VideoClient:– Normal graphical client for proving correctness.– For trace creation.
• The server runs on one machine, the clients on the other.– Server – 8 cores (4 x dual-core 64-bit Intel® Xeon™).
• Frame execution time as a performance measure.
• Prototype version 3.0 of the Intel STM C/C++ compiler.– In-place updates.– Cache line granularity conflict detection.– Transactions validate the read set at commit time, and
if necessary during the read operation, – function annotations: tm_callable, tm_pure and tm_unknown.– Closed nesting - flattening
![Page 20: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/20.jpg)
20
Results - Normalized average frame execution times (coarse)
0
2
4
6
8
1 2 4 8 16
Number of clients
Nor
mal
ized
exe
cutio
n tim
e serial global_lock TM_coarse
The baseline is always the average frame execution time of the sequential server for the respected number of clients.
TM version overhead3.5x – 6x
more than 85% of the time is spent
in critical sections.
Overhead is too high
![Page 21: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/21.jpg)
21
Results - performance of coarse-grained configurations
0.01.02.03.04.05.06.07.08.0
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
1 client 2 clients 4 clients 8 clients 16 clients
Number of threads
Tim
e [m
s]
global_lock TM_coarse
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
TM_coarse
Spee
dup
2 threads 4 threads 8 threads
0.0
2.0
4.0
6.0
8.0
1 2 4 8
Threads
Ave
rage
fra
me
time
[ms]
global_lock TM_coarse
Comparative performance of parallel configurations
Transactional server running with 16 clients (speedup & scalability)
![Page 22: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/22.jpg)
22
Transactional statistic – coarse-grained
Clients Transactions AbortsAbort rate
[%] Mean [KB] Max [KB] Total [MB]
1 34754 0 0.0Reads 3.0 104 105
Writes 0.6 17 20
2 95980 1970 2.1Reads 2.8 863 263
Writes 0.6 164 55
4 179241 10820 6.0Reads 3.4 1413 570
Writes 0.6 269 108
8 364305 76560 21.0Reads 4.2 1478 1207
Writes 0.8 251 216
16 524561 184992 35.3Reads 5.1 1704 1725
Writes 0.9 262 296
The abort rate is significant
TM server running with 8 threads.
![Page 23: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/23.jpg)
23
The Overhead Breakdown
TM block
Multithread execution - 8 threads, 16 clients
Total [109 cycles]
Instrumentation time Abort overheadAbort rate
[%]109 cycles % 109 cycles %
1 13.5 10.3 75.8 3.3 24.2 19.5
2 9.5 9.0 94.1 0.6 5.9 18.0
3 17.2 15.1 87.9 2.1 12.1 52.7
4 11.6 10.9 94.3 0.7 5.7 22.4
5 5.9 3.2 53.7 2.8 46.3 61.1
overall 57.9 48.5 83.8 9.4 16.2 35.2
We have limited possibility for profiling
Seems like the TM instrumentation
overhead is more important
![Page 24: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/24.jpg)
24
Results - Normalized average frame execution times (fine)
0
1
2
3
4
1 2 4 8 16
Number of clients
Nor
mal
ized
exe
cutio
n tim
e serial lock_fine TM_fine
TM version overhead2.4x – 3x
![Page 25: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/25.jpg)
25
Results - performance of fine-grained configurations
0.00.51.01.52.02.53.03.54.0
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
1 client 2 clients 4 clients 8 clients 16 clients
Number of threads
Tim
e [m
s]
lock_fine TM_fine
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
lock_fine TM_fine
Spee
dup
2 threads 4 threads 8 threads
0.0
2.0
4.0
6.0
8.0
1 2 4 8
Threads
Ave
rage
fra
me
time
[ms]
global_lock lock_fineTM_coarse TM_fine
Comparative performance of parallel configurations
Transactional server running with 16 clients (speedup & scalability)
![Page 26: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/26.jpg)
26
Transactional statistic – fine-grained
Clients Transactions AbortsAbort rate
[%] Mean [B] Max [B] Total [MB]
1 190206 0 0.0Reads 65.1 58511 12
Writes 5.2 20102 1
2 367118 826 0.2Reads 66.0 62728 25
Writes 5.7 24397 2
4 655020 4165 0.6Reads 83.7 80275 55
Writes 8.2 39726 5
8 1439874 20593 1.4Reads 102.5 102470 145
Writes 9.6 57552 14
16 3226759 131814 4.1Reads 133.3 231593 192
Writes 15.5 211651 22
TM server running with 8 threads.
![Page 27: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/27.jpg)
27
Outline
Introduction & motivation
Quake description
Parallelization
Results
Conclusion
![Page 28: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/28.jpg)
28
QuakeTM Characteristics
• 27.600 lines of code.• 49 files.• Configurable with macros
– Synchronization, granularity, nesting, TM implementation.
• Coarse-grained setup:– 8 critical regions (TM or global lock)
• Fine-grained setup:– 58 critical regions (TM or fine-grained locks)
• Available on the www.bscmsrc.eu
![Page 29: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/29.jpg)
29
Conclusion
• The transactional overhead is excessive:– 6x slowdown – 35.3% abort rate
• A coarse-grained approach is not a good option for the current STM systems.
• Significant programmer time investment (10 man-months).
• Fine-grained approach maybe the only solution.
![Page 30: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/30.jpg)
30
Questions?
Thank you!
Download QuakeTM
www.bscmsrc.eu
![Page 31: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/31.jpg)
31
Intel Compiler
• single lock atomicity semantics and weak atomicity guarantees. – Strongly atomic semantics, where non-
transactional accesses are treated as implicit single-operation transactions
![Page 32: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/32.jpg)
32
Atomic Quake
• Main objective was to evaluate the effort of replacing locks with transactions.
• The lock parallelization is not block structured which required code reorganization to adapt to the TM model.
• The second problem was to avoid I/O operations which is not an issue in a lock based system.
• Finally, a big fraction of the development time was spent in understanding how locks are associated with the variables and to get a grip with the locking strategy.
![Page 33: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal](https://reader037.vdocument.in/reader037/viewer/2022102923/551b5abb550346d31b8b5539/html5/thumbnails/33.jpg)
33
Atomic Quake 2
• Thread private data – call to get_specific• The conditional variables – no retry• I/O in transactions – tm_pure• Proposition for error handling
– When error happens commit the transaction and handle the error outside the atomic block.
• Privatization examples– Custom memory manager allocates a block of
memory for string operations• TM fits for guarding access to different shared
data (separate locks)