Using ATLAS for Performance Tuning and Debugging
Sewook Wee and Njuguna NjorogeComputer Systems Laboratory
Stanford University http://tcc.stanford.edu/prototypes
2
Tutorial Set-up
Wireless Router access• SSID: RAMP-DEMO• Passphrase: rampramp
Team setup (Extreme Programming)• One member = driver, will code on his or her laptop• Rest of team = passengers, will review and help driver
Server connection• ssh 10.0.0.2• Username and password is on your desk
Environment variables check• check $BEE2_BOARD $VACATION $DLL
echo $BEE2 $VACATION $DLL
Make sure that your favorite text editor is working properly VNC viewer
• VNC Viewer executable http://www.realvnc.com/cgi-bin/download.cgi
• Open up VNC in shared mode
3
Transactionalizing vacation
vacation – Part of TCC group’s STAMP benchmark suite
• STAMP = Stanford’s Transactional Applications for Multi-Processing
• http://stamp.stanford.edu
• Modeled after SPECjbb2000
About vacation …
• Implements travel reservation system powered by database
• Workload consists of clients interacting with DB manager
• Four tables in DB: cars, rooms, flights, and customers The table of customers tracks the reservations and total price
The tables are implemented as Red-Black trees
4
Running vacation …
Let’s run vacation in its original form
%> cd $VACATION%> make run_seq
ACTION!
5
vacation pseudocode – main
In vacation.c, function MAIN
starting on line 340
initializeManager;
initializeClients;
PROFILER_ON;
client_run;
PROFILER_OFF;
6
vacation pseudocode – client work
In client.c, function client_run starting on line 109for (i = 0; i < numOperation; i++) {
action = selectAction;
switch (action) {
case ACTION_MAKE_RESERVATION: // 1st case
for(j = 0; j < numQueries; j++)
switch(query_type) { …}
case ACTION_BILL_CUSTOMER: // 2nd case
case ACTION_UPDATE_TABLES: // 3rd case
for(j = 0; j < numUpdates; j++) switch(update_type){ …}
}
} ...
7
Verify success of sequential run
Initializing manager... done.
Manager Stats are initialized
Initializing clients... done.
Transactions = 1024
Clients = 1
Transactions/client = 1024
Queries/transaction = 1
Relations = 4096
Query percent = 99
Query range = 4055
Percent user = 80
Running clients... done.
Checking tables... done.
Deallocating memory...
Number of total adds = 24700
Number of total deletes = 56
Number of total queries = 3341
Number of total reservations = 1618
Number of total cancellations = 0
Done.
%> more trace/0/atlas.stdoutACTION!
8
Quick overview TCC API
TM_PARALLEL(function_ptr, arg_ptr, numThreads)• function_ptr = pointer to parallel function• arg_ptr = pointer to function’s arguments
• numThreads = for TCC number of CPUs
TM_BEGIN(), TM_END()• Indicate start and end of a transaction
TM_GET_THREAD_ID(), TM_GET_NUM_THREAD()• Retrieve thread’s ID and number of threads
High-level language, OpenTM (resembles OpenMP), is in the works
9
Transacationalizing vacation – Step 1
OPEN vacation.c, CHANGE line 362 to:
358 MEMORY_INIT();358 MEMORY_INIT();
359 PROFILER_ON();359 PROFILER_ON();
360 360
361 /* Run transactions */
362 TM_PARALLEL(client_run, (void*)clients, global_params[PARAM_CLIENTS]);
363363
Note: global_params[PARAM_CLIENTS] = Number of Processors
• command-line parsing code sets this value
CHANGE
10
Transacationalizing vacation – Step 2 OPEN client.c
• ADD #include "tm.h“ on line 19
16 #include "client.h"16 #include "client.h"
17 #include "manager.h"17 #include "manager.h"
18 #include "reservation.h"18 #include "reservation.h"
19 #include "tm.h”
2020
• ADD int myId = TM_GET_THREAD_ID(); to line 113 and CHANGE line 115 clients[0] clients[myId]
112 int i;112 int i;
113 int myId = TM_GET_THREAD_ID();
114 client_t** clients = (client_t**)(args);114 client_t** clients = (client_t**)(args);
115 client_t* clientPtr = clients[myId];
ADD
ADD
CHANGE
11
Transacationalizing vacation – Step 3
Still in client.c , ADD TM_BEGIN(); on line 129124 for (i = 0; i < numOperation; i++) {124 for (i = 0; i < numOperation; i++) {
125 125
126 int r = random_generate(randomPtr) % 100;126 int r = random_generate(randomPtr) % 100;
127 action_t action = selectAction(r, percentUser);127 action_t action = selectAction(r, percentUser);
128 128
129 TM_BEGIN();
130 130
131 switch (action) {131 switch (action) {
Still in client.c , ADD TM_END(); on line 242
239 } /* switch (action) */239 } /* switch (action) */
240 formattingAndProtocol(&i);240 formattingAndProtocol(&i);
241 241
242 TM_END();
243 243
ADD
ADD
%> make run_parACTION!
12
Profiling File Output Format
1-way ATLAS Polling BRAM...
*****************************Profiling Info from TCC[0/1]*****************************TOTAL: 1339369387PERF: 595385277BUSY: 592922718L1_MISS: 2451797ARBIT: 1703COMMIT: 9059SYNC: 0VIOL: 0MISC: 0... OVFL CYCLE: 592071059
# OVFL: 98# LRU OVFL: 98# READ: 246260# R-MISS: 10793# WRITE: 108005# W-MISS: 1866# Inst.: 271417058# Trans: 3# Violation: 0# ITLBMISS: 595# DTLBMISS: 4775# DStorage: 0# SC: 0ITLBCYCLE: 546605DTLBCYCLE: 8177964DS CYCLE: 0SC CYCLE: 0# SYS Inst.: 1554382# SYS CYCLE: 8724569# Timeout: 1394# TimeoutL: 0
While vacation is running, open sequential run’s stats
%> more trace/0/atlas.logACTION!
13
Analyzing scalability of vacation
Look at reported speedup• Gets slower when we add more processors!!
Violations dominate PERF time!
%> less trace/8/atlas.logACTION!
14
Reading TAPE-violation report
Let’s look at violation log file
• Report says … lines 132 to 135 of manager.c should be largest offenders
%> less trace/8/viol.logACTION!
In manager.c, go to above lines and examine code
• Function increment_stats reads and writes global variables lots of conflicts between transactions
• Incrementing these stats causes many violations
Read_PC Object_Addr Occurence Loss Write_Proc Line 10001500 100830e0 32 1265341 3 ..//vacation/manager.c:134 10001448 100830e0 29 766816 4 ..//vacation/manager.c:134 10001390 100830e0 30 6446858 1 ..//vacation/manager.c:134 10005f4c 304492e4 3 750669 6 ..//lib/rbtree.c:105 10005f4c 304492e4 3 750669 6 ..//lib/rbtree.c:105
15
Fixing violations in vacation
The problem Violations on global stats variables
The fix privatize global variables
Simple privatization scheme
• Make an 8-element array for each stat variable i.e. int num_adds; int num_adds[MAX_CPUS];
• Each element is owned by a processor i.e. num_adds[x] = Processor x’s element
• In the stats printing function, aggregate the array elements into one single variable
16
Privatization of vacation – Step 1
OPEN manager.c, CHANGE to lines 111-115 to:
110 #ifdef manager_stats110 #ifdef manager_stats
111 int num_adds[MAX_CPUS];
112 int num_deletes[MAX_CPUS];
113 int num_queries[MAX_CPUS];
114 int num_reservations[MAX_CPUS];
115 int num_cancels[MAX_CPUS];
116 #endif116 #endif
Then, for lines 132-136, CHANGE to:
130 switch (stat) 130 switch (stat)
131 {131 {
132 case ADDS: num_adds[TM_GET_THREAD_ID()]++; break;
133 case DELETES: num_deletes[TM_GET_THREAD_ID()]++; break;
134 case QUERIES: num_queries[TM_GET_THREAD_ID()]++; break;
135 case RESERVATIONS: num_reservations[TM_GET_THREAD_ID()]++; break;
136 case CANCELS: num_cancels[TM_GET_THREAD_ID()]++; break;
137 default: break;137 default: break;
138 }138 }
CHANGE
CHANGE
CHANGE
CHANGE
17
Privatization of vacation – Step 2
In function manager_initStats in manager.c, ADD to lines 153-155 and CHANGE lines 156-161:
149 void149 void
150 manager_initStats(void)150 manager_initStats(void)
151 {151 {
152 #ifdef manager_stats 152 #ifdef manager_stats
153 int i;
154 for(i = 0; i < MAX_CPUS; i++)
155 {
156 num_adds[i] = 0;
157 num_deletes[i] = 0;
158 num_queries[i] = 0;
159 num_reservations[i] = 0;
160 num_cancels[i] = 0;
161 }
162 #endif162 #endif
163 163
164 printf("Manager Stats are initialized\n");164 printf("Manager Stats are initialized\n");
165 165
166 }166 }
ADD
ADD
CHANGE
CHANGE
18
Privatization of vacation – Step 3
In manager_printStats function in manager.c, CHANGE line 177 and lines 192-196
176 #ifdef manager_stats176 #ifdef manager_stats 177 #if 1 178 int i;178 int i; 179 int num_adds_t = 0, num_deletes_t = 0, num_queries_t = 0;179 int num_adds_t = 0, num_deletes_t = 0, num_queries_t = 0; 180 int num_reservations_t = 0, num_cancels_t = 0;180 int num_reservations_t = 0, num_cancels_t = 0; 181 /* aggregate stats */181 /* aggregate stats */
…. 191 printf("\n");191 printf("\n");
192 printf("Number of total adds = %d\n", num_adds_t);
193 printf("Number of total deletes = %d\n", num_deletes_t);
194 printf("Number of total queries = %d\n", num_queries_t);
195 printf("Number of total reservations = %d\n", num_reservations_t);
196 printf("Number of total cancellations = %d\n", num_cancels_t);
CHANGE
CHANGE
CHANGE
%> make run_parACTION!
19
Summary of Transactional vacation
After ~2 minutes, observe speedup at 8 processors is approx 6 times faster than uniprocessor configuration• Note: In OpenTM, privatization and reduction will be
automated by flagging variables Compiler will insert the privatization and reduction code for us
In this exercise, we demonstrated• Ease of use of transactional memory
Intuitive coarse-grain parallelization Did not require low-level understanding of code
• Guided performance tuning Identifies significant performance bottlenecks Without profiler and TAPE, finding such bottlenecks is like
“looking for a needle in a haystack”!
20
Debugging Parallel Code
There are established techniques for debugging sequential code
• Standard debugger (i.e. GDB)
How about parallel code?
• Non-deterministic runtime behavior
• Sometimes you have to understand underlying architecture
How can transactional memory help?
• Atomicity & Isolation No intrusion from other threads inside the transaction
• Deterministic replay
• Infinite data-watches
Our focus today => deterministic replay & GDB support
21
Functional Debugging of Transactional Apps
Once app is transactional, most common type of functional bug is atomicity violation
What are atomicity violations?
• In TM, programmer splits an atomic region of code into two or more transactions Intermediate values of shared data in one transaction prematurely
exposed to other transactions
• In fine-grain lock-based programming, much easier to introduce such violations
Challenge: Atomicity violations are non-deterministic and hard to regenerate
22
Fixing atomicity violations in ATLAS
In this session, you will debug an application with atomicity violations
ATLAS provides framework for deterministic replay
• 1st Step: Run a small application with atomicity violations
• 2nd Step: Deterministically regenerate the buggy execution
• 3rd Step: Add monitor code to identify origins of bugs
• 4th Step: Fix the code!
23
Example Code: Doubly Linked List
Toy example: Goal is to demonstrate the tool
Global doubly-linked-list queue
• Head and Tail pointers
• Each thread dequeues an item from the Head pointer, and enqueue it back to the Tail
• Threads use dequeue and enqueue functions which are individually synchronized using transaction
Programmer’s intention:
• The order of items in the queue remains same
• Like one thread repeats dequeue and enqueue
24
High-level Pseudo code
run: for i = 0:NUM_ITERS item = atomic_dequeue(); atomic_enqueue(item); End
atomic_enqueue (item): TM_BEGIN() enqueue_to_Tail(item) TM_END()
Head Tail
Thread B
Thread A
atomic_dequeue: TM_BEGIN() item = dequeue_from_Head TM_END() return item
1 3 42 5 6
25
Execution Step 1: Test Drive
The result from ATLAS may or may not meet the spec.
Here’s one possible example of undesired execution result.
Try it several times. You will see different result.
CorrectOutput
Actual Result
1 3 42 5 6
%> cd $DLL%> make run
ACTION!
70
1 4 32 5 6 70
26
Execution Step 2: Replay
%> make replay LOGFILE=commit.outACTION!
Log file• After test drive, you will get atlas.stdout and commit.out
• atlas.stdout: Standard output from the application
• commit.out: transaction order Now we will replay the previous run with commit.out
Unlike previous step, you see the same behavior
%> make replay LOGFILE=commit.correct%> make replay LOGFILE=commit.error
TIP
27
Execution Step 3: Finding the bug
Hypothesis: Enqueuing order may be different from the dequeuing order.
For example,
How to test hypothesis?
• Let’s make it simple: add printf
Thread A dequeue X
Thread A enqueue X
Thread B dequeue Y
Thread B enqueue Y
Thread A dequeue X
Thread B dequeue Y
Thread B enqueue Y
Thread A enqueue X
28
Execution Step 3: Finding the bug
Let’s add monitoring code
• printf in the transaction will not affect the transaction order
• Therefore, you will get the exactly same behavior
EDITdll.c
96 Head = next; 97 printf("Dequeue(%d)\n", item->id); 98
108 109 printf("\t\tEnqueue(%d)\n", item->id); 110 item->prev = NULL;
%> make replay LOGFILE=commit.outACTION!
You see that your hypothesis is right.
29
Execution Step 4: Fix it Make dequeue/enqueue as one atomic block
• Untransactionalize dequeue/enqueue• Transactionalize dequeue/enqueue pair
%> make runACTION!
EDITdll.c
84 //TM_BEGIN();... 99 //TM_END();... 107 //TM_BEGIN();... 122 //TM_END();
73 for (i = 0; i < NUM_ITERS; i+= TM_GET_NUM... 74 75 TM_BEGIN(); 76 77 item = dequeue(); 78 enqueue(item); 79 80 TM_END(); 81 82 }
30
Replay on Local Machine
Runs sequentially following LOGFILE
Faster execution
GDB support already exist
Does not support machine specific code
%> make replay_local LOGFILE=commit.outACTION!
31
GDB and Replay
%> echo “commit_in commit.error” > config.tcc%> gdb --args ./dll_local 8
ACTION!
(gdb) break dll.c:88(gdb) condition 1 Head->id==3(gdb) run(gdb) p Head->id(gdb) p Tail->id(gdb) p myid(gdb) break dll.c:110(gdb) condition 2 myid==3(gdb) continue(gdb) p myid(gdb) p Tail->id
About to dequeue
About to enqueue
1
4
32
Debugging Conclusion Slide
Deterministic replay
• Provides regeneration of buggy scenario
• Allows embedding monitoring code without contaminating the buggy scenario
All Transactions All The Time concept helps parallel code debugging
• Easier deterministic replay
• Easier GDB support