1
Scalable Dynamic Formal Verifica3on and Correctness Checking of MPI Applica3ons
Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1
University of Utah1
Technische Universität Dresden2
Lawrence Livermore NaMonal Laboratory3
2
Organiza3on • Overview of Erroneous Programming in MPI • MPI Run3me Error Detec3on with Marmot • ISP: A Run3me Checker Emphasizing Non-‐determinism • GEM: Graphical Explorer of MPI Programs • Improved Scalability Through Umpire BREAK • Verifica3on at Large Scale: DAMPI • Scalable MPI Error Detec3on with MUST • MPI Run3me Error Detec3on in Hybrid OpenMP/MPI
Applica3ons • Mul3ple Concurrency Models • Concluding Remarks and LiveDVD Distribu3on
3
An Overview of Erroneous Programming in MPI
Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1
University of Utah1
Technische Universität Dresden2
Lawrence Livermore NaMonal Laboratory3
4
MPI was designed to support performance
• Complex standard with many opera3ons – Includes non-‐blocking and collec3ve opera3ons – Can specify messaging choices precisely – Library not required to detect non-‐compliant usage
• Many erroneous or unsafe ac3ons – Incorrect arguments – Resource errors – Buffer usage – Type matching errors – Deadlock
• Includes concept of “unsafe” sends
5
Incorrect Arguments
• Incorrect arguments manifest during: – Compila3on (Type mismatch) – Run3me (Crash in MPI or unexpected behavior)
– During por3ng (Only manifests for some MPIs/Systems)
• Example (C): MPI_Send (buf, count, MPI_INTEGER,…);
6
Resource Tracking Errors
• Many MPI features require resource alloca3ons – Communicators – Data types – Requests – Groups, Error Handlers, Reduc3on Opera3ons
• Simple “MPI_Op leak” example: MPI_Op_create (..., &op);
MPI_Finalize ()
7
Dropped and Lost Requests
• Two resource errors with message requests – Leaked by creator (i.e., never completed)
– Never matched by src/dest (dropped request)
• Simple “lost request” example: MPI_Irecv (..., &req);
MPI_Irecv (..., &req);
MPI_Wait (&req,…)
8
Buffer Usage Errors
• Buffers passed to MPI_Isend, MPI_Irecv, … – Must not be wrieen to un3l MPI_Wait is called – Must not be read for non-‐blocking receive calls
• Example: MPI_Irecv (buf, ..., &request); read(buf[i]);
MPI_Wait (&request, ...);
9
MPI Type Matching
• Three kinds of MPI type matching – Send buffer type and MPI send data type – MPI send type and MPI receive type – MPI receive type and receive buffer type
• Similar requirements for collec3ve opera3ons
• Buffer type <=> MPI type matching – Requires compiler support – MPI_BOTTOM, MPI_LB & MPI_UB complicates – Not provided by our tools
10
Basic MPI Type Matching Example
• MPI standard provides support for heterogeneity – Endian-‐ness – Data formats – Limita3ons
• Simple example code: Task 0 Task 1 MPI_Send(1, MPI_INT) MPI_Recv(8, MPI_BYTE)
• Do the types match? — Buffer type <=> MPI type: Yes — MPI send type <=> MPI receive type?
– NO! – Common misconcep?on
11
Derived MPI Type Matching Example
• Consider MPI derived types corresponding to: – T1: struct {double, char} – T2: struct {double, char, double}
• Do these types match?
Example 1: Task 0 Task 1 MPI_Send(1, T1) MPI_Recv(1, T2)
Yes: MPI supports par?al receives — Allows efficient algorithms — double <=> double; char <=> char
12
Derived MPI Type Matching Example
• Consider MPI derived types corresponding to: – T1: struct {double, char} – T2: struct {double, char, double}
• Do these types match?
Example 2: Task 0 Task 1 MPI_Send(1, T2) MPI_Recv(2, T1)
Yes: — double <=> double — char <=> char — double <=> double
13
Derived MPI Type Matching Example
• Consider MPI derived types corresponding to: – T1: struct {double, char} – T2: struct {double, char, double}
• Do these types match?
Example 3: Task 0 Task 1 MPI_Send(2, T1) MPI_Recv(2, T2) MPI_Send(2, T2) MPI_Recv(4, T1)
No! What happens? Nothing good!
14
Basic MPI Deadlocks
• Unsafe or erroneous MPI programming prac3ces
• Code results depends on: – MPI implementa3on limita3ons – User input parameters
• Classic example code: Task 0 Task 1
MPI_Send MPI_Send
MPI_Recv MPI_Recv
• Assume applica3on uses “Thread funneled”
15
Deadlocks with MPI Collec3ves
• Erroneous MPI programming prac3ce • Simple example code:
Tasks 0, 1, & 2 Task 3 MPI_Bcast MPI_Barrier MPI_Barrier MPI_Bcast
• Possible code results: – Deadlock – Correct message matching – Incorrect message matching – Mysterious error messages
• “Wait-‐for” every task in communicator
16
1
2
3 MPI_Recv(from:3)
MPI_Barrier
MPI_Recv (ANY_SOURCE)
• Simple cycle detec3on only sufficient for AND case • The more general AND-‐OR model is suitable, but: - Visualiza3on of deadlock unsa3sfactory - Too complex: each MPI call either uses AND or OR
• We developed a model specifically designed for MPI
Waits for all process (AND) Waits for any process (OR)
Consider Dependency Types in MPI Programs
17
Non-‐empty set of nodes N where: For all nodes x in N the set descendants(x) is equal to N
• Umpire uses enhanced wait for graph with: - AND seman3c arcs (drawn solid) - OR seman3c arcs (drawn dashed)
- Each node only uses one arc type • Deadlock Criterion ?
1
2
3 Example:
Best case reduction: 1
2
3
Cycle is not sufficient !
1
2
3
1
2
3
No knot !
Knot is not necessary !
The Either AND or OR Model
Each task executes the waited for calls
18
• OR-‐Knot is a relaxed knot: - Set of nodes N, each node can reach all nodes in N - Nodes may also reach further nodes
o But: there must not be an AND arc from a node in N to a node not in N
• Examples: " An OR-Knot (in red): " Still an OR-Knot: " No OR-Knot:
A Necessary and Sufficient Deadlock Condi3on: The OR-‐Knot
19
• Uses best-‐case reduc3on of wait-‐for condi3ons • Sinks(fan-‐out=0) can sa3sfy wait-‐for condi3ons • Two reduc3on types: - AND: removes one incoming arc of a sink
- OR: removes all outgoing arcs of node connected to sink
• Example: OR AND
• Deadlock if resul3ng WFG has non-‐empty arc set
Signal Reduc3on Detec3on for the Either AND or OR Model
20
MPI Run3me Error Detec3on with Marmot
Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1
University of Utah1
Technische Universität Dresden2
Lawrence Livermore NaMonal Laboratory3
21
Content
• Mo3va3on & Overview • Architecture • Usage • Example • Integra3ons
– Cube – Visual Studio – DDT – Vampir
22
Mo3va3on & Overview OpenSource: hep://www.hlrs.de/organiza3on/av/amt/research/marmot • Project founded 2003 at HLRS in Germany
– Now developed by ZIH (TU Dresden) and HLRS • Goal: enhance MPI usability
– Consequence of a very lengthy debugging session • Funded in/by:
– Crossgrid, Microsot, VI-‐HPS, ParMA, H4H • Design philosophy:
– C++ library – Requires no source modifica3ons – MPI-‐1.2 support + some MPI-‐2 – Lots of usability
23
Architecture
• Process local checks on applica3on processes • Non-‐local checks on addi3onal “Debug Server” process (e.g., 3meout deadlock detec3on)
24
Usage
• Use Marmot compiler wrappers to compile and link: - Replace compiler calls by appropriate wrapper - For C/C++: marmotcc or marmotcxx
- For Fortran: marmoQ77 or marmoQ90 - Source code instrumenta3on added automa3cally
• Execu3on with Marmot requires one addi3onal process - Used for Debug Server - Instead of mpirun -‐np n call mpirun -‐np n+1 - Marmot's checks cause overhead
• Environmental variables control Marmots behaviour
25
Example – Code (11) MPI_Init (&argc, &argv); (12) MPI_Comm_rank (MPI_COMM_WORLD, &rank); (13) MPI_Comm_size (MPI_COMM_WORLD, &size); (14) MPI_Type_con3guous (2, MPI_INT, &cont2Int); (15) (16) assert (size >= 2); (17) (18) if (rank == 0)� (19) MPI_Send (s_buf, 1, cont2Int, 1, 7 /*Tag*/ ,MPI_COMM_WORLD); (20) (21) if (rank == 1)� (22) MPI_Recv (r_buf, 1, cont2Int, 0, 7 /*Tag*/,MPI_COMM_WORLD,&status); (23) (24) MPI_Type_free (&cont2Int); (25) MPI_Finalize();
26
Example – Building and Running
• Build: -‐> marmotcc datatype_sc10.c –o my_exe
• Run (2 applica3on processes): -‐> mpirun –np 3 my_exe
• Result (for HTML mode): -‐> Marmot_my_exe.<TIMESTAMP>.html
27
Example – Result (Environmental sezngs)
Beginning of Marmot correctness report lists environmental sezngs.
28
Example – MPI Usage Error
Correctness Error
What is the issue
Details in MPI Standard
Where in source
Background informa3on
29
Integra3ons
• Marmot is integrated in/with: – Cube (Data visualiza3on tool) – DDT (Debugger) – MS Visual Studio (IDE) – Eclipse (IDE) (beta) – VampirTrace
• Our integra3on goal: – Provide a “check with Marmot” bueon
30
Integra3ons – Cube • Cube from FZJ (Juelich) provides a convenient overview:
What error
Where in source
On which processes
31
Marmot – Integra3ons • Integra3on with MS Visual Studio
Launch tool for Marmot
Cube like error visualiza3on with source highligh3ng
32
Integra3ons – VisualStudio
• Integra3on with MS VisualStudio includes: – Marmot port to Windows
– VisualStudio plug-‐in with: • Error visualizer • Launch tool
– MPI API help (MPI 1.2 and MPI 2.1) – Installer for MPICH and MS-‐MPI
34
Marmot – Integra3ons • Integra3on with Vampir[Trace]
Marmot Errors are shown in the 3meline
Details for selected error
35
Integra3ons – VampirTrace UniMCI (OpenSource): www.tu-‐dresden.de/zih/unimci
• Integra3on uses UniMCI: – Universal MPI Correctness Interface – Provides MPI correctness checking to other tools – Host tool – wants to use correctness checking – Guest tool – implements correctness Checking – Schema3c:
• Sotware installa3on order: (1) Marmot; (2) UniMCI; (3) VampirTrace
Host Tool UniMCI Guest Tool 1
(Correctness Tool)
Guest Tool N
…
36
Heat Flow Example
• MPI implementa3on for 2D heat conduc3on
• Border Exchange:
P0 P1
P3 P2
Pk
Pi Pj Pm
Pn
Neighborhood of Pi: Exchange for Pi: To Pk From Pk
To Pm
From Pm
To Pn From Pn
To Pj
From Pj
37
Live Demonstra3on
• Usage: – Replace compiler command with Marmot tool, e.g. mpicc -‐> marmotcc
– Run with 1 extra process • Examples:
– Datatype example – Heat conduc3on
38
ISP: A Dynamic MPI Checker Emphasizing Nondeterminism
Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1
University of Utah1
Technische Universität Dresden2
Lawrence Livermore NaMonal Laboratory3
39
Descrip3on of an Idealized Tes3ng Tool
1. Eliminates redundant tests – Example: pos3ng a determinis3c send/receive in
both orders is wasteful (w.r.t. tes3ng priori3es)
2. De-‐bias from absolute speeds – Schedules must not be a vic3m of sequen3al
execu3on speed of individual processes
3. Force non-‐determinism coverage – Not only determine where non-‐determinism is,
but also force those cases to get tested
39
40
Descrip3on of an Idealized Tes3ng Tool
4. Force non-‐determinism coverage even around complex opera3ons (e.g., collec3ves)
– Tes3ng unbiased by collec3ve seman3cs
5. Tool based on a uniform underlying theory – Say a “happens-‐before” model
6. Be able to cover the input space 7. Provide intui3ve user interface within popular
frameworks
40
41
Mee3ng all these goals is difficult!
• ISP (In-‐situ Par3al Order) is our tool that meets all goals except Goal-‐6 for a reasonably large subset of MPI 2.0
• Goal-‐6 (input space coverage) typically requires symbolic analysis of program paths (not supported)
• Goals 1-‐5 met using special verifica3on Scheduler – ISP Scheduler AUTOMATICALLY Replays Given Program Enough Time Till Non-‐Determinism Space is Covered !!
• Goal 7 met by embedding within Eclipse PTP (GEM, an officially released PTP 4.0 component)
41
42
Flow of ISP
42
Executable
Proc1 Proc2 …… Procn
Scheduler Run
MPI Run?me
Hijack MPI Calls Scheduler decides how they are sent to the MPI run3me Scheduler plays out only the RELEVANT interleavings
MPI Program
Interposi3on Layer
43
Example Illustra3ng Goals 1 and 2
43
Process P0
R(from:*, r1) ;
R(from:2, r2);
S(to:2, r3);
R(from:*, r4);
All the Ws…
Process P1
Sleep(rand());
S(to:0, r1);
All the Ws…
Process P2
Sleep(rand());
S(to:0, r1);
R(from:0, r2);
S(to:0, r3);
All the Ws…
44
Example (contd.): Cover this case
44
Process P0
R(from:*, r1) ;
R(from:2, r2);
S(to:2, r3);
R(from:*, r4);
All the Ws…
Process P1
Sleep(rand());
S(to:0, r1);
All the Ws…
Process P2
Sleep(rand());
S(to:0, r1);
R(from:0, r2);
S(to:0, r3);
All the Ws… No deadlock
45
Example (contd.): … and also this case (no more)
45
Process P0
R(from:*, r1) ;
R(from:2, r2);
S(to:2, r3);
R(from:*, r4);
All the Ws…
Process P1
Sleep(rand());
S(to:0, r1);
All the Ws…
Process P2
Sleep(rand());
S(to:0, r1);
R(from:0, r2);
S(to:0, r3);
All the Ws…
deadlock
46
Mee3ng Goal 3 : Determinize to ‘fire and forget’
46
Process P0
R(from:*, r1) ; …
Process P1
Sleep(rand());
S(to:0, r1); …
Process P2
Sleep(rand());
S(to:0, r1); …
MPI Run?me
R(from:1, r1) ; S(to:0, r1) ;
47
Mee3ng Goal 3 : Determinize to ‘fire and forget’
47
Process P0
R(from:*, r1) ; …
Process P1
Sleep(rand());
S(to:0, r1); …
Process P2
Sleep(rand());
S(to:0, r1); …
MPI Run?me
R(from:2, r1) ; S(to:0, r1) ;
48
Goal 4: Forcing ND-‐coverage around collec3ves
48
P0 -‐-‐-‐
IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : *);
B IS2 (to : P2 )
Can S2 (to : P2 ) match R(from : *) ?
49
49
P0 -‐-‐-‐
IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : *);
B IS2 (to : P2 )
Can S2 (to : P2 ) match R(from : *) ? YES!
Goal 4: Forcing ND-‐coverage around collec3ves
50
50
P0 -‐-‐-‐
IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : *);
B IS2 (to : P2 )
ISP handles this situa3on through • Out-‐of-‐order execu3on • Dynamic Instruc3on Rewri3ng
Goal 4: Forcing ND-‐coverage around collec3ves
51
51
P0 -‐-‐-‐ Collect IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : *);
B IS2 (to : P2 )
Goal 4: Forcing ND-‐coverage around collec3ves
52
52
P0 -‐-‐-‐ Collect IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : *);
B IS2 (to : P2 )
Collect
Goal 4: Forcing ND-‐coverage around collec3ves
53
53
P0 -‐-‐-‐ Collect IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : *);
B IS2 (to : P2 )
Collect
Issue into MPI run3me Issue into MPI run3me
Issue into MPI run3me
Goal 4: Forcing ND-‐coverage around collec3ves
54
54
P0 -‐-‐-‐ Collect IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : *);
B IS2 (to : P2 )
Collect
Goal 4: Forcing ND-‐coverage around collec3ves
55
55
P0 -‐-‐-‐ Collect IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : *);
B IS2 (to : P2 )
Collect
Collect
Goal 4: Forcing ND-‐coverage around collec3ves
56
56
P0 -‐-‐-‐ Collect IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : *);
B IS2 (to : P2 )
Collect
Collect
Form Matches
Goal 4: Forcing ND-‐coverage around collec3ves
57
57
P0 -‐-‐-‐ Collect IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : 0);
B IS2 (to : P2 )
Collect
Collect
Rewrite, play
Goal 4: Forcing ND-‐coverage around collec3ves
58
58
P0 -‐-‐-‐ Collect IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : *);
B IS2 (to : P2 )
Collect
Collect
Re-‐execute to get back here
Goal 4: Forcing ND-‐coverage around collec3ves
59
59
P0 -‐-‐-‐ Collect IS1 (to : P2 );
B
P1 -‐-‐-‐
B;
P2 -‐-‐-‐
IR(from : 1);
B IS2 (to : P2 )
Collect
Collect
Rewrite, play
Goal 4: Forcing ND-‐coverage around collec3ves
60
P0 : IR(from:0, h1); B; IS(to:0, h2); W(h1); W(h2);
The HB for a simple “auto-‐send” example (BELOW) (all these “funny” examples uniformly handled by ISP)
Even the effects of buffering and the “eager” buffering of MPI Is handled this way
Goal 5: Scheduling based on Happens-‐before
HB for auto-‐send:
Process P0
Isend(1, req) ;
Barrier ;
Wait(req) ;
Process P1
Irecv(*, req) ;
Barrier ;
Recv(2) ;
Wait(req) ;
Process P2
Barrier ;
Isend(1, req) ;
Wait(req) ;
62
Puzng it all together : Study one example
63
63
P0 P1 P2
Barrier
Isend(1, req)
Wait(req)
Scheduler
Irecv(*, req)
Barrier
Recv(2)
Wait(req)
Isend(1, req)
Wait(req)
Barrier
Isend(1)
sendNext Barrier
MPI Run3me
ISP Scheduler Actions (animation)
64
P0 P1 P2
Barrier
Isend(1, req)
Wait(req)
Scheduler
Irecv(*, req)
Barrier
Recv(2)
Wait(req)
Isend(1, req)
Wait(req)
Barrier
Isend(1)
sendNext Barrier
Irecv(*)
Barrier
64
MPI Run3me
ISP Scheduler Actions (animation)
65
P0 P1 P2
Barrier
Isend(1, req)
Wait(req)
Scheduler
Irecv(*, req)
Barrier
Recv(2)
Wait(req)
Isend(1, req)
Wait(req)
Barrier
Isend(1)
Barrier
Irecv(*)
Barrier
Barrier
Barrier
Barrier
Barrier
65
MPI Run3me
ISP Scheduler Actions (animation)
66
P0 P1 P2
Barrier
Isend(1, req)
Wait(req)
MPI Run3me
Scheduler
Irecv(*, req)
Barrier
Recv(2)
Wait(req)
Isend(1, req)
Wait(req)
Barrier
Isend(1)
Barrier
Irecv(*)
Barrier
Barrier
Wait (req)
Recv(2)
Isend(1)
SendNext
Wait (req)
Irecv(2) Isend
Wait
No Match-‐Set
66
Deadlock!
ISP Scheduler Actions (animation)
67
ISP Summary • Tested on many examples • Tested against five MPI libraries
– MPICH2, OpenMPI, MVAPICH, Microsot MPI, IBM MPI • Versa3le
– Runs on one’s laptop – Can handle Parme3s (15K LOC) for 32 procs on laptop
• Suggested uses – Debug using ISP; then perform large-‐scale debugging – “Knows” enough about MPI that it can supplement textbooks
• Embedding of Pacheco’s book examples as projects within GEM
• Future work: – Hybrid verifica3on – Finding errors in the C-‐space of behaviors
67
68
GEM: Graphical Explorer of MPI Programs Demo of ISP Integra3on with Eclipse PTP 4.0
Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1
University of Utah1
Technische Universität Dresden2
Lawrence Livermore NaMonal Laboratory3
69
Live Demonstra3on
• General Features of GEM • Examples:
– Examples with Deadlocks and Leaks
– Heat conduc3on
70
Improved Scalability through Umpire
Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1
University of Utah1
Technische Universität Dresden2
Lawrence Livermore NaMonal Laboratory3
71
Umpire: greater scalability through asynchrony
• Similar error detec3on capability to Marmot • Asynchronous design that separates error detec3on from applica3on execu3on
• Rudimentary error reports
• GUI support based on ToolGear – Not currently maintained
• Planned replacement by MUST (later session)
72 Umpire divides MPI usage informa3on and correctness proper3es into local and global
MPI Call Data: Parameters Program Counter
Local Proper3es: Resource Usage Buffer Management
For Each MPI Task
MPI Call Data: Parameters Program Counter
Local Proper3es: Resource Usage Buffer Management
Global/Non-‐Local Proper3es: Deadlock Freedom Message Recep3on Message Type Matching
73 Umpire uses the MPI profiling layer to collect MPI call informa3on
MPI Application!Interposition via MPI Profiling Layer!
Collect pre-MPI call info
Collect post-MPI call info!
MPI Application!
MPI Runtime System!
74
Ou�ielder thread checks local MPI proper3es
MPI Application!Interposition via MPI Profiling Layer!
Collect pre-MPI call info
Collect post-MPI call info!
MPI Application!
MPI Runtime System!
*Except data types and non-‐blocking send buffer write errors
Outfielder!Track resource usage!
Perform allother localasynchronous tasks*!
SharedMemory!
SharedMemory!
*Also, does not track MPI communicators
75
Checking Global MPI Proper3es
Out of band communica3on mechanisms:
Outfielder!
Outfielder!
One
per M
PI ta
sk
Shared memory TCP/IP (originally planned; currently on hold) MPI
Manager!Perform message matching!
Detect deadlocks!
Data integrity?!
Out of Band!
Out of Band!
76
Puzng the Pieces Together
Manager!Perform message matching!
Detect deadlocks!
Data integrity!
MPI!
MPI!Outfielder!
Outfielder!
One
per M
PI ta
sk
MPI Task!
MPI Task!
Heartbeat!
Heartbeat!
Maximizes asynchrony Assumes MPI_THREAD_MULTIPLE
77
Umpire Manager Report --------------- Umpire Manager Report --------------- --- Communicators --- Communicator Leaks
--- Redundant Comm Freees
--- Datatype Errors --- Type Mismatches
--- Bad Type Handles
--- Redundant Type Commits
--- Incompatible Types for MPI_Op
--- Request Errors --- Unfreed Persistent Init Errors
--- Lost Requests
--- Dropped Requests
--- Bad Handles for Request Frees
--- Active Request Frees
--- Other Bad Request Handles
--- Bad Activations
--- Bad Reactivations --------------- End of Umpire Manager Report ---------------
78
Umpire Ou�ielder Report --------------- Umpire Outfielder Report --------------- --- Unreleased Derived Types
--- Redundant Type Frees
--- Reused Type Handles
--- Communications with Uncommitted Types
--- Unreleased Groups
--- Redundant Group Frees
--- Reused Group Handles
--- Unreleased Errhandlers
--- Redundant Errhandler Frees
--- Reused Errhandler Handles
--- Unreleased MPI_Ops
--- Redundant MPI_Op Frees
--- Reused MPI_Op Handles
--- Send Overwrites
--------------- End of Umpire Outfielder Report ---------------
79
Umpire detects resource tracking errors
• Track most resources in ou�ielder – Track MPI assigned handle (follows PMPI call) – Variable may change without leak
• Requests and communicators tracked by manager – Needed to perform MPI message matching
• Used to detect dropped requests • Also needed for type matching and deadlock detec3on
– Avoid duplicate storage by not tracking in ou�ielder
• Detect lost requests similarly to other resource tracking errors
• Also ensure proper use of persistent requests – Init (Start Complete)* Free
80
MPI Type Matching in Umpire • Represent types as regular expressions
– Determined in user process when commieed – Factor using “smallest first” canonical order
• e.g., {(c2(dc)3)3}, not {(c(cd)3c)3} • e.g., {((dc)2dc3)5}, not {((dc)3c2)5}
• Compare “greatest contained” factors – In manager, “on demand” – Ignore outermost count – Can match exactly or par3ally one-‐way or both – Compute and store “par3al count” – Both implies count at most one for one type – Remember results and combine exact matches
• Compare using send/recv count * outermost count
81
Umpire MPI Type Mismatch Output
--- Type Mismatches
57 type mismatch errors found: 1 occurence at 10001938 (MPI_COMM_WORLD rank 1)
1/4/10001938: 1011 MPI_Bcast pre umpi_op_ref_count = 1 buf = 540349096 count = 512 datatype = 2 root = 0 comm = 0
Secondary ops: 1 occurence at 10000830 (MPI_COMM_WORLD rank 0) 0/4/10000830: 1011 MPI_Bcast pre umpi_op_ref_count = 1 buf = 540361272 count = 128 datatype = 8 root = 0 comm = 0
1 occurence at 10000894 (MPI_COMM_WORLD rank 0) 0/5/10000894: 1047 MPI_Gather pre
Etc.
82
Umpire has two deadlock detec3on algorithm
• MPI deadlock queues – One per task in Manager – Track blocking MPI messaging opera3ons
• Items added through transac3ons • Remove when safely matched
• Simple reduc3on algorithm described earlier (see ICS 2009 paper)
• Automa3cally detect deadlocks – MPI opera3ons only – Wait-‐for graph – Recursive algorithm – Invoke when queue head changes
• Also support 3meouts (not currently used)
83
Umpire MPI Deadlock Output 0/0: DEADLOCK DETECTED. Aborting
MGR DEADLOCK Q HEADS -------------------- -----TASK 0 ----- 0/2/10000460: 1011 MPI_Bcast pre umpi_op_ref_count = 3 buf = 804397376 count = 128 datatype = 8 root = 1 comm = 0 -----TASK 1 ----- 1/2/1000047c: 1010 MPI_Barrier pre umpi_op_ref_count = 4 comm = 0
MGR DEADLOCK Q HEADS END--------------------
MGR DEADLOCK Q DUMP -------------------- MGR DEADLOCK Q 0 ----- 0/2/10000460: 1011 MPI_Bcast pre
umpi_op_ref_count = 3 buf = 804397376 count = 128 datatype = 8 root = 1 comm = 0
MGR DEADLOCK Q 1 ----- 1/2/1000047c: 1010 MPI_Barrier pre
umpi_op_ref_count = 4 comm = 0
MGR DEADLOCK Q DUMP END--------------------
84
MPI_ANY_SOURCE Receive Deadlocks
• Complicates deadlock detec3on significantly – Must obtain actual source from implementa3on – Timing dependent deadlocks
• Simple example code: Task 0 Task 1 Task 2 MPI_Recv(ANY) MPI_Send(0) MPI_Send(0) MPI_Send(1) MPI_Recv(0) MPI_Recv(ANY)
• Umpire detects errors that actually occur
Impact on Point-‐to-‐point Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
4 64 1024 16384 262144 4194304
Band
width (M
B/s)
Message Size (KB)
Base Hera
Umpire Hera
0
10
20
30
40
50
60
70
80
16 tasks 32 tasks 64 tasks 128 tasks 256 tasks
Time in secon
ds
base 4 tasks per node
umpire 4 tasks per node
base 8 tasks per node
umpire 8 tasks per node
• Computa3on-‐bound • NO chksums
Applica3on Impact: sPPM
87
Applica3on Impact: sPPM
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
16 tasks 32 tasks 64 tasks 128 tasks 256 tasks
Slow
donw
(Umpire ?me/Ba
se Tim
e)
umpire 4 tasks per node
umpire 8 tasks per node
88
Applica3on Impact: SMG2000
0
100
200
300
400
500
600
16 tasks 32 tasks 64 tasks 128 tasks
Slow
down (Umpire ?me/Ba
se ?me)
• Excessive MPI calls with liele overlapping
• NO chksums
89
Verifica3on at Large Scale: DAMPI
Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1
University of Utah1
Technische Universität Dresden2
Lawrence Livermore NaMonal Laboratory3
90
Mo3va3on
• ISP does not scale beyond 100+ processes • We need verifica3on tool at large scale:
– Some MPI programs require large number of processes to run certain problem size.
– Some bugs only become manifest at large scale (buffer overflow, index out of memory range).
– Large scale programs stress the MPI implementa3on and will expose MPI implementa3on bugs.
91
Rethinking ISP for Large Scale
• Why ISP did not do well at large scale: – Centralized scheduler with 3ghtly integrated checks – Not using each process’ own cycle
• This is essen3al at large scale. • How about:
– Use all processes cycles to accomplish ISP’s scheduler’s work
– Give users more flexibility: • Modularize all error checks • Allow users to focus coverage to specific code regions
92
DAMPI at a glance
Executable
Proc1 Proc2 …… Procn
Alternate Matches
MPI run?me
MPI Program
DAMPI -‐ PnMPI modules
Schedule Generator
Epoch Decisions
Rerun
DAMPI – Distributed Analyzer for MPI
93
DAMPI modules
Status module
Request module
Communicator module
Type module
Deadlock module
DAMPI -‐ PnMPI modules
Core Module
Op3onal Error Checking Module
Piggyback module
DAMPI driver
94
Discovering poten3al non-‐determinis3c matches
• Recall key concept (ISP): – Send and Recv can match if co-‐enabled
• Realizing happens-‐before (HB) edges is the key – Easy to do with centralized bookkeeping – In distributed sezng:
• More sophis3cated, but possible
• Use logical clocks (Lamport clock, vector clock)
95
A glimpse into DAMPI algorithm
• Keep track of non-‐determinis3c (ND) events through Lamport clocks
• Clocks are communicated through piggyback
• DAMPI driver uses these clocks to determine if sends and recvs have happens-‐before edges
• Enforce poten3al outcome through replays
• Want to know more? Wednesday
96
DAMPI Experimental results
0
100
200
300
400
500
600
700
800
900
4 8 16 32
Time in secon
ds
Number of tasks
ParMETIS-‐3.1 (no wildcard)
ISP
DAMPI
97
DAMPI results
0
1000
2000
3000
4000
5000
6000
7000
8000
250 500 750 1000
Time in secon
ds
Number of Interleavings
Matrix Mul?plica?on with Wildcard Receives
ISP
DAMPI
98 Impact on large applica3ons: SpecMPI2007 and NAS-‐PB
Benchmark Slowdown Total R* Communicator Leak Request Leak
ParMETIS-3.1 1.18 0 Yes No
104.milc 15 51K Yes No
107.leslie3d 1.14 0 No No
113.GemsFDTD 1.13 0 Yes No
126.lammps 1.88 0 No No
130.socorro 1.25 0 No No
137.lu 1.04 732 Yes No
BT 1.28 0 Yes No
CG 1.09 0 No No
DT 1.01 0 No No
EP 1.02 0 No No
FT 1.01 0 Yes No
IS 1.09 0 No No
LU 2.22 1K No No
MG 1.15 0 No No
99
Scalable MPI Run3me Error Detec3on with MUST
Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1
University of Utah1
Technische Universität Dresden2
Lawrence Livermore NaMonal Laboratory3
100
Content • Correctness Checking at Scale • MUST
– Overview – Design – Scalable Correctness Checks – Required Tool Infrastructure – Sotware Layers
• GTI – Overview – Communica3on System – Genera3on and Instan3a3on
• Summary
101
Correctness Checking at Scale
• Marmot: – Usability tools (compiler wrappers, …) – Many integra3ons
• However, at scale: – Debug Server is a boeleneck – (Processes are no3fying – blocking – the server before calling the actual MPI call)
• MPI correctness checking needs to scale, e.g. a case from Sabine Roller and Harald Klimach (GRS-‐Sim, Aachen)
:: 1 ::
Runable Job
Smallest configuration providing usable results
needs 1680 cores and at least 4 days
103
Correctness Checking at Scale
• Marmot performance test with SPEC MPI2007 – Combina3on with VampirTrace, only local checks
104
MUST – Overview
• MUST (Marmot Umpire Scalable Tool) = Umpire + Marmot + PnMPI – Local checks from Marmot – Non-‐Local checks from Umpire (e.g. Deadlock det.) – PnMPI as Infrastructure – Ongoing project, s3ll in development
• Goals: – Combine checks into one tool – Overcome scalability limita3ons – Maintainable and extendable checks (e.g. MPI-‐3)
105
A Dynamic Tool Infrastructure
Application
PMPI Tool 1 MPI Library
Application PMPI Tool 2 MPI Library
Application
MPI Library
PN
MP
I Application PMPI Tool 1 PMPI Tool 2 MPI Library
PMPI Tool 2
PMPI Tool 1
• Transparent layering of MPI tools - Binary rewrite of PMPI tools into modules - Configura3on at applica3on startup - PnMPI core loads modules into stacks
• Op3onal: tool modules can register with core - Share/Request services
MUST – Overview
106
MUST – Design
• Uses PnMPI and fine grained modules • Each correctness check is a module: - Needs specified input data - Can run anywhere - May use other modules for collabora3on
• Checks run on a place: - Applica3on thread or, - Extra thread/process
• Places connected with a communica3on network (TBON)
107
MUST – Required Tool Infrastructure (1/2)
• Example configura3on for 8 applica3on processes: - Layout for places - Communica3on system - Distribu3on of correctness checks
Tree network 0
2
4
6
1
3
5
7
108
MUST – Required Tool Infrastructure (2/2)
• MUST needs an infrastructure that provides: - Genera3on of MPI wrappers, - Spawning of extra tool threads/processes - Records to communicate MPI trace data, - A (flexible + scalable) communica3on system, - Forwarding of trace data to the checks, - Handling of applica3on crashes
• Exis3ng approaches: - MRNet is close, but: - No wrappers, No records, No data forwarding to checks, (No crash handling)
109
MUST – Marmot & Umpire
• MUST re-‐uses checks: – Local checks from Marmot – Non-‐Local checks from Umpire
0
2
4
6
1
3
5
7
Local checks: • Invalid arguments • Resource errors • Buffer errors • Call order errors …
Non-‐local checks: • Deadlock detec3on • Type matching • Collec3ve valida3on … (Umpire checks currently centralized)
110
MUST – Scalable Correctness Checks
• MUST uses a reduction network – Well suited to verify collective calls, e.g.: – E.g.: MPI_Bcast(buf, count, type, root, comm)
p1
• root must match on all tasks • Signature spawned by (count, type) must match on all tasks
p2
p3
q root2, count2, type2 CORRECT, root1, count1, type1
INCORRECT
111
MUST – Scalable Correctness Checks
• MUST uses a reduction network – Less suited for message matching
p1
p2
p3
q1
q2
p1
• N tasks • N2 possible matches exist
• M places • Each receives from N/M tasks • Each place can detect (N/M)2
matches • Total matches detected: M*(N/M)2 = N2/M E.g.: M=100 (N=1000) => first layer only detects 1% of the matches
N+((N-‐1)*N)/2 In fact …
Solu3on: Usage of a layer with inter-‐communica3on for matching
112
• MUST uses a 3-‐Layer sotware stack:
MUST
Generic Tool Infrastructure
PnMPI Module infrastructure, basic modules
Trace records, communication, places, ...
Checks
MUST – Sotware Layers
113
GTI – Overview
• GTI (Generic Tool Infrastructure) consists of: - Interfaces for modules with different tasks - Implementa3ons for these interface - A complex generator for wrapper genera3on, trace record genera3on, and instan3a3on
• MPI and MUST Agnos3c: - Basically an infrastructure for execu3ng analyses in a parallel environment - For MUST: “analysis” = “correctness check” - The API being used is an input for the GTI genera3on, e.g. For MUST this is an MPI descrip3on
114
GTI – Communica3on System (1/4)
• An instance of the tool has mul3ple layers • Pairs of layers may be connected (no cycles) • E.g.:
Layer 1 Layer 2 Layer 3
115
Layer 3
Layer 3
Layer 2 Layer 1
GTI – Communica3on System (2/4)
Layer 1 Layer 2
0
2
1
3
4
6
5
7
a
c
b d
• Each layer may contain mul3ple places • For MPI: first layer would contain all MPI tasks • E.g.:
116
Layer 3 Layer 1 Layer 2
GTI – Communica3on System (3/4)
0
2
1
3
4
6
5
7
a
c
b d
Reduc3on network
• A connec3on between layers i and j means that each process in i is connected to exactly one process of j
• E.g.:
117
Layer 3 Layer 1 Layer 2
GTI – Communica3on System (4/4)
0
2
1
4
6
5
7
a
c
dStrategy 3 b bProtocol Strategy
Decides when to transfer data. E.g.: aggrega3on of records
Decides how to transfer data. E.g.: MPI, TCP, …
• Each pair of connected processes uses a strategy and a protocol for its communica3on
• E.g.:
118
GTI – Genera3on and Instan3a3on
• Each place has one ingoing strategy - Which may receive data from mul3ple processes - Except Applica3on layer -‐> no ingoing strategy
• Tasks: - Receive trace data - Forward it to a processing and forwarding interface
Checks
Generated receival/forward • unpack serialized record • call checks • forward
Generated wrapper: • intercept • call checks • create records • forward records
119
GTI – Genera3on and Instan3a3on
• Modules for wrappers, records, and data forwarding need to be generated to instan3ate MUST
• The following things need to be specified: - What calls exist, what data do they provide - What checks exist, what data do they require - What tool layers are used, how are they connected, what checks do they run - What communica3on modules should be used
• This is specified in XML files • The “System Builder” (part of GTI) processes these
120
GTI – Genera3on and Instan3a3on GTI Spec. API
Spec.
API Spec.
Analysis Spec.
Analysis Spec.
Layout Spec.
Analysis Spec.
API Spec.
• What comm. Modules available • What types of places available
• What checks exist • What are their collabora3ons, inputs, …
• How many layers • Layer connec3ons • What checks(analyses) on each layer
• What calls can be wrapped • What are their arguments • What analyses use the arguments
121
GTI Spec. API
Spec.
API Spec.
Analysis Spec.
Analysis Spec.
Layout Spec.
Analysis Spec.
API Spec.
Layout GUI
GTI – Genera3on and Instan3a3on
122
GTI Spec. API
Spec.
API Spec.
Analysis Spec.
Analysis Spec.
Layout Spec.
Analysis Spec.
API Spec.
System Builder Weaver • Central component to
process and relate Specifica3ons
• Generates no code!
GTI – Genera3on and Instan3a3on
123
Receival/ Forward Gen Input Receival/ Forward Gen Input Wrapper Gen Input Wrapper Gen Input
GTI Spec. API
Spec.
API Spec.
Analysis Spec.
Analysis Spec.
Layout Spec.
Analysis Spec.
API Spec.
System Builder Weaver
Wrapper Gen Input Receival/ Forward Gen Input
• Specifica3on for wrapper and receival/forward genera3on
• XML files • One for each layer
• Specifica3on for wrapper and receival/forward genera3on
• XML files • One for each layer
GTI – Genera3on and Instan3a3on
124
Receival/ Forward Gen Input Receival/ Forward Gen Input Wrapper Gen Input Wrapper Gen Input
GTI Spec. API
Spec.
API Spec.
Analysis Spec.
Analysis Spec.
Layout Spec.
Analysis Spec.
API Spec.
System Builder Weaver
Wrapper Gen Input Receival/ Forward Gen Input
Wrapper Generator Receival/Forward Gen.
• Process input XMLs and create code • Also create trace records (details later) • Process input XMLs and create code • Also create trace records (details later)
GTI – Genera3on and Instan3a3on
125
Receival/Forward Mod. Receival/Forward Mod. Wrapper Module Wrapper Module
Receival/ Forward Gen Input Receival/ Forward Gen Input Wrapper Gen Input Wrapper Gen Input
GTI Spec. API
Spec.
API Spec.
Analysis Spec.
Analysis Spec.
Layout Spec.
Analysis Spec.
API Spec.
System Builder Weaver
Wrapper Gen Input Receival/ Forward Gen Input
Wrapper Generator Receival/Forward Gen.
Intermediate Modules
Wrapper Module Receival/Forward Mod.
GTI – Genera3on and Instan3a3on
126
• Checks modules • Comm. modules • Place modules
Receival/Forward Mod. Receival/Forward Mod. Wrapper Module Wrapper Module
Receival/ Forward Gen Input Receival/ Forward Gen Input Wrapper Gen Input Wrapper Gen Input
GTI Spec. API
Spec.
API Spec.
Analysis Spec.
Analysis Spec.
Layout Spec.
Analysis Spec.
API Spec.
System Builder Weaver
Wrapper Gen Input Receival/ Forward Gen Input
Wrapper Generator Receival/Forward Gen.
Intermediate Modules Module Library
Executable
Instance
Wrapper Module Receival/Forward Mod. PnMPI Conf.
GTI – Genera3on and Instan3a3on
127
Summary
• MUST: Scalable MPI correctness checking • Checks are modules, can run anywhere
– PnMPI as base infrastructure • Based on Generic Tool Infrastructure
– Very flexible (MPI agnostic) – Uses reduction networks – Instantiation uses generated code – Generation work on XML descriptions
128
MPI Run3me Error Detec3on in Hybrid OpenMP/MPI Applica3ons
Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1
University of Utah1
Technische Universität Dresden2
Lawrence Livermore NaMonal Laboratory3
129
Overview
• OpenMP – Parallel threads, shared memory
• MPI – Parallel processes – No shared memory, messages used for exchange
... Process Thread
Memory
... MPI
130
Overview • Hybrid OpenMP/MPI:
Process 0 Thread 0
Memory ...
Process 1
Memory
Process m
Memory
Thread 1
Thread n1
Thread 0 Thread 1
Thread n2
Thread 0 Thread 1
Thread nm
MPI Interface
• Communica3on: – Between threads of one process via shared memory – Between threads of different processes via MPI
131
Overview
• MPI-‐2 standard defines level of thread support: – MPI_THREAD_SINGLE: there is only one thread – MPI_THREAD_FUNNELED: only main thread performs MPI calls
– MPI_THREAD_SERIALIZED: only one thread is in MPI at a 3me
– MPI_THREAD_MULTIPLE: threads may call MPI simultaneously
• MPI_Init_thread used to request a certain level • Further restric3ons in the MPI Standard
– E.g.: A communicator, file handle or window must not be used in mul3ple collec3ve calls simultaneous
132
Marmot Support for Hybrid OpenMP/MPI
• Three steps: (1) Marmot extensions to operate in hybrid mode - Synchroniza3on
(2) Marmot checks for hybrid OpenMP/MPI errors - Usage errors of MPI that result from OpenMP usage - Detect errors that appear in a run with Marmot
(3) Advanced checks that detect errors in alterna3ve execu3on orders - Uses Intel Thread Checker
133
Step1: Synchroniza3on Applica3on Wrapper Synchronisa3on Core MPI
MPI_X enterMARMOT
checkAndExecute
enterPMPI
PMPI_X
leavePMPI
leaveMARMOT
Protected
Protected
MARMOT
Pre call checks
Post call checks
134
Step2: Marmot Checks (Example) • “Finally, in mulMthreaded implementaMons, one can have
more than one, concurrently execuMng, collecMve communicaMon call at a process. In these situaMons, it is the user’s responsibility to ensure that the same communicator is not used concurrently by two different collecMve communicaMon calls at the same process.” [MPI-‐1 p 130 lines 37-‐41]
• Implementa3on pseudo code: Pre execu?on code for all collec?ves: If check_is_comm_in_use (comm) == TRUE Then print_error() register_comm_as_used(comm)
Post execu?on code for all collec?ves: unregister_comm(comm)
135
Example – Code 25) //init MPI 26) MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided); 27) MPI_Comm_rank(MPI_COMM_WORLD,&rank); 28) MPI_Comm_size(MPI_COMM_WORLD,&size); 29) 30) if ((rank == 0) && (provided != MPI_THREAD_MULTIPLE)) 31) prin� ("WARNING MPI_THREAD_MULTIPLE not supported\n"); 32) 33) //set num threads 34) omp_set_num_threads(4); 35) 36) #pragma omp parallel 37) { 38) MPI_Barrier(MPI_COMM_WORLD); //this is erroneous ! 39) }
137
Step3: Advanced Checks
• Consider: #pragma omp parallel private(thread) { #pragma omp sec3ons { #pragma omp sec3on { MPI_Barrier(MPI_COMM_WORLD); } #pragma omp sec3on { sleep(5); MPI_Barrier(MPI_COMM_WORLD); } } }
It is almost certain that Marmot won’t detect this error!
138
Step3: Advanced Checks
• Intel Thread Checker – Detects data races, deadlocks, erroneous
parallel execution – Uses an simulation approach in order to detect
race conditions deterministically – Requires binary or source instrumentation
• Race Detec3on is used to enhance Marmot checks
139
Step3: Advanced Checks
• Assume Marmot writes to variable “MyDataRace”
when wrapping an MPI call:
- Intel Thread Checker will detect race on “MyDataRace” if and only if
two MPI calls could be executed in parallel on one process
- Thus, a race occurs if MPI_THREAD_MULTIPLE is used
- Thread Checkers simula3on detects this for every interleaving
• Marmot uses ar3ficial races for MPI error detec3on
• Correctness errors listed in Thread Checker output
140
Example – Data Race for Communicator Error
• Wanted: Race that appears if collec3ve communicator restric3on is violated:
– Requires two threads calling collec3ve – Requires usage of same communicator
⇒ One conflict variable per communicator
• Pseudo code: Pre execu?on code for all collec?ves (before synchronisa?on starts): begin_cri3cal() index = map_comm_2_index (comm) end_cri3cal()
conflict_variable[index] = 1
142
Summary
• Marmot supports hybrid OpenMP/MPI • Detects several MPI usage errors that result from presence of mul3ple threads
• Advanced detec3on: – Uses ar3ficial data races – Needs data race detector, e.g. Intel Thread Checker – Improves quality of checks
143
Mul3ple Concurrency Models
Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1
University of Utah1
Technische Universität Dresden2
Lawrence Livermore NaMonal Laboratory3
144
Issues Verifying Hybrid Concurrency Models
• Seman3cs of interac3on are unclear • Some APIs (e.g. OpenMP) don’t provide a standard interface for OMP thread/task scheduling control
• One approach we have tried with OMP + MPI is : determinize OMP schedules, to allow MPI non-‐determinism to replay successfully
• Another project : Verify MPI + CUDA
• A liele tutorial on our CUDA Verifier follows
Our CUDA / OpenCL Verifica3on Flow
145
Analyzer
Kernel Invoca3on Contexts
PUG Analyzer for Races and Asser3ons
C Applica3on Containing Mul3ple Kernels
Kernel Descrip3ons
CPU / GPU Communica3on Codes
CPU / GPU Communica3on Verifier (CGV)
Verifica3on Results
Verifica3on Results
PUG’s Symbolic Approach
Analyzer supported by LLNL Rose
C Applica3on Containing Mul3ple Kernels
Constraint solver (Fast Logical Decision Procedures)
Verifica3on Condi3ons i.e. “Constraints”
UNSAT: The instance is “OK” – i.e.
• Race-‐free • No mismatched barriers • Passes user Asser3ons
SAT: The instance has bugs
Puts out “bread crumbs” to help debug
(SAT instance)
146
Sample results: Bug-‐free Examples
+ O: required asser3ons to specify that bit-‐vector computa3ons don’t overflow
+C: required constraints on the input values
+R: required manual loop refinement
B.C.: measures how serious the bank conflicts are
Time: SMT solving 3me in seconds to confirm absence of issues. 147
Sample results: Buggy Examples
We tested 57 assignment submissions from a recently completed graduate GPU class taught in our department.
Defects: Indicates how many kernels are not well parameterized, i.e. work only in certain configura3ons
Refinement: Measures how many loops need automa3c refinement.
148
149
Real race (GPU class)
149
__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) {
d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); } __syncthreads(); assume(blockDim.x <= BLOCKSIZE / 2); // for tes3ng if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) {
d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];
/* The counter example given by PUG is : TRY HITTING THIS VIA RANDOM TESTING! t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0, that is, d_out[threadIdx.x+8*i]+=d_out[threadIdx.x+8*i+1]; d_out[2+8*1]+=d_out[10+8*0+1]; d_out[10]+=d_out[10] a race!!! */
150
Real __syncthreads() error
150
__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) { // *d_sum=0;
d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); } __syncthreads();
// PUG found this synchroniza3on error if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];
__syncthreads(); } } …
151
Combined MPI + CUDA / OpenCL verifica3on • Separately verify kernels using PUG (e.g.) • Put kernel codes within MPI and verify using one of the tools described here
• PUG can generate input sets that MPI tes3ng must ‘hit’ when invoking kernels
• Many open issues:
– Mul3ple concurrent kernel launches – Data sharings between kernels and MPI
• Much more research is needed