Download - tut129s1 Gopalakrishnan Muller deSupinskikorobkin/tmp/SC10/tutorials/docs/M06/M06.pdf · 1 Scalable"Dynamic"Formal"Veriﬁca3on"and" Correctness"Checking"of"MPIApplica3ons" GaneshGopalakrishnan

1

Scalable Dynamic Formal Verifica3on and Correctness Checking of MPI Applica3ons

Ganesh Gopalakrishnan1, Ma1hias Müller2, Bronis R. de Supinski3, Tobias Hilbrich2, Anh Vo1, Alan Humphrey1, and Christopher Derrick1

University of Utah1

Technische Universität Dresden2

Lawrence Livermore NaMonal Laboratory3

2

Organiza3on •  Overview of Erroneous Programming in MPI •  MPI Run3me Error Detec3on with Marmot •  ISP: A Run3me Checker Emphasizing Non-‐determinism •  GEM: Graphical Explorer of MPI Programs •  Improved Scalability Through Umpire BREAK •  Verifica3on at Large Scale: DAMPI •  Scalable MPI Error Detec3on with MUST •  MPI Run3me Error Detec3on in Hybrid OpenMP/MPI

Applica3ons •  Mul3ple Concurrency Models •  Concluding Remarks and LiveDVD Distribu3on

3

An Overview of Erroneous Programming in MPI


University of Utah1



4

MPI was designed to support performance

•  Complex standard with many opera3ons –  Includes non-‐blocking and collec3ve opera3ons –  Can specify messaging choices precisely –  Library not required to detect non-‐compliant usage

•  Many erroneous or unsafe ac3ons –  Incorrect arguments –  Resource errors –  Buffer usage –  Type matching errors – Deadlock

•  Includes concept of “unsafe” sends

5

Incorrect Arguments

•  Incorrect arguments manifest during: –  Compila3on (Type mismatch) –  Run3me (Crash in MPI or unexpected behavior)

– During por3ng (Only manifests for some MPIs/Systems)

•  Example (C): MPI_Send (buf, count, MPI_INTEGER,…);

6

Resource Tracking Errors

•  Many MPI features require resource alloca3ons –  Communicators – Data types –  Requests – Groups, Error Handlers, Reduc3on Opera3ons

•  Simple “MPI_Op leak” example: MPI_Op_create (..., &op);

MPI_Finalize ()

7

Dropped and Lost Requests

•  Two resource errors with message requests – Leaked by creator (i.e., never completed)

– Never matched by src/dest (dropped request)

•  Simple “lost request” example: MPI_Irecv (..., &req);

MPI_Irecv (..., &req);

MPI_Wait (&req,…)

8

Buffer Usage Errors

•  Buffers passed to MPI_Isend, MPI_Irecv, … – Must not be wrieen to un3l MPI_Wait is called – Must not be read for non-‐blocking receive calls

•  Example: MPI_Irecv (buf, ..., &request); read(buf[i]);

MPI_Wait (&request, ...);

9

MPI Type Matching

•  Three kinds of MPI type matching –  Send buffer type and MPI send data type – MPI send type and MPI receive type – MPI receive type and receive buffer type

•  Similar requirements for collec3ve opera3ons

•  Buffer type <=> MPI type matching –  Requires compiler support – MPI_BOTTOM, MPI_LB & MPI_UB complicates – Not provided by our tools

10

Basic MPI Type Matching Example

•  MPI standard provides support for heterogeneity –  Endian-‐ness –  Data formats –  Limita3ons

•  Simple example code: Task 0 Task 1 MPI_Send(1, MPI_INT) MPI_Recv(8, MPI_BYTE)

•  Do the types match? — Buffer type <=> MPI type: Yes — MPI send type <=> MPI receive type?

– NO! – Common misconcep?on

11

Derived MPI Type Matching Example

•  Consider MPI derived types corresponding to: –  T1: struct {double, char} –  T2: struct {double, char, double}

•  Do these types match?

  Example 1: Task 0 Task 1 MPI_Send(1, T1) MPI_Recv(1, T2)

  Yes: MPI supports par?al receives — Allows efficient algorithms — double <=> double; char <=> char

12




  Example 2: Task 0 Task 1 MPI_Send(1, T2) MPI_Recv(2, T1)

  Yes: — double <=> double — char <=> char — double <=> double

13




  Example 3: Task 0 Task 1 MPI_Send(2, T1) MPI_Recv(2, T2) MPI_Send(2, T2) MPI_Recv(4, T1)

  No! What happens? Nothing good!

14

Basic MPI Deadlocks

•  Unsafe or erroneous MPI programming prac3ces

•  Code results depends on: – MPI implementa3on limita3ons – User input parameters

•  Classic example code: Task 0 Task 1

MPI_Send MPI_Send

MPI_Recv MPI_Recv

•  Assume applica3on uses “Thread funneled”

15

Deadlocks with MPI Collec3ves

•  Erroneous MPI programming prac3ce •  Simple example code:

Tasks 0, 1, & 2 Task 3 MPI_Bcast MPI_Barrier MPI_Barrier MPI_Bcast

•  Possible code results: – Deadlock –  Correct message matching –  Incorrect message matching – Mysterious error messages

•  “Wait-‐for” every task in communicator

16

1

2

3 MPI_Recv(from:3)

MPI_Barrier

MPI_Recv (ANY_SOURCE)

•  Simple cycle detec3on only sufficient for AND case •  The more general AND-‐OR model is suitable, but: - Visualiza3on of deadlock unsa3sfactory - Too complex: each MPI call either uses AND or OR

• We developed a model specifically designed for MPI

Waits for all process (AND) Waits for any process (OR)

Consider Dependency Types in MPI Programs

17

Non-‐empty set of nodes N where: For all nodes x in N the set descendants(x) is equal to N

• Umpire uses enhanced wait for graph with: - AND seman3c arcs (drawn solid) - OR seman3c arcs (drawn dashed)

- Each node only uses one arc type •  Deadlock Criterion ?

1

2

3 Example:

Best case reduction: 1

2

3

Cycle is not sufficient !

1

2

3

1

2

3

No knot !

Knot is not necessary !

The Either AND or OR Model

Each task executes the waited for calls

18

•  OR-‐Knot is a relaxed knot: - Set of nodes N, each node can reach all nodes in N - Nodes may also reach further nodes

o But: there must not be an AND arc from a node in N to a node not in N

•  Examples: "  An OR-Knot (in red): "  Still an OR-Knot: "  No OR-Knot:

A Necessary and Sufficient Deadlock Condi3on: The OR-‐Knot

19

• Uses best-‐case reduc3on of wait-‐for condi3ons • Sinks(fan-‐out=0) can sa3sfy wait-‐for condi3ons • Two reduc3on types: - AND: removes one incoming arc of a sink

- OR: removes all outgoing arcs of node connected to sink

• Example: OR AND

• Deadlock if resul3ng WFG has non-‐empty arc set

Signal Reduc3on Detec3on for the Either AND or OR Model

20

MPI Run3me Error Detec3on with Marmot


University of Utah1



21

Content

•  Mo3va3on & Overview •  Architecture •  Usage •  Example •  Integra3ons

– Cube – Visual Studio – DDT – Vampir

22

Mo3va3on & Overview OpenSource: hep://www.hlrs.de/organiza3on/av/amt/research/marmot •  Project founded 2003 at HLRS in Germany

– Now developed by ZIH (TU Dresden) and HLRS •  Goal: enhance MPI usability

–  Consequence of a very lengthy debugging session •  Funded in/by:

–  Crossgrid, Microsot, VI-‐HPS, ParMA, H4H •  Design philosophy:

–  C++ library –  Requires no source modifica3ons – MPI-‐1.2 support + some MPI-‐2 –  Lots of usability

23

Architecture

•  Process local checks on applica3on processes •  Non-‐local checks on addi3onal “Debug Server” process (e.g., 3meout deadlock detec3on)

24

Usage

•  Use Marmot compiler wrappers to compile and link: -  Replace compiler calls by appropriate wrapper -  For C/C++: marmotcc or marmotcxx

-  For Fortran: marmoQ77 or marmoQ90 -  Source code instrumenta3on added automa3cally

•  Execu3on with Marmot requires one addi3onal process -  Used for Debug Server -  Instead of mpirun -‐np n call mpirun -‐np n+1 -  Marmot's checks cause overhead

•  Environmental variables control Marmots behaviour

25

Example – Code (11)  MPI_Init (&argc, &argv); (12)  MPI_Comm_rank (MPI_COMM_WORLD, &rank); (13)  MPI_Comm_size (MPI_COMM_WORLD, &size); (14)  MPI_Type_con3guous (2, MPI_INT, &cont2Int); (15)  (16)  assert (size >= 2); (17)  (18)  if (rank == 0)� (19)  MPI_Send (s_buf, 1, cont2Int, 1, 7 /*Tag*/ ,MPI_COMM_WORLD); (20)  (21)  if (rank == 1)� (22)  MPI_Recv (r_buf, 1, cont2Int, 0, 7 /*Tag*/,MPI_COMM_WORLD,&status); (23)  (24)  MPI_Type_free (&cont2Int); (25)  MPI_Finalize();

26

Example – Building and Running

•  Build: -‐> marmotcc datatype_sc10.c –o my_exe

•  Run (2 applica3on processes): -‐> mpirun –np 3 my_exe

•  Result (for HTML mode): -‐> Marmot_my_exe.<TIMESTAMP>.html

27

Example – Result (Environmental sezngs)

Beginning of Marmot correctness report lists environmental sezngs.

28

Example – MPI Usage Error

Correctness Error

What is the issue

Details in MPI Standard

Where in source

Background informa3on

29

Integra3ons

•  Marmot is integrated in/with: – Cube (Data visualiza3on tool) – DDT (Debugger) – MS Visual Studio (IDE) – Eclipse (IDE) (beta) – VampirTrace

•  Our integra3on goal: – Provide a “check with Marmot” bueon

30

Integra3ons – Cube •  Cube from FZJ (Juelich) provides a convenient overview:

What error

Where in source

On which processes

31

Marmot – Integra3ons •  Integra3on with MS Visual Studio

Launch tool for Marmot

Cube like error visualiza3on with source highligh3ng

32

Integra3ons – VisualStudio

•  Integra3on with MS VisualStudio includes: – Marmot port to Windows

– VisualStudio plug-‐in with: •  Error visualizer •  Launch tool

– MPI API help (MPI 1.2 and MPI 2.1) –  Installer for MPICH and MS-‐MPI

33

Marmot – Integra3ons •  Integra3on with DDT

Debugger stops when Marmot detects an error

34

Marmot – Integra3ons •  Integra3on with Vampir[Trace]

Marmot Errors are shown in the 3meline

Details for selected error

35

Integra3ons – VampirTrace UniMCI (OpenSource): www.tu-‐dresden.de/zih/unimci

•  Integra3on uses UniMCI: – Universal MPI Correctness Interface –  Provides MPI correctness checking to other tools – Host tool – wants to use correctness checking – Guest tool – implements correctness Checking –  Schema3c:

•  Sotware installa3on order: (1) Marmot; (2) UniMCI; (3) VampirTrace

Host Tool UniMCI Guest Tool 1

(Correctness Tool)

Guest Tool N

…

36

Heat Flow Example

•  MPI implementa3on for 2D heat conduc3on

•  Border Exchange:

P0 P1

P3 P2

Pk

Pi Pj Pm

Pn

Neighborhood of Pi: Exchange for Pi: To Pk From Pk

To Pm

From Pm

To Pn From Pn

To Pj

From Pj

37

Live Demonstra3on

•  Usage: – Replace compiler command with Marmot tool, e.g. mpicc -‐> marmotcc

– Run with 1 extra process •  Examples:

– Datatype example – Heat conduc3on

38

ISP: A Dynamic MPI Checker Emphasizing Nondeterminism


University of Utah1



39

Descrip3on of an Idealized Tes3ng Tool

1.  Eliminates redundant tests –  Example: pos3ng a determinis3c send/receive in

both orders is wasteful (w.r.t. tes3ng priori3es)

2.  De-‐bias from absolute speeds –  Schedules must not be a vic3m of sequen3al

execu3on speed of individual processes

3.  Force non-‐determinism coverage –  Not only determine where non-‐determinism is,

but also force those cases to get tested

39

40

Descrip3on of an Idealized Tes3ng Tool

4.  Force non-‐determinism coverage even around complex opera3ons (e.g., collec3ves)

–  Tes3ng unbiased by collec3ve seman3cs

5. Tool based on a uniform underlying theory –  Say a “happens-‐before” model

6.  Be able to cover the input space 7.  Provide intui3ve user interface within popular

frameworks

40

41

Mee3ng all these goals is difficult!

•  ISP (In-‐situ Par3al Order) is our tool that meets all goals except Goal-‐6 for a reasonably large subset of MPI 2.0

•  Goal-‐6 (input space coverage) typically requires symbolic analysis of program paths (not supported)

•  Goals 1-‐5 met using special verifica3on Scheduler –  ISP Scheduler AUTOMATICALLY Replays Given Program Enough Time Till Non-‐Determinism Space is Covered !!

•  Goal 7 met by embedding within Eclipse PTP (GEM, an officially released PTP 4.0 component)

41

42

Flow of ISP

42

Executable

Proc1 Proc2 …… Procn

Scheduler Run

MPI Run?me

  Hijack MPI Calls   Scheduler decides how they are sent to the MPI run3me   Scheduler plays out only the RELEVANT interleavings

MPI Program

Interposi3on Layer

43

Example Illustra3ng Goals 1 and 2

43

Process P0

R(from:*, r1) ;

R(from:2, r2);

S(to:2, r3);

R(from:*, r4);

All the Ws…

Process P1

Sleep(rand());

S(to:0, r1);

All the Ws…

Process P2

Sleep(rand());

S(to:0, r1);

R(from:0, r2);

S(to:0, r3);

All the Ws…

44

Example (contd.): Cover this case

44

Process P0

R(from:*, r1) ;

R(from:2, r2);

S(to:2, r3);

R(from:*, r4);

All the Ws…

Process P1

Sleep(rand());

S(to:0, r1);

All the Ws…

Process P2

Sleep(rand());

S(to:0, r1);

R(from:0, r2);

S(to:0, r3);

All the Ws… No deadlock

45

Example (contd.): … and also this case (no more)

45

Process P0

R(from:*, r1) ;

R(from:2, r2);

S(to:2, r3);

R(from:*, r4);

All the Ws…

Process P1

Sleep(rand());

S(to:0, r1);

All the Ws…

Process P2

Sleep(rand());

S(to:0, r1);

R(from:0, r2);

S(to:0, r3);

All the Ws…

deadlock

46

Mee3ng Goal 3 : Determinize to ‘fire and forget’

46

Process P0

R(from:*, r1) ; …

Process P1

Sleep(rand());

S(to:0, r1); …

Process P2

Sleep(rand());

S(to:0, r1); …

MPI Run?me

R(from:1, r1) ; S(to:0, r1) ;

47

Mee3ng Goal 3 : Determinize to ‘fire and forget’

47

Process P0

R(from:*, r1) ; …

Process P1

Sleep(rand());

S(to:0, r1); …

Process P2

Sleep(rand());

S(to:0, r1); …

MPI Run?me

R(from:2, r1) ; S(to:0, r1) ;

48

Goal 4: Forcing ND-‐coverage around collec3ves

48

P0 -‐-‐-‐

IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : *);

B IS2 (to : P2 )

Can S2 (to : P2 ) match R(from : *) ?

49

49

P0 -‐-‐-‐

IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : *);

B IS2 (to : P2 )

Can S2 (to : P2 ) match R(from : *) ? YES!


50

50

P0 -‐-‐-‐

IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : *);

B IS2 (to : P2 )

ISP handles this situa3on through •  Out-‐of-‐order execu3on •  Dynamic Instruc3on Rewri3ng


51

51

P0 -‐-‐-‐ Collect IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : *);

B IS2 (to : P2 )


52

52

P0 -‐-‐-‐ Collect IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : *);

B IS2 (to : P2 )

Collect


53

53

P0 -‐-‐-‐ Collect IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : *);

B IS2 (to : P2 )

Collect

Issue into MPI run3me Issue into MPI run3me

Issue into MPI run3me


54

54

P0 -‐-‐-‐ Collect IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : *);

B IS2 (to : P2 )

Collect


55

55

P0 -‐-‐-‐ Collect IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : *);

B IS2 (to : P2 )

Collect

Collect


56

56

P0 -‐-‐-‐ Collect IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : *);

B IS2 (to : P2 )

Collect

Collect

Form Matches


57

57

P0 -‐-‐-‐ Collect IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : 0);

B IS2 (to : P2 )

Collect

Collect

Rewrite, play


58

58

P0 -‐-‐-‐ Collect IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : *);

B IS2 (to : P2 )

Collect

Collect

Re-‐execute to get back here


59

59

P0 -‐-‐-‐ Collect IS1 (to : P2 );

B

P1 -‐-‐-‐

B;

P2 -‐-‐-‐

IR(from : 1);

B IS2 (to : P2 )

Collect

Collect

Rewrite, play


60

P0 : IR(from:0, h1); B; IS(to:0, h2); W(h1); W(h2);

The HB for a simple “auto-‐send” example (BELOW) (all these “funny” examples uniformly handled by ISP)

Even the effects of buffering and the “eager” buffering of MPI Is handled this way

Goal 5: Scheduling based on Happens-‐before

HB for auto-‐send:

61

Goal 7: The GEM Front-‐end for ISP

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

62

Puzng it all together : Study one example

63

63

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNext Barrier

MPI Run3me

ISP Scheduler Actions (animation)

64

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNext Barrier

Irecv(*)

Barrier

64

MPI Run3me


65

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Barrier

Barrier

Barrier

Barrier

65

MPI Run3me


66

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

MPI Run3me

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Barrier

Wait (req)

Recv(2)

Isend(1)

SendNext

Wait (req)

Irecv(2) Isend

Wait

No Match-‐Set

66

Deadlock!


67

ISP Summary •  Tested on many examples •  Tested against five MPI libraries

–  MPICH2, OpenMPI, MVAPICH, Microsot MPI, IBM MPI •  Versa3le

–  Runs on one’s laptop –  Can handle Parme3s (15K LOC) for 32 procs on laptop

•  Suggested uses –  Debug using ISP; then perform large-‐scale debugging –  “Knows” enough about MPI that it can supplement textbooks

•  Embedding of Pacheco’s book examples as projects within GEM

•  Future work: –  Hybrid verifica3on –  Finding errors in the C-‐space of behaviors

67

68

GEM: Graphical Explorer of MPI Programs Demo of ISP Integra3on with Eclipse PTP 4.0


University of Utah1



69

Live Demonstra3on

•  General Features of GEM •  Examples:

– Examples with Deadlocks and Leaks

– Heat conduc3on

70

Improved Scalability through Umpire


University of Utah1



71

Umpire: greater scalability through asynchrony

•  Similar error detec3on capability to Marmot •  Asynchronous design that separates error detec3on from applica3on execu3on

•  Rudimentary error reports

•  GUI support based on ToolGear – Not currently maintained

•  Planned replacement by MUST (later session)

72 Umpire divides MPI usage informa3on and correctness proper3es into local and global

MPI Call Data: Parameters Program Counter

Local Proper3es: Resource Usage Buffer Management

For Each MPI Task

MPI Call Data: Parameters Program Counter

Local Proper3es: Resource Usage Buffer Management

Global/Non-‐Local Proper3es: Deadlock Freedom Message Recep3on Message Type Matching

73 Umpire uses the MPI profiling layer to collect MPI call informa3on

MPI Application!Interposition via MPI Profiling Layer!

Collect pre-MPI call info

Collect post-MPI call info!

MPI Application!

MPI Runtime System!

74

Ou�ielder thread checks local MPI proper3es

MPI Application!Interposition via MPI Profiling Layer!

Collect pre-MPI call info

Collect post-MPI call info!

MPI Application!

MPI Runtime System!

*Except data types and non-‐blocking send buffer write errors

Outfielder!Track resource usage!

Perform allother localasynchronous tasks*!

SharedMemory!

SharedMemory!

*Also, does not track MPI communicators

75

Checking Global MPI Proper3es

Out of band communica3on mechanisms:

Outfielder!

Outfielder!

One

per M

PI ta

sk

Shared memory TCP/IP (originally planned; currently on hold) MPI

Manager!Perform message matching!

Detect deadlocks!

Data integrity?!

Out of Band!

Out of Band!

76

Puzng the Pieces Together

Manager!Perform message matching!

Detect deadlocks!

Data integrity!

MPI!

MPI!Outfielder!

Outfielder!

One

per M

PI ta

sk

MPI Task!

MPI Task!

Heartbeat!

Heartbeat!

Maximizes asynchrony Assumes MPI_THREAD_MULTIPLE

77

Umpire Manager Report --------------- Umpire Manager Report --------------- --- Communicators --- Communicator Leaks

--- Redundant Comm Freees

--- Datatype Errors --- Type Mismatches

--- Bad Type Handles

--- Redundant Type Commits

--- Incompatible Types for MPI_Op

--- Request Errors --- Unfreed Persistent Init Errors

--- Lost Requests

--- Dropped Requests

--- Bad Handles for Request Frees

--- Active Request Frees

--- Other Bad Request Handles

--- Bad Activations

--- Bad Reactivations --------------- End of Umpire Manager Report ---------------

78

Umpire Ou�ielder Report --------------- Umpire Outfielder Report --------------- --- Unreleased Derived Types

--- Redundant Type Frees

--- Reused Type Handles

--- Communications with Uncommitted Types

--- Unreleased Groups

--- Redundant Group Frees

--- Reused Group Handles

--- Unreleased Errhandlers

--- Redundant Errhandler Frees

--- Reused Errhandler Handles

--- Unreleased MPI_Ops

--- Redundant MPI_Op Frees

--- Reused MPI_Op Handles

--- Send Overwrites

--------------- End of Umpire Outfielder Report ---------------

79

Umpire detects resource tracking errors

•  Track most resources in ou�ielder –  Track MPI assigned handle (follows PMPI call) –  Variable may change without leak

•  Requests and communicators tracked by manager –  Needed to perform MPI message matching

•  Used to detect dropped requests •  Also needed for type matching and deadlock detec3on

–  Avoid duplicate storage by not tracking in ou�ielder

•  Detect lost requests similarly to other resource tracking errors

•  Also ensure proper use of persistent requests –  Init (Start Complete)* Free

80

MPI Type Matching in Umpire •  Represent types as regular expressions

–  Determined in user process when commieed –  Factor using “smallest first” canonical order

•  e.g., {(c2(dc)3)3}, not {(c(cd)3c)3} •  e.g., {((dc)2dc3)5}, not {((dc)3c2)5}

•  Compare “greatest contained” factors –  In manager, “on demand” –  Ignore outermost count –  Can match exactly or par3ally one-‐way or both –  Compute and store “par3al count” –  Both implies count at most one for one type –  Remember results and combine exact matches

•  Compare using send/recv count * outermost count

81

Umpire MPI Type Mismatch Output

--- Type Mismatches

57 type mismatch errors found: 1 occurence at 10001938 (MPI_COMM_WORLD rank 1)

1/4/10001938: 1011 MPI_Bcast pre umpi_op_ref_count = 1 buf = 540349096 count = 512 datatype = 2 root = 0 comm = 0

Secondary ops: 1 occurence at 10000830 (MPI_COMM_WORLD rank 0) 0/4/10000830: 1011 MPI_Bcast pre umpi_op_ref_count = 1 buf = 540361272 count = 128 datatype = 8 root = 0 comm = 0

1 occurence at 10000894 (MPI_COMM_WORLD rank 0) 0/5/10000894: 1047 MPI_Gather pre

Etc.

82

Umpire has two deadlock detec3on algorithm

•  MPI deadlock queues –  One per task in Manager –  Track blocking MPI messaging opera3ons

•  Items added through transac3ons •  Remove when safely matched

•  Simple reduc3on algorithm described earlier (see ICS 2009 paper)

•  Automa3cally detect deadlocks –  MPI opera3ons only –  Wait-‐for graph –  Recursive algorithm –  Invoke when queue head changes

•  Also support 3meouts (not currently used)

83

Umpire MPI Deadlock Output 0/0: DEADLOCK DETECTED. Aborting

MGR DEADLOCK Q HEADS -------------------- -----TASK 0 ----- 0/2/10000460: 1011 MPI_Bcast pre umpi_op_ref_count = 3 buf = 804397376 count = 128 datatype = 8 root = 1 comm = 0 -----TASK 1 ----- 1/2/1000047c: 1010 MPI_Barrier pre umpi_op_ref_count = 4 comm = 0

MGR DEADLOCK Q HEADS END--------------------

MGR DEADLOCK Q DUMP -------------------- MGR DEADLOCK Q 0 ----- 0/2/10000460: 1011 MPI_Bcast pre

umpi_op_ref_count = 3 buf = 804397376 count = 128 datatype = 8 root = 1 comm = 0

MGR DEADLOCK Q 1 ----- 1/2/1000047c: 1010 MPI_Barrier pre

umpi_op_ref_count = 4 comm = 0

MGR DEADLOCK Q DUMP END--------------------

84

MPI_ANY_SOURCE Receive Deadlocks

•  Complicates deadlock detec3on significantly – Must obtain actual source from implementa3on –  Timing dependent deadlocks

•  Simple example code: Task 0 Task 1 Task 2 MPI_Recv(ANY) MPI_Send(0) MPI_Send(0) MPI_Send(1) MPI_Recv(0) MPI_Recv(ANY)

•  Umpire detects errors that actually occur

Impact on Point-‐to-‐point Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

4 64 1024 16384 262144 4194304

Band

width (M

B/s)

Message Size (KB)

Base Hera

Umpire Hera

0

10

20

30

40

50

60

70

80

16 tasks 32 tasks 64 tasks 128 tasks 256 tasks

Time in secon

ds

base 4 tasks per node

umpire 4 tasks per node

base 8 tasks per node


• Computa3on-‐bound • NO chksums

Applica3on Impact: sPPM

87

Applica3on Impact: sPPM

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

16 tasks 32 tasks 64 tasks 128 tasks 256 tasks

Slow

donw

(Umpire ?me/Ba

se Tim

e)



88

Applica3on Impact: SMG2000

0

100

200

300

400

500

600

16 tasks 32 tasks 64 tasks 128 tasks

Slow

down (Umpire ?me/Ba

se ?me)

• Excessive MPI calls with liele overlapping

• NO chksums

89

Verifica3on at Large Scale: DAMPI


University of Utah1



90

Mo3va3on

•  ISP does not scale beyond 100+ processes •  We need verifica3on tool at large scale:

– Some MPI programs require large number of processes to run certain problem size.

– Some bugs only become manifest at large scale (buffer overflow, index out of memory range).

– Large scale programs stress the MPI implementa3on and will expose MPI implementa3on bugs.

91

Rethinking ISP for Large Scale

•  Why ISP did not do well at large scale: – Centralized scheduler with 3ghtly integrated checks – Not using each process’ own cycle

•  This is essen3al at large scale. •  How about:

– Use all processes cycles to accomplish ISP’s scheduler’s work

– Give users more flexibility: •  Modularize all error checks •  Allow users to focus coverage to specific code regions

92

DAMPI at a glance

Executable

Proc1 Proc2 …… Procn

Alternate Matches

MPI run?me

MPI Program

DAMPI -‐ PnMPI modules

Schedule Generator

Epoch Decisions

Rerun

DAMPI – Distributed Analyzer for MPI

93

DAMPI modules

Status module

Request module

Communicator module

Type module

Deadlock module

DAMPI -‐ PnMPI modules

Core Module

Op3onal Error Checking Module

Piggyback module

DAMPI driver

94

Discovering poten3al non-‐determinis3c matches

•  Recall key concept (ISP): – Send and Recv can match if co-‐enabled

•  Realizing happens-‐before (HB) edges is the key – Easy to do with centralized bookkeeping –  In distributed sezng:

•  More sophis3cated, but possible

•  Use logical clocks (Lamport clock, vector clock)

95

A glimpse into DAMPI algorithm

•  Keep track of non-‐determinis3c (ND) events through Lamport clocks

•  Clocks are communicated through piggyback

•  DAMPI driver uses these clocks to determine if sends and recvs have happens-‐before edges

•  Enforce poten3al outcome through replays

•  Want to know more? Wednesday

96

DAMPI Experimental results

0

100

200

300

400

500

600

700

800

900

4 8 16 32

Time in secon

ds

Number of tasks

ParMETIS-‐3.1 (no wildcard)

ISP

DAMPI

97

DAMPI results

0

1000

2000

3000

4000

5000

6000

7000

8000

250 500 750 1000

Time in secon

ds

Number of Interleavings

Matrix Mul?plica?on with Wildcard Receives

ISP

DAMPI

98 Impact on large applica3ons: SpecMPI2007 and NAS-‐PB

Benchmark Slowdown Total R* Communicator Leak Request Leak

ParMETIS-3.1 1.18 0 Yes No

104.milc 15 51K Yes No

107.leslie3d 1.14 0 No No

113.GemsFDTD 1.13 0 Yes No

126.lammps 1.88 0 No No

130.socorro 1.25 0 No No

137.lu 1.04 732 Yes No

BT 1.28 0 Yes No

CG 1.09 0 No No

DT 1.01 0 No No

EP 1.02 0 No No

FT 1.01 0 Yes No

IS 1.09 0 No No

LU 2.22 1K No No

MG 1.15 0 No No

99

Scalable MPI Run3me Error Detec3on with MUST


University of Utah1



100

Content •  Correctness Checking at Scale •  MUST

–  Overview –  Design –  Scalable Correctness Checks –  Required Tool Infrastructure –  Sotware Layers

•  GTI –  Overview –  Communica3on System –  Genera3on and Instan3a3on

•  Summary

101

Correctness Checking at Scale

•  Marmot: – Usability tools (compiler wrappers, …) – Many integra3ons

•  However, at scale: – Debug Server is a boeleneck –  (Processes are no3fying – blocking – the server before calling the actual MPI call)

•  MPI correctness checking needs to scale, e.g. a case from Sabine Roller and Harald Klimach (GRS-‐Sim, Aachen)

:: 1 ::

Runable Job

Smallest configuration providing usable results

needs 1680 cores and at least 4 days

103

Correctness Checking at Scale

•  Marmot performance test with SPEC MPI2007 – Combina3on with VampirTrace, only local checks

104

MUST – Overview

•  MUST (Marmot Umpire Scalable Tool) = Umpire + Marmot + PnMPI –  Local checks from Marmot – Non-‐Local checks from Umpire (e.g. Deadlock det.) –  PnMPI as Infrastructure – Ongoing project, s3ll in development

•  Goals: –  Combine checks into one tool – Overcome scalability limita3ons – Maintainable and extendable checks (e.g. MPI-‐3)

105

A Dynamic Tool Infrastructure

Application

PMPI Tool 1 MPI Library

Application PMPI Tool 2 MPI Library

Application

MPI Library

PN

MP

I Application PMPI Tool 1 PMPI Tool 2 MPI Library

PMPI Tool 2

PMPI Tool 1

•  Transparent layering of MPI tools - Binary rewrite of PMPI tools into modules - Configura3on at applica3on startup - PnMPI core loads modules into stacks

• Op3onal: tool modules can register with core - Share/Request services

MUST – Overview

106

MUST – Design

• Uses PnMPI and fine grained modules • Each correctness check is a module: - Needs specified input data - Can run anywhere - May use other modules for collabora3on

• Checks run on a place: - Applica3on thread or, - Extra thread/process

• Places connected with a communica3on network (TBON)

107

MUST – Required Tool Infrastructure (1/2)

• Example configura3on for 8 applica3on processes: -  Layout for places - Communica3on system - Distribu3on of correctness checks

Tree network 0

2

4

6

1

3

5

7

108

MUST – Required Tool Infrastructure (2/2)

• MUST needs an infrastructure that provides: - Genera3on of MPI wrappers, - Spawning of extra tool threads/processes - Records to communicate MPI trace data, - A (flexible + scalable) communica3on system, - Forwarding of trace data to the checks, - Handling of applica3on crashes

• Exis3ng approaches: - MRNet is close, but: - No wrappers, No records, No data forwarding to checks, (No crash handling)

109

MUST – Marmot & Umpire

•  MUST re-‐uses checks: –  Local checks from Marmot – Non-‐Local checks from Umpire

0

2

4

6

1

3

5

7

Local checks: •  Invalid arguments •  Resource errors •  Buffer errors •  Call order errors …

Non-‐local checks: •  Deadlock detec3on •  Type matching •  Collec3ve valida3on … (Umpire checks currently centralized)

110

MUST – Scalable Correctness Checks

•  MUST uses a reduction network – Well suited to verify collective calls, e.g.: – E.g.: MPI_Bcast(buf, count, type, root, comm)

p1

•  root must match on all tasks •  Signature spawned by (count, type) must match on all tasks

p2

p3

q root2, count2, type2 CORRECT, root1, count1, type1

INCORRECT

111

MUST – Scalable Correctness Checks

•  MUST uses a reduction network – Less suited for message matching

p1

p2

p3

q1

q2

p1

• N tasks • N2 possible matches exist

• M places • Each receives from N/M tasks • Each place can detect (N/M)2

matches • Total matches detected: M*(N/M)2 = N2/M E.g.: M=100 (N=1000) => first layer only detects 1% of the matches

N+((N-‐1)*N)/2 In fact …

Solu3on: Usage of a layer with inter-‐communica3on for matching

112

•  MUST uses a 3-‐Layer sotware stack:

MUST

Generic Tool Infrastructure

PnMPI Module infrastructure, basic modules

Trace records, communication, places, ...

Checks

MUST – Sotware Layers

113

GTI – Overview

•  GTI (Generic Tool Infrastructure) consists of: -  Interfaces for modules with different tasks -  Implementa3ons for these interface - A complex generator for wrapper genera3on, trace record genera3on, and instan3a3on

• MPI and MUST Agnos3c: - Basically an infrastructure for execu3ng analyses in a parallel environment - For MUST: “analysis” = “correctness check” - The API being used is an input for the GTI genera3on, e.g. For MUST this is an MPI descrip3on

114

GTI – Communica3on System (1/4)

•  An instance of the tool has mul3ple layers •  Pairs of layers may be connected (no cycles) •  E.g.:

Layer 1 Layer 2 Layer 3

115

Layer 3

Layer 3

Layer 2 Layer 1


Layer 1 Layer 2

0

2

1

3

4

6

5

7

a

c

b d

•  Each layer may contain mul3ple places •  For MPI: first layer would contain all MPI tasks •  E.g.:

116



0

2

1

3

4

6

5

7

a

c

b d

Reduc3on network

•  A connec3on between layers i and j means that each process in i is connected to exactly one process of j

•  E.g.:

117



0

2

1

4

6

5

7

a

c

dStrategy 3 b bProtocol Strategy

Decides when to transfer data. E.g.: aggrega3on of records

Decides how to transfer data. E.g.: MPI, TCP, …

•  Each pair of connected processes uses a strategy and a protocol for its communica3on

•  E.g.:

118

GTI – Genera3on and Instan3a3on

•  Each place has one ingoing strategy - Which may receive data from mul3ple processes -  Except Applica3on layer -‐> no ingoing strategy

•  Tasks: - Receive trace data -  Forward it to a processing and forwarding interface

Checks

Generated receival/forward •  unpack serialized record •  call checks •  forward

Generated wrapper: •  intercept •  call checks •  create records •  forward records

119


• Modules for wrappers, records, and data forwarding need to be generated to instan3ate MUST

• The following things need to be specified: - What calls exist, what data do they provide - What checks exist, what data do they require - What tool layers are used, how are they connected, what checks do they run - What communica3on modules should be used

• This is specified in XML files • The “System Builder” (part of GTI) processes these

120

GTI – Genera3on and Instan3a3on GTI Spec. API

Spec.

API Spec.

Analysis Spec.

Analysis Spec.

Layout Spec.

Analysis Spec.

API Spec.

• What comm. Modules available • What types of places available

• What checks exist • What are their collabora3ons, inputs, …

• How many layers • Layer connec3ons • What checks(analyses) on each layer

• What calls can be wrapped • What are their arguments • What analyses use the arguments

121

GTI Spec. API

Spec.

API Spec.

Analysis Spec.

Analysis Spec.

Layout Spec.

Analysis Spec.

API Spec.

Layout GUI


122

GTI Spec. API

Spec.

API Spec.

Analysis Spec.

Analysis Spec.

Layout Spec.

Analysis Spec.

API Spec.

System Builder Weaver • Central component to

process and relate Specifica3ons

• Generates no code!


123

Receival/ Forward Gen Input Receival/ Forward Gen Input Wrapper Gen Input Wrapper Gen Input

GTI Spec. API

Spec.

API Spec.

Analysis Spec.

Analysis Spec.

Layout Spec.

Analysis Spec.

API Spec.

System Builder Weaver

Wrapper Gen Input Receival/ Forward Gen Input

• Specifica3on for wrapper and receival/forward genera3on

• XML files • One for each layer

• Specifica3on for wrapper and receival/forward genera3on

• XML files • One for each layer


124


GTI Spec. API

Spec.

API Spec.

Analysis Spec.

Analysis Spec.

Layout Spec.

Analysis Spec.

API Spec.



Wrapper Generator Receival/Forward Gen.

• Process input XMLs and create code • Also create trace records (details later) • Process input XMLs and create code • Also create trace records (details later)


125

Receival/Forward Mod. Receival/Forward Mod. Wrapper Module Wrapper Module


GTI Spec. API

Spec.

API Spec.

Analysis Spec.

Analysis Spec.

Layout Spec.

Analysis Spec.

API Spec.




Intermediate Modules

Wrapper Module Receival/Forward Mod.


126

• Checks modules • Comm. modules • Place modules

Receival/Forward Mod. Receival/Forward Mod. Wrapper Module Wrapper Module


GTI Spec. API

Spec.

API Spec.

Analysis Spec.

Analysis Spec.

Layout Spec.

Analysis Spec.

API Spec.




Intermediate Modules Module Library

Executable

Instance

Wrapper Module Receival/Forward Mod. PnMPI Conf.


127

Summary

•  MUST: Scalable MPI correctness checking •  Checks are modules, can run anywhere

– PnMPI as base infrastructure •  Based on Generic Tool Infrastructure

– Very flexible (MPI agnostic) – Uses reduction networks –  Instantiation uses generated code – Generation work on XML descriptions

128

MPI Run3me Error Detec3on in Hybrid OpenMP/MPI Applica3ons


University of Utah1



129

Overview

•  OpenMP – Parallel threads, shared memory

•  MPI – Parallel processes – No shared memory, messages used for exchange

... Process Thread

Memory

... MPI

130

Overview •  Hybrid OpenMP/MPI:

Process 0 Thread 0

Memory ...

Process 1

Memory

Process m

Memory

Thread 1

Thread n1

Thread 0 Thread 1

Thread n2

Thread 0 Thread 1

Thread nm

MPI Interface

•  Communica3on: –  Between threads of one process via shared memory –  Between threads of different processes via MPI

131

Overview

•  MPI-‐2 standard defines level of thread support: – MPI_THREAD_SINGLE: there is only one thread – MPI_THREAD_FUNNELED: only main thread performs MPI calls

– MPI_THREAD_SERIALIZED: only one thread is in MPI at a 3me

– MPI_THREAD_MULTIPLE: threads may call MPI simultaneously

•  MPI_Init_thread used to request a certain level •  Further restric3ons in the MPI Standard

–  E.g.: A communicator, file handle or window must not be used in mul3ple collec3ve calls simultaneous

132

Marmot Support for Hybrid OpenMP/MPI

•  Three steps: (1) Marmot extensions to operate in hybrid mode -  Synchroniza3on

(2) Marmot checks for hybrid OpenMP/MPI errors -  Usage errors of MPI that result from OpenMP usage -  Detect errors that appear in a run with Marmot

(3)  Advanced checks that detect errors in alterna3ve execu3on orders -  Uses Intel Thread Checker

133

Step1: Synchroniza3on Applica3on Wrapper Synchronisa3on Core MPI

MPI_X enterMARMOT

checkAndExecute

enterPMPI

PMPI_X

leavePMPI

leaveMARMOT

Protected

Protected

MARMOT

Pre call checks

Post call checks

134

Step2: Marmot Checks (Example) •  “Finally, in mulMthreaded implementaMons, one can have

more than one, concurrently execuMng, collecMve communicaMon call at a process. In these situaMons, it is the user’s responsibility to ensure that the same communicator is not used concurrently by two different collecMve communicaMon calls at the same process.” [MPI-‐1 p 130 lines 37-‐41]

•  Implementa3on pseudo code: Pre execu?on code for all collec?ves: If check_is_comm_in_use (comm) == TRUE Then print_error() register_comm_as_used(comm)

Post execu?on code for all collec?ves: unregister_comm(comm)

135

Example – Code 25)  //init MPI 26)  MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided); 27)  MPI_Comm_rank(MPI_COMM_WORLD,&rank); 28)  MPI_Comm_size(MPI_COMM_WORLD,&size); 29)  30)  if ((rank == 0) && (provided != MPI_THREAD_MULTIPLE)) 31)  prin� ("WARNING MPI_THREAD_MULTIPLE not supported\n"); 32)  33)  //set num threads 34)  omp_set_num_threads(4); 35)  36)  #pragma omp parallel 37)  { 38)  MPI_Barrier(MPI_COMM_WORLD); //this is erroneous ! 39)  }

136

Example – Marmot Results

137

Step3: Advanced Checks

•  Consider: #pragma omp parallel private(thread) { #pragma omp sec3ons { #pragma omp sec3on { MPI_Barrier(MPI_COMM_WORLD); } #pragma omp sec3on { sleep(5); MPI_Barrier(MPI_COMM_WORLD); } } }

It is almost certain that Marmot won’t detect this error!

138


•  Intel Thread Checker – Detects data races, deadlocks, erroneous

parallel execution – Uses an simulation approach in order to detect

race conditions deterministically – Requires binary or source instrumentation

•  Race Detec3on is used to enhance Marmot checks

139


• Assume Marmot writes to variable “MyDataRace”

when wrapping an MPI call:

-  Intel Thread Checker will detect race on “MyDataRace” if and only if

two MPI calls could be executed in parallel on one process

-  Thus, a race occurs if MPI_THREAD_MULTIPLE is used

-  Thread Checkers simula3on detects this for every interleaving

• Marmot uses ar3ficial races for MPI error detec3on

• Correctness errors listed in Thread Checker output

140

Example – Data Race for Communicator Error

•  Wanted: Race that appears if collec3ve communicator restric3on is violated:

– Requires two threads calling collec3ve – Requires usage of same communicator

⇒ One conflict variable per communicator

•  Pseudo code: Pre execu?on code for all collec?ves (before synchronisa?on starts): begin_cri3cal() index = map_comm_2_index (comm) end_cri3cal()

conflict_variable[index] = 1

141

Example – With Advanced Checks

142

Summary

•  Marmot supports hybrid OpenMP/MPI •  Detects several MPI usage errors that result from presence of mul3ple threads

•  Advanced detec3on: – Uses ar3ficial data races – Needs data race detector, e.g. Intel Thread Checker –  Improves quality of checks

143

Mul3ple Concurrency Models


University of Utah1



144

Issues Verifying Hybrid Concurrency Models

•  Seman3cs of interac3on are unclear •  Some APIs (e.g. OpenMP) don’t provide a standard interface for OMP thread/task scheduling control

•  One approach we have tried with OMP + MPI is : determinize OMP schedules, to allow MPI non-‐determinism to replay successfully

•  Another project : Verify MPI + CUDA

•  A liele tutorial on our CUDA Verifier follows

Our CUDA / OpenCL Verifica3on Flow

145

Analyzer

Kernel Invoca3on Contexts

PUG Analyzer for Races and Asser3ons

C Applica3on Containing Mul3ple Kernels

Kernel Descrip3ons

CPU / GPU Communica3on Codes

CPU / GPU Communica3on Verifier (CGV)

Verifica3on Results

Verifica3on Results

PUG’s Symbolic Approach

Analyzer supported by LLNL Rose

C Applica3on Containing Mul3ple Kernels

Constraint solver (Fast Logical Decision Procedures)

Verifica3on Condi3ons i.e. “Constraints”

UNSAT: The instance is “OK” – i.e.

•  Race-‐free •  No mismatched barriers •  Passes user Asser3ons

SAT: The instance has bugs

Puts out “bread crumbs” to help debug

(SAT instance)

146

Sample results: Bug-‐free Examples

+ O: required asser3ons to specify that bit-‐vector computa3ons don’t overflow

+C: required constraints on the input values

+R: required manual loop refinement

B.C.: measures how serious the bank conflicts are

Time: SMT solving 3me in seconds to confirm absence of issues. 147

Sample results: Buggy Examples

We tested 57 assignment submissions from a recently completed graduate GPU class taught in our department.

Defects: Indicates how many kernels are not well parameterized, i.e. work only in certain configura3ons

Refinement: Measures how many loops need automa3c refinement.

148

149

Real race (GPU class)

149

__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) {

d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); } __syncthreads(); assume(blockDim.x <= BLOCKSIZE / 2); // for tes3ng if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];

/* The counter example given by PUG is : TRY HITTING THIS VIA RANDOM TESTING! t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0, that is, d_out[threadIdx.x+8*i]+=d_out[threadIdx.x+8*i+1]; d_out[2+8*1]+=d_out[10+8*0+1]; d_out[10]+=d_out[10] a race!!! */

150

Real __syncthreads() error

150

__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) { // *d_sum=0;

d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); } __syncthreads();

// PUG found this synchroniza3on error if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];

__syncthreads(); } } …

151

Combined MPI + CUDA / OpenCL verifica3on •  Separately verify kernels using PUG (e.g.) •  Put kernel codes within MPI and verify using one of the tools described here

•  PUG can generate input sets that MPI tes3ng must ‘hit’ when invoking kernels

•  Many open issues:

– Mul3ple concurrent kernel launches – Data sharings between kernels and MPI

•  Much more research is needed

152

Concluding Remarks and Live-‐DVD Distribu3on


University of Utah1



Download - tut129s1 Gopalakrishnan Muller deSupinskikorobkin/tmp/SC10/tutorials/docs/M06/M06.pdf · 1 Scalable"Dynamic"Formal"Veriﬁca3on"and" Correctness"Checking"of"MPIApplica3ons" GaneshGopalakrishnan

Top Related