effectively mapping linguistic abstractions for message...

65
Ganesha Upadhyaya [email protected] Iowa State University Effectively Mapping Linguistic Abstractions for Message-passing Concurrency to Threads on the Java Virtual Machine Supported in part by the NSF grants CCF- 08-46059, CCF-11-17937, and CCF-14-23370. Hridesh Rajan [email protected] Iowa State University

Upload: others

Post on 15-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

[email protected]

Effectively Mapping Linguistic Abstractions for Message-passing Concurrency to Threads on

the Java Virtual Machine

Supported in part by the NSF grants CCF- 08-46059, CCF-11-17937, and CCF-14-23370.

[email protected]

Page 2: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

[email protected]

Effectively Mapping Linguistic Abstractions for Message-passing Concurrency to Threads on

the Java Virtual Machine

Supported in part by the NSF grants CCF- 08-46059, CCF-11-17937, and CCF-14-23370.

[email protected]

1)  Local computation and communication behavior of a concurrent

entity can help predict the performance. 2)  A modular and static technique can solve the problem.

Page 3: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Problem

a1 a2

a3

a4 a5

Mapping

core0 core1

core2 core3

MPC System Architecture

MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions

2

Page 4: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Problem

a1 a2

a3

a4 a5

core0 core1

core2 core3

MPC System Architecture

2

MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions

JVM threads

Mapping OS Scheduler

Page 5: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Problem

a1 a2

a3

a4 a5

core0 core1

core2 core3

MPC System Architecture

2

MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions

JVM threads

Mapping OS Scheduler

In this work : Mapping Abstractions to JVM threads

Page 6: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Problem

a1 a2

a3

a4 a5

core0 core1

core2 core3

MPC System Architecture

2

MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions

JVM threads

Mapping OS Scheduler

PROBLEM: Mapping Abstractions to JVM threads

GOAL: Concurrency

Page 7: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Problem

a1 a2

a3

a4 a5

core0 core1

core2 core3

MPC System Architecture

2

MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions

JVM threads

Mapping OS Scheduler

PROBLEM: Mapping Abstractions to JVM threads

GOAL: Concurrency + Performance (Fast and Efficient)

Page 8: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  State of the art: Abstractions to Threads mapping is performed by programmers.

Motivation

3

Page 9: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Abstractions to Threads mapping is performed by programmers

•  MPC frameworks (Akka, Scala Actors, SALSA) provide Schedulers and Dispatchers to programmers for mapping Abstractions to Threads

Motivation

3

Page 10: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Dispatchers in Akka •  default •  pinned •  balancing •  calling-thread

Akka

Kilim

Scala Actors SALSA

Jetlang Actors Guild ActorFoundry

JVM-based MPC Frameworks

4

Schedulers in Scala Actors •  thread-based •  event-based

Availability of wide-variety of Schedulers and Dispatchers suggests that. -  Programmers can chose the ones that works best for their applications -  And, perform the mapping carefully

Page 11: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Abstractions to Threads mapping is performed by programmers

•  MPC frameworks (Akka, Scala Actors, SALSA) provide Schedulers and Dispatchers to programmers for mapping Abstractions to Threads

•  Programmers find it hard to manually perform the mapping.

– Start with an initial mapping and incrementally improve the mapping. – This process can be tedious and time consuming.

Motivation

3

Page 12: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Abstractions to Threads mapping is performed by programmers •  MPC frameworks (Akka, Scala Actors, SALSA) provide Schedulers and

Dispatchers to programmers for mapping Abstractions to Threads •  SO discussions about configuring and fine tuning the mapping suggests that,

– Randomly tweaking the mapping without finding the root cause of performance problem doesn’t help and

– Without knowing the nature of the task performed by the abstractions, the mapping task becomes hard.

Motivation

3

Page 13: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard, •  Programmers use default mappings (default schedulers/dispatchers)

Motivation

5

Page 14: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard, •  Programmers use default mappings (default schedulers/dispatchers) •  Problem: a single default mapping may not work across programs

Motivation

5

Page 15: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard, •  Programmers use default mappings (default schedulers/dispatchers) •  Problem: a single default mapping may not work across programs

Motivation

5

LogisticMap (computes logistic map using a recurrence relation)

ScratchPad (counts lines for all files in a directory)

0

1

2

3

4

5

2 4 8 12

Exec

utio

n tim

e (s

)

Core setting

thread round-robin random work-stealing

0

1

2

3

4

5

6

7

2 4 8 12

Exec

utio

n tim

e (s

)

Core setting

thread round-robin random work-stealing

Page 16: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard, •  Programmers use default mappings (default schedulers/dispatchers) •  Problem: a single default mapping may not work across programs

Motivation

5

LogisticMap (computes logistic map using a recurrence relation)

0

1

2

3

4

5

2 4 8 12

Exec

utio

n tim

e (s

)

Core setting

thread round-robin random work-stealing

Page 17: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard, •  Programmers use default mappings (default schedulers/dispatchers) •  Problem: a single default mapping may not work across programs

Motivation

5

ScratchPad (counts lines for all files in a directory)

0

1

2

3

4

5

6

7

2 4 8 12

Exec

utio

n tim

e (s

)

Core setting

thread round-robin random work-stealing

Page 18: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard, •  Programmers use default mappings (default schedulers/dispatchers) •  Problem: a single default mapping may not work across programs

Motivation

5

ScratchPad (counts lines for all files in a directory)

0

1

2

3

4

5

6

7

2 4 8 12

Exec

utio

n tim

e (s

)

Core setting

thread round-robin random work-stealing

Page 19: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard, •  Programmers use default mappings (default schedulers/dispatchers) •  Problem: a single default mapping may not work across programs

Motivation

5

ScratchPad (counts lines for all files in a directory)

0

1

2

3

4

5

6

7

2 4 8 12

Exec

utio

n tim

e (s

)

Core setting

thread round-robin random work-stealing

Page 20: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard, •  Programmers use default mappings (default schedulers/dispatchers) •  Problem: a single default mapping may not work across programs

Motivation

5

LogisticMap (computes logistic map using a recurrence relation)

ScratchPad (counts lines for all files in a directory)

0

1

2

3

4

5

2 4 8 12

Exec

utio

n tim

e (s

)

Core setting

thread round-robin random work-stealing

0

1

2

3

4

5

6

7

2 4 8 12

Exec

utio

n tim

e (s

)

Core setting

thread round-robin random work-stealing

Page 21: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard and default mappings may not produce the desired performance,

•  Brute force technique that tries all possible combinations of Abstractions to Threads mapping could be used

Motivation

6

Page 22: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard and default mappings may not produce the desired performance,

•  brute force technique that tries all possible combinations of Abstractions to Threads mapping could be used

•  Problem: Combinatorial explosion

Motivation

6

Page 23: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard and default mappings may not produce the desired performance,

•  brute force technique that tries all possible combinations of Abstractions to Threads mapping could be used

•  Problem: Combinatorial explosion •  For example: an MPC program with 8 kinds of abstractions, trying all possible

combinations of 4 kinds of schedulers/dispatchers requires exploring 65536 (4^8) different combinations (some may even violate the concurrency correctness property)

Motivation

6

Page 24: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard and default mappings may not produce the desired performance,

•  brute force technique that tries all possible combinations of Abstractions to Threads mapping could be used

•  Problem: Combinatorial explosion •  For example: an MPC program with 8 kinds of abstractions, trying all possible

combinations of 4 kinds of schedulers/dispatchers requires exploring 65536 (4^8) different combinations (some may even violate the concurrency correctness property)

•  Also, a small change to the program may require re-doing the mapping.

Motivation

6

Page 25: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

When manual tuning is hard and default mappings may not produce the desired performance,

•  brute force technique that tries all possible combinations of Abstractions to Threads mapping could be used

•  Problem: Combinatorial explosion •  For example: an MPC program with 8 kinds of abstractions, trying all possible

combinations of 4 kinds of schedulers/dispatchers requires exploring 65536 (4^8) different combinations

•  Also, a small change to the program may require redoing the mapping.

A mapping solution that yields significant performance improvement over default mappings is desirable

Motivation

6

Page 26: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

1) Local computation and communication behavior of concurrent entity is predictive for determining globally beneficial mapping

Key Ideas

7

Page 27: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

1) Local computation and communication behavior of concurrent entity is predictive for determining globally beneficial mapping

•  Computation and Communication behaviors, –  externally blocking behavior –  local state –  computational workload –  message send/receive pattern –  inherent parallelism

Key Ideas

7

Page 28: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

1) Local computation and communication behavior of concurrent entity is predictive for determining globally beneficial mapping

•  Computation and Communication behaviors, –  externally blocking behavior –  local state –  computational workload –  message send/receive pattern –  inherent parallelism

2) Determining these behavior at a coarse/abstract level is sufficient to solve the mapping problem

Key Ideas

7

Page 29: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Represent Computation and Communication behavior of MPC abstractions (language-agnostic manner),

Solution Outline

8

Input Program

Page 30: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Represent Computation and Communication behavior of MPC abstractions (language-agnostic manner),

•  Perform local program analyses statically to determine behaviors (proposed solution is both modular and static)

Solution Outline

8

cVector Analysis

Input Program

cVector

Page 31: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Represent Computation and Communication behavior of MPC abstractions (language-agnostic manner)

•  Perform local program analyses statically to determine behaviors (proposed solution is both modular and static)

•  A Mapping function that takes represented behaviors as input and produces an execution policy for each abstraction Execution policy: describes how messages of MPC abstraction are processed (in detail later)

Solution Outline

8

Mapping Function

cVector Analysis

Input Program

Execution Policy

cVector

Page 32: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Panini Capsules

Mapping Function

cVector Analysis

9

Execution Policy

cVector Input Program

General MPC framework MPC Abstraction Message Handlers

Panini Capsules Capsule Procedures

Page 33: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Panini Capsules

Mapping Function

cVector Analysis

9

Execution Policy

cVector Input Program

State

p0

p1

pn

...

Capsule

capsule Ship { short state = 0; int x = 5; void die() { state = 2; } void fire() { state = 1; } void moveLeft() { if (x>0) x--; } void moveRight() { if (x<10) x++; } }

General MPC framework MPC Abstraction Message Handlers

Panini Capsules Capsule Procedures

Page 34: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Solution Outline

Mapping Function

cVector Analysis

9

Execution Policy

cVector Input Program

May-Block Analysis

State Analysis

Call-claim Analysis

Communication Summary Analysis

Computational Workload Analysis

State

p0

p1

pn

...

for each procedure pi

Procedure Behavior

Composition

cVector

Capsule

Static Analyses

Page 35: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Solution Outline

Mapping Function

cVector Analysis

9

Execution Policy

cVector Input Program

May-Block Analysis

State Analysis

Call-claim Analysis

Communication Summary Analysis

Computational Workload Analysis

State

p0

p1

pn

...

for each procedure pi

Procedure Behavior

Composition

cVector

Capsule

Static Analyses

Page 36: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Solution Outline

Mapping Function

cVector Analysis

9

Execution Policy

cVector Input Program

May-Block Analysis

State Analysis

Call-claim Analysis

Communication Summary Analysis

Computational Workload Analysis

State

p0

p1

pn

...

for each procedure pi

Procedure Behavior

Composition

cVector Mapping Function

Capsule

EP

Page 37: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Blocking behavior (β) –  represents externally blocking behavior due to I/O, socket or db primitives –  dom(beta): {true, false}

•  Local state (σ) –  local state variables –  dom(sigma): {nil, primitive, large}

•  Inherent parallelism (π) –  inherent parallelism exposed by capsule when it communicates with other

capsules –  dom(pi): {sync, async, future}

•  Computational workload (ω) –  represents computations performed by capsule –  dom(omega): {math, io}

•  Communication Pattern ( ρ, ρ )

Representing Behaviors

10

Characteristics Vector (cVector)

<β, σ, π, ρ, ρ, ω>

Page 38: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Blocking behavior (β) –  represents externally blocking behavior due to I/O, socket or db primitives –  dom(beta): {true, false}

•  Local state (σ) –  local state variables –  dom(sigma): {nil, primitive, large}

•  Inherent parallelism (π) –  inherent parallelism exposed by capsule when it communicates with other

capsules –  dom(pi): {sync, async, future}

•  Computational workload (ω) –  represents computations performed by capsule –  dom(omega): {math, io}

•  Communication Pattern ( ρ, ρ )

Representing Behaviors

10

Characteristics Vector (cVector)

<β, σ, π, ρ, ρ, ω>

All behaviors <β, π, ρ, ρ, ω> except σ is defined for capsule procedures and combined to form behaviors for capsule using behavior composition (described later)

Page 39: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Blocking behavior (β) –  represents externally blocking behavior due to I/O, socket or db primitives –  dom (β) : {true, false}

•  Local state (sigma) –  local state variables –  dom(sigma): {nil, primitive, large}

•  Inherent parallelism (pi) –  inherent parallelism exposed by capsule when it communicates with other

capsules –  dom(pi): {sync, async, future}

•  Communication Pattern (rho) •  Computational workload (omega)

–  represents computations performed by capsule –  dom(omega): {math, io}

Representing Behaviors

Characteristics Vector (cVector)

10

BufferedReader br = new BufferedReader(new InputStreamReader(System.in)); try { userName = br.readLine(); // blocking } catch (IOException ioe) { }

<β, σ, π, ρ, ρ, ω>

May Block Analysis •  Input : Manually created dictionary of blocking library calls •  Analysis : Flow analysis with message receives as sources and blocking library calls as

sinks

Page 40: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Blocking behavior (beta) –  represents externally blocking behavior due to I/O, socket or db primitives –  dom(beta): {true, false}

•  Local state (σ) –  local state variables –  dom (σ): {nil, fixed, variable}

•  Inherent parallelism (pi) –  inherent parallelism exposed by capsule when it communicates with other

capsules –  dom(pi): {sync, async, future}

•  Communication Pattern (rho) •  Computational workload (omega)

–  represents computations performed by capsule –  dom(omega): {math, io}

Representing Behaviors

10

capsule Receiver { void receive() { } }

capsule Receiver { int state1; int state2; void receive() { } }

capsule Receiver { Map<String,String> dataMap = new HashMap<>(); void receive() { } }

Characteristics Vector (cVector)

<β, σ, π, ρ, ρ, ω>

State Analysis •  Input : State variables •  Analysis : Checks the type of state variables that composed capsule state for primitive or

collection types.

Page 41: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Blocking behavior (beta) –  represents externally blocking behavior due to I/O, socket or db primitives –  dom(beta): {true, false}

•  Local state (sigma) –  local state variables –  dom(sigma): {nil, primitive, large}

•  Inherent parallelism (π) –  kind of parallelism exposed by capsule while communicating –  dom (π) : {sync, async, future}

•  Communication Pattern (rho) •  Computational workload (omega)

–  represents computations performed by capsule –  dom(omega): {math, io}

Representing Behaviors

Characteristics Vector (cVector)

10

capsule Receiver (Sender sender) { void receive() { // sync int i = sender.get(); // print i; } }

capsule Receiver (Sender sender) { void receive() { sender.done(); // async } }

capsule Receiver (Sender sender) { void receive() { // future int i = sender.get(); // some computation // print i; }}

<β, σ, π, ρ, ρ, ω>

Page 42: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Blocking behavior (beta) –  represents externally blocking behavior due to I/O, socket or db primitives –  dom(beta): {true, false}

•  Local state (sigma) –  local state variables –  dom(sigma): {nil, primitive, large}

•  Inherent parallelism (π) –  dom (π) : {sync, async, future}

•  Computational workload –  represents computations performed by capsule –  dom(omega): {math, io}

•  Communication Pattern

Representing Behaviors

Characteristics Vector (cVector)

10

<β, σ, π, ρ, ρ, ω>

Inherent Parallelism Analysis

Page 43: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Blocking behavior (beta) –  represents externally blocking behavior due to I/O, socket or db primitives –  dom(beta): {true, false}

•  Local state (sigma) –  local state variables –  dom(sigma): {nil, primitive, large}

•  Inherent parallelism (pi) –  dom(pi): {sync, async, future}

•  Computational workload (ω) –  represents computations performed by capsule –  dom (ω) : {math, io} –  math : computation-to-wait > 1 –  io : computation-to-wait <= 1

Representing Behaviors

Characteristics Vector (cVector)

10

<β, σ, π, ρ, ρ, ω>

Computational Workload Analysis •  Computation summary : recursive calls, high

cost library calls, unbounded loops, complete read/write to state that is variable size.

Page 44: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Communication pattern – Message send pattern

dom ( ρ ): {leaf, router, scatter}

– Message receive pattern •  dom ( ρ ): {gather, request-reply}

Representing Behaviors

11

leaf : no outgoing communication router : one-to-one communication scatter : batch communication

gather : recv-to-send > 1 request-reply : recv-to-send <= 1

Page 45: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Communication pattern – Message send pattern

dom ( ρ ): {leaf, router, scatter}

– Message receive pattern •  dom ( ρ ): {gather, request-reply}

Representing Behaviors

11

leaf : no outgoing communication router : one-to-one communication scatter : batch communication

gather : recv-to-send > 1 request-reply : recv-to-send <= 1

–  Builds Communication Summary which abstracts away expressions except message send/receive, state read/writes.

–  Analysis is a function that takes communication summary of a procedure and produces message send/receive pattern tuple as output.

Communication Pattern Analysis

Page 46: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Procedure Behavior Composition

Mapping Function

cVector Analysis

9

Execution Policy

cVector Input Program

May-Block Analysis

State Analysis

Call-claim Analysis

Communication Summary Analysis

Computational Workload Analysis

State

p0

p1

pn

...

for each procedure pi

Procedure Behavior

Composition

cVector Mapping Function

Capsule

EP

Page 47: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Procedure Behavior Composition ( )

12

•  Capsule may have multiple procedures

•  Behavior of the capsule is determined by combining the behaviors of its procedures

•  For instance, a capsule has blocking behavior is any of its procedures are blocking

•  Key idea : Capsule behavior is predominantly defined by the procedure that executes often

Page 48: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Mapping Function

cVector Analysis

9

Execution Policy

cVector Input Program

Page 49: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Mapping Function

cVector Analysis

9

Execution Policy

cVector Input Program

Page 50: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  THREAD, capsule is assigned a dedicated thread,

•  TASK, capsule is assigned to a task-pool and the shared thread of the task-pool will process the messages,

•  SEQ/MONITOR, calling capsule’s thread itself will execute the behavior at callee capsule.

A

Q Thread

B A B

Thread Thread

A B Thread

A B Thread

A: Th, B:Th A: Ta, B:Ta

A: Th, B:S A: Th, B:M

Th: Thread Ta: Task S: Sequential M: Monitor

Execution Policies

13

Page 51: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Mapping Function

cVector Analysis

9

Execution Policy

cVector Input Program

Page 52: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Encodes several intuitions about MPC abstractions

Mapping Heuristics

Heuristic Execution Policy

Blocking Th

Heavy Th

HighCPU Ta

LowCPU M

Hub Ta

Affinity M/S

Master Ta

Worker Ta

Th - Thread, Ta - Task, M - Monitor and S - Sequential

14

Page 53: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Encodes several intuitions about MPC abstractions

•  Examples –  Blocking Heuristics

•  capsules with externally blocking behaviors •  should be assigned Th execution policy •  rationale : other policies may lead to blocking of

the executing thread, starvation and system deadlocks.

Mapping Heuristics

Heuristic Execution Policy

Blocking Th

Heavy Th

HighCPU Ta

LowCPU M

Hub Ta

Affinity M/S

Master Ta

Worker Ta

Th - Thread, Ta - Task, M - Monitor and S - Sequential

14

Page 54: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Encodes several intuitions about MPC abstractions

•  Examples –  Blocking Heuristics

•  capsules with externally blocking behaviors •  should be assigned Th execution policy •  rationale : other policies may lead to blocking of

the executing thread, starvation and system deadlocks.

Mapping Heuristics

Heuristic Execution Policy

Blocking Th

Heavy Th

HighCPU Ta

LowCPU M

Hub Ta

Affinity M/S

Master Ta

Worker Ta

Th - Thread, Ta - Task, M - Monitor and S - Sequential

14

The goal of heuristics, Ø  Reduce mailbox contentions, message

passing and processing overheads and cache-misses

Page 55: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Mapping Function: Flow Diagram

15

•  Mapping function takes cVector as input and assigns an execution policy •  Encodes several intuitions (heuristics) as shown in the figure •  It is complete and assigns a single policy

Page 56: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Evaluation

•  Benchmark programs (15 total) –  that exhibits data, task, and pipeline parallelism at coarse and fine

granularities. •  Comparing cVector mapping against thread-all and round-robin-task-

all, random-task-all, and work-stealing-task-all •  Measured reduction in program execution time and CPU

consumption over default mappings on different core settings.

16

Page 57: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

0"

20"

40"

60"

80"

100"

120"

2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12"

bang" mbrot" serialmsg" FileSearch" ScratchPad" Polynomial" fasta"

%"ru

n&me"im

provem

ent"

!250%

!200%

!150%

!100%

!50%

0%

50%

100%

2% 4% 8% 12%

DCT%

%"ru

n&me"im

provem

ent"

!60$

!40$

!20$

0$

20$

40$

60$

80$

100$

2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$

Knucleo0de$ Fannkuchred$ RayTracer$ Breamformer$ logmap$ concdict$ concsll$

%"ru

n&me"im

provem

ent"

% runtime improvement over default mappings, for fifteen benchmarks. For each benchmark there are four core settings (2, 4, 8, 12-cores) and for each core setting there are four bars (Ith, Irr, Ir, Iws) showing improvement over four default mappings (thread, round-robin, random, work-stealing). Higher bars are better.

Results: Improvement In Program Runtime

17

Page 58: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

0"

20"

40"

60"

80"

100"

120"

2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12"

bang" mbrot" serialmsg" FileSearch" ScratchPad" Polynomial" fasta"

%"ru

n&me"im

provem

ent"

!250%

!200%

!150%

!100%

!50%

0%

50%

100%

2% 4% 8% 12%

DCT%

%"ru

n&me"im

provem

ent"

!60$

!40$

!20$

0$

20$

40$

60$

80$

100$

2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$

Knucleo0de$ Fannkuchred$ RayTracer$ Breamformer$ logmap$ concdict$ concsll$

%"ru

n&me"im

provem

ent"

% runtime improvement over default mappings, for fifteen benchmarks. For each benchmark there are four core settings (2, 4, 8, 12-cores) and for each core setting there are four bars (Ith, Irr, Ir, Iws) showing improvement over four default mappings (thread, round-robin, random, work-stealing). Higher bars are better.

Results: Improvement In Program Runtime

17

X-axis : core settings (2,4,8,12) Y-axis : % runtime improvement

Page 59: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

0"

20"

40"

60"

80"

100"

120"

2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12"

bang" mbrot" serialmsg" FileSearch" ScratchPad" Polynomial" fasta"

%"ru

n&me"im

provem

ent"

!250%

!200%

!150%

!100%

!50%

0%

50%

100%

2% 4% 8% 12%

DCT%

%"ru

n&me"im

provem

ent"

!60$

!40$

!20$

0$

20$

40$

60$

80$

100$

2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$

Knucleo0de$ Fannkuchred$ RayTracer$ Breamformer$ logmap$ concdict$ concsll$

%"ru

n&me"im

provem

ent"

% runtime improvement over default mappings, for fifteen benchmarks. For each benchmark there are four core settings (2, 4, 8, 12-cores) and for each core setting there are four bars (Ith, Irr, Ir, Iws) showing improvement over four default mappings (thread, round-robin, random, work-stealing). Higher bars are better.

Results: Improvement In Program Runtime

17

4 bars indicating % runtime improvements over thread-all, round-robin, random, and work-stealing default strategies

Page 60: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

0"

20"

40"

60"

80"

100"

120"

2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12"

bang" mbrot" serialmsg" FileSearch" ScratchPad" Polynomial" fasta"

%"ru

n&me"im

provem

ent"

!250%

!200%

!150%

!100%

!50%

0%

50%

100%

2% 4% 8% 12%

DCT%

%"ru

n&me"im

provem

ent"

!60$

!40$

!20$

0$

20$

40$

60$

80$

100$

2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$

Knucleo0de$ Fannkuchred$ RayTracer$ Breamformer$ logmap$ concdict$ concsll$

%"ru

n&me"im

provem

ent"

% runtime improvement over default mappings, for fifteen benchmarks. For each benchmark there are four core settings (2, 4, 8, 12-cores) and for each core setting there are four bars (Ith, Irr, Ir, Iws) showing improvement over four default mappings (thread, round-robin, random, work-stealing). Higher bars are better.

Results: Improvement In Program Runtime

o  12 of 15 programs showed improvements

o  On average 40.56%, 30.71%, 59.50%, and 40.03% improvements (execution time) over thread, round-robin, random and work-stealing mappings respectively

o  3 programs showed no improvements (data parallel programs)

17

Page 61: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

0"

20"

40"

60"

80"

100"

120"

2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12"

bang" mbrot" serialmsg" FileSearch" ScratchPad" Polynomial" fasta"

%"ru

n&me"im

provem

ent"

!250%

!200%

!150%

!100%

!50%

0%

50%

100%

2% 4% 8% 12%

DCT%

%"ru

n&me"im

provem

ent"

!60$

!40$

!20$

0$

20$

40$

60$

80$

100$

2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$

Knucleo0de$ Fannkuchred$ RayTracer$ Breamformer$ logmap$ concdict$ concsll$

%"ru

n&me"im

provem

ent"

% runtime improvement over default mappings, for fifteen benchmarks. For each benchmark there are four core settings (2, 4, 8, 12-cores) and for each core setting there are four bars (Ith, Irr, Ir, Iws) showing improvement over four default mappings (thread, round-robin, random, work-stealing). Higher bars are better.

Analysis: Improvement In Program Runtime

1)  Reduced mailbox contentions 2)  Reduced message-passing and processing overheads 3)  Reduced mailbox contentions and cache-misses

17

Page 62: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  We presented cVector mapping technique for capsules, however the technique should be applicable to other MPC frameworks

•  Proof of concept : evaluation on Akka –  Similar results for most benchmarks –  Data parallel applications show no improvements (as in Panini)

Applicability

18

!60$

!40$

!20$

0$

20$

40$

60$

80$

100$

2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$

bang$ mbrot$ serialmsg$ fasta$ scratchpad$ logmap$ concdict$ concsll$

%"execu'o

n"'m

e"im

provem

ents"

Benchmarks"and"core"se6ngs"

Ith$ Ita$ I;$

Page 63: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

•  Placing Erlang actors on multicore efficiently, Francesquini et al. [Erlang’13 Workshop]

–  considers only hub-affinity behavior that is annotated by programmer

–  out technique takes care of many other behaviors

•  Mapping task graphs to cores, Survey [DAC’13]

–  not directly applicable to JVM- based MPC frameworks, because threads to cores mapping is left to OS scheduler

•  Efficient strategies for mapping threads to cores for

OpenMP multi-threaded programs, Tousimojarad and Vanderbauwhede [Journal of Parallel Computing’14]

–  our technique maps capsules to threads and not threads to cores

Related Works

19

Page 64: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Placing Erlang actors on multicore efficiently, Francesquini et al. [Erlang’13 Workshop]

considers only hub-affinity behavior that is annotated by programmer out technique takes care of many other behaviors

Mapping task graphs to cores, Survey [DAC’13]

not directly applicable to JVM- based MPC frameworks, because threads to cores mapping is left to OS scheduler

Efficient strategies for mapping threads to cores for OpenMP multi-threaded programs, Tousimojarad and Vanderbauwhede [Journal of Parallel Computing’14]

our technique maps capsules to threads and not threads to cores

Related Works

19

o  Not directly applicable to JVM-based MPC frameworks o  Non-automatic

Page 65: Effectively Mapping Linguistic Abstractions for Message ...web.cs.iastate.edu/~ganeshau/mapping/oopsla2015.pdf · Effectively Mapping Linguistic Abstractions for Message-passing Concurrency

Conclusion

Ganesha Upadhyaya and Hridesh Rajan {ganeshau,hridesh}@iastate.edu

Supported in part by the NSF grants CCF- 08-46059, CCF-11-17937, and CCF-14-23370.

Questions?

a1 a2

a3

a4 a5

core0 core1

core2 core3

MPC System Architecture

MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions

JVM threads

Mapping OS Scheduler

Mapping Abstractions to JVM threads Evaluated on Panini and Akka