executing parallel programs with potential bottlenecks efficiently

42
Executing Parallel Programs with Potential Bottlenecks Efficiently University of Tokyo Yoshihiro Oyama Kenjiro Taura (visiting UCSD) Akinori Yonezawa

Upload: iliana-herring

Post on 31-Dec-2015

37 views

Category:

Documents


0 download

DESCRIPTION

Executing Parallel Programs with Potential Bottlenecks Efficiently. University of Tokyo Yoshihiro Oyama Kenjiro Taura (visiting UCSD) Akinori Yonezawa. Programs We Consider. programs updating shared data frequently with mutex operations. Context: Implementation of concurrent OO langs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Executing Parallel Programs with Potential Bottlenecks Efficiently

Executing Parallel Programs with Potential Bottlenecks Efficiently

University of Tokyo

Yoshihiro OyamaKenjiro Taura (visiting UCSD)

Akinori Yonezawa

Page 2: Executing Parallel Programs with Potential Bottlenecks Efficiently

Programs We Consider

bottleneckobject

(e.g., counter)

exclusivemethod

exclusivemethod

exclusivemethod

exclusivemethod

……..

Context: Implementation of concurrent OO langs on SMPs and DSM machines

e.g., synchronizedmethods in Java

update!

update!update!

update!

programs updating shared data frequently with mutex operations

Page 3: Executing Parallel Programs with Potential Bottlenecks Efficiently

Amdahl’s Law

int foo(…){ int x = 0, y = 0; parallel for (…) { ... } lock(); printf(…); unlock(); parallel for (…) { c[i]=0; } parallel for (…) { baz(5); } return x * 2 + y;}

int foo(…){ int x = 0, y = 0; parallel for (…) { ... } lock(); printf(…); unlock(); parallel for (…) { c[i]=0; } parallel for (…) { baz(5); } return x * 2 + y;}

90% → can execute in parallel10% → must execute sequentially (bottleneck)

10 times speedup, at most

Can you really gain10 times speedup???

but...

You expect 10 times speedup

Page 4: Executing Parallel Programs with Potential Bottlenecks Efficiently

Speedup Curvesfor Programs with Bottlenecks

# of PEs

tim

e

ideal

real

“Excessive” processors may be used!  ∵ It is difficult to predict dynamic behavior  ∵ Different phases need different num. of PEs

Page 5: Executing Parallel Programs with Potential Bottlenecks Efficiently

Preliminary Experiments using aSimple Counter Program in C

0

500

1000

1500

2000

0 10 20 30 40 50 60 70# of PEs

tim

e (

msec)

spin simple blocking lock our scheme

• Solaris threads & Ultra Enterprise 10000• Each processor increments a shared counter in parallel

The time didn’t remain constant, but increases dramatically.

Page 6: Executing Parallel Programs with Potential Bottlenecks Efficiently

Goal

• Efficient execution of programs with bottlenecks– Focusing on synchronization of methods

time to execute a wholeprogram in parallel

time to execute onlybottlenecks sequentially

making closer

other parts

1PE

bottleneck partsother parts

50PE

bottleneck parts

Page 7: Executing Parallel Programs with Potential Bottlenecks Efficiently

What Problem Should We Solve?

1PE 50PE

other partsbottleneck parts

ideal implementation

Stop the increase of the time consumed in bottlenecks!

other parts

bottleneck parts

bottleneckparts

other partsnaïve implementation

Page 8: Executing Parallel Programs with Potential Bottlenecks Efficiently

Put it in Prof. Ito’s terminology!

• He aims at keeping:– the PP/M SP/S property≧

• Our work aims at keeping:– the PP/M ≧ PP/S property

Performance on 100 PE should behigher than that on 1 PE!

Page 9: Executing Parallel Programs with Potential Bottlenecks Efficiently

Presentation Overview

• Examples of potential bottlenecks• Two naïve schemes and their problems

– Local-based execution– Owner-based execution

• Our scheme– Detachment of requests– Priority mechanism using compare-and-swap– Two compile-time optimizations

• Performance evaluation & Related work

Page 10: Executing Parallel Programs with Potential Bottlenecks Efficiently

Examples ofPotential Bottleneck Objects

• Objects introduced to easily reuse MT-unsafe functions in MT env.

• Abstract I/O objects

• Stubs in distributed systems– One stub conducts all communications in a site

• Shared global variables– e.g., counters to collect statistics information

It is sometimes difficult to eliminate them.

Page 11: Executing Parallel Programs with Potential Bottlenecks Efficiently
Page 12: Executing Parallel Programs with Potential Bottlenecks Efficiently

Local-based Execution(e.g., Implementation with Spin-locks)

instance variables

method

methodmethod

Advantage:No need to move “computation”

Disadvantage:Cache misses when referencing an object (due to invalidation/update of cache by other processors)

object

method methodmethod

Each PE executesmethods by itself

↓Each PE references/updates

an object by itself

Page 13: Executing Parallel Programs with Potential Bottlenecks Efficiently

Confirmation of Overheadin Local-based Execution

0

500

1000

1500

2000

0 10 20 30 40 50 60 70

# of PEs

tim

e (m

sec)

empty method counter

C programon Ultra Enterprise 10000

Overhead of referencing/updating an object• increases according to the increase of PEs• occupies 1/3 of whole exec. time on 60 PEs

Page 14: Executing Parallel Programs with Potential Bottlenecks Efficiently

Owner-based Execution

a request (a data structurecontaining method info)object

ownernon-owners

owner = a processor holding an object’s lock currently

owner present → creates and inserts a requestowner absent → becomes an owner and executes a method

=

Page 15: Executing Parallel Programs with Potential Bottlenecks Efficiently

Owner-based Executionwith Simple Blocking Locks

object

instancevariables

Dequeued• one by one• with aux. locks

One processor likely executes multiple methods consecutively

Page 16: Executing Parallel Programs with Potential Bottlenecks Efficiently

Advantages/Disadvantagesof Owner-based Execution

Advantage:

Disadvantages:

Less cache misses to reference an object

Overhead to move “computation”• Synchronization operations for a queue• Waiting time to manipulate a queue• Cache misses to read requests

(focusing on owner’s execution, which typically gives a critical path)

Can they be reduced?

Page 17: Executing Parallel Programs with Potential Bottlenecks Efficiently
Page 18: Executing Parallel Programs with Potential Bottlenecks Efficiently

Overview of Our Scheme

• Improve simple blocking locks– Detach requests

• Reduce the frequency of mutex operations

– Give high priority to owner• Reduce the time required to take control of requests

– Prefetch requests• Reduce cache misses in reading requests

Our scheme is realized implicitly by a compiler and runtime of a concurrent object-oriented language Schematic

Page 19: Executing Parallel Programs with Potential Bottlenecks Efficiently

Data Structures

• Requests are managed with a list– 1-word pointer area (lock area) is added to each object

– Non-owner : creates and inserts a request– Owner : picks requests out and execute

them

object

Page 20: Executing Parallel Programs with Potential Bottlenecks Efficiently

Design Policy

• Owner’s behavior determines a critical path

• We make owner’s execution fast, above all

• We allow non-owners’ execution to be slow

Battle in Bottleneck: 1 owner vs. 99 non-owners

We should help him!

Page 21: Executing Parallel Programs with Potential Bottlenecks Efficiently

Non-owners Inserting a Request

B Cobject A

Y ZX

Page 22: Executing Parallel Programs with Potential Bottlenecks Efficiently

Non-owners Inserting a Request

B Cobject A

Y ZX

Update with compare-and-swapRetry if interrupted

Page 23: Executing Parallel Programs with Potential Bottlenecks Efficiently

Non-owners Inserting a Request

B Cobject A

Y ZX

Update with compare-and-swapRetry if interrupted

Non-owners repeatthe loop until success

Page 24: Executing Parallel Programs with Potential Bottlenecks Efficiently

Y

Owner Detaching Requests

B Cobject A

Important• A whole list is detached• Update with swap always succeeds     →  owner is never interrupted by other processors

Page 25: Executing Parallel Programs with Potential Bottlenecks Efficiently

Y

Owner Detaching Requests

B Cobject A

Important• A whole list is detached• Update with swap always succeeds     →  owner is never interrupted by other processors

Page 26: Executing Parallel Programs with Potential Bottlenecks Efficiently

Owner Executing Requests

Y B C

object

A

1. No synchronization operations by owner

inserting requests withoutdisturbing owner

ZX

executed in turnwithout mutex ops

Page 27: Executing Parallel Programs with Potential Bottlenecks Efficiently

Giving Higher Priority to Owner

• Insertion by non-owner (compare-and-swap):

may fail many times

• Detachment by owner (swap):

always succeeds in constant steps

2. Owner never spins to manipulate requests

Page 28: Executing Parallel Programs with Potential Bottlenecks Efficiently

Compile-time Optimization (1/2)

• Prefetch requests

while this request is processed

the request is prefetched

3. Reduce cache misses to read requests

...while (req != NULL) { PREFETCH(req->next); EXECUTE(req); req = req->next;}...

...while (req != NULL) { PREFETCH(req->next); EXECUTE(req); req = req->next;}...

Page 29: Executing Parallel Programs with Potential Bottlenecks Efficiently

Compile-time Optimization (2/2)

• Caching instance variables in registers– Non-owners do not reference/update an object

while detached requests are processed

passing IVs in registers

Two versions of code are provided for one method   Code to process requests :     uses instance variables on memory   Code to execute methods directly :  uses instance variables in registers

object

Page 30: Executing Parallel Programs with Potential Bottlenecks Efficiently

Achieving Similar Effectsin Low-level Languages (e.g., in C)

• “Always spin-lock” approach– Waste of CPU cycles, memory bandwidth

– Deadlocks

• “Finding bottlenecks→rewriting code” approach– Implements owner-based execution only in bottlenecks

– Harder than “support of high-level lang” approach• Implementing owner-based execution is troublesome

• Bottlenecks appear dynamically in some programs

Page 31: Executing Parallel Programs with Potential Bottlenecks Efficiently
Page 32: Executing Parallel Programs with Potential Bottlenecks Efficiently

Experimental Results (1/2)RNA secondary structure prediction (with stat.)

in Schematic on Ultra Enterprise 10000

01000200030004000500060007000

0 10 20 30 40 50 60 70

# of PEs

tim

e (m

sec)

spin l o c kblocking spin bl o c kour scheme

Page 33: Executing Parallel Programs with Potential Bottlenecks Efficiently

Experimental Results (2/2)RNA secondary structure prediction (with stat.)

in Schematic on Origin 2000

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110

# of PEs

tim

e (m

sec)

spin l o c kblocking spin bl o c kour scheme

Page 34: Executing Parallel Programs with Potential Bottlenecks Efficiently

Interesting Results using aSimple Counter Program in C

• Simple blocking locks :waiting time was the largest overhead

– 70 % of owner’s whole execution time

• Our scheme is efficient also on uniprocessor– Spin-locks: 641 msec– Simple blocking locks: 1025 msec– Our scheme: 810 msec

                   (execution time)

Page 35: Executing Parallel Programs with Potential Bottlenecks Efficiently

Related Work (1/3)- execution of methods invoked in parallel -

• ICC++ [Chien et al. 96]– Detects nonexclusive methods through static analysis

• Concurrent Aggregates [Chien 91]– Realizes interleaving through explicit programming

• Cooperative Technique [Barnes 93]– PE entering critical section later “helps” predecessors

•Focus on exposing parallelism among nonexclusive operations•No remark on performance loss in bottlenecks

Page 36: Executing Parallel Programs with Potential Bottlenecks Efficiently

Related Work (2/3)- efficient spin-locks when contention occurs -

• MCS Lock [Mellor-Crummey et al. 91]– Provides spin area for each processor

• Exponential Backoff [Anderson 90]– Is heuristics to “withdraw” processors which fai

led in lock acquisition– Needs some skills to determine parameters

These locks give local-based execution→ Low locality in referencing bottleneck objects

Page 37: Executing Parallel Programs with Potential Bottlenecks Efficiently

Related Work (3/3)- efficient Java monitors -

• Bimodal object-locking [Onodera et al. 98],Thin Locks [Bacon et al. 98]– Affected our low-level implementation– Uses unoptimized “fat locks” in contended objects

• Meta-locks [Agesen et al. 99]– Clever technique similar to MCS locks– No busy-waiting even in contended objects

• Their primary concern lies on uncontended cases• They do not take locality of object references into account

Page 38: Executing Parallel Programs with Potential Bottlenecks Efficiently

Summary

• Serious performance loss in existing schemes– spin-locks: low locality of object references– blocking locks: overhead in contended request queue

• Very fast execution in contended objects– Highly-optimized owner-based execution

• Excellent Performance– Several times faster than simple schemes!

(several hundred percent speedup!)

Page 39: Executing Parallel Programs with Potential Bottlenecks Efficiently

Future Work

• Solving a problem to use large memory in some cases– A long list of requests may be formed– The problem is common to owner-based schemes– This work focused on time-efficiency, not on space-efficiency– Simple solution: memory used for requests  ≧  some threshold

        ⇒  dynamic switch to local-based execution

• Increasing/decreasing PEs according to exec. status– System automatically decides the “best” number of PEs

for each program point– It eliminates the existence of excessive processors itself

Page 40: Executing Parallel Programs with Potential Bottlenecks Efficiently

ここからは質問タイムに見せるスライド

• ここからは質問タイムに見せるスライド

Page 41: Executing Parallel Programs with Potential Bottlenecks Efficiently

More Detailed Measurementsusing a Counter Program in C

0

500

1000

1500

2000

0 10 20 30 40 50 60 70# of PEs

tim

e (

mse

c)

spin block block (det a c h )getone detach reg. + p r e f .

• Solaris threads & Sun Ultra Enterprise 10000• Each processor increments a shared counter

Page 42: Executing Parallel Programs with Potential Bottlenecks Efficiently

No guarantee of FIFO order

• The method invoked later may beexecuted earlier– Simple solution: “reverse” detached requests– Better solution:

• Can we use a queue, instead of list?

• Are 64bit compare-and-swap/swap necessary?