dynamic feedback: an effective technique for adaptive computing pedro diniz and martin rinard...

Dynamic Feedback:An Effective Techniquefor Adaptive Computing

Pedro Diniz and Martin Rinard

Department of Computer ScienceUniversity of California, Santa Barbara

http://www.cs.ucsb.edu/~{pedro,martin}

Basic Issue:Efficient Implementation of Atomic

Operations in Object-Based Languages

Approach:Reduce Lock Overhead by

Coarsening Lock Granularity

Problem:Coarsening Lock Granularity

May ReduceAvailable Concurrency

Solution: Dynamic Feedback

• Multiple Lock Coarsening Policies

• Dynamic Feedback• Generate Multiple Versions of Code• Measure Dynamic Overhead of Each Policy• Dynamically Select Best Version

• Context• Parallelizing Compiler

• Irregular Object-Based Programs• Pointer-Based Data Structures

• Commutativity Analysis

Talk Outline

• Lock Coarsening

• Dynamic Feedback

• Experimental Results

• Related Work

• Conclusions

Model of Computation

• Parallel Programs• Serial Phases• Parallel Phases

•Atomic Operations on Shared Objects•Mutual Exclusion Locks•Acquire Constructs•Release Constructs

AtomicOperations

SerialPhase

SerialPhase

ParallelPhase

L.acquire()

L.release()

Mutual ExclusionRegion

Problem: Lock Overhead

L.acquire()

L.release()

L.acquire()

L.release()

Solution: Lock Coarsening

Original After Lock Coarsening

L.acquire()

L.release()

L.acquire()

L.release()

L.acquire()

L.release()

Reference: Diniz and Rinard“Synchronization Transformations for Parallel Computing”, POPL97

Lock Coarsening Trade-Off

• Advantage: • Reduces Number of Executed Acquires and Releases• Reduces Acquire and Release Overhead

• Disadvantage: May Introduce False Exclusion• Multiple Processors Attempt to Acquire Same Lock• Processor Holding the Lock is Executing Code that

was Originally in No Mutual Exclusion Region

False Exclusion

Original After Lock Coarsening

L.acquire()

L.release()

L.acquire()

L.release()

L.acquire()

L.release()

L.acquire()

L.release()

L.acquire()

•••

L.release()

FalseExclusion

Lock Coarsening Policy

Goal: Limit Potential Severity of False Exclusion

Mechanism: Multiple Lock Coarsening Policies

• Original: Never Coarsen Granularity• Bounded: Coarsen Granularity Only Within

Cycle-Free Subgraphs of ICFG

• Aggressive: Always Coarsen Granularity

Choosing Best Policy

• Best Lock Coarsening Policy May Depend On• Topology of Data Structures• Dynamic Schedule Of Computation

• Information Required to Choose Best Policy Unavailable at Compile Time

• Complications• Different Phases May Have Different Best Policy• In Same Phase, Best Policy May Change Over Time

Solution: Dynamic Feedback

• Generated Code Executes• Sampling Phases: Measure Performance of Different Policies• Production Phases : Use Best Policy From Sampling Phase

• Periodically Resample to Discover Best Policy Changes

AggressiveOriginal Bounded

Time

Ove

rhea

d

Sampling Phase Production Phase Sampling Phase

AggressiveCodeVersion Original

Guaranteed Performance Bounds

• Assumptions:• Overhead Changes Bounded by Exponential Decay

Functions

• Worst Case Scenario:• No Useful Work During Sampling Phase• Sampled Overheads Are Same For All Versions• Overhead of Selected Version Increases at Maximum Rate• Overhead of Other Versions Decreases at Maximum Rate

S PS S

Ove

rhea

d

Time

V0

Guaranteed Performance Bound

Definition 1. Policy p is at Most Worse Than Policy p over a Time Interval T if

Work = 0

T

(1 - oi(t))

dt

where

(1 - ) P + (1/) e(-P) Š (- 1) SN + (1/)

Result 1. To Guarantee this Bound

Work - Work Š T T

i

T

jT

i

ji

Definition 2. Dynamic Feedback is at Most Worse Than the Optimal if

Work - Work Š (P+SN) P+SN

opt

P

0 where Work = 1

P+SN

(1 - o1(t))

dt

P+SN

opt

Guaranteed Performance Bounds

(1 - ) P + (1/) e(-P)

(- 1) SN + (1/)

Production Interval P

Con

stra

int V

alue

sFeasibleRegion

Production Interval Too Long:May Execute Suboptimal

Policy for Long Time

Production Interval Too Short:

Unable to Amortize Sampling Overhead

Basic Constraint:Decay Rate () Must be Small Enough

Dynamic Feedback: Implementation

• Code Generation

• Measuring Policy Overhead

• Interval Selection

• Interval Expiration

• Policy Switch

Code Generation

• Statically Generate Different Code Versions for Each Policy• Alternative: Dynamic Code Generation

• Advantages of Static Code Generation:• Simplicity of Implementation• Fast Policy Switching

• Potential Drawback of Static Code Generation• Code Size (In Practice Not a Problem)

Measuring Policy Overhead

• Sources of Overhead• Locking Overhead• Waiting Overhead

• Compute Locking Overhead• Count Number of Executed Acquire/Release Constructs

• Estimate Waiting Overhead• Count Number of Spins on Locks Waiting to be

Released

Sampling TimeSampled Overhead =

Numberof Spins

Number ofAcquire/Release

xx Spin TimeAcquire/ReleaseExecution Time( )+( )

Interval Selection and Expiration

• Fixed Interval Values• Sampling Interval: 10 milliseconds• Production Interval: 10 seconds• Good Results for Wide Range of Interval

Values

• Polling Code for Expiration Detection• Location: Back Edges of Parallel Loop• Advantage: Low Overhead• Disadvantage: Potential Interaction with

Iteration Size

AtomicOperationsPolling

Points

Policy Switch

• Synchronous• Processors Poll Timer to Detect Interval Expiration• Barrier At End of Each Interval

• Advantages:• Consistent Transitions• Clean Overhead Measurements

• Disadvantages:• Need to Synchronize All Processors• Potential Idle Time At Barrier

Experimental Results

• Parallelizing Compiler Based on Commutativity Analysis [PLDI’96]

• Set of Complete Scientific Applications• Barnes-Hut N-Body Solver (1500 lines of C++)• Liquid Water Simulation Code (1850 lines of C++)• Seismic Modeling String Code (2050 lines of C++)

• Different Lock Coarsening Policies

• Dynamic Feedback

• Performance on Stanford DASH Multiprocessor

Code Sizes

0

20

40

60

Size

Tex

t Seg

men

t (K

byte

s)

Barnes-Hut

SerialOriginalDynamic

0

20

40

60

Size

Tex

t Seg

men

t (K

byte

s)

Water

Serial

OriginalDynamic

0

20

40

60

Size

Tex

t Seg

men

t (K

byte

s)

String

Serial

OriginalDynamic

Lock Overhead

0

20

40

60

Perc

enta

ge L

ock

Ove

rhea

d

Barnes-Hut(16K Particles)

Original

Bounded

Aggressive

Percentage of Time that the Single Processor Execution Spends Acquiring and Releasing

Mutual Exclusion Locks

0

20

40

60

Perc

enta

ge L

ock

Ove

rhea

d

Water(512 Molecules)

Original

BoundedAggressive

0

20

40

60

Perc

enta

ge L

ock

Ove

rhea

d

String(Big Well Model)

OriginalAggressive

Contention OverheadC

onte

ntio

n Pe

rcen

tage

Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors

100

0

25

50

75

0 4 8 12 16Processors

0

25

50

75

100


0

25

50

75

100


OriginalBoundedAggressive

Barnes-Hut(16K Particles)

Water(512 Molecules)

String(Big Well Model)

Performance Results: Barnes-Hut

IdealAggressive

Dynamic FeedbackBounded

Original

Barnes-Hut on DASH(16K Particles)

0

4

8

12

16

0 4 8 12 16

Number of Processors

Spe

edup

Performance Results: Water

Ideal

Bounded

OriginalAggressive

Dynamic Feedback

Water on DASH(512 Molecules)

0

4

8

12

16

0 4 8 12 16


Spe

edup

Performance Results: String

String on DASH(Big Well Model)

Ideal

Original

Aggressive

Dynamic Feedback

0

4

8

12

16

0 4 8 12 16


Spe

edup

Summary

• Code Size Is Not An Issue

• Lock Coarsening Has Significant Performance Impact

• Best Lock Coarsening Policy Varies With Application

• Dynamic Feedback Delivers Code With Performance Comparable to The Best Static Lock Coarsening Policy

Related Work

• Adaptive Execution Techniques (Saavedra Park:PACT96)

• Dynamic Dispatch Optimizations (Hölzle Ungar:PLDI94)

• Dynamic Code Generation (Engler:PLDI96)

• Profiling (Brewer:PPoPP95)

• Synchronization Optimizations (Plevyak et al:POPL95)

Conclusions

• Dynamic Feedback• Generated Code Adapts to Different Execution

Environments

• Integration with Parallelizing Compiler• Irregular Object-Based Programs• Pointer-Based Linked Data Structures• Commutativity Analysis

• Evaluation with Three Complete Applications• Performance Comparable to Best Hand-Tuned

Optimization

BACKUP SLIDES

0

2

4

6

8

10

12

14

16

Spe

edup

0 2 4 6 8 10 12 14 16Number of Processors

Ideal

Aggressive

Bounded

Original

Barnes-Hut (16K Particles)

Performance Results : Barnes-Hut

Performance Results: Water

Ideal

Aggressive

Bounded

Original

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Spe

edup


Water (512 Molecules)

Performance Results: String

String (Big Well Model)

Spe

edup


0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Ideal

Original

Aggressive

Policy Switch

TimerExpires

Policy 1

Policy 2TimerExpires

Motivation

Challenges:• Match Best Implementation to Environment• Heterogeneous and Mobile Systems

Goal: • Develop Mechanisms to Support Code that

Adapts to Environment Characteristics

Technique:• Dynamic Feedback

Overhead for Barnes-Hut

0

0.1

0.2

0.3

0.4

0.5

0 5 10 15 20 25

Sam

pled

Ove

rhea

d

Execution Time (Seconds)

Original

Aggressive

Bounded

Barnes-Hut on DASH (8 Processors)FORCES Loop

Data Set - 16K Particles

Overhead for Water

Water on DASH (8 Processors) INTERF Loop

Data Set - 512 Molecules

0

0.1

0.2

0.3

0.4

0.5

0 10 20 30 40 50 60

Sam

pled

Ove

rhea

d


Original

Bounded

Overhead for Water

Water on DASH (8 Processors)POTENG Loop

Data Set - 512 Molecules

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60

Sam

pled

Ove

rhea

d


Aggressive

Original

Overhead for String

String on DASH (8 Processors)PROJFWD Loop

Data Set -Big Well

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500

Sam

pled

Ove

rhea

d


Aggressive

Original

Dynamic Feedback

AggressiveOriginalBounded

Time

Ove

rhea

d

Sampling Phase Production Phase Sampling Phase

AggressiveCodeVersion

dynamic feedback: an effective technique for adaptive computing pedro diniz and martin rinard...

Documents

release slide

lock overhead

granularity slide

release false exclusion

lock coarsening policy

time slide

lock processor

release overhead disadvantage