work stealing and persistence-based load balancers for iterative overdecomposed applications

26
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant Kale, UIUC HPDC 2012

Upload: danae

Post on 24-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications. Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant Kale, UIUC HPDC 2012. Dynamic load balancing on 100,000 processor cores and beyond. Iterative Applications. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

Work Stealing and Persistence-based Load Balancers for Iterative

Overdecomposed Applications

Jonathan Lifflander, UIUCSriram Krishnamoorthy, PNNL*

Laxmikant Kale, UIUC

HPDC 2012

Page 2: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

Dynamic load balancing on 100,000 processor cores

and beyond

Page 3: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Iterative Applications

Applications repeatedly executing the same computation

Static or slowly evolving execution characteristics

Execution characteristics preclude static balancingApplication characteristics (comm. pattern, sparsity,…)Execution environment (topology, asymmetry, …)

Challenge: Load-balancing such applications

Page 4: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Overdecomposition

Expose greater levels concurrency than supported by hardware

Middleware (runtime) dynamically maps the concurrent tasks to hardware resources

Abstraction supports continuous optimization and adaptation

Improvements to load balancingNew metrics (power, energy, graceful degradation, …)New features: fault tolerance, power/energy-awareness

Page 5: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Problem Statement

Scalable load balancers for iterative overdecomposed applications

We consider two alternatives:Persistence-based load balancingWork stealing

How do these algorithms behave at scale?

How do they compare?

Page 6: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Related Work

Overdecomposition is a widely used approach

Inspector-executor approaches employ start-time load balancers

Hierarchical load balancers in the past typically do not consider localization

Scalability of work stealing not well understood – largest prior demonstration was on 8192 cores

No comparative evaluation of the two schemes

Page 7: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

TASCEL: Task Scheduling Library

Runtime library for task-parallel programs

Manages task collections for execution on distributed memory machines

Compatible with native MPI programs

Phase-based switch between SPMD and non-SPMD modes of execution

Page 8: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

TASCEL Execution

Task: basic unit of migrateable executionTypical workflow:

Create a task collectionSeed it with one or more tasksProcess tasks in the collection till termination detection

Processing of task collectionsManages concurrency, faults, …

Trade-offs exposed through implementation specializations

Dynamic load balancing schemesFault tolerance protocols…

Page 9: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Load Balancers

Greedy localized hierarchical persistence-based load balancing

Retentive work stealing

Page 10: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

0 1 2 3 4 5

3 4 5

1 2

0

Greedy Localized Hierarchical Persistence-based LB

Intuition: Satisfy local imbalance first

Page 11: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

0 1 2 3 4 5

3 4 5

1 2

0

Greedy Localized Hierarchical Persistence-based LB

Intuition: Satisfy local imbalance first

Page 12: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Proc 1

Proc 2

Proc 3 …

Proc n

Local Queues Work Pool

Retentive Work Stealing

Page 13: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

head split stail

Local Remote

Retentive Work Stealing

Page 14: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

head split

addTask(): add task to local region

getTask(): remove task from local region

stail

Buffer of locally executed tasks

Retentive Work Stealing

Page 15: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

head split

releaseToShared(): move to shared portion

acquireFromShared(): move to local portion

stail

Retentive Work Stealing

Page 16: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

head split

1. Mark tasks stolen at stail and begin transfer

itail ctail

stail: beginning of tasks available to be stolenitail: number of tasks that have finished transferctail: past this marker it is safe to use buffer

stail

2. Atomically increment itail on completion of transfer 3. Worker updates ctail when stail == itail

==itail ==ctail

Retentive Work Stealing

Page 17: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Proc 1Proc 2Proc 3 …

Proc n

Seeded Local Queues

Proc 1Proc 2Proc 3Proc n

Actual Executed Tasks

Intuition: Stealing indicates poor initial balance

Retentive Work Stealing

Page 18: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Retentive Work Stealing

Active message based work stealing optimized for distributed memory

Exploit persistence across work stealing iterations

Each work stealing phaseTrack tasks executed by this worker in this iterationSeed with tasks executed by this worker for the next iteration

Page 19: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Experimental Setup

Multi-threaded MPI; one core per node for active messages“Flat” execution – each core is an independent worker

No. nodes

Cores per

node

Memory per

node

Max cores in queue

Hopper (Cray XE6) 6384 24 32GB 146400Intrepid (BG/P) 40960 4 4GB 163840Titan (Cray XK6) 18688 16 32GB 298592

Page 20: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Hartree-Fock Benchmark

Basis for several electronic structure theoriesTwo-electron contributionSchwarz-screening: data dependent sparsity screening at runtimeTasks vary in size from milliseconds to seconds

HF-Be512 (20) HF-Be512 (40)

Total tasks 2.2x1010 1.4x109

Non-null tasks 9.1x106 8.6x105

Page 21: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Hopper: Performance

Persistence-based load balancing “converges” fasterRetentive stealing also improves efficiencyStealing effective even with limited parallelism

Persistence-based load balancing Retentive Stealing

Effi

cien

cy

Core count Core count

Avg. tasks per core

Page 22: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Intrepid: Performance

Much worse performance for the first iterationConverges to a better efficiency than on Hopper

Persistence-based load balancing Retentive Stealing

Effi

cien

cy

Core count Core count

Avg. tasks per core

Page 23: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Titan: Performance

Similar behavior as on Intrepid

Persistence-based load balancing Retentive Stealing

Effi

cien

cy

Core count Core count

Avg. tasks per core

Page 24: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Intrepid: Num. Steals

Retentive stealing stabilizes stealing costsSimilar trends on all systems

Core count Core count

Num

. ste

als

Attempted steals Successful steals

Page 25: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Utilization

HF-Be256 on 9600 cores on HopperInitial stealing has high costs during ramp-downRetentive stealing does a better job reducing this cost

Steal (13.6secs) StealRet-final (12.6secs) PLB (12.2secs)

Util

izat

ion

(%)

Time Time Time

Page 26: Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

HPDC’12, Delft

Summary of InsightsRetentive work stealing can scale – demonstrated on up to 163,840 cores of Intrepid, 146,400 cores of Hopper, and 128,000 cores of Titan

Retentive stealing and persistence-based load balancing perform comparably

Retentive stealing incrementally improves balance

Number of steals does not grow substantially with scale

Greedy hierarchical persistence-based load balancer achieves good load balance quality as compared to a centralized scheme (details in paper)