inspector-executor load balancing algorithms for block-sparse tensor contractions david ozog*, jeff...

32
Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond , James Dinan , Pavan Balaji , Sameer Shende*, Allen Malony* *University of Oregon Argonne National Laboratory 2013 International Conference on Parallel Processing (ICPP) October 2, 2013

Upload: lynette-boyd

Post on 23-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Inspector-Executor Load Balancing Algorithms for Block-Sparse

Tensor Contractions

David Ozog*, Jeff R. Hammond†, James Dinan†, Pavan Balaji†, Sameer Shende*, Allen Malony*

*University of Oregon †Argonne National Laboratory

2013 International Conference on Parallel Processing (ICPP)October 2, 2013

Page 2: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Outline

1. NWChem, Coupled Cluster, Tensor Contraction Engine2. Load Balance Challenges3. Dynamic Load Balancing with Global Arrays (GA)4. Nxtval Performance Experiments5. Inspector/Executor Design6. Performance Modeling (DGEMM and TCE Sort)7. Largest Processing Time (LPT) Algorithm8. Dynamic Buckets – Design and Implementation9. Results10. Conclusions 11. Future Work

Page 3: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

NWChem and Coupled ClusterNWChem:• Wide range of methods, accuracies, and

supported supercomputer architectures• Well-known for its support of many

quantum mechanical methods on massively parallel systems.

• Built on top of Global Arrays (GA) / ARMCI

Coupled Cluster (CC):• Ab initio - i.e., Highly accurate• Solves an approximate Schrödinger

Equation• Accuracy hierarchy: CCSD < CCSD(T) < CCSDT < CCSDT(Q) < CCSDTQ• The respective computational costs:

• And respective storage costs:

)()()()()( 109876 nOnOnOnOnO

)()()()()( 86644 nOnOnOnOnO

*Photos from nwchem-sw.org

Page 4: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

NWChem and Coupled Cluster

*Diagram from GA tutorial (ACTS 2009)

Global Address Space

Distributed Memory Spaces

NWChem:• Wide range of methods, accuracies, and

supported supercomputer architectures• Well-known for its support of many

quantum mechanical methods on massively parallel systems.

• Built on top of Global Arrays (GA) / ARMCI

Coupled Cluster (CC):• Ab initio - i.e., Highly accurate• Solves an approximate Schrödinger

Equation• Accuracy hierarchy: CCSD < CCSD(T) < CCSDT < CCSDT(Q) < CCSDTQ• The respective computational costs:

• And respective storage costs:

)()()()()( 109876 nOnOnOnOnO

)()()()()( 86644 nOnOnOnOnO

Page 5: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

DGEMM Tasks - Load Imbalance

• In CCSX (X=D,T,Q), 1 tensor contraction contains between 1 hundred and 1 million DGEMMs

• MFLOPs per task depend on: • number of atoms• Spin and spatial

symmetry • Accuracy of chosen basis• The tile size

Page 6: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Computational Challenges

Benzene Water Clusters Macro-Molecules

Highly symmetric Asymmetric QM/MM

• Load balance is crucially important for performance• Obtaining optimal load balance is an NP-Hard problem.

*Photos from nwchem-sw.org

Page 7: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

GA Dynamic Load Balancing Template

Page 8: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

GA Dynamic Load Balancing Template

Page 9: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

GA Dynamic Load Balancing Template

Page 10: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

GA Dynamic Load Balancing Template

Page 11: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

GA Dynamic Load Balancing Template

Page 12: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

GA Dynamic Load Balancing Template

Page 13: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

GA Dynamic Load Balancing Template

Page 14: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

GA Dynamic Load Balancing Template

Page 15: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

GA Dynamic Load Balancing Template

Works best when:

• On a single node (in SysV shared memory)

• Time spent in FOO(a) is huge

• On high-speed interconnects

• Number of simultaneous calls is reasonably small (less than 1,000).

Page 16: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Nxtval - Performance ExperimentsTAU Profiling• 14 water molecules, aug-cc-PVDZ

• 123 nodes, 8 ppn

• Nxtval consumes a large percentage of the execution time.

Flooding micro-benchmark

• Proportional time within Nxtval increases with more participating processes.

• When the arrival rate exceeds the processing rate, process hosting the counter must utilize buffer and flow control.

Page 17: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Nxtval Performance Experiments

Strong Scaling

• 10 water molecules, (aDZ)• 14 water molecules, (aDZ)

• 8 processes per node

• Percentage of overall execution time within Nxtval increases with scaling.

Page 18: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Inspector/Executor Design1. Inspector

• Calculate memory requirements• Remove null tasks• Collate task-list

2. Task Cost Estimator• Two options:

• Use performance models • Load gettimeofday() measurement from previous

iteration(s)

• Deduce performance models off-line

3. Static Partitioner• Partition into N groups where N is the number of MPI

processes• Minimize load balance according to cost estimations• Write task list information for each proc/contraction to

volatile memory

4. Executor• Launch all tasks

Page 19: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Performance Modeling - DGEMMDGEMM:

• A(m,k), B(k,n), and C(m,n) are 2D matrices• α and β are scalar coefficients

Our Performance Model:

• (mn) dot products of length k• Corresponding (mn) store operations in C• m loads of size k from A• n loads of size k from B• a, b, c, and d are found by solving a nonlinear least squares problem (in Matlab)

Page 20: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Performance Modeling - DGEMMDGEMM:

• A(m,k), B(k,n), and C(m,n) are 2D matrices• α and β are scalar coefficients

Our Performance Model:

• (mn) dot products of length k• Corresponding (mn) store operations in C• m loads of size k from A• n loads of size k from B• a, b, c, and d are found by solving a nonlinear least squares problem (in Matlab)

Page 21: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Performance Modeling – TCE “Sort”

Our Performance Model:

• TCE “Sorts” are actually matrix permutations

• 3rd order polynomial fit suffices

• Data always fits in L2 cache for this architecture

• Somewhat noisy measurements, but that’s OK.

(bytes)

Page 22: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Largest Processing Time (LPT) Algorithm

1. Sort tasks by cost in descending order

2. Assign to least loaded process so far

*SIAM Journal on Applied Mathematics, Vol. 17, No. 2. (Mar., 1969), pp. 416-429.

• Polynomial time algorithm applied to an NP-Hard problem

• Proven “4/3 approximate” by Richard Graham*

Page 23: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

1. Sort tasks by cost in descending order

2. Assign to least loaded process so far

*SIAM Journal on Applied Mathematics, Vol. 17, No. 2. (Mar., 1969), pp. 416-429.

• Polynomial time algorithm applied to an NP-Hard problem

• Proven “4/3 approximate” by Richard Graham*

Largest Processing Time (LPT) Algorithm

Page 24: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

LPT - Binary Min Heap

1. Initialize a heap with N nodes (N = # of procs) each having zero cost.

2. Perform IncreaseMin() operation for each new cost from the sorted list of tasks.

• IncreaseMin() is quite efficient because UpdateRoot() often occurs in O(1) time.

• Far more efficient than the naïve approach of iterating through an array to find the min.

• Execution time for this phase is negligible.

Page 25: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

LPT - Load Balance

(a) Original with Nxtval Measured

(b) Inspector/Executor with Nxtval Measured

(c) LPT – 1st iteration

(d) LPT – subsequent iterations

Page 26: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Dynamic Buckets Design

Page 27: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Dynamic Buckets Implementation

Page 28: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Dynamic Buckets Load Balance

(a) LPT Predicted

(b) LPT Measured

(c) Dynamic Buckets Predicted

d) Dynamic Buckets Measured

Page 29: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

I/E ResultsNitrogen - CCSDT Benzene - CCSD

Page 30: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

10-H2O Cluster Results (DB)

CCSD_t2_7_3 CCSD_t2_7

Page 31: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Conclusions

1. Nxtval can be expensive at large scales2. Static Partitioning can fix the problem, but has

weaknesses:• Requires performance model• Noise degrades results

3. Dynamic Buckets is a viable alternative, and requires few changes to GA applications.

4. Solving load balance issues differs from problem to problem – work needs to be done to pinpoint why and what to do about it.

Page 32: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond †, James Dinan †, Pavan Balaji †, Sameer

Future Work (Research)

1. Cyclops Tensor Framework (CTF)2. DAG Scheduling of tensor contractions3. What happens with accelerators (MIC/GPU)?

1. Performance model2. Balancing load across both CPU and device

4. Comparison with hierarchical distributed load balancing, work stealing, etc.

5. Hypergraph partitioning / data locality