m. minkoff and d. kaushik - argonne national laboratorymmin/mcs/mike_dinesh.pdf · m. minkoff and...

28
1 Pioneering Science and Technology Office of Science U.S. Department of Energy Using Tensor Methods, PETSc, and SLEPc to Obtain Exact Cumulative Reaction Probability M. Minkoff and D. Kaushik

Upload: hahuong

Post on 09-May-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

1

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Using Tensor Methods, PETSc, and SLEPc to ObtainExact Cumulative Reaction Probability

M. Minkoff and D. Kaushik

Page 2: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

2

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Outline

• Overview• Description of CRP Simulation Problem• Using PETSc and SLEPc for Application to CRP• Computations for Sparse Matrices• Results for Banded Preconditioning• Future Directions• Tensor Matrix Multiplication

Page 3: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

3

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Chemical Dynamics Theory3 angles, 3 stretches6 degrees of freedom

Page 4: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

4

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Chemical Dynamics Theory• Reaction rates are related to

- Cumulative Reaction Probability (CRP), N(E)N(E) = 4 Tr [ εr

1/2 G(E)† εp G(E) εr1/2 ]

• Where Tr is the trace of the matrix, † is the adjoint, εr and εp are the

absorbing potential in the reactant and product regions.• εr

+ εp = ε, the given total absorbing potential.The Green’s functions have the form:

G(E) = (E + iε - H)-1 , where i imaginary and H is the Hamiltonian.

We need to solve two linear systems (at each iteration):

(E + iε - H) y = x and it’s adjoint where x is known.

This system is solved via GMRES with preconditioning methods (initiallydiagonal scaling).

Page 5: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

5

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Hamiltonian Sparsity Pattern

• Sparsity for d=2 and d>2

Page 6: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

6

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Non-zero Storage for Green’s Matrix

0.01

0.1

1

10

100

1000

104

1 10

Storage Required for F=3.5 Cutoff, .32ev

(3% Accuracy)

Matrix Size (MWords)Extrapolation Curve

Matrix Size (MWords)

Dimension

Page 7: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

7

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Eigenvalues vs. Total Energy

Page 8: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

8

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Performance vs. Processors

Page 9: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

9

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Banded Single-Processor Approach

• Compare diagonal and bandedpreconditioner in terms of reducing totaliteration count and cost

• Compare iterative methods (Davidson,GMRES) to benchmark PETSc results

• Evaluate relative cost of banded operationswith sparse-matrix approach in PETSc

Page 10: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

10

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

10- 2

100

102

104

106

108

1010

1012

1014

3 4 5 6 7 8 9 10 11 12 13

TIm

e-to

-sol

n (m

in)

No. of Dimensions

banded pr econdi ti oner

di agonal pr econdi ti oner

1 hour

1 day

512 hour s

Diagonal vs. Banded Diagonal vs. Banded PreconditionerPreconditioner

Page 11: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

11

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

10

100

1000

10000

100000

3 4 4 5 6

Non LU Decomposition

Time (in .01sec)

Tim

e (i

n .0

1sec

)

No. of Dimensi ons

Y = M0*e M1*X

0.053049M02.5021M1

0.99999R

Page 12: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

12

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

10 0

10 1

10 2

10 3

10 4

10 5

3 4 5

Time for LU Preconditioner

Time ( i n .01 sec)Ti

me[

LU]

(.01

sec)

No. of Dimensions

Y = M0* e M1* X

4.0707e- 05M03.9286M1

0.99999R

Page 13: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

13

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Results and Future Work

• Developing global and block orthogonal(W. Poirier) preconditioning methods

• Use SLEPc for Lanczos iteration andtensor products for efficient matrix-vectormultiplies

• Use NLCF IBM BGL to solve 10 DOFproblems

!

Page 14: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

A U.S. Department of EnergyOffice of Science LaboratoryOperated by The University of Chicago

Argonne National Laboratory

Office of ScienceU.S. Department of Energy

Improving the Performance of Tensor MatrixVector Product

Dinesh Kaushik Argonne National Laboratory

Page 15: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

15

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Tensor Matrix Vector Product

• Operator comes from the tensor product of a densematrix with the identity matrix

• Ax, Ay, Az are one directional operators (dense)• v and w are vectors of size n3

Avw =

xyz AIIIAIIIAA !!+!!+!!=

Page 16: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

16

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Two Ways

• Build the large sparse matrix- Large sparse matrix of size (n3 x n3 for 3D case)- Slow memory bandwidth limited performance

• Just evaluate the action of A on v (withoutexplicitly forming A)- Done as dense matrix-matrix multiplication- Very efficient implementation- Huge savings in memory

Page 17: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

17

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Performance Issues for Sparse Matrix Vector Product

• Little data reuse• High ratio of load/store to

instructions/floating-point ops• Stalling of multiple load/store functional

units on the same cache line• Low available memory bandwidth

Page 18: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

18

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Sparse Matrix Vector Algorithm: A GeneralForm

for every row, i { fetch ia(i+1) for j = ia(i) to ia(i + 1) { // loop over the non-zeros of the

row fetch ja(j), a(j), x1(ja(j)), ..…xN(ja(j)) do N fmadd (floating multiply add) } Store y1(i) ..…yN(i) }

Page 19: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

19

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Estimating the Memory Bandwidth Limitation

Assumptions

• Perfect Cache (only compulsory misses; no overhead)• No memory latency• Unlimited number of loads and stores per cycle

Data Volume (AIJ Format)

m*sizeof(int) + N*(m+n)*sizeof(double) // ia, N input (size n) and output (size m) vectors

+ Nnz* (sizeof(int) + sizeof(double))// ja, and a arrays

= 4*(m+nnz) + 8*(N*(m+n)+ Nnz)

Page 20: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

20

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

• Number of Floating-Point Multiply Add (fmadd) Ops = N*nz• For square matrices,

(Since Nnz >> n, Bytes transferred / fmadd ~12/N)

• Similarly, for Block AIJ (BAIJ) format

Estimating the Memory Bandwidth Limitation(Contd.)

Page 21: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

21

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Realistic Measures of Peak PerformanceSparse Matrix Vector Product

One vector, matrix size, m = 90,708, nonzero entries nz = 5,047,120

Page 22: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

22

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Second Choice: Dense Matrix-Matrix Multiplication

• We just need to store the small densematrices of size nxn- for 3 dimensions memory needed is 3n2

- Good ratio of flops to bytes: O(n4) operations O(n3)doubles

- Gets better for higher dimensions

Page 23: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

23

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Evaluating the Tensor Product Terms

• Type 1

• Type 2

• Type 3

- Loop over Type 2 for i = 1, p

[ ] [ ]nxmnxnnmnnm

VAvAI =! )(

[ ] [ ] T

nxnnmxnmnmnAVvIA =! )(

mnmnp vIAI )( !!

Page 24: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

24

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Performance of Tensor Matrix-VectorMultiplication – 3D case

(Intel Madison Processor 1.5 GHz, 6 Gflops/s Peak, 4 GB Memory)Memory Bandwidth Limited Bound 670 Mflops/s

n

Mflops/s

20 40 60 80 100

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

Custom

MXM

DGEMM

Page 25: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

25

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Performance of Tensor Matrix-Vector Multiplication –Fixed Mesh Points (n=7)

(Intel Madison Processor 1.5 GHz, 6 Gflops/s Peak, 4 GB Memory)

Dimensions

Mflops/s

3 4 5 6 7 8 9 10

250

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

3500

Custom

MXM

DGEMM

Page 26: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

26

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Performance of Tensor Matrix-Vector Multiplication–Long Reaction Co-ordinate

(51 points along reaction path and 7 points in other dimensions)

Dimensions

Mflops/s

3 4 5 6 7 8 9 10

1000

2000

3000

4000

5000

Custom

MXM

DGEMM

Page 27: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

27

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Conclusions and Future Work

• Very efficient implementation- Sparse matvecs take about 80% of execution time- We expect that tensor product implementation can

improve the performance by a factor of three to five• Possible to solve much larger problems

because of huge savings in memoryrequirement

• Parallel implementation

Page 28: M. Minkoff and D. Kaushik - Argonne National Laboratorymmin/MCS/mike_dinesh.pdf · M. Minkoff and D. Kaushik. 2 Pioneering Science and T echnology Office of Science U.S. Department

28

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Acknowledgements

• Barry Smith, William Gropp, and PaulFischer for many helpful discussions