m. minkoff and d. kaushik - argonne national laboratorymmin/mcs/mike_dinesh.pdf · m. minkoff and...

1

PioneeringScience andTechnology

Office of Science U.S. Department

of Energy

Using Tensor Methods, PETSc, and SLEPc to ObtainExact Cumulative Reaction Probability

M. Minkoff and D. Kaushik

2



of Energy

Outline

• Overview• Description of CRP Simulation Problem• Using PETSc and SLEPc for Application to CRP• Computations for Sparse Matrices• Results for Banded Preconditioning• Future Directions• Tensor Matrix Multiplication

3



of Energy

Chemical Dynamics Theory3 angles, 3 stretches6 degrees of freedom

4



of Energy

Chemical Dynamics Theory• Reaction rates are related to

- Cumulative Reaction Probability (CRP), N(E)N(E) = 4 Tr [ εr

1/2 G(E)† εp G(E) εr1/2 ]

• Where Tr is the trace of the matrix, † is the adjoint, εr and εp are the

absorbing potential in the reactant and product regions.• εr

+ εp = ε, the given total absorbing potential.The Green’s functions have the form:

G(E) = (E + iε - H)-1 , where i imaginary and H is the Hamiltonian.

We need to solve two linear systems (at each iteration):

(E + iε - H) y = x and it’s adjoint where x is known.

This system is solved via GMRES with preconditioning methods (initiallydiagonal scaling).

5



of Energy

Hamiltonian Sparsity Pattern

• Sparsity for d=2 and d>2

6



of Energy

Non-zero Storage for Green’s Matrix

0.01

0.1

1

10

100

1000

104

1 10

Storage Required for F=3.5 Cutoff, .32ev

(3% Accuracy)

Matrix Size (MWords)Extrapolation Curve

Matrix Size (MWords)

Dimension

7



of Energy

Eigenvalues vs. Total Energy

8



of Energy

Performance vs. Processors

9



of Energy

Banded Single-Processor Approach

• Compare diagonal and bandedpreconditioner in terms of reducing totaliteration count and cost

• Compare iterative methods (Davidson,GMRES) to benchmark PETSc results

• Evaluate relative cost of banded operationswith sparse-matrix approach in PETSc

10



of Energy

10- 2

100

102

104

106

108

1010

1012

1014

3 4 5 6 7 8 9 10 11 12 13

TIm

e-to

-sol

n (m

in)

No. of Dimensions

banded pr econdi ti oner

di agonal pr econdi ti oner

1 hour

1 day

512 hour s

Diagonal vs. Banded Diagonal vs. Banded PreconditionerPreconditioner

11



of Energy

10

100

1000

10000

100000

3 4 4 5 6

Non LU Decomposition

Time (in .01sec)

Tim

e (i

n .0

1sec

)

No. of Dimensi ons

Y = M0*e M1*X

0.053049M02.5021M1

0.99999R

12



of Energy

10 0

10 1

10 2

10 3

10 4

10 5

3 4 5

Time for LU Preconditioner

Time ( i n .01 sec)Ti

me[

LU]

(.01

sec)

No. of Dimensions

Y = M0* e M1* X

4.0707e- 05M03.9286M1

0.99999R

13



of Energy

Results and Future Work

• Developing global and block orthogonal(W. Poirier) preconditioning methods

• Use SLEPc for Lanczos iteration andtensor products for efficient matrix-vectormultiplies

• Use NLCF IBM BGL to solve 10 DOFproblems

!

A U.S. Department of EnergyOffice of Science LaboratoryOperated by The University of Chicago

Argonne National Laboratory

Office of ScienceU.S. Department of Energy

Improving the Performance of Tensor MatrixVector Product

Dinesh Kaushik Argonne National Laboratory

15



of Energy

Tensor Matrix Vector Product

• Operator comes from the tensor product of a densematrix with the identity matrix

• Ax, Ay, Az are one directional operators (dense)• v and w are vectors of size n3

Avw =

xyz AIIIAIIIAA !!+!!+!!=

16



of Energy

Two Ways

• Build the large sparse matrix- Large sparse matrix of size (n3 x n3 for 3D case)- Slow memory bandwidth limited performance

• Just evaluate the action of A on v (withoutexplicitly forming A)- Done as dense matrix-matrix multiplication- Very efficient implementation- Huge savings in memory

17



of Energy

Performance Issues for Sparse Matrix Vector Product

• Little data reuse• High ratio of load/store to

instructions/floating-point ops• Stalling of multiple load/store functional

units on the same cache line• Low available memory bandwidth

18



of Energy

Sparse Matrix Vector Algorithm: A GeneralForm

for every row, i { fetch ia(i+1) for j = ia(i) to ia(i + 1) { // loop over the non-zeros of the

row fetch ja(j), a(j), x1(ja(j)), ..…xN(ja(j)) do N fmadd (floating multiply add) } Store y1(i) ..…yN(i) }

19



of Energy

Estimating the Memory Bandwidth Limitation

Assumptions

• Perfect Cache (only compulsory misses; no overhead)• No memory latency• Unlimited number of loads and stores per cycle

Data Volume (AIJ Format)

m*sizeof(int) + N*(m+n)*sizeof(double) // ia, N input (size n) and output (size m) vectors

+ Nnz* (sizeof(int) + sizeof(double))// ja, and a arrays

= 4*(m+nnz) + 8*(N*(m+n)+ Nnz)

20



of Energy

• Number of Floating-Point Multiply Add (fmadd) Ops = N*nz• For square matrices,

(Since Nnz >> n, Bytes transferred / fmadd ~12/N)

• Similarly, for Block AIJ (BAIJ) format

Estimating the Memory Bandwidth Limitation(Contd.)

21



of Energy

Realistic Measures of Peak PerformanceSparse Matrix Vector Product

One vector, matrix size, m = 90,708, nonzero entries nz = 5,047,120

22



of Energy

Second Choice: Dense Matrix-Matrix Multiplication

• We just need to store the small densematrices of size nxn- for 3 dimensions memory needed is 3n2

- Good ratio of flops to bytes: O(n4) operations O(n3)doubles

- Gets better for higher dimensions

23



of Energy

Evaluating the Tensor Product Terms

• Type 1

• Type 2

• Type 3

- Loop over Type 2 for i = 1, p

[ ] [ ]nxmnxnnmnnm

VAvAI =! )(

[ ] [ ] T

nxnnmxnmnmnAVvIA =! )(

mnmnp vIAI )( !!

24



of Energy

Performance of Tensor Matrix-VectorMultiplication – 3D case

(Intel Madison Processor 1.5 GHz, 6 Gflops/s Peak, 4 GB Memory)Memory Bandwidth Limited Bound 670 Mflops/s

n

Mflops/s

20 40 60 80 100

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

Custom

MXM

DGEMM

25



of Energy

Performance of Tensor Matrix-Vector Multiplication –Fixed Mesh Points (n=7)

(Intel Madison Processor 1.5 GHz, 6 Gflops/s Peak, 4 GB Memory)

Dimensions

Mflops/s

3 4 5 6 7 8 9 10

250

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

3500

Custom

MXM

DGEMM

26



of Energy

Performance of Tensor Matrix-Vector Multiplication–Long Reaction Co-ordinate

(51 points along reaction path and 7 points in other dimensions)

Dimensions

Mflops/s

3 4 5 6 7 8 9 10

1000

2000

3000

4000

5000

Custom

MXM

DGEMM

27



of Energy

Conclusions and Future Work

• Very efficient implementation- Sparse matvecs take about 80% of execution time- We expect that tensor product implementation can

improve the performance by a factor of three to five• Possible to solve much larger problems

because of huge savings in memoryrequirement

• Parallel implementation

28



of Energy

Acknowledgements

• Barry Smith, William Gropp, and PaulFischer for many helpful discussions

m. minkoff and d. kaushik - argonne national laboratorymmin/mcs/mike_dinesh.pdf · m. minkoff and...

Documents