toward an automatically tuned dense symmetric eigensolver for shared memory machines yusaku yamamoto...
DESCRIPTION
Introduction Target problem –Standard eigenvalue problem Ax = λx A : n by n real symmetric matrix We assume that A is dense. Applications –Molecular orbital methods Solution of dense eigenproblems of order more than 100,000 is required to compute the electronic structure of large protein molecules. –Principal component analysisTRANSCRIPT
Toward an Automatically Tuned Dense Symmetric Eigensolver
for Shared Memory Machines
Yusaku YamamotoDept. of Computational Science & Engineering
Nagoya University
Outline of the talk
1. Introduction
2. High performance tridiagonalization algorithms
3. Performance comparison of two tridiagonalization algorithms
4. Optimal choice of algorithm and its parameters
5. Conclusion
Introduction
• Target problem– Standard eigenvalue problem Ax = λx
• A : n by n real symmetric matrix• We assume that A is dense.
• Applications– Molecular orbital methods
• Solution of dense eigenproblems of order more than 100,000 is required to compute the electronic structure of large protein molecules.
– Principal component analysis
Target machines
• Symmetric Multi-Processors (SMP’s)
• Background– PC servers with 4 to 8 processors are quite common now.– The number of CPU cores integrated into one chip is rapidly incre
asing.
Cell processor(1+8 cores)
dual-core Xeon
Flow chart of a typical eigensolverfor dense symmetric matrices
Q*AQ = T(Q: orthogonal)
Tyi =i yi
xi = Qyi
Tridiagonalization
Eigensolution of thetridiagonal matrix
Back transformation
Tridiagonal matrix T
Eigenvalues of T: {i } ,Eigenvectors of T: {yi }
Eigenvectors of A: {xi }
Real symmetric A Computation Algorithm
Householder method
QR methodD & C methodBisection & IIMMR3 algorithm
Back transformation
Objective of this study
• Study the performance of tridiagonalization and back transformation algorithms on various SMP machines.
• The optimal choice of algorithm and its parameters will differ greatly depending on the computational environment and the problem size.
• The obtained data will be useful in designing an automatically tuned eigensolver for SMP machines.
Algorithms for tridiagonalization and back transformation
• Tridiagonalization by the Householder method– Reduction by Householder transformation H = I – uut
• H is a symmetric orthogonal matrix.• H eliminates all but the first element of a vector.
• Computation at the k-th step
• Back transformation– Apply the Householder transformations in reverse order.
k
0
0
0
0
0
0elements to be eliminatedmodified elements
nonzero elements
multiplyH from
left
multiply H from
right
multiplyH from
left
vector
Problems with the conventional Householder method
• Characteristics of the algorithm– Almost all the computational work are done in matrix vector
multiplication and rank-2 update, both of which are level-2 BLAS. • Total computational work: (4/3)N3
• Matrix-vector multiplication: (2/3)N3
• rank-2 update: (2/3)N3
• Problems– Poor data reuse due to the level-2 BLAS
• The effect of cache misses and memory access contentions among the processors is large.
– Cannot attain high performance on modern SMP machines.
High performance tridiagonalization algorithms (I)
• Dongarra’s algorithm ( Dongarra et al., 1992 )– Defer the application of the Householder transformations until sev
eral (M) transformations are ready.– Rank-2 update can be replaced by rank-2M update, enabling a mor
e efficient use of cache memory.– Adopted by LAPACK , ScaLAPACK and other libraries.
• Back transformation– Can be blocked in the same manner to use the level-3 BLAS.
0A(K*M)
0
A((K–1)*M)= ×–
U(K*M) Q(K*M)t
×–
Q(K*M) U(K*M)t
rank-2M update (using level-3 BLAS)
Properties of Dongarra’s algorithm
• Computational work– Tridiagonalization: (4/3)n3
– Back transformation: 2n2m (m: number of eigenvectors)
• Parameters to be determined– MT (block size for tridiagonalization)
– MB (block size for back transformation)
• Level-3 aspects– In the tridiagonalization phase, only half of the total work can be d
one with the level-3 BLAS.– The other half is level-2 BLAS, thereby limiting the performance o
n SMP machines.
High performance tridiagonalization algorithms (II)
• Two-step tridiagonalization (Bishof et al., 1993, 1994)– First, transform A into a band matrix B (of half bandwidth L ).
• This can be done almost entirely with the level-3 BLAS.– Next, transform B into a tridiagonal matrix T.
• The work is O(n2L) and is much smaller than that in the first step.
• Transformation into a band matrix
0
0 00
L
A B T
(4/3)n3 O(n2L)
Murata’salgorithm
0
0
0
0
0
0elements to be eliminated
modified elements
nonzero elements
multiplyH from
left
multiply H from
right
High performance tridiagonalization algorithms (II)
• Wu’s improvement (Wu et al., 1996)– Defer the application of the block Householder transformations until
several (MT) transformations are ready.– Rank-2L update can be replaced by rank-2LMT update, enabling a
more efficient use of cache memory.
• Back transformation– Two-step back transformation is necessary.
• 1st step: eigenvectors of T -> eigenvectors of B• 2nd step: eigenvectors of B -> eigenvectors of A
– Both steps can be reorganized to use the level-3 BLAS.– In the 2nd step, several (MB) block Householder transformations can
be aggregated to enhance cache utilization.
Properties of Bischof’s algorithm
• Computational work– Reduction to a band matrix: (4/3)n3
– Murata’s algorithm: 6n2L– Back transformation
• 1st step: 2n2m• 2nd step: 2n2m
• Parameters to be determined– L (half bandwidth of the intermediate band matrix)– MT (block size for tridiagonalization)– MB (block size for back transformation)
• Level-3 aspects– All the computations that require O(n3) work can be done with the l
evel-3 BLAS.
Performance comparison of two tridiagonalization algorithms
• We compare the performance of Dongarra’s and Bischof’s algorithm under the following conditions:
– Computational environments• Xeon (2.8GHz) Opteron (1.8GHz)• Fujitsu PrimePower HPC2500 IBM pSeries (Power5, 1.9GHz)
– Number of processors• 1 to 32 (depending on the target machines)
– Problem size• n=1500, 3000, 6000, 12000 and 24000• All eigenvalues & eigenvectors, or eigenvalues only
– Parameters in the algorithm• Choose nearly optimal values for each environment and matrix size.
Performance on the Xeon (2.8GHz) processor
02468
1012
1 2 4
BischofDongarra
01020304050607080
1 2 40
100200300400500600
1 2 4
Executiontime (sec.)
Number ofprocessors
n = 1500 n = 3000 n = 6000
Time for computing all the eigenvalues and eigenvectors
00.51
1.52
2.53
1 2 4
BischofDongarra
02468
101214161820
1 2 40
20406080
100120140160180
1 2 4
Time for computing all the eigenvalues only
n = 1500 n = 3000 n = 6000
Bischof’s algorithm is preferable when only the eigenvalues are needed.
Executiontime (sec.)
Number ofprocessors
00.51
1.52
2.53
3.5
1 2 4
BischofDongarra
05
1015202530
1 2 40
50
100
150
200
250
1 2 4
02468
1012
1 2 4
BischofDongarra
01020304050607080
1 2 40
100200300400500600
1 2 4
Performance on the Opteron (1.8GHz) processorExecutiontime (sec.)
Number ofprocessors
n = 1500 n = 3000 n = 6000
Time for computing all the eigenvalues and eigenvectors
Time for computing all the eigenvalues only
n = 1500 n = 3000 n = 6000
Bischof’s algorithm is preferable for the 4 processor case even whenall the eigenvalues and eigenvectors are needed.
Executiontime (sec.)
Number ofprocessors
0
50
100
150
200
250
1 2 4 8 16 32
BischofDongarra
0200400600800
1000120014001600
1 2 4 8 16 320
2000400060008000
100001200014000
1 2 4 8 16 32
050
100150200250300350400450
1 2 4 8 16 32
BischofDongarra
0500
100015002000250030003500
1 2 4 8 16 320
5000
10000
15000
20000
25000
30000
1 2 4 8 16 32
Performance on the Fujitsu PrimePower HPC2500Executiontime (sec.)
Number ofprocessors
n = 6000 n = 12000 n = 24000
Time for computing all the eigenvalues and eigenvectors
Time for computing all the eigenvalues only
Bischof’s algorithm is preferable when the number of processors is greaterthan 2 even for computing all the eigenvectors.
n = 6000 n = 12000 n = 24000Executiontime (sec.)
Number ofprocessors
020406080
100120140160
1 2 4 8 16
BischofDongarra
0
200
400
600
800
1000
1200
1 2 4 8 160
100020003000400050006000700080009000
1 2 4 8 16
BischofDongarra
050
100150200250300350
1 2 4 8 16
BischofDongarra
0
500
1000
1500
2000
2500
1 2 4 8 160
2000400060008000
100001200014000160001800020000
1 2 4 8 16
Performance on the IBM pSeries (Power5, 1.9GHz)Executiontime (sec.)
Number ofprocessors
n = 6000 n = 12000 n = 24000
Time for computing all the eigenvalues and eigenvectors
Time for computing all the eigenvalues only
n = 6000 n = 12000 n = 24000
Bischof’s algorithm is preferable when only the eigenvalues are needed.
Executiontime (sec.)
Number ofprocessors
Optimal algorithm as a function of the number of processors and necessary eigenvectors
• Xeon (2.8GHz)
• Opteron (1.8GHz)
0%20%40%60%80%
100%
1 2 40%
20%40%60%80%
100%
1 2 40%
20%40%60%80%
100%
1 2 4
Bischof
Dongarra
Fraction ofeigenvectorsneeded
Number ofprocessors
n = 1500 n = 3000 n = 6000
0%20%40%60%80%
100%
1 2 40%
20%40%60%80%
100%
1 2 40%
20%40%60%80%
100%
1 2 4
n = 1500 n = 3000 n = 6000Fraction ofeigenvectorsneeded
Number ofprocessors
Bischof
Dongarra
0%20%40%60%80%
100%
1 2 4 8 160%
20%40%60%80%
100%
1 2 4 8 160%
20%40%60%80%
100%
1 2 4 8 16
0%20%40%60%80%
100%
1 2 4 8 16 320%
20%40%60%80%
100%
1 2 4 8 16 320%
20%40%60%80%
100%
1 2 4 8 16 32
Optimal algorithm as a function of the number of processors and necessary eigenvectors (Cont’d)
• PrimePower HPC2500
• pSeries
Bischof
DongarraFractionofEigen-vectorsneeded
Number ofprocessors
n = 6000 n = 12000 n = 24000
Bischof
DongarraFractionofEigen-vectorsneeded
Number ofprocessors
n = 6000 n = 12000 n = 24000
Optimal choice of algorithm and its parameters
• To construct an automatically tuned eigensolver, we need to select the optimal algorithm and its parameters according to the computational environment and the problem size.
• To do this efficiently, we need an accurate and inexpensive performance model that predicts the performance of an algorithm given the computational environment, problem size and the parameter values.
• A promising way to achieve this goal seems to be the hierarchical modeling approach (Cuenca et al., 2004):– Model the execution time of BLAS primitives accurately based on the mea
sured data.– Predict the execution time of the entire algorithm by summing up the exec
ution times of the BLAS primitives.
Performance prediction of Bischof’s algorithm
• Predicted and measured execution time of reduction to a band matrix (Opteron, n=1920)
• The model reproduces the variation of the execution time as a function of L and MT fairly well.
• Prediction errors: less than 10%• Prediction time: 0.5s (for all the values of L and MT )
14
1664
L=3 L=6 L=12 L=24 L=48 L=96
02468
1012141618
L=3L=6L=12L=24L=48L=96
14
1664
L=3 L=6 L=12 L=24 L=48 L=96
02468
1012141618
L=3L=6L=12L=24L=48L=96
Time ( s) Time ( s)
Predicted time Measured timeL
MT
L
MT
Automatic optimization of parameters L and MT
• Performance for the optimized values of (L, MT)
• The values of (L , MT) determined from the model were actually optimal for almost all cases.
• Parameter optimization in other phases and selection of the optimal algorithm is in principle possible using the same methodology.
0
500
1000
1500
2000
2500
3000
3500
960 1920 2880 3840 4800 5760 6720 7680 8640 9600
(12,8)(24,4)(48,4)auto- tunedoptimal
Performance(MFLOPS)
Matrix size n
Conclusion
• We studied the performance of Dongarra’s and Bischof’s tridiagonalization algorithms on various SMP machines for various problem sizes.
• The results show that the optimal algorithm strongly depends on the target machine, the number of processors and the number of necessary eigenvectors. On the Opteron and HPC2500, Bischof’s algorithm is always preferable when the number of processors exceeds 2.
• By using the hierarchical modeling approach, it seems possible to predict the optimal algorithm and its parameters prior to execution. This will open a way to an automatically tuned eigensolver.