one-sided communication implementation in fmo method j. maki, y. inadomi, t. takami, r. susukita...
TRANSCRIPT
One-sided Communication Implementation in FMO Method
J. Maki, Y. Inadomi, T. Takami, R. Susukita†, H. Honda,J. Ooba, T. Kobayashi, R. Nogita, K. Inoue and M. Aoyagi
Computing and Communications Center, Kyushu University†Fukuoka Industry, Science & Technology Foundation
Overview of FMO method (1)• FMO(Fragment Molecular Orbital method) has been devel
oped by Kitaura (AIST) and co-workers to calculate electronic states of a macromolecule such as a protein and nucleotide
• The target molecule is divided into fragments (monomers)
• ab initio molecular orbital (MO) calculation is carried out on each fragment and fragment pair (dimer).
• Usually executed with high parallel efficiency (>90%)
• The method can execute all-electron calculation on a protein molecule with 10,000 atoms.
Hamiltonian
monomer calculation
Overview of FMO method (2)
1
| |[ ]I J
i i
i ji I J I i j I
H h V
rr r
environmentalelectrostatic potential
(potential from other monomers)
monomer calculation depends on other monomers electron density
electron density of monomer J
,
KKK A
A K i j KA
ZV d
r
r rr R r r
JJ J
iterated until all electron densities unchanged(SCC procedure)
I I I IH E
Overview of FMO method (3)
Schrødinger equation
basis functions
2 I I II I I IIj k ji ki j jk
jk i jk
C C D
MO coefficients
density matrix elements
RHF (Restricted Hartree-Fock) SCF method
Molecular Orbital (MO)
electron density
number of basis functions
Hamiltonian
dimer calculation
( 2)IJ I
I J I
E E N E
Total energy
IJ IJ IJ IJH E
Overview of FMO method (4)
,
1
| |[ ]IJ K
i i
i ji IJ K I J i j IJ
H h V
rr r
environmental electrostatic potential
Schrødinger equation
,
KKK A
A K i j KA
ZV d
r
r rr R r r
not all dimers are calculated by SCF
dimer-es approximation• For distant dimers, SCF calculation are not carried out • Energy is calculated by electrostatic approximation
monomer
SCFdimer
ES (non-SCF)dimer
1 21 2
1 2
I JIJ I JE E E d d
r r
r rr r
SCC procedure
convergence check of SCC procedure
already converged
not yet converged
electronic structure calculation for monomer
initial density calculation 0ID
electronic structure calculation for SCF-dimer
ES dimer calculation
total energy calculationFor all monomers
For all SCF-dimers
For all separated dimers
Flow chart of FMO calculation
( 2)IJ I
I J I
E E N E
I I I IH E
IJ IJ IJ IJH E
1 21 2
1 2
I JIJ I JE E E d d
r r
r rr r
SCC procedure
convergence check of SCC procedure
already converged
not yet converged
electronic structure calculation for monomer
initial density calculation 0ID
electronic structure calculation for SCF-dimer
ES-dimer calculation
total energy calculation
update monomer density matrices
refer monomer matrices needed in electronic structure calculation
refer two monomer matrices needed in dimer-ES calculation
refer monomer matrices needed in electronic structure calculation
update monomer density matrices
For all monomers
For all SCF-dimers
For all separated dimers
Update all monomer density matrices
Refer some monomer density matrices
( 2)IJ I
I J I
E E N E
I I I IH E
IJ IJ IJ IJH E
1 21 2
1 2
I JIJ I JE E E d d
r r
r rr r
Density requirements and update in FMO calculation
4-center two electron integral calculationsfor all environment monomers are prohibitive
,
K I K Iij i j
KI I I IAi j i j
A K i j KA
V V d
Zd d d
r r r r
rr r r r r r rr R r r
,
KI Ii j
I I K Ki j k l K
klk l K
d d
d d D
rr r r r
r r
r r r rr r
r r
KijVmatrix element of KV r (needed in SCF calculation)
involves 4 center tow electron integrals
order of calculation costs number of basis functions
Approximation of calculation (1)
esp-aoc approximation
Approximation of
KS
K K
iiD S
K Ki iK K KA
iiA K i KA
ZV d D S
r r
r rr R r r
calculation (2)
overlapmatrix of fragment K
Mulliken AO population of
K K K K Ki i ii
i K
D S
r r r
K K Kij i jS d r r r( )
Ki r
Approximation of
,
K K K K KA ij jiii
i A i A j K
Q D S D S
K K
K A A
A K A KA A
Z QV
rr R r R
K KA A
A K
Q
r r R
calculation (3)
esp-ptc approximation
Mulliken atomic charge of the nucleus AKAQ
point charge approx. of
number of environment monomersfor which 4-center two electron integralcalculations are carried out
Hypothetical petascale computing environment
Number of fragments : 10,000-100,000
~1,000) (the current largest
Number of CPUs : 1PFlops = 100,000 CPU
(current one CPU peak performance ~10GFlops)
Memory requirement in FMO
RIJ
1,000 350MB 4MB
5,000 1.7GB 95MB
10,000 3.4GB 380MB
50,000 17GB 9.3GB
100,000 34GB 37GB
Number of fragments
It is difficult for each process to store all necessary data
Memory per one CPU core > 10 GB×
The size of one density matrix ≒ 350KBRIJInterfragment distance data ≒ / 2
OSC implementation in OpenFMO (1)
node00
0 1
node01
2 3
node02
4 5
node03
6 7
node04
8 9
Group00 Group01 Group02
• In its execution, only the process of rank 0 is used as a server process for dynamic load balancing. All the other processes are worker processes and are divided into groups and the two-level parallelization is used as the other implementations.
OSC implementation in OpenFMO (2)
node00
0 1
node01
2 3
node02
4 5
node03
6 7
node04
8 9
Group00 Group01 Group02
memory window created by MPI_Win
density matrix data
MPI_Get
Estimation of communication cost (1)Assumptions
1. The sizes of all density matrices are the same.All groups have the same number of worker processes.
2. The time to put or get one density matrix is equal to the average point-to-point communication time .
3. MPI_Bcast implements a binomial tree algorithm so that the time to broadcast one density matrix over processes is obtained by .
4. And for OSC scheme, dynamic load balancing works well and the delay due to competing put or get requests is ignorable. Therefore all groups execute the same number of monomer and dimer jobs and this means that all groups have the same
N
SCF dimers
SCC iterationsneighboring monomers (average)
neighboring monomers (average)
17
20
10,000-100,000
96,000
32
7
12
Estimation of communication cost (2)
monomers with which a monomerforms a SCF dimer
point-to-point communication time to transfer one density matrix
mon monget SCC 4C 1( )fN N N N
number of SCF dimers
total number of dimers
mon dim esget get get get
monSCC 4C
dim dimSCF 4C
1
/ 2 1
( )
( )f
f f f
N N N N
N N N
N N N N N
total number of get requests
Estimation of communication cost (3)
dim dim dimget SCF 4C 2 / 2( )fN N N N
number of monomerget requests
number of SCFdimerget requests
OSC scheme
dimSCF1 / 2 / 2( )f f fN N N N number of ES dimers
es dimget SCF1 / 2 / 2 2( )f f fN N N N N
number of ES dimerget requests
p2p 2 proc group1 log /( )t N N
SCC p2p group/fN N t N
Estimation of communication cost (4)
Bcast scheme
total cost per one group
SCC p2p groupOSC
p2p proc group get group2
/
1 log / /( )fT N N t N
t N N N N
cost of MPI_Bcast
cost for one get requests
total cost of put requests
total cost per one group
SCC p2p procBcast 2logfT N N t N
OSC scheme
Simulation using skelton code (1)• FMO calculation was simulated using skelton code
– quantum calculation parts are removed• Skelton code accumulates estimated time of :
– molecular integral calculation– Fock matrix builging, SCF procedure
• Time of each communication are estimated and accumulated– MPI_Send, MPI_recv, MPI_get, MPI_put
estimated from measurements– MPI_Bcast, MPI_Allreduce
estimated from point to point communication timeby assuming binomial tree, butterfly algorithm
Simulation using skelton code (2)• integral calculation time estimation formula
integrals required for environmental potential
1 electron
integrals required for SCF
• Fock builging time estimation
A=1.12072E-05 B=6.03297E-08 env 2
1e b b bf N AN BN
env 22 3e C b b bf N AN BN
env 22 4e C b b bf N AN BN
2 32e b b bf N AN BN
2 3Fock b b bf N AN BN
2 electron (3center)A=5.28555E-05 B=2.60718E-07
A=7.57123E-05 B=9.33117E-07
A=5.84313E-05 B=1.12603E-06
2 electron (4center)A=0.00050 B=3.26399E-06
simulation using skelton codes (3)
OSC cost < Bcast cost
> 512 with
Aquaporin protein
6-31G* basis set
PDBID 2F2B,
=492, 14,492 atoms
> 32) (
=4, 8, 16, 32, 64
=16
Linux cluster in RIKEN Super Combined Cluster (RSCC) was used
Summary
• OSC of MPI-2 standard has been implemented in a new FMO code to reduce the memory requirement per process.
• These results show OSC scheme is prospective for the petascale computing environment.
• Evaluation of communication costs shows that OSC scheme has an advantage over the Bcast scheme for
=10,000-100,000, =96,000.
by broadcast unnecessary.• OSC implementation also makes the exchange of
• Simulations using skelton code also show that OSC scheme has an advantage in communication cost with large .