tensor contraction engine & extensible many-electron theory module in nwchem
DESCRIPTION
Tensor contraction engine & extensible many-electron theory module in NWChem. So Hirata Pacific Northwest National Laboratory MSS group meeting (24 Oct, 2002). Collaborators & Sponsors. M. Nooijen (Princeton University) R. J. Harrison & D. Bernholdt (Oak Ridge National Laboratory) - PowerPoint PPT PresentationTRANSCRIPT
Tensor contraction engine& extensible many-
electron theory module in NWChem
Tensor contraction engine& extensible many-
electron theory module in NWChem
So HirataPacific Northwest National
Laboratory
MSS group meeting (24 Oct, 2002)
2
Collaborators & SponsorsCollaborators & Sponsors
• M. Nooijen (Princeton University)• R. J. Harrison & D. Bernholdt (Oak Ridge National
Laboratory)• D. Cociorva, G. Baumgartner, R. Pitzer, & P.
Sadayappan (Ohio State University)• J. Ramanujam (Louisiana State University)
• Office of Basic Energy Science, Department of Energy
• Office of Biological and Environmental Research, Department of Energy
3
Purpose of this projectPurpose of this project
• Create a high-level symbolic manipulation language that derives working equations of second-quantized many-electron theories and implement them automatically• Expedites complex and error-prone many-
electron theory implementation• Helps develop and examine new theories or
approximations• Facilitates parallelization and other laborious
code optimizations• CCSDT T3 code is >18000 lines in Fortran77!
4
Operator contraction engine (OCE)Operator contraction engine (OCE)
• Object-oriented symbolic manipulation program that derives working equations from any well-defined second-quantized many-electron theory ansatz
• Performs valid contractions of normal-ordered operators according to Wick’s theorem and reduces any given ansatz into the simplest form of tensor contraction expressions
• Consolidates identical terms and recognizes terms that are related by permutation symmetry
5
Tensor contraction engine (TCE)Tensor contraction engine (TCE)
• Object-oriented symbolic manipulation program that analyzes tensor contraction expressions and implement them into efficient programs
• Breaks down multiple tensor contractions (A=BCDE) into a sequence of elementary tensor contractions (X=DE; Y=BX; A=YC) with minimal operation costs
• Factorizes common contractions [X=BC+BD into X=B(C+D)]
• Generates debug-level Fortran90 programs and release-level parallel Fortran77 programs
OCE & TCE demonstration
OCE & TCE demonstration
7
What is new?What is new?
• Full exploitation of index permutation symmetry• Taking advantage of spin and spatial
symmetry also, the programs generated by TCE are theoretically operation cost minimal
• OCE extracts permutation symmetries among working equations
• TCE breaks down permutation operators into elementary permutation operators, analyzes which permutation symmetries can be exploited, and reflects the result to the generated codes
8
Permutation symmetryPermutation symmetry
• Primitive tensors that appear in many-electron theories possess “permutation anti-symmetry.” For example, a two-electron integral tensor and a three-electron excitation amplitude tensor have the following properties: qp
srqprs
pqsr
pqrs vvvv
cbakji
cbakij
cbajki
cbajik
cbaikj
cbaijk
cabkji
cabkij
cabjki
cabjik
cabikj
cabijk
bcakji
bcakij
bcajki
bcajik
bcaikj
bcaijk
backji
backij
bacjki
bacjik
bacikj
bacijk
acbkji
acbkij
acbjki
acbjik
acbikj
acbijk
abckji
abckij
abcjki
abcjik
abcikj
abcijk
tttttt
tttttt
tttttt
tttttt
tttttt
tttttt
9
ImplicationImplication
• Reduced storage size• Instead of storing full , we may keep only
• Reduced operation cost by shorter summation index ranges
• Reduced operation cost by shorter target index ranges• Instead of computing full , we may
obtain just
abijt
bajit
dc
abdc
dcij
dc
abcd
cdij vtvt 2
,
dc
abdc
dcij vt2
dc
badc
dcji vt2
10
ChallengesChallenges
• What is the index permutation symmetry of an intermediate tensor?• Consider the intermediate
• What is the best way to store just the non-redundant elements of tensors?
• What is the operation cost minimal contraction of two tensors with permutation symmetry?
• How can TCE generate a code that exploits spin, spatial, and permutation symmetries at the same time?
bj
ai
abij tti
11
Index permutation symmetry versus permutation symmetry of tensor contraction expressions
Index permutation symmetry versus permutation symmetry of tensor contraction expressions
• Index permutation anti-symmetry ultimately reflects the Pauli principle of fermions; any tensor having electron indices (such as integrals, excitation amplitudes) is anti-symmetric• When there is such a multiple tensor contraction
as
there “must” be also
dnm
mnid
cm
abdjkn
abcijk vtti
,,
mnid
cm
abdjkn
mnid
cm
abdjkn
mnid
cm
abdjkn
mnid
cm
abdjkn
mnid
cm
abdjkn
mnid
cm
abdjkn
mnid
cm
abdjkn
mnid
cm
abdjkn
vttkiPcbPvttkiPcaPvttjiPcbPvttjiPcaP
vttkiPvttjiPvttcbPvttcaP
)()()()()()()()(
)()()()(
12
Break down of permutation operators
Break down of permutation operators
• When breaking down a multiple tensor contraction into a sequence of binary tensor contractions, we should break down the permutation operators appropriately, so that each intermediate has maximum index permutation symmetries
mnid
cm
abdjkn
abcijk vttjkiPabcPr )/()/(
m
abmijk
cm
abcijk
nd
mnid
abdjkn
abmijk
itabcPr
vtjkiPi
)/(
)/(,
13
What is the best way to store an intermediate?
What is the best way to store an intermediate?
• An intermediate tensor has much more limited index permutation symmetries. Super (sub) indices are categorized into global targets and local targets, and permutation anti-symmetry exists among just global targets and among just local targets. So in general, the non-redundant elements are: pn
qm
ggggggggi
321321
321321
,,
14
What is the general form of tensor contraction with permutation
symmetry?
What is the general form of tensor contraction with permutation
symmetry?
• Expansion
Note that an excitation amplitude tensor will not have local target indices. This is because two excitation amplitudes cannot contract (as they have super particles, sub holes
structures).
txn
uym
pn
qm
ccggccgg
gg
gg ii
111
111
11
11
,,,,
,
,
up
tq
n
m
ccgg
ccggaaii tt
11
11
1
1
,
,
15
What is the general form of tensor contraction with permutation
symmetry?
What is the general form of tensor contraction with permutation
symmetry?
• Contraction
Note that at least one of the two tensors is always an excitation amplitude tensor.
pxn
qym
t u
up
tq
txn
uym
gggggggg
cc cc
ccggccgg
ccggccgg
i
titu
111
111
1 1
11
11
111
111
,,,,
,,
,,,,!!
16
What is the general form of tensor contraction with permutation
symmetry?
What is the general form of tensor contraction with permutation
symmetry?
• Compressionpxn
qym
xvpn
ywqm
gggg
gggg
gg
gg iPi
111
111
11
11
,,
,,
,
,
17
Spin & spatial symmetrySpin & spatial symmetry
• Spin symmetry
• Spatial symmetry
indices
subscriptindices
tsuperscrip
pp ss
symmetricTotally zqp
18
An exampleAn example
d
cldi
dbkj
cblkji vtbcPjkiPx /,
LOOP OVER b,j<=k BLOCKS LOOP OVER l,c,i BLOCKS LOOP OVER d BLOCKS IF (b<=d) READ t(b<=d,j<=k) IF (d<b) READ t(d<b,j<=k) READ v(l<c,i<d) ! Always holes < particles IF (spin/spatial sym block of t is non-zero) THEN IF (spin/spatial sym block of v is non-zero) THEN MAKE x(l,b,c,i,j<=k) BLOCK BY DGEMM IF (b<=c and i<=j) ACCUM x(l,b<=c,i<=j<=k) IF (b<=c and j<=i,i<=k) ACCUM -x(l,b<=c,j<=i<=k) IF (b<=c and k<=i) ACCUM x(l,b<=c,k<=i<=j) IF (c<=b and i<=j) ACCUM -x(l,c<=b,i<=j<=k) IF (c<=b and j<=i,i<=k) ACCUM x(l,c<=b,j<=i<=k) IF (c<=b and k<=i) ACCUM -x(l,c<=b,k<=i<=j) END IF ! Note that b=c, i=j block is accumulated END IF ! multiple times END LOOP END LOOPEND LOOP
19
Extensible many-electron theory module in NWChem
Extensible many-electron theory module in NWChem
• “Extensible” because a new many-electron method can be added relatively easily by TCE
• Very general tensor storage interface (needs only size & offsets of one-dimensional compressed tensor arrays; intermediate arrays’ offsets are also computed in run-time by programs generated by TCE )
• Compatible one- and two-electron integral transformation codes and offset generators
20
OptimizationsOptimizations
• Spin, spatial, permutation symmetries• Dynamic tiling (orbital ranges are “tiled” (or
blocked) into smaller section so that the peak local memory usage does not exceed the user-specified limit)
• Dynamic load balancing parallelism (each tile-level tensor contraction is carried out in one processor with virtually no communication)
• Disk I/O is based on Shared File Library of ParSoft, which allows one-sided (independent) read/write without Global Array cache
• Local sorting of array elements (so that the composite summation indices become contiguous in memory) followed by local DGEMM (with absolutely no communication in this critical step)
21
Previous & new algorithmsPrevious & new algorithms
DRA DRADRADRADRA
GA
MAGA to MA sort (communications!)
Collective I/O (synchronization!) & GA cache
SF SFSFSFSF
MAMA to MA sort (no communications!)
One-sided I/O (no synchronization!)
MA
22
Methods availableMethods available
• Various spin-unrestricted coupled-cluster methods• LCCD, CCD, LCCSD, CCSD, CCSDT• More to follow (higher CC, CI, MBPT, EOM-CC,
etc.)
• Input syntax• Uses NWDFT module for the ground statedft
xc Hfexch 1.0end
tceccsdthresh 1e-6maxiter 100end
task tce energy
23
A sample output (water CCSD/sto-3g)
A sample output (water CCSD/sto-3g)
NWChem General Electron-Correlation Theory Module ------------------------------------------------- Programs generated by a Tensor Contraction Engine
General Information ------------------- Wavefunction type : Restricted No. of electrons : 10 Alpha electrons : 5 Beta electrons : 5
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
Correlation Information ----------------------- Calculation type : Coupled-cluster singles & doubles (CCSD) Max iterations : 100 Residual threshold : 0.10E-09
Memory Information ------------------ Available GA+MA space size is 26213624 doubles
Maximum block size 50 doubles
24
A sample output (continued)A sample output (continued) Suggested orbital blocking
Block Spin Irrep Size Offset ----------------------------------------- 1 alpha a 5 doubles 0 2 beta a 5 doubles 5 3 alpha a 2 doubles 10 4 beta a 2 doubles 12
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
2-e file size = 5443 2-e file name = ./temp.v2 Cpu time / sec 0.0
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
t2 file size = 300 t2 file name = ./temp.t2 Cpu time / sec 0.0
MBPT(2) correlation energy = -0.035867246917899 hartree MBPT(2) total energy = -74.998530309066552 hartree Cpu time / sec 0.0
25
A sample output (continued)A sample output (continued) ------------------------------------------------------- Iter Residuum Correlation Cpu/Sec ------------------------------------------------------- 1 0.089123237955088 -0.035867246917899 0.1 2 0.031759620132034 -0.045406888265697 0.1 3 0.012682891602275 -0.048387005902666 0.1 4 0.005383277884425 -0.049437059764660 0.1 5 0.002395445228466 -0.049839118488995 0.1 6 0.001110827268269 -0.050002172402908 0.1
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
26 0.000000002031284 -0.050127328255753 0.1 27 0.000000001066715 -0.050127328323605 0.0 28 0.000000000560286 -0.050127328359134 0.1 29 0.000000000294338 -0.050127328377747 0.1 30 0.000000000154649 -0.050127328387501 0.1 31 0.000000000081266 -0.050127328392616 0.1 ------------------------------------------------------- CC iteration converged CCSD correlation energy = -0.050127328392616 hartree CCSD total energy = -75.012790390541269 hartree
Task times cpu: 2.0s wall: 2.4s
26
PerformancePerformance
• Titan spin-adapted parallel CCSD code • H2O CCSD/cc-pVTZ
Energy = – 0.2850225 hartree1 node sym=off 16.8 secs/iter1 node sym=on 16.6 secs/iter2 nodes sym=off 8.2 secs/iter2 nodes sym=on 8.3 secs/iter
• Present spin-unrestricted parallel CCSD code• H2O CCSD/cc-pVTZ
Energy = – 0.2850225 hartree1 node sym=off 49.1 secs/iter1 node sym=on 14.5 secs/iter2 nodes sym=off 25.2 secs/iter2 nodes sym=on 7.5 secs/iter
Spin-unrestricted code has to deal with 3 times as many t-amplitudes as does spin-adapted code, so theoretically spin-adapted code should be 3 times as fast as spin-unrestricted code
27
Future plansFuture plans
• CCSDTQ, CI, MBPT, EOM-CC implementation• What is the appropriate tensor formulation for MBPT? (are the
MBPT denominators tensors?) See Head-Gordon et al.• “Persistent intermediates” (or the so-called similarity
transformed Hamiltonian matrix elements) in EOM-CC
• CC(2)PT(2) implementation• Post-CCSD(T) O(n7) method that includes perturbative
quadruples
• Further optimization (loop fusion, more aggressive factorization, space-time tradeoffs, etc.) by computer scientist colleagues
• Modular extensibility of operator contraction engine• Active spaces (multi-reference methods)• Orbital rotations (atomic-orbital-based or local correlation
methods)