terascale computations of multiscale magnetohydrodynamics for fusion plasmas: ornl ldrd project d....

Terascale Computations of Multiscale Magnetohydrodynamics for Fusion Plasmas:

ORNL LDRD Project

D. A. Spong*, E. F. D’Azevedo†, D. B. Batchelor*,

D. del-Castillo-Negrete*, M. Fahey‡, S. P. Hirshman*, R. T. Mills‡

*Fusion Energy Division, †Computer Science and Mathematics Division,‡National Center for Computational Sciences

Oak Ridge National Laboratory

S. Jardin, W. Park, G.Y.Fu, J. Breslau, J. Chen,

S. Ethier, S. Klasky

Princeton University Plasma Physics Laboratory

Our project identified six tasks to work on over two years. Our focus during the past year has

been on the first three.

• Our focus will shift to tasks 3 - 6 for next year• Access to ORNL X1E/XT3 computers required -

more significant as project progresses

1. Adapt/optimize M3D for the Cray2. Adapt/optimize DELTA5D for the Cray3. Develop particle closure relations4. Couple M3D and particle closures5. Demonstrate code on realistic examples6. Final report, documentation

Significant progress on tasks that are critical to this LDRD project:

• DELTA5D porting and optimization– Vectorized particle stepping + rapid field evaluation (3D spline, SVD

compression)

• M3D porting and optimization– New vectorized sparse matrix-vector multiply algorithm

• Particle-based closure relations– Improved compatibility with MHD (flowsn isolated from particle

model), benchmarked with other codes (DKES)

• PPPL Subcontract added (May - September, 2005)– Addresses issues for fusion scientific end station applications

DELTA5D and M3D optimization

Particle simulation performance has been significantly improved by optimizing the

magnetic field evaluation routines:

• DELTA5D converted to cylindrical geometry for compatibility with M3D

• 3D B-spline routine optimized by vectorization

• For larger problems, spline memory requirements will limit number of particles per processor

• New data compression techniques developed

0.1

1

10

100

1000

10 100 1000 104

# of particles per processor

CheetahIBM Power4

Cray X1

Cray XT3

Initial 3D splinealgorithm tests

Cray X1IBM Power4

impact ofoptimization &vectorization

DELTA5D

High performance + small memory footprint SVD* fits of magnetic/electric field data have been

developed

Strategy: combine 2-D SVD* fit (R,Z) with 1-D Fourier series ()

N x N matrix -rank approximation

*SVD: Singular value decomposition

B

i j=B θi,φj( )

2N terms <<N B

i j = wk gk θ i( )

k=1

∑ fk φj( )

DELTA5D

New sparse matrix-vector multiply routine (vectorized) developed: 10 times faster

• M3D spends much time in:Sparse elliptic equation solvers

• PETSC matrix storage– CSR (Compressed Sparse Row)– Unit stride good for scalar/poor for

vector processors

• CSRP (CSR with permutation)– Reorders/groups rows to match

vector register size– “strip-mining” breaks up longer

loops

• CSRP algorithm encapsulated into a new PETSC matrix object

M3D

Matrix-vectormultiplies

ILU preconditioner(forward/backward solves)

10

100

1000

ASTRO BCSSTK18 7PT 77PTB

PETSC-SSPCSRP Vectorized -SSP

Test Problem

1

10

100

1000

ASTRO BCSSTK18 7PT 77PTB

PETSC-MSPCSRP Vectorized-MSP

Test Problem

This substantial improvement in matrix-vector operations has exposed the next layer of opportunity for M3D performance gains:

• For small test problems up to 15% improvement is observed with the vectorized CSRP routine

• Incomplete LU factorization uses sparse triangular solves that are not easily vectorizable

• Other preconditioning techniques offer better vectorization - tradeoff between effectiveness vs. cost of iterations– Multi-color ordering has been used on Earth Simulator to vectorize

sparse triangular solves– Forces diagonal blocks in the triangular matrices to be diagonal

• Other novel vectorizable preconditioners– Sparse approximate inverse (SPAI)– Sparse direct solvers

M3D

Neoclassical Closure Relations

Our goal is to couple kinetic transport effects with an MHD model - important for long collisional

path length plasmas such as ITER

nm∂∂t

+V ⋅∇⎛⎝⎜

⎞⎠⎟V =−∇p−∇⋅Π+ J ×B

∂∂t

+V ⋅∇⎛⎝⎜

⎞⎠⎟

p=−γp∇⋅V + γ −1( ) Q−∇⋅q( )

E =−V ×B+ηJ +1ne

J ×B−∇pe −∇⋅Π||e( )

∂B∂t

=−∇×E, ∇⋅B =0, μ0J =∇×B

• Closure relations: enter through the momentum balance equation and Ohm’s law:

• Moments hierarchy closed by Π= function of n, T, V, B, E• Requires solution of Boltzmann equation: f = f(x,v,t)• High dimensionality: 3 coordinate + 2 velocity + time

Neoclassical transport closures introduce new challenges:

• Collisions introduce new timescales– lengthy evolution to steady state, especially at low

collisionalities– Our particle model (DELTA5D) assigns particles to

processors, assumes global data can be gathered at each MHD step

– Multi-layer time-stepping• orbit time step v||t1 << L|| - determined by LSODE• collisional time step t2 << 1• For low collisionality t1 << t2

– Want to avoid calculating quantities (flows, macroscopic gradients) that are already evolved by the MHD model

• New f partitioning

N x (time stepfollowed by

collisional step)

- Collect diagnostic data- Check for particles that exit the solution domain (i.e., last closed surface) reseed or remove- Check for collisions that scatter particles outside of |v||/v| ≤ 1 or energy ≥ 0-For beams or alphas, check for thermalized particles

Merge diagnosticdata from different

processors

-process, depending on physics problem- write out from PE 0- Neutral beam

heating/slowing down- Global energy confinement- Orbit dispersion/diffusion- PDF data for nonlocal transport- Bootstrap current- ICRF heating- Alpha particle/runaway elect. confinement

Main components of DELTA5D Monte Carlo code:

Set initialconditions andassign groupsof particlesto processors

DELTA5D equations were converted from magnetic to cylindrical coordinates

Uses bspllib 3D cubic B-spline fit to data from VMEC

drR

dt=

1B||

*v||

rB* −b̂×

rE * −

1Ze

μr∇

rB

⎛

⎝⎜⎞

⎠⎟⎡

⎣⎢

⎤

⎦⎥

mdv||

dt=

rB*

B||*gZe

rE * −μ

r∇

rB( )

where B||* =b̂g

rB* b̂=

rB /

rB μ =

mv⊥2

2rB

rB* =

rB +

mv||

Ze

r∇×b̂=

rB−

mv||

Zeb̂× b̂g

r∇b̂( )

rE * =

rE −

mv||

Ze∂b̂∂t

≈rE (if ∂B / ∂t= Ωc )

In M3D variables,rB =

r∇ψ ×

r∇φ+

1F∇⊥F + R0 +

%I( )r∇φ

A model perturbed field has been added to mock up tearing modes: B = BVMEC + (BVMEC):

= mnc (ψ )cos(mθ − nφ) +α mns (ψ )sin(mθ − nφ)[ ]m,n∑

010 0103 0102

Coulomb collision operator for collisions of test particles (species a) with a background plasma

(species b):

Cabfa =νD

ab

2∂∂λ

(1−λ2)∂fa

∂λ+

1v2

∂∂v

v2 2νεαabfa +νε

vαab

3 ∂fa

∂v⎡ ⎣ ⎢

⎤ ⎦ ⎥

⎧ ⎨ ⎩

⎫ ⎬ ⎭

where

νDab =

ν0ab

v /αab( )3 φ

vαb

⎛ ⎝ ⎜

⎞ ⎠ ⎟ −G

vαb

⎛ ⎝ ⎜

⎞ ⎠ ⎟

⎡

⎣ ⎢

⎤

⎦ ⎥ νε =ν0

abGvαb

⎛ ⎝ ⎜

⎞ ⎠ ⎟

φ(x) =2π

dtt1/2 e−t

0

x

∫ G(x) =1

2x2 φ(x)−x ′ φ (x)[ ]

αab =2Tb0

ma

αb =2Tb0

mb

ν0ab =

4πnb lnΛab eaeb( )2Tb( )

3/2ma1/2

2

Monte Carlo (Langevin) Equivalent of the Monte Carlo (Langevin) Equivalent of the Fokker-Planck OperatorFokker-Planck Operator

[A. Boozer, G. Kuo-Petravic, Phys. Fl. 24 (1981)]

λn =λn−1(1−νdΔt) ± 1−λn−12( )νdΔt[ ]

1/2

En =En−1 −(2νεΔt) En−1 −32

+En−1

νε

dνε

dE⎛ ⎝ ⎜

⎞ ⎠ ⎟ Tb

⎡

⎣ ⎢

⎤

⎦ ⎥ ±2 TbEn−1νεΔt[ ]

1/2

Local Monte-Carlo equivalent quasilinear Local Monte-Carlo equivalent quasilinear ICRF operator (developed by J. Carlsson)ICRF operator (developed by J. Carlsson)

E+ =E− +μE +ζ σ EE λ+ =λ−+μλ +ζ σ λλ

ζ =a zero−mean, unit−variancerandomnumber(i.e., μζ =0 andσ ζ =1)

σ EE =2m2v⊥2Δv0 σ λλ =2

k||

ω−

v||

v2⎛ ⎝

⎞ ⎠

2 v⊥3Δv0

v2

μE =2 1−k||v||

ω⎛ ⎝

⎞ ⎠ mv⊥Δv0 μλ = 2 1−

k||v||

ω⎛ ⎝

⎞ ⎠ −

v⊥2

v2

⎡ ⎣ ⎢

⎤ ⎦ ⎥

k||

ω−

v||

v2⎛ ⎝

⎞ ⎠ +

v||

v2

v⊥2

v2

⎧ ⎨ ⎩

⎫ ⎬ ⎭

v⊥Δv0

v

where

Δv0 =1v⊥

eZ2m

E+J n−1(k⊥ρ)+E−J n+1(k⊥ρ)⎛ ⎝

⎞ ⎠

2 2πn ˙ Ω

as ˙ Ω → 0

2πn ˙ Ω

→ 2π2 2n˙ ̇ Ω

2/3

×Ai2 −n2 ˙ Ω 2

42

n˙ ̇ Ω

4/3⎛ ⎝ ⎜

⎞ ⎠ ⎟

A f method is used that follows the distribution function partitioning used in

H. Sugama, S. Nishimura, Phys. Plasmas 9, 4637 (2002).

f = fM 1+eT

dlB

BE||−B2

B2BE||

⎛

⎝⎜

⎞

⎠⎟∫

⎡

⎣⎢⎢

⎤

⎦⎥⎥

+2vth

v||v

xfM u||+ x2 −52

⎛⎝⎜

⎞⎠⎟2q||

p⎡

⎣⎢

⎤

⎦⎥

+fMT

fUu||B

B2+ x2 −

52

⎛⎝⎜

⎞⎠⎟2 q||B

p B2

⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪+ fX X1 + X2 x2 −

52

⎛⎝⎜

⎞⎠⎟

⎧⎨⎩

⎫⎬⎭+ fU + mv||B( )

⎡

⎣⎢⎢

⎤

⎦⎥⎥

V −C( ) fU =σU =−mv2P2 (v|| / v)rBg

r∇lnBV −C( ) fX =σ X =−

v2

2ΩP2 (v|| / v)

rBg

r∇ B %U( )

wherex=v/ vth,%U =Pfirsch−Schlüterflow=Bζ

B1−

B2

B2

⎡

⎣⎢⎢

⎤

⎦⎥⎥fortokamak,P2 (y) =

32

y2 −12

Transport / viscositycoefficients : fU ,σU( ); fX,σ X( ); fU ,σ X( )

withL ,L( ) =12

d(v|| / v)−1

1

∫ L ,L( )θ,ζ“∫∫ g / V

New MHD viscosity-based closure relations are more consistent with the MHD model

0

0.002

0.004

0.006

0.008

0.01

0 5 10-5 0.0001 0.00015

time (seconds)

/ =0.033v m-1

0.08m-1

0.16m-10.016m-1

0.0085m-1

0.0035m-1

0.0018m-1

0.0009m-1

10-5

0.0001

0.001

0.01

0.0001 0.001 0.01 0.1

/ (v meter-1)

5DELTA D( )Monte Carlo

results

Viscosity coefficient from DKES code

μ|| from DELTA5D: running time averages - 1 flux surface

3200 electrons

μ|| benchmark with DKES

* DKES: Drift Kinetic Equation Solver

DELTA5D

B⋅∇⋅Π( )

B ⋅∇⋅Θ( )

⎡

⎣⎢⎢

⎤

⎦⎥⎥=

M1 M2

M2 M3

⎡

⎣⎢⎢

⎤

⎦⎥⎥

V||

Q||

⎡

⎣⎢⎢

⎤

⎦⎥⎥+

N1 N2

N2 N3

⎡

⎣⎢⎢

⎤

⎦⎥⎥

1n∂p∂s

−e∂φ∂s

−∂T∂s

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

where M j,N j ∝ dE0

∞

∫ e−E /kT E E −52

kT⎛

⎝⎜⎞

⎠⎟

j−1

μ||,N(E )

In the next phase, kinetic closure relations will be further developed and coupled with

the MHD model:

• Closure relations– Collisionality scans, convergence studies– Examine/check 2D/3D variation of stress tensor– Accelerate slow collisional time evolution of viscosity coefficients

• Test pre-converged restarts• Equation-free projective integration extrapolation methods

– Extension to more general magnetic field models:

– Green-Kubo molecular dynamics methods - direct viscosity calculation

• DELTA5D/M3D coupling– Interface, numerical stability, data locality issues

B =∇ ×∇β

DELTA5D

For next step: need data from several saturated M3D tearing

mode test cases at different helilicities (2/1, 3/2, 4/3, etc.)

’ and island width evolution from my earlier calculations with the FAR/KITE reduced MHD codes

q(r ) =q(0) 1+rr0

⎛

⎝⎜⎞

⎠⎟

2⎡

⎣

⎢⎢

⎤

⎦

⎥⎥

1/

r = χ / χedge

mn =lim→ 0

dψmn

dr

⎛

⎝⎜⎞

⎠⎟r+

−dψmn

dr

⎛

⎝⎜⎞

⎠⎟r−

⎡

⎣⎢⎢

⎤

⎦⎥⎥ψmn rs( )⎡⎣ ⎤⎦

−1where q(rs ) =m/ n

e.g., for r0 =0.56, =1.5 [D.A. Spong, et al., ORNL /TM −1097]

21 < 0 for 0.8 <q(0) < 0.9 and 21 > 0 for 0.9 <q(0) <1.

Neoclassical island width evolution from the FAR/KITE reduced MHD codes

M3D positive ’ runs (based on case from W. Park)

Nonlinear M3D 2/1 tearing mode

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 20 40 60 80 100 120 140

γ

t/τA

0.0025*(0.1)-3/5

10-11

10-10

0 20 40 60 80

t/τA

Restarttimes

Strategy/tasks for Terascale MHD LDRDPort and optimize

DELTA5D for Cray X1

Port and optimizeM3D for Cray X1

Adapt particle modelto cylindrical geometry,M3D fields, parallelize

(by processor or region?)

Develop ’ > 0 tearingmode test cases

Develop efficientrepresentation for E, B

broadcast to all processors

optimize/test

Develop closure relations:- current or pressure tensor- neo. ion polarization drift effects- generalized Sugama- Ohm’s law- delta-f or full-f- new methods:

- viscosity from dispersion vs.time of momentum (?)- track parallel friction between particles/background

Couple currents,Pressure tensor,…back into M3D - needequations from flows/viscosities to JBS

optimize/testnumericalstability

Monte Carlo issues:Collision operator conservation propertiesTest particle/background consistencySmoothed particles or bin and then smoothResolution of trapped/passing boundary layers

terascale computations of multiscale magnetohydrodynamics for fusion plasmas: ornl ldrd project d....

Documents

matrixvector operations

d spline

m3d performance gains

particle model

delta5d porting

svd compressionm3d porting

vector processorscsrp

sparse rowunit stride