a. castro, j. alberdi and a. rubio
DESCRIPTION
First principles modeling with Octopus: massive parallelization towards petaflop computing and more. A. Castro, J. Alberdi and A. Rubio . Outline. Theoretical Spectroscopy The octopus code Parallelization. Outline. Theoretical Spectroscopy The octopus code Parallelization. - PowerPoint PPT PresentationTRANSCRIPT
First principles modeling with Octopus: massive parallelization towards petaflop
computing and more
A. Castro, J. Alberdi and A. Rubio
2
Outline
Theoretical SpectroscopyThe octopus codeParallelization
3
Outline
Theoretical SpectroscopyThe octopus codeParallelization
4
Theoretical Spectroscopy
5
Theoretical Spectroscopy
Electronic excitations:~Optical absorption~Electron energy
loss~Inelastic X-ray
scattering
~Photoemission~Inverse
photoemission~…
6
Theoretical SpectroscopyGoal: First principles (from
electronic structure) theoretical description of the various spectroscopies (“theoretical beamlines”):
7
Theoretical Spectroscopy
Role: interpretation of (complex) experimental findings
8
Theoretical Spectroscopy
Role: interpretation of (complex) experimental findings
Theoretical atomistic structures, and corresponding TEM images.
9
Theoretical Spectroscopy
10
Theoretical Spectroscopy
11
Theoretical SpectroscopyThe European Theoretical Spectroscopy
Facility (ETSF)
12
Theoretical SpectroscopyThe European Theoretical Spectroscopy
Facility (ETSF)
~ Networking~ Integration of tools (formalism,
software)~ Maintenance of tools~ Support, service, formation
13
Theoretical Spectroscopy
The octopus code is a member of a family of free software codes developed, to a large extent, within the ETSF:~abinit~octopus~dp
14
Outline
Theoretical SpectroscopyThe octopus codeParallelization
15
The octopus code
Targets:~Optical absorption spectra of molecules, clusters,
nanostructures, solids.~Response to lasers (non-perturbative response to
high-intensity fields)~Dichroic spectra, and other mixed (electric-
magnetic responses)~Adiabatic and non-adiabatic Molecular Dynamics
(for, e.g. infrared and vibrational spectra, or photochemical reactions).
~Quantum Optimal Control Theory for molecular processes.
16
The octopus code
Physical approximations and techniques:~Density-Functional Theory, Time-Dependent
Density-Functional Theory to describe the electron structure.• Comprehensive set of functionals through the
libxc library.~Mixed quantum-classical systems.~Both real-time and frequency domain
response (“Casida” and “Sternheimer” formulations).
17
The octopus code
Numerics:~Basic representation:
real space grid.~Usually regular and
rectangular, occasionally curvilinear.
~Plane waves for some procedures (especially for periodic systems)
~Atomic orbitals for some procedures
18
The octopus code
Derivative in a point: sum over neighbor points.Cij depend on the points used: the stencil.More points -> more precision.Semi-local operation.
19
The octopus code
The key equations~Ground-state DFT: Kohn-Sham equations.
~Time-dependent DFT: time-dependent KS eqs:
20
The octopus code
Key numerical operations:~Linear systems with sparse matrices.~Eigenvalue systems with sparse matrices.~Non-linear eigenvalue systems.~Propagation of “Schrödinger-like” equations.
~The dimension can go up to 10 million points.
~The storage needs can go up to 10 Gb.
21
The octopus code
Use of libraries:~BLAS, LAPACK~GNU GSL mathematical library.~FFTW~NetCDF~ETSF input/output library~Libxc exchange and correlation library~Other optional libraries.
22
www.tddft.org/programs/octopus/
23
Outline
Theoretical SpectroscopyThe octopus codeParallelization
24
Objective
Reach petaflops computing, with a scientific codeSimulate photosynthesis of the light in chlorophyll
29
Multi levelparallelization
MPIKohnShamstates
Realspacedomains
In NodeOpenMPthreads OpenCL
tasksVectorization
CPU GPU
30
Target systems:Massive number of execution units ~Multi core
processors with vectorial FPUs
~IBM Blue Gene architecture
~Graphical processing units
31
High Level Parallelization
MPI parallelization
32
Parallelization by states/orbitals
Assign each processor a group of statesTime propagation is independent for each stateLittle communication requiredLimited by the number of states in the system
33
Domain parallelization
Assign each processor a set of grid pointsPartition libraries: Zoltan or Metis
34
Main operations in domain parallelization
Low level paralelization and vectorization
OpenMP andGPU
36
Two approachesOpenMP
Thread programming based on compiler directivesIn node parallelizationLittle memory overhead compared to MPIScaling limited by memory bandwidthMultithreaded Blas and Lapack
OpenCLHundreds of execution unitsHigh memory bandwidth but with long latencyBehaves like a vector processor (length > 16)Separated memory: copy from/to main memory
37
SupercomputersCorvo cluster~X86_64VARGAS (in IDRIS)~Power6~67 teraflopsMareNostrum~PowerPC 970~94 teraflopsJugene (image)~1 petaflops
38
Test Results
39
Laplacian operator
Comparison in performance of the finitedifference Laplacian operator
CPU uses 4 threadsGPU is 4 times fasterCache effects are visible
40
Timepropagation
Comparison in performance for a timepropagation
Fullerene moleculeThe GPU is 3 times fasterLimited by copying and non GPU code
41
Multi level parallelization
Clorophyll molecule: 650 atomsJugene Blue Gene/PSustained throughput: > 6.5 teraflopsPeak throughput: 55 teraflops
Scaling
43
Scaling (II)Comparison of two atomic system in Jugene
44
Target system
Jugene all nodes~294 912 processor cores = 73 728
nodes~Maximum theoretical performance of
1002 MFlops5879 atoms chlorophyll system~Complete molecule of spinach
45
Tests systemsSmaller molecules~180 atoms~441 atoms~650 atoms~1365 atomsPartition of machines~Jugene and Corvo
46
Profiling
Profiled within the codeProfiled with Paraver tool~www.bsc.es/paraver
1 TD iteration
Some “inner” iterations
One “inner” iterationIreceive Isend Iwait
Poisson solver2xAlltoallAllgather Allgather Scatter
51
ImprovementsMemory improvements in GS~Split the memory among the nodes~Use of ScaLAPACKImprovements in the Poisson solver for TD~Pipeline execution • Execute Poisson while continues with an
approximation~Use new algorithms like FFM~Use of parallel FFTs
52
ConclusionsKohn Sham scheme is inherently parallelIt can be exploited for parallelization and vectorizationSuited to current and future computer architecturesTheoretical improvements for large system modeling