a. castro, j. alberdi and a. rubio

48
First principles modeling with Octopus: massive parallelization towards petaflop computing and more A. Castro, J. Alberdi and A. Rubio

Upload: arnold

Post on 23-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

First principles modeling with Octopus: massive parallelization towards petaflop computing and more. A. Castro, J. Alberdi and A. Rubio . Outline. Theoretical Spectroscopy The octopus code Parallelization. Outline. Theoretical Spectroscopy The octopus code Parallelization. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A. Castro, J. Alberdi and A. Rubio

First principles modeling with Octopus: massive parallelization towards petaflop

computing and more

A. Castro, J. Alberdi and A. Rubio

Page 2: A. Castro, J. Alberdi and A. Rubio

2

Outline

Theoretical SpectroscopyThe octopus codeParallelization

Page 3: A. Castro, J. Alberdi and A. Rubio

3

Outline

Theoretical SpectroscopyThe octopus codeParallelization

Page 4: A. Castro, J. Alberdi and A. Rubio

4

Theoretical Spectroscopy

Page 5: A. Castro, J. Alberdi and A. Rubio

5

Theoretical Spectroscopy

Electronic excitations:~Optical absorption~Electron energy

loss~Inelastic X-ray

scattering

~Photoemission~Inverse

photoemission~…

Page 6: A. Castro, J. Alberdi and A. Rubio

6

Theoretical SpectroscopyGoal: First principles (from

electronic structure) theoretical description of the various spectroscopies (“theoretical beamlines”):

Page 7: A. Castro, J. Alberdi and A. Rubio

7

Theoretical Spectroscopy

Role: interpretation of (complex) experimental findings

Page 8: A. Castro, J. Alberdi and A. Rubio

8

Theoretical Spectroscopy

Role: interpretation of (complex) experimental findings

Theoretical atomistic structures, and corresponding TEM images.

Page 9: A. Castro, J. Alberdi and A. Rubio

9

Theoretical Spectroscopy

Page 10: A. Castro, J. Alberdi and A. Rubio

10

Theoretical Spectroscopy

Page 11: A. Castro, J. Alberdi and A. Rubio

11

Theoretical SpectroscopyThe European Theoretical Spectroscopy

Facility (ETSF)

Page 12: A. Castro, J. Alberdi and A. Rubio

12

Theoretical SpectroscopyThe European Theoretical Spectroscopy

Facility (ETSF)

~ Networking~ Integration of tools (formalism,

software)~ Maintenance of tools~ Support, service, formation

Page 13: A. Castro, J. Alberdi and A. Rubio

13

Theoretical Spectroscopy

The octopus code is a member of a family of free software codes developed, to a large extent, within the ETSF:~abinit~octopus~dp

Page 14: A. Castro, J. Alberdi and A. Rubio

14

Outline

Theoretical SpectroscopyThe octopus codeParallelization

Page 15: A. Castro, J. Alberdi and A. Rubio

15

The octopus code

Targets:~Optical absorption spectra of molecules, clusters,

nanostructures, solids.~Response to lasers (non-perturbative response to

high-intensity fields)~Dichroic spectra, and other mixed (electric-

magnetic responses)~Adiabatic and non-adiabatic Molecular Dynamics

(for, e.g. infrared and vibrational spectra, or photochemical reactions).

~Quantum Optimal Control Theory for molecular processes.

Page 16: A. Castro, J. Alberdi and A. Rubio

16

The octopus code

Physical approximations and techniques:~Density-Functional Theory, Time-Dependent

Density-Functional Theory to describe the electron structure.• Comprehensive set of functionals through the

libxc library.~Mixed quantum-classical systems.~Both real-time and frequency domain

response (“Casida” and “Sternheimer” formulations).

Page 17: A. Castro, J. Alberdi and A. Rubio

17

The octopus code

Numerics:~Basic representation:

real space grid.~Usually regular and

rectangular, occasionally curvilinear.

~Plane waves for some procedures (especially for periodic systems)

~Atomic orbitals for some procedures

Page 18: A. Castro, J. Alberdi and A. Rubio

18

The octopus code

Derivative in a point: sum over neighbor points.Cij depend on the points used: the stencil.More points -> more precision.Semi-local operation.

Page 19: A. Castro, J. Alberdi and A. Rubio

19

The octopus code

The key equations~Ground-state DFT: Kohn-Sham equations.

~Time-dependent DFT: time-dependent KS eqs:

Page 20: A. Castro, J. Alberdi and A. Rubio

20

The octopus code

Key numerical operations:~Linear systems with sparse matrices.~Eigenvalue systems with sparse matrices.~Non-linear eigenvalue systems.~Propagation of “Schrödinger-like” equations.

~The dimension can go up to 10 million points.

~The storage needs can go up to 10 Gb.

Page 21: A. Castro, J. Alberdi and A. Rubio

21

The octopus code

Use of libraries:~BLAS, LAPACK~GNU GSL mathematical library.~FFTW~NetCDF~ETSF input/output library~Libxc exchange and correlation library~Other optional libraries.

Page 22: A. Castro, J. Alberdi and A. Rubio

22

www.tddft.org/programs/octopus/

Page 23: A. Castro, J. Alberdi and A. Rubio

23

Outline

Theoretical SpectroscopyThe octopus codeParallelization

Page 24: A. Castro, J. Alberdi and A. Rubio

24

Objective

Reach petaflops computing, with a scientific codeSimulate photosynthesis of the light in chlorophyll

Page 25: A. Castro, J. Alberdi and A. Rubio

29

Multi levelparallelization

MPIKohn­Sham­states

Real­space­domains

In NodeOpenMP­threads OpenCL­

tasksVectorization

CPU GPU

Page 26: A. Castro, J. Alberdi and A. Rubio

30

Target systems:Massive number of execution units ~Multi core

processors with vectorial FPUs

~IBM Blue Gene architecture

~Graphical processing units

Page 27: A. Castro, J. Alberdi and A. Rubio

31

High Level Parallelization

MPI parallelization

Page 28: A. Castro, J. Alberdi and A. Rubio

32

Parallelization by states/orbitals

Assign each processor a group of statesTime propagation is independent for each stateLittle communication requiredLimited by the number of states in the system

Page 29: A. Castro, J. Alberdi and A. Rubio

33

Domain parallelization

Assign each processor a set of grid pointsPartition libraries: Zoltan or Metis

Page 30: A. Castro, J. Alberdi and A. Rubio

34

Main operations in domain parallelization

Page 31: A. Castro, J. Alberdi and A. Rubio

Low level paralelization and vectorization

OpenMP andGPU

Page 32: A. Castro, J. Alberdi and A. Rubio

36

Two approachesOpenMP

Thread programming based on compiler directivesIn node parallelizationLittle memory overhead compared to MPIScaling limited by memory bandwidthMultithreaded Blas and Lapack

OpenCLHundreds of execution unitsHigh memory bandwidth but with long latencyBehaves like a vector processor (length > 16)Separated memory: copy from/to main memory

Page 33: A. Castro, J. Alberdi and A. Rubio

37

SupercomputersCorvo cluster~X86_64VARGAS (in IDRIS)~Power6~67 teraflopsMareNostrum~PowerPC 970~94 teraflopsJugene (image)~1 petaflops

Page 34: A. Castro, J. Alberdi and A. Rubio

38

Test Results

Page 35: A. Castro, J. Alberdi and A. Rubio

39

Laplacian operator

Comparison in performance of the finitedifference Laplacian operator

CPU uses 4 threadsGPU is 4 times fasterCache effects are visible

Page 36: A. Castro, J. Alberdi and A. Rubio

40

Timepropagation

Comparison in performance for a timepropagation

Fullerene moleculeThe GPU is 3 times fasterLimited by copying and non GPU code

Page 37: A. Castro, J. Alberdi and A. Rubio

41

Multi level parallelization

Clorophyll molecule: 650 atomsJugene Blue Gene/PSustained throughput: > 6.5 teraflopsPeak throughput: 55 teraflops

Page 38: A. Castro, J. Alberdi and A. Rubio

Scaling

Page 39: A. Castro, J. Alberdi and A. Rubio

43

Scaling (II)Comparison of two atomic system in Jugene

Page 40: A. Castro, J. Alberdi and A. Rubio

44

Target system

Jugene all nodes~294 912 processor cores = 73 728

nodes~Maximum theoretical performance of

1002 MFlops5879 atoms chlorophyll system~Complete molecule of spinach

Page 41: A. Castro, J. Alberdi and A. Rubio

45

Tests systemsSmaller molecules~180 atoms~441 atoms~650 atoms~1365 atomsPartition of machines~Jugene and Corvo

Page 42: A. Castro, J. Alberdi and A. Rubio

46

Profiling

Profiled within the codeProfiled with Paraver tool~www.bsc.es/paraver

Page 43: A. Castro, J. Alberdi and A. Rubio

1 TD iteration

Page 44: A. Castro, J. Alberdi and A. Rubio

Some “inner” iterations

Page 45: A. Castro, J. Alberdi and A. Rubio

One “inner” iterationIreceive Isend Iwait

Page 46: A. Castro, J. Alberdi and A. Rubio

Poisson solver2­xAlltoallAllgather Allgather Scatter

Page 47: A. Castro, J. Alberdi and A. Rubio

51

ImprovementsMemory improvements in GS~Split the memory among the nodes~Use of ScaLAPACKImprovements in the Poisson solver for TD~Pipeline execution • Execute Poisson while continues with an

approximation~Use new algorithms like FFM~Use of parallel FFTs

Page 48: A. Castro, J. Alberdi and A. Rubio

52

ConclusionsKohn Sham scheme is inherently parallelIt can be exploited for parallelization and vectorizationSuited to current and future computer architecturesTheoretical improvements for large system modeling