time-dependent density-functional theory in massively parallel...

12
Time-dependent density-functional theory in massively parallel computer architectures: the octopus project This article has been downloaded from IOPscience. Please scroll down to see the full text article. 2012 J. Phys.: Condens. Matter 24 233202 (http://iopscience.iop.org/0953-8984/24/23/233202) Download details: IP Address: 88.22.173.76 The article was downloaded on 05/05/2012 at 15:32 Please note that terms and conditions apply. View the table of contents for this issue, or go to the journal homepage for more Home Search Collections Journals About Contact us My IOPscience

Upload: others

Post on 07-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

Time-dependent density-functional theory in massively parallel computer architectures: the

octopus project

This article has been downloaded from IOPscience. Please scroll down to see the full text article.

2012 J. Phys.: Condens. Matter 24 233202

(http://iopscience.iop.org/0953-8984/24/23/233202)

Download details:IP Address: 88.22.173.76The article was downloaded on 05/05/2012 at 15:32

Please note that terms and conditions apply.

View the table of contents for this issue, or go to the journal homepage for more

Home Search Collections Journals About Contact us My IOPscience

Page 2: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

IOP PUBLISHING JOURNAL OF PHYSICS: CONDENSED MATTER

J. Phys.: Condens. Matter 24 (2012) 233202 (11pp) doi:10.1088/0953-8984/24/23/233202

TOPICAL REVIEW

Time-dependent density-functional theoryin massively parallel computerarchitectures: the OCTOPUS projectXavier Andrade1, Joseba Alberdi-Rodriguez2,3, David A Strubbe4,5,Micael J T Oliveira6, Fernando Nogueira6, Alberto Castro7,Javier Muguerza3, Agustin Arruabarrena3, Steven G Louie4,5,Alan Aspuru-Guzik1, Angel Rubio2,8 and Miguel A L Marques9,10

1 Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge,MA 02138, USA2 Nano-Bio Spectroscopy Group and ETSF Scientific Development Center, Departamento de Fısica deMateriales, Centro de Fısica de Materiales CSIC-UPV/EHU and DIPC, University of the BasqueCountry UPV/EHU, Avenida Tolosa 72, 20018 Donostia/San Sebastian, Spain3 Deparment of Computer Architecture and Technology, University of the Basque Country UPV/EHU,M Lardizabal 1, 20018 Donostia/San Sebastian, Spain4 Department of Physics, University of California, 366 LeConte Hall MC 7300, Berkeley, CA 94720,USA5 Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley,CA 94720, USA6 Center for Computational Physics, University of Coimbra, Rua Larga, 3 004-516 Coimbra, Portugal7 Institute for Biocomputation and Physics of Complex Systems (BIFI), Zaragoza Center for AdvancedModelling (ZCAM), University of Zaragoza, Spain8 Fritz-Haber-Institut der Max-Planck-Gesellschaft, Berlin, Germany9 Universite de Lyon, F-69000 Lyon, France10 LPMCN, CNRS, UMR 5586, Universite Lyon 1, F-69622 Villeurbanne, France

E-mail: [email protected]

Received 4 April 2012Published 4 May 2012Online at stacks.iop.org/JPhysCM/24/233202

AbstractOCTOPUS is a general-purpose density-functional theory (DFT) code, with a particular emphasis on thetime-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve theparallelization of OCTOPUS. We focus on the real-time variant of TDDFT, where the time-dependentKohn–Sham equations are directly propagated in time. This approach has great potential for execution inmassively parallel systems such as modern supercomputers with thousands of processors and graphicsprocessing units (GPUs).

For harvesting the potential of conventional supercomputers, the main strategy is a multi-levelparallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space griddomain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. ForGPUs, we show how using blocks of Kohn–Sham states provides the required level of data parallelismand that this strategy is also applicable for code optimization on standard processors. Our results showthat real-time TDDFT, as implemented in OCTOPUS, can be the method of choice for studying theexcited states of large molecular systems in modern parallel architectures.

(Some figures may appear in colour only in the online journal)

10953-8984/12/233202+11$33.00 c� 2012 IOP Publishing Ltd Printed in the UK & the USA

Page 3: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

J. Phys.: Condens. Matter 24 (2012) 233202 Topical Review

Contents

1. Introduction 22. Octopus features 2

2.1. Theory 22.2. Grids 32.3. The software package 4

3. Parallelization 43.1. Parallelization in k-points and spin 43.2. Parallelization in Kohn–Sham states 43.3. Parallelization in domains 53.4. Parallelization of the Poisson solver 5

4. GPU parallelization and code optimization 65. Scalability in massively parallel systems 7

5.1. Ground-state calculations 85.2. Real-time TDDFT calculations 85.3. Combined MPI–GPU parallelization 9

6. Conclusions 10Acknowledgments 10References 10

1. Introduction

One of the main factors that has influenced the wide adoptionof first principles electronic-structure methods in physicsand chemistry was the fast growth of the capabilities ofcomputers dictated by Moore’s law [1]. This was combinedwith the development of new algorithms and softwarecapable of exploiting these capabilities. Thanks to theseadvances, molecular or solid-state systems containing severalthousands of atoms are now accurately modelled usingsupercomputers.

During the last years, however, we have witnessed aremarkable change in computer architectures. Before, theever-increasing transistor density was directly translatedinto an increase of the speed and capabilities of a singleprocessor core. However, problems related to efficiency,power consumption and heat dissipation forced a change ofparadigm. So the trend is now to use the extra transistorsto provide more processing elements. This is reflected intoday’s supercomputers, where the number of processorcores is constantly increasing while the capabilities of eachprocessing element are progressing much slower. The same istrue for personal computing: the parallelism can be obtainedvia multi-core central processing units (CPU), but alsogenerally-programmable graphics processing units (GPU). Infact, GPUs effectively convert a desktop computer into amassively parallel computer.

Unfortunately, the task of using efficiently all theavailable parallel units concurrently is left to the applicationprogrammer. This is a very complex task, and presents areal challenge to programmers in general and to scientists inparticular. Of course, as new parallel programming paradigmsare being introduced, scientific codes also have to evolve.Some of the methods that we can currently use can beadapted to parallel architectures, but many popular scientific

computing techniques or algorithms might have to be replacedby others that are more efficient in these new environments.In exchange, this change in computational paradigm offers usthe possibility of studying larger systems under more realisticconditions and using more accurate methods than it waspossible before.

In this paper, we show how time-dependent density-functional theory (TDDFT) [2] in real time can be a verycompetitive approach to study the excited states of electronicsystems in massively parallel architectures, especially whencombined with a spatial grid representation. In fact, real-timeTDDFT is a versatile method to model the response of anelectronic system (molecular or crystalline [3]) to differentkinds of perturbations. It is useful to calculate propertieslike optical absorption spectra [4, 5], non-linear opticalresponse [6, 7], circular dichroism [8, 9], van der Waalscoefficients [10], Raman intensities [11], etc. The numericaladvantage of real-time TDDFT is that the propagatorpreserves the orthogonality of the states [12]. In practice,this allows us to propagate each one of the states in anindependent manner, which is ideal for parallelization. Sincethe method does not require expensive orthogonalizationsteps, the numerical cost scales with the square of thesize of the system and not cubically as many othermethods [13].

Here, we present a summary of the work that hasbeen made during the last years on the optimization andparallelization of the code OCTOPUS [14, 15], in order to profitfrom state-of-the-art high-performance computing platforms,from GPUs to supercomputers. This code is a widely used toolto perform real-time TDDFT simulations. It uses a real-spacegrid that provides an accurate and controllable discretizationthat also allows for an efficient parallelization by domaindecomposition.

Directly following (section 2), we give a brief descriptionof OCTOPUS and its features. Then (section 3), we focusour discussion in the parallelization of OCTOPUS: thevarious parallelization modes and how they can be combineddepending on the system characteristics and the availableprocessors. We pay special attention to the crucial bottleneckthat Poisson’s equation can be. Then we describe howwe tackled the GPU parallelization (section 4). Finally(section 5), we present some scalability results to illustrate allthe efforts described in the previous sections.

2. Octopus features

2.1. Theory

OCTOPUS was originally written to solve the equationsof density-functional theory (DFT) in its ground-state [16]and time-dependent [2] forms. In particular, and likethe vast majority of DFT applications, we use theKohn–Sham (KS) [17] formulation of DFT, which leads to acoupled set of single-particle equations whose solution yieldsthe many-body electronic density n(r, t). For example, for thetime-dependent case these equations read (atomic units are

2

Page 4: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

J. Phys.: Condens. Matter 24 (2012) 233202 Topical Review

used throughout this paper)

i@

@t'i(r, t) =

"

�12r2 + vext(r, t) + vHartree[n](r, t)

+ vxc[n](r, t)

#

'i(r, t) (1)

n(r, t) =occX

i|'i(r, t)|2 (2)

where 'i(r, t) are the single-particle KS states (also calledKS orbitals), vext(r, t) is the time-dependent external potentialthat can be the potential generated by the nuclei, a laser field,etc; vHartree[n](r, t) is the Hartree potential that describes theclassical mean-field interaction of the electron distribution;and vxc[n](r, t) is the exchange–correlation (xc) potential thatincludes all non-trivial many-body contributions.

It is true that the (time-dependent) KS equationsare an exact reformulation of (time-dependent) quantummechanics. However, the exact form of the xc functionalis unknown and, therefore, has to be approximated inany practical application of the theory. In OCTOPUS

different approximations for this term are implemented,from local and semi-local functionals to more sophisticatedorbital-dependent functionals, including hybrids [18] andthe optimized effective potential approach [19]. Asymptoticcorrection methods are also implemented [20].

We note that the local and semi-local xc functionalsin OCTOPUS were implemented as a separate component,Libxc [21]. This library is now completely independent ofOCTOPUS and is used by several projects like APE [22],GPAW [23], and ABINIT [24]. Currently it containsaround 180 functionals for the exchange, correlation, andkinetic energies belonging to the local-density approximation(LDA), the generalized-gradient approximation (GGA), themeta-GGA, and hybrid functional families. Functionals forsystems of reduced dimensionality (1D and 2D) are alsoincluded.

OCTOPUS can also be used to study model systems ofdifferent dimensionalities (currently up to five-dimensionalsystems are supported). For this, an arbitrary external potentialcan be given by the user by directly including its formulain the input file. This model is extremely useful, e.g., tostudy reduced-dimensional systems of interacting electronsin time-dependent fields [25–28]. In fact, the Schrodingerequation describing two electrons interacting in 1D isequivalent to a Schrodinger equation of one independentelectron in 2D. In the same way, the problem of two electronsinteracting in 2D can be mapped to the problem of oneelectron in a 4D space.

Another recent incorporation in the code is thepossibility of performing multi-scale modelling by combiningelectronic systems with complex electrostatic environments.For example, this has been used to simulate a moleculeplaced between two metallic plates at a certain voltagebias [29]. Multi-scale QM/MM calculations can be performedas well [30].

Figure 1. Example of the grid of OCTOPUS for a benzene molecule.The different colours represent different processors in a domaindecomposition parallelization.

Besides ground-state and real-time TDDFT, OCTOPUS

can do other types of calculations. It can perform Ehrenfest-TDDFT non-adiabatic molecular dynamics [14, 31] and adi-abatic molecular dynamics based on the modified Ehrenfestscheme [32, 33], which inherits the scalability properties ofreal-time TDDFT, or the standard Born–Oppenheimer andCar–Parrinello [34] schemes. Different response propertiesin TDDFT [35] can also be obtained using linear responseformalisms like Casida [36], or the density-functionalperturbation theory/Sternheimer approach [37–43]. OCTOPUS

can also do quantum optimal-control calculations [44–46] andreal-time quantum transport calculations [47]. OCTOPUS cangenerate the DFT data required for GW and Bethe–Salpetercalculations using the BerkeleyGW code [48].

2.2. Grids

OCTOPUS uses a real-space grid discretization to representfields such as the Kohn–Sham states and the electronicdensity. Each function is represented by its value over anarray of points distributed in real space. Differential operatorsare approximated by high-order finite-difference methods [49]while integration is performed by a simple sum over the gridpoint coefficients.

The real-space grid approach does not impose a particularform for the boundary conditions, so it is possible tomodel both finite and periodic systems directly. Moreover,in OCTOPUS the grid boundaries can have an arbitrary shape,avoiding unnecessary grid points. For example, for molecularcalculations the default box shape corresponds to the unionof spheres centred around the atoms (see figure 1 for anexample).

One of the main advantages of the real-space gridapproach is that it is possible to systematically control thequality of the discretization. By reducing the spacing andincreasing the size of the box, the error is systematicallydecreased, and can be made as small as desired, at the costof an increased computational cost. This is of particularsignificance for response properties [42, 50].

3

Page 5: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

J. Phys.: Condens. Matter 24 (2012) 233202 Topical Review

While the real-space scheme results in a large numberof discretization coefficients when compared with localizedbasis set representations, the discretized Hamiltonian isvery sparse. In fact, the number of non-zero componentsdepends linearly on the number of coefficients. Moreover, theHamiltonian only requires information from near neighbours,which is advantageous for parallelization and optimization.

Finally, since the description of the core regions is ex-pensive with a uniform-resolution discretization, in OCTOPUS

the ion–electron interaction is usually modelled using norm-conserving pseudo-potentials [51]. At the moment, the codecan read pseudo-potentials in several formats: the SIESTA

format [52], the Hartwigsen–Goedecker–Hutter format [53],the Fritz–Haber format [54] and its Abinit version [24], andthe Quantum Espresso universal pseudo-potential format [55].Relativistic corrections, like spin–orbit coupling, can also beincluded by using relativistic pseudo-potentials [56, 57].

2.3. The software package

The source code of OCTOPUS is publicly available under theGNU public license (GPL) v.2.0. This allows anyone to usethe code, study it, modify it and distribute it, as long as theserights are retained for other users. We believe that this issomething of particular importance for scientific work [58].The code is written mainly in Fortran 95 with some parts inC and OpenCL [59]. Currently it consists of 180 000 lines ofcode (excluding external libraries).

Since the code is publicly available, it is essential toprovide documentation so users can learn how to use it.The OCTOPUS website11 contains a user manual and severaltutorials that teach users how to perform different types ofcalculations, including some basic examples. Additionally,all input variables have their own documentation that can beaccessed through the website or from a command line utility.A mailing list is also available, where users can get helpwith specific questions about OCTOPUS from the developersor other users.

One of the most important points in developing ascientific code is to ensure the correctness of the results. Whena new feature is implemented, the developers validate theresults by comparing them with known results from othermethods and other codes. To ensure that future changesdo not modify the results, we use an automated system(BuildBot [60]) that runs the code periodically and comparesthe results with reference data. A short set of tests is executedeach time a change is made in OCTOPUS while a long one isexecuted every day. The tests are run on different platformsand with different compilation options, to ensure that theresults are consistent for different platforms. Users should alsorun the testsuite to validate the build on their machine beforerunning real calculations.

To avoid users inadvertently using parts of the code thatare being developed or have not being properly validated, theyare marked as ‘experimental’. These experimental featurescan only be used by explicitly setting a variable in the input

11 http://tddft.org/programs/octopus/.

file. In any case, users are expected to validate their results inknown situations before making scientific predictions basedon OCTOPUS results.

3. Parallelization

In order to take advantage of modern day architectures,OCTOPUS uses a hybrid parallelization scheme. This schemeis based on a distributed memory approach using the messagepassing interface (MPI) library for the communicationbetween processes. This is combined with fine-grainedparallelism inside each process using either OpenCL orOpenMP.

The MPI parallelization is mainly based on a tree-baseddata parallelism approach, even if some steps have alreadybeen done in order to take advantage of task parallelism.The main piece of data to be divided among processes arethe KS states, an object that depends on three main indices:a combined k-point and spin index, the state index, and thespace coordinate. Each one of these indices is associated witha data parallelization level, where each process is assigneda section of the total range of the index. Note that since thek-point and spin index are both symmetry related quantumnumbers, it is convenient to combine them in a unique indexthat labels the wavefunctions.

This multi-level parallelization scheme is essential toensure the scaling of real-time TDDFT. As the size of thesystem is increased, two factors affect the computationaltime: first, the region of space that needs to be simulatedincreases, and second, the number of electrons increases. Bydividing each of these degrees of freedom among processors,multi-level parallelization ensures that the total parallelefficiency remains constant as we increase the system size andthe number of processors.

3.1. Parallelization in k-points and spin

For independent particles, the Schrodinger Hamiltonian canbe exactly partitioned according to the k-point and spin labels.This means that each one of the subproblems, i.e. for eachk-point and spin label, can be solved independently of theothers, reducing thereby the dimension of the Hamiltonianand the computational complexity. Mathematically, in (time-dependent) DFT this is no longer true as the subproblemsare mixed by the density. However, a large part of thenumerical solution can still be partitioned effectively, makingthe parallelization in k-points and spin very efficient as littlecommunication is required. Such a scheme does not alwayshelp for scaling, unfortunately. For example, for finite systemsonly a single k-point is used and in many cases we areinterested in spin-unpolarized calculations. In this extremecase, there is absolutely no advantage in this parallelizationlevel.

3.2. Parallelization in Kohn–Sham states

The following level of parallelization regards the distributionof the KS states between processors. The problem is very

4

Page 6: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

J. Phys.: Condens. Matter 24 (2012) 233202 Topical Review

different for ground-state DFT and for real-time TDDFT, sowe discuss these cases separately.

For real-time TDDFT the propagation of each orbitalis almost independent of the others, as the only interactionoccurs through the time-dependent density. This again leadsto a very efficient parallelization scheme for time propagationthat is only limited by the number of available KS states.In fact, when too few states are given to each process(assuming that no other parallelization level is used), thecost of state-independent calculations starts to dominate (inparticular the cost of solving the Poisson equation required toobtain the Hartree potential). As a rule of thumb, one shouldnot have fewer than 4 states per process in order not to loseefficiency. Note that when ions are allowed to move duringthe propagation, a complementary parallelization over atomsis used by OCTOPUS. This ensures that routines that calculateforces and that re-generate the atomic potential at each stepdo not spoil the very favourable scaling of TDDFT.

For ground-state calculations the parallelization overstates is more complex than for the time-propagation case,as the orthogonality constraint forces the diagonalizationprocedure to explicitly mix different states. Regardingparallelization, the most efficient eigensolver implemented inOCTOPUS turns out to be residual minimization method—direct inversion in iterative subspace (RMM-DIIS) [61,62] where the mixing of orbitals is restricted to twoprocedures: the Gram–Schmidt orthogonalization [63] and thesubspace diagonalization. To parallelize these operations weuse the parallel linear-algebra library Scalapack [64]. Stateparallelization of these procedures is particularly important asthey involve matrices whose dimension is given by the numberof KS states. Without state parallelization a complete copy ofthese matrices is kept by each process, which can easily leadto memory problems for systems with thousands of atoms.

3.3. Parallelization in domains

The final level of MPI parallelization consists in assigning acertain number of grid points to each process. In practice, thespace is divided in domains that are assigned to each process.As the finite-difference operators, such as the Laplacian,only require the values of neighbouring points, the (small)boundary regions between domains need to be communicatedbetween processors. In OCTOPUS this is done asynchronously,which means that the boundary values are copied betweenprocessors while the Laplacian calculation is done over thepoints in the central part of the domain that do not require thisinformation. This decreases substantially the problems relatedto latency, thereby improving the scaling with the number ofprocesses.

Note that also the calculation of integrals requires acommunication step to add the partial sum over each domain,an operation known as a reduction. The strategy to reducethe cost of this communication step is to group reductionstogether, as the cost of reductions of small vectors of data isdominated by the latency in the communication.

An important issue in domain parallelization is selectingwhich points are assigned to each processor. This task, known

as grid partitioning, is not trivial for grids with an arbitraryshape. Not only the number of points must be balancedbetween processors but also the number of points in theboundary regions must be minimized. OCTOPUS relies onexternal libraries for this task, with two currently supported:METIS [65] and ZOLTAN [66]. An example of the gridpartitioning scheme is shown in figure 1.

Certainly the communication cost of the domainparallelization is considerably higher than for the otherschemes. It is also the most complicated to implement.However, once the basic grid operations are implemented itis almost transparent for developers to write parallel code.

3.4. Parallelization of the Poisson solver

In order to obtain the Hartree potential from the electronicdensity, we solve the Poisson equation

r2vHartree(r) = �4⇡n(r). (3)

When performing real-time propagation, the solution of thisequation becomes the main bottleneck when thousands ofprocessors are used [67]. The reason is clear: as thereis only one Poisson equation to solve (regardless of thenumber of KS states), domain partitioning is the onlyavailable scheme to parallelize this operation. Unfortunately,the solution of the Poisson equation is highly non-local,mixing information from all points, making it unsuitablefor the domain decomposition approach that relies on spacelocality. This can be easily seen from the integral form ofequation (3)

vHartree(r) =Z

dr0 n(r0)|r � r0| . (4)

Fortunately, several solutions to this problem exist, giventhe wide use of the Poisson equation in different fields. InOCTOPUS this parallelization level is handled in a special way,and all processes are used independently of the distributionscheme. We found two particularly efficient approaches.

The first one is the use of fast Fourier transforms (FFT)to evaluate (4) in reciprocal space, a very fast method whenrunning on a single processor. However, FFTs are not easyto parallelize. Additionally, the standard FFT approach yieldsa Poisson potential with periodic boundary conditions, sosome modifications have to be made to obtain the free-spacesolution. Two different FFT-based solvers are available inOCTOPUS: (i) the software package provided by Genoveseet al [68] that uses an interpolating scaling functions (ISF)approach and the parallel FFT routine by Godecker [69];(ii) the parallel fast Fourier transform (PFFT) library [70]combined with the free-space Coulomb kernel proposed byRozzi et al [71].

The second parallel Poisson solver we tested is thefast multipole method (FMM) [72]. This is an approximatemethod that reduces the complexity of evaluating equation (4)by using a multipole expansion to approximate the far-field effect from several charges into a single term. Ourimplementation is based on the FMM library [73]. Thislibrary is designed to calculate the interaction between point

5

Page 7: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

J. Phys.: Condens. Matter 24 (2012) 233202 Topical Review

Figure 2. Time required to solve the Poisson equation for differentsolvers implemented in OCTOPUS. Grid of size 153 on a IBM BlueGene/P system.

charges, therefore, we have to assume that the densityin the grid corresponds to an array of point charges andthen calculate a correction for the interaction betweenneighbouring points [74]. This correction term, essentialto obtain the necessary accuracy, has the form of afinite-differences operator, and is therefore simple to evaluatein our framework.

In figure 2, we compare the time required to solve thePoisson equation in a parallelepiped box of size 153. Aswe can see, the ISF method is faster for small numbers ofprocessors, while PFFT and FMM become more competitiveas we increase their number. The crossover point is quite lowwith PFFT method (128 processors) and higher with FMM(4096 processors). More details about the parallel solution ofthe Poisson equation in OCTOPUS can be found in [74].

4. GPU parallelization and code optimization

Graphical processing units were initially designed forgenerating graphics in real time. They are massively parallelprocessors with hundreds or thousands of execution units.Given their particular architecture, GPUs require code writtenin a explicitly parallel language. The OCTOPUS support forGPUs is based on OpenCL, an open and platform-independentframework for high-performance computing on parallelprocessors. OpenCL implementations are available for GPUsfrom different vendors, for multi-core CPUs, and fordedicated accelerator boards. Since the OpenCL standard onlydefines a C interface, we have developed our own interfaceto call OpenCL from Fortran. This interface is currentlyavailable as an independent library, FortranCL [75].

For optimal performance, GPUs must process severalstreams of independent data simultaneously. This impliesthat performance-critical routines must receive a considerableamount of data on each call. To do this, the GPU optimizationstrategy of OCTOPUS is based on the concept of a blockof states, i.e., a small group of KS states. Performancecritical routines are designed to operate over these blocks.This approach provides a larger potential for data parallelism

than routines that operate over a single state at a time. Itturns out that this strategy also works well for CPUs withvectorial floating units, where parallelization is based onthe OpenMP framework combined with explicit vectorizationusing compiler directives.

For both GPUs and CPUs, memory access tends to bethe main limitation to the performance of OCTOPUS. Workingwith blocks of orbitals improves memory access, providedthat the coefficients are ordered in memory by the state index,so that load and stores are done from/to sequential addresses.However, increasing the KS block size can have a negativeeffect in memory access as larger data sets are less likelyto benefit from cache memory. This is particularly criticalfor CPUs that depend much more than GPUs on caching foroptimal performance.

One routine that can greatly benefit from caching isthe application of the finite-difference Laplacian operatorrequired for the kinetic-energy term. This is also an importantoperation as it represents a considerable part of the executiontime of OCTOPUS. We devised an approach to improve cacheutilization by controlling how grid points are ordered inmemory, i.e., how the three-dimensional grid is enumerated.The standard approach is to use a row-major or column-majororder which leads to neighbouring points being often allocatedin distant memory locations. Our approach is to enumeratethe grid points based on a sequence of small cubic grids, soclose spatial regions are stored close in memory, improvingmemory locality for the Laplacian operator. The effect ofthis optimization can be seen in figure 3. For the CPU withthe standard ordering of points, the performance decreasesas the block size is increased, while by optimizing the gridorder the parallelism exposed by a larger block size allowsa performance gain of approximately 40%. For the GPU theeffect of the optimization is less dramatic but still significant.

In figure 4 we show a comparison of the numericalthroughput of the GPU and CPU implementations of theLaplacian operator as a function of the size of the KSstates block. It can be seen that the use of blocks of KSstates represents a significant numerical performance gainwith respect to working with one state at a time. This isparticularly important for GPUs where performance with asingle state is similar to the CPU one but can be tripled byusing a block of size 16 or 32. The same conclusion canbe reached by looking at the numerical throughput of theorbital propagation and the total time required for a TDDFTiteration (see figure 5 and table 1). The use of GPUs givesquite spectacular improvements, with a total iteration timebeing decreased by more than a factor of 6 with respect tothe optimized multi-threaded CPU implementation. In fact,the propagation of the KS orbitals is eight times faster in theGPU, but currently the total speed-up is limited by the Poissonsolver that is executed on the CPU. These results mean thatusing a single GPU OCTOPUS could obtain the absorptionspectrum of the C60 molecule in about one hour. More detailsabout the GPU implementation in OCTOPUS can be foundin [76, 77].

6

Page 8: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

J. Phys.: Condens. Matter 24 (2012) 233202 Topical Review

Figure 3. Effect of the optimization of the grid mapping for data locality in the numerical throughput of the Laplacian operator as afunction of the size of the KS states block. Left: computations with an AMD FX 8150 CPU (8 threads). Right: computations with anAMD/ATI Radeon HD 5870 GPU.

Figure 4. Numerical throughput of the OCTOPUS implementationof the Laplacian operator for different processors as a function ofthe number of KS states in a block. Spherical grid with 5 ⇥ 105

points. For each point the grid ordering has been adjusted foroptimal cache usage.

Table 1. OCTOPUS propagation time per step for differentprocessors. C60 molecule with a grid of spheres of radius 10 a.u.around each atom and spacing 0.375 a.u.

Processor Time per step (s)

CPU AMD FX 8150 (8 threads) 9.45GPU AMD/ATI Radeon HD 5870 1.86GPU NVIDIA GeForce GTX 480 1.72

5. Scalability in massively parallel systems

In this section we show how all the improvements inalgorithms, parallelization, and optimization are combined inthe simulation of large electronic systems. As a benchmarkwe used portions of the spinach photosynthetic unit [78]with 180, 441, 650, 1365 and 2676 atoms containing severalchlorophyll units. As these are molecular systems, only thestate and domain decomposition parallelization levels areused. We emphasize that these are real-world examples thatwere not tweaked in order to artificially improve throughputor scalability.

Figure 5. Numerical throughput of the KS state propagation inOCTOPUS for different processors as a function of the size of the KSstates blocks. C60 molecule with a grid of spheres of radius 10 a.u.around each atom and spacing 0.375 a.u.

We used different supercomputers for our benchmarks,as using more than one machine ensures a better portabilityof the code and inherent scalability of the algorithm. Thefirst one is Curie, a supercomputer that belongs to the FrenchCommissariat a l’Energie Atomique (CEA) and that consistsof Intel Xeon processors interconnected by an Infinibandnetwork. In total it has 11 520 cores available to users. Thesecond platform is the IBM Blue Gene/P, a design based onlow power processors (IBM PowerPC 450) interconnected bytwo networks, one with a toroidal topology and the other witha tree topology. The Blue Gene/P is a challenging architectureas it has a very limited amount of memory per core (512 MiB).Because of this, inside each node OpenMP parallelizationis used as this guarantees a minimal replication of data.Compiler directives are also used to profit from the vectorialfloating point units. The Blue Gene/P calculations presentedin this work were performed in the 16 384-core supercomputerof the Rechenzentrum Garching of the Max Planck Society,Germany.

We benchmark separately the two main tasks performedby OCTOPUS. The first concerns the calculation of the ground

7

Page 9: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

J. Phys.: Condens. Matter 24 (2012) 233202 Topical Review

Figure 6. Ground-state calculation on a Blue Gene/P system for different numbers of atoms. Left: wall-clock time per self-consistencyiteration. Right: parallel speed-up compared with the ideal case.

Figure 7. Real-time TDDFT propagation on a Blue Gene/P system for different numbers of atoms. Left: wall-clock time per time step.Right: parallel speed-up compared with the ideal case.

state of the system, while the second regards real-timeTDDFT runs. We stress again that the parallelization ofthe first task is considerably more difficult than that of thesecond. However, the parallelization of the time-dependentruns is much more important, as this consumes much morecomputational time than the ground-state runs.

Note that OCTOPUS has to perform several initializations(create the grid, reserve memory, prepare the MPI environ-ment, etc) before starting the actual simulations. This time isusually negligible compared to the total simulation time. Ourresults are therefore measured in terms of elapsed wall-clocktime per self-consistent iteration (for the ground state) or pertime step (for the time-dependent runs). The total executiontime of a real-world simulation is basically proportional tothese numbers.

5.1. Ground-state calculations

We start by showing scalability tests for ground-statecalculations for systems of 180 and 441 atoms executed ona Blue Gene/P system. In order to measure the ground-stateiteration time, 10 self-consistency cycles were run and theaverage time is shown in figure 6. We also show the parallelspeed-up, the relative performance with respect to the run

with the smallest number of processors. Excellent scalabilityis achieved up to 256–512 cores, and up to 4096 cores thetime per iteration decreases as the number of processors isincreased. As expected, scaling improves for systems with alarger number of atoms.

Recall that all runs introduced here were performedin symmetric multiprocessing (SMP) mode, combininginner-loop OpenMP parallelization with domain MPI paral-lelization. This allows us not only to take advantage of thearchitecture of the CPU, but also to make better use of the(limited) memory available in each of the Blue Gene nodes.For instance, the run with 1024 cores was done launching 256MPI processes, each of them with 4 OpenMP threads.

5.2. Real-time TDDFT calculations

In figure 7, we show the execution times and scaling fortime-propagation runs for the systems of 180, 650 and 1365atoms executed on a Blue Gene/P system also in SMP mode.We combine different ways to partition the cores betweenthe domain and states parallelization, selecting only the mostefficient combination. We used a minimum of 16 cores with180 atoms, 256 cores with 650 atoms and 1024 cores with1365 atoms, as runs with fewer cores were not possible due to

8

Page 10: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

J. Phys.: Condens. Matter 24 (2012) 233202 Topical Review

Figure 8. Real-time TDDFT propagation on the Curie system for different numbers of atoms. Left: wall-clock time per time step. Right:parallel speed-up compared with the ideal case.

Figure 9. Scaling of time propagation with multiple Nvidia tesla M2090 GPUs. Left: wall-clock time per time step. Right: parallelspeed-up compared with the ideal case.

memory limitations. Almost perfect scaling was achieved upto 512 cores with the system of 180 atoms, and a remarkablespeed-up of 3785 was obtained with 16 384 cores, reaching anefficiency of 23% with 91 cores per atom. In the case of thesystem composed by 650 atoms the ideal speed-up is reachedup to 2048 cores, while for 16 384 cores a value of 7222 isachieved. Even better performance is obtained with the systemof 1365 atoms, with a speed-up of 11 646 for 16 384 cores.Thus, we can maintain an almost ideal speed-up with around3 cores per atom, and acceptable speed-ups up to 50 cores peratom.

Benchmark tests also were done in the Curie supercom-puter. The results are shown in figure 8. The main advantage ofthis machine is that it has more powerful processors and morememory per core. However, the drawback is that the networkis not as reliable as the Blue Gene one. The network is a sharedresource and depends not only in the current run but alsoin all other running jobs. Consequently, only the minimumiteration time is shown here as an example of the best possibleexecution. Nevertheless, and confirming the scalability of thealgorithm, remarkable speed-ups are achieved up to 8192cores.

We note that these tests were performed with the PFFTPoisson solver, which shows a great improvement when a

large number of nodes are used when compared with thedefault ISF solver. The improvements made to the Poissonsolver are reflected in the entire time-dependent iteration time.For large numbers of processes, the total iteration time isimproved by up to 58% with respect to the ISF solver.

5.3. Combined MPI–GPU parallelization

OCTOPUS can also combine the MPI and OpenCLparallelizations in a hybrid approach to use multiple GPUs.This has created some additional challenges when comparedto multiple-CPU parallelization. First of all, as GPUs offerhigher computational capabilities than CPUs, the time spentin communication becomes a larger fraction of the totalexecution time. In second place, communication times canbecome higher as transferring data between GPUs is athree-step procedure: first the data is copied to main memoryusing OpenCL, then it is sent to the other process usingMPI; and finally it is copied to the memory of the secondGPU using OpenCL. Still OCTOPUS can scale reasonably wellwhen running on multiple GPUs, as it can be seen in figure 9.The speed-up for 8 GPUs is 5.7, an efficiency of 71%. Notethat the main limitation to scalability in this case is the lack ofa GPU accelerated parallel Poisson solver.

9

Page 11: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

J. Phys.: Condens. Matter 24 (2012) 233202 Topical Review

6. Conclusions

In summary, our results show that real-time solution ofTDDFT equations is a very parallelisable task. In ourcalculations we can scale with a reasonable efficiency toalmost 100 cores per atom, which allows for TDDFT real-timesimulations of systems with thousands of atoms. We wouldlike to remark that all the calculations in this paper wereperformed using simulation parameters that correspond toactual calculations, and that they were not tweaked to obtainbetter scalability. This parallelisability can also be exploitedfor efficient execution in parallel processors, including GPUsand multi-core CPUs with vectorial floating point units.

We have also shown a hybrid parallelization scheme forexecution on clusters of GPUs. This approach combines thehigh-level MPI parallelization with low-level parallelizationbased on OpenCL. For the moment our tests have been limitedto a small number of GPUs, but clusters with thousands ofGPUs are already available so certainly this approach will bedeveloped with an eye on exaflop computing.

The current capabilities of OCTOPUS ensure that theexcited-states properties of systems with thousands of atomscan be studied in the immediate future. The flexibility ofreal-time TDDFT method means that not only electroniclinear response properties can be studied. In fact, non-linearphenomena or the effect of ionic motion in the response can allbe tackled with this formalism. This is of particular interest,for example, in the study of quantum effects in biologicalmolecules, or in the interaction of matter with strong laserfields.

Acknowledgments

We would like to thank all those who have contributedto OCTOPUS development, in particular to F Lorenzen,H Appel, D Nitsche, C A Rozzi, and M Verstraete. We alsoacknowledge P Garcia-Risueno, I Kabadshow, and M Pippigfor their contribution in the implementation of the FMM andPFFT Poisson solvers in OCTOPUS.

We acknowledge the PRACE Research Infrastructurefor providing access to Jugene at the ForschungszentrumJuelich and Curie at Commissariat a l’Energie Atomique. Wealso acknowledge the Rechenzentrum Garching of the MaxPlanck Society for access to their Blue Gene/P. Computationalresources were also provided by the US Department of Energyat Lawrence Berkeley National Laboratory’s NERSC facility.

XA and AA acknowledge hardware donations byAdvanced Micro Devices, Inc. (AMD) and Nvidia Corp.,and support from the US NSF CDI program (PHY-0835713). MALM acknowledges support from the FrenchANR (ANR-08-CEXC8-008-01) and computational resourcesfrom GENCI (project x2011096017). JAR acknowledgesthe scholarship of the University of the Basque CountryUPV/EHU. DAS acknowledges support from the US NSFIGERT and Graduate Research Fellowship Programs, andby NSF Grant No. DMR 10-1006184. SGL was supportedby NSF Grant No. DMR10-1006184 and by the Director,Office of Science, Office of Basic Energy Sciences, Materials

Sciences and Engineering Division, US Department ofEnergy under Contract No. 10DE-AC02-05CH11231. Theauthors were partially supported by the European ResearchCouncil Advanced Grant DYNamo (ERC-2010-AdG—Proposal No. 267374), Spanish Grants (FIS2011-65702-C02-01 and PIB2010US-00652), ACI-Promociona (ACI2009-1036), Grupos Consolidados UPV/EHU del Gobierno Vasco(IT-319-07), University of the Basque Country UPV/EHUgeneral funding for research groups ALDAPA (GIU10/02),Consolider nanoTHERM (Grant No. CSD2010-00044)and European Commission projects CRONOS (280879-2CRONOS CP-FP7) and THEMA (FP7-NMP-2008-SMALL-2, 228539).

References

[1] Moore G E 1965 Electronics 38 114–7[2] Runge E and Gross E K U 1984 Phys. Rev. Lett. 52 997[3] Bertsch G F, Iwata J-I, Rubio A and Yabana K 2000 Phys.

Rev. B 62 7998[4] Yabana K and Bertsch G F 1996 Phys. Rev. B 54 4484[5] Castro A, Marques M A L, Alonso J A and Rubio A 2004

J. Comput. Theor. Nanosci. 1 231[6] Castro A, Marques M A L, Alonso J A, Bertsch G F and

Rubio A 2004 Eur. Phys. J. D 28 211[7] Takimoto Y, Vila F D and Rehr J J 2007 J. Chem. Phys.

127 154114[8] Yabana K and Bertsch G F 1999 Phys. Rev. A 60 1271[9] Varsano D, Espinosa Leal L A, Andrade X, Marques M A L,

diFelice R and Rubio A 2009 Phys. Chem. Chem. Phys.11 4481

[10] Marques M A L, Castro A, Malloci G, Mulas G andBotti S 2007 J. Chem. Phys. 127 014107

[11] Aggarwal R, Farrar L, Saikin S, Andrade X,Aspuru-Guzik A and Polla D 2012 Solid State Commun.152 204

[12] Castro A, Marques M A L and Rubio A 2004 J. Chem. Phys.121 3425

[13] Helgaker T, Jorgensen P and Olsen J 2012 MolecularElectronic-Structure Theory (New York: Wiley)

[14] Marques M A L, Castro A, Bertsch G F and Rubio A 2003Comput. Phys. Commun. 151 60

[15] Castro A, Appel H, Oliveira M, Rozzi C A, Andrade X,Lorenzen F, Marques M A L, Gross E K U andRubio A 2006 Phys. Status Solidi b 243 2465

[16] Hohenberg P and Kohn W 1964 Phys. Rev. 136 B864[17] Kohn W and Sham L J 1965 Phys. Rev. 140 A1133[18] Becke A D 1993 J. Chem. Phys. 98 1372[19] Kummel S and Perdew J P 2003 Phys. Rev. Lett. 90 043004[20] Andrade X and Aspuru-Guzik A 2011 Phys. Rev. Lett.

107 183002[21] Marques M A L, Oliveira M J T and Burnus T 2012 Comput.

Phys. Commun. submitted[22] Oliveira M J T and Nogueira F 2008 Comput. Phys. Commun.

178 524[23] Mortensen J J, Hansen L B and Jacobsen K W 2005 Phys. Rev.

B 71 035109[24] Gonze X et al 2009 Comput. Phys. Commun. 180 2582[25] Helbig N, Tokatly I V and Rubio A 2010 Phys. Rev. A

81 022504[26] Helbig N, Fuks J I, Casula M, Verstraete M J, Marques M A L,

Tokatly I V and Rubio A 2011 Phys. Rev. A 83 032503[27] Fuks J I, Helbig N, Tokatly I V and Rubio A 2011 Phys. Rev.

B 84 075107[28] Fuks J I, Rubio A and Maitra N T 2011 Phys. Rev. A

83 042501

10

Page 12: Time-dependent density-functional theory in massively parallel …nano-bio.ehu.es/files/articles/X._Andrade_JPCM_2012_810.pdf · Libxc [21]. This library is now completely independent

J. Phys.: Condens. Matter 24 (2012) 233202 Topical Review

[29] Olivares-Amaya R, Stopa M, Andrade X, Watson M A andAspuru-Guzik A 2011 J. Phys. Chem. Lett. 2 682

[30] Marques M A L, Lopez X, Varsano D, Castro A andRubio A 2003 Phys. Rev. Lett. 90 258101

Lopez X, Marques M A L, Castro A and Rubio A 2005 J. Am.Chem. Soc. 127 12329

[31] Meng S and Kaxiras E 2008 J. Chem. Phys. 129 054110[32] Alonso J L, Andrade X, Echenique P, Falceto F,

Prada-Gracia D and Rubio A 2008 Phys. Rev. Lett.101 096403

[33] Andrade X, Castro A, Zueco D, Alonso J L, Echenique P,Falceto F and Rubio A 2009 J. Chem. Theory Comput.5 728

[34] Car R and Parrinello M 1985 Phys. Rev. Lett. 55 2471[35] Strubbe D A, Lehtovaara L, Rubio A, Marques M A L and

Louie S G 2012 Fundamentals of Time-Dependent DensityFunctional Theory (Lecture Notes in Physics vol 837) ed MA L Marques, N T Maitra, F M S Nogueira,E K U Gross and A Rubio (Berlin: Springer) pp 139–66

[36] Casida M E, Jamorski C, Bohr F, Guan J andSalahub D R 1996 Optical Properties fromDensity-Functional Theory (Washington, DC: AmericanChemical Society) chapter 9, pp 145–63

[37] Baroni S, de Gironcoli S, Dal Corso A and Giannozzi P 2001Rev. Mod. Phys. 73 515

[38] Kadantsev E S and Stott M J 2005 Phys. Rev. B 71 045104[39] Andrade X, Botti S, Marques M A L and Rubio A 2007

J. Chem. Phys. 126 184106[40] Botti S, Castro A, Andrade X, Rubio A and

Marques M A L 2008 Phys. Rev. B 78 035333[41] Botti S, Castro A, Lathiotakis N N, Andrade X and

Marques M A L 2009 Phys. Chem. Chem. Phys. 11 4523[42] Vila F D, Strubbe D A, Takimoto Y, Andrade X, Rubio A,

Louie S G and Rehr J J 2010 J. Chem. Phys. 133 034111[43] Zhang G P, Strubbe D A, Louie S G and George T F 2011

Phys. Rev. A 84 023837[44] Brif C, Chakrabarti R and Rabitz H 2010 New J. Phys.

12 075008[45] Krieger K, Castro A and Gross E K U 2011 Chem. Phys.

391 50[46] Castro A and Gross E K U 2012 Fundamentals of

Time-Dependent Density Functional Theory (Lecture Notesin Physics vol 837) ed M A L Marques, N T Maitra,F M S Nogueira, E K U Gross and A Rubio (Berlin:Springer) pp 265–76

[47] Kurth S, Stefanucci G, Almbladh C-O, Rubio A andGross E K U 2005 Phys. Rev. B 72 035308

[48] Deslippe J, Samsonidze G, Strubbe D A, Jain M,Cohen M L and Louie S G 2012 Comput. Phys. Commun.183 1269

[49] Chelikowsky J R, Troullier N and Saad Y 1994 Phys. Rev.Lett. 72 1240

[50] Rappoport D 2011 ChemPhysChem 12 3404[51] Troullier N and Martins J L 1991 Phys. Rev. B 43 1993[52] Soler J M, Artacho E, Gale J D, Garcıa A, Junquera J,

Ordejon P and Sanchez-Portal D 2002 J. Phys.: Condens.Matter 14 2745

[53] Hartwigsen C, Goedecker S and Hutter J 1998 Phys. Rev. B58 3641

[54] Fuchs M and Scheffler M 1999 Comput. Phys. Commun.119 67

[55] Giannozzi P et al 2009 J. Phys.: Condens. Matter 21 395502[56] Castro A, Marques M A L, Romero A H, Oliveira M J T and

Rubio A 2008 J. Chem. Phys. 129 144110[57] Oliveira M J T, Nogueira F, Marques M A L and

Rubio A 2009 J. Chem. Phys. 131 214302[58] Ince D C, Hatton L and Graham-Cumming J 2012 Nature

482 485[59] Munshi A (ed) 2009 The OpenCL Specification (Philadelphia,

PA: Khronos Group)[60] http://buildbot.net/[61] Pulay P 1980 Chem. Phys. Lett. 73 393[62] Kresse G and Furthmuller J 1996 Phys. Rev. B 54 11169[63] Gram J P 1883 J. Reine Angew. Math. 94 71

Schmidt E 1907 Math. Ann. 63 433[64] Blackford L S et al 1996 Proc. 1996 ACM/IEEE Conf. on

Supercomputing (CDROM) (Washington, DC: IEEEComputer Society)

[65] Karypis G and Kumar V 1998 SIAM J. Sci. Comput.20 359–92

[66] Devine K, Boman E, Heaphy R, Hendrickson B andVaughan C 2002 Comput. Sci. Eng. 4 90

[67] Alberdi-Rodriguez J 2010 Analysis of Performance andScaling of the Scientific Code Octopus (Saarbrucken,Germany: LAP LAMBERT Academic Publishing)http://nano-bio.ehu.es/files/alberdi master.pdf

[68] Genovese L, Deutsch T, Neelov A, Goedecker S andBeylkin G 2006 J. Chem. Phys. 125 074105

[69] Goedecker S 1993 Comput. Phys. Commun. 76 294Goedecker S, Boulet M and Deutsch T 2003 Comput. Phys.

Commun. 154 105[70] Pippig M 2010 Proc. HPC Status Conf. of the Gauss-Allianz

e.V.[71] Rozzi C A, Varsano D, Marini A, Gross E K U and

Rubio A 2006 Phys. Rev. B 73 205119[72] Greengard L F and Rokhlin V 1987 J. Comput. Phys. 73 325

Greengard L F and Rokhlin V 1988 The Rapid Evaluation ofPotential Fields in Three Dimensions (Berlin: Springer)

[73] Kabadshow I and Lang B 2007 Int. Conf. on ComputationalScience (1) pp 716–22

[74] Garcıa-Risueno P, Alberdi-Rodriguez J, Oliveira M J T,Andrade X, Kabadshow I, Pippig M, Rubio A,Muguerza J and Arruabarrena A 2012 to be submitted

Garcıa-Risueno P 2011 PhD Thesis Universidad de Zaragozahttp://nano-bio.ehu.es/files/garciarisueno phd thesis.pdf

[75] http://code.google.com/p/fortrancl/[76] Andrade X and Genovese L 2012 Fundamentals of

Time-Dependent Density Functional Theory (Lecture Notesin Physics) ed M A L Marques, N T Maitra,F M S Nogueira, E K U Gross and A Rubio (Berlin:Springer) pp 401–13

[77] Andrade X and Aspuru-Guzik A 2012 to be submitted[78] Liu Z, Yan H, Wang K, Kuang T, Zhang J, Gui L, An X and

Chang W 2004 Nature 428 287

11