gadget: a code for collisionless and gasdynamical ...gadget: a code for collisionless and...

arX

iv:a

stro

-ph/

0003

162v

3 3

1 M

ay 2

001

GADGET: A code for collisionless andgasdynamical cosmological simulations

Volker Springel1,2, Naoki Yoshida1 and Simon D. M. White11Max-Planck-Institut fur Astrophysik, Karl-Schwarzschild-Straße 1, 85740 Garching bei Munchen, Germany

2Harvard-Smithsonian Center for Astrophysics, 60 Garden Street, Cambridge, MA 02138, USA

AbstractWe describe the newly written codeGADGETwhich is suitable both for cosmological simulations ofstructure formation and for the simulation of interacting galaxies.GADGET evolves self-gravitatingcollisionless fluids with the traditional N-body approach,and a collisional gas by smoothed particlehydrodynamics. Along with the serial version of the code, wediscuss a parallel version that has beendesigned to run on massively parallel supercomputers with distributed memory. While both versionsuse a tree algorithm to compute gravitational forces, the serial version ofGADGET can optionallyemploy the special-purpose hardwareGRAPE instead of the tree. Periodic boundary conditions aresupported by means of an Ewald summation technique. The codeuses individual and adaptivetimesteps for all particles, and it combines this with a scheme for dynamic tree updates. Due to itsLagrangian nature,GADGET thus allows a very large dynamic range to be bridged, both in space andtime. So far,GADGET has been successfully used to run simulations with up to 7.5×107 particles,including cosmological studies of large-scale structure formation, high-resolution simulations of theformation of clusters of galaxies, as well as workstation-sized problems of interacting galaxies. Inthis study, we detail the numerical algorithms employed, and show various tests of the code. Wepublically release both the serial and the massively parallel version of the code.

Key words:methods: numerical – galaxies: interactions – cosmology: dark matter.

1. Introduction

Numerical simulations of three-dimensional self-gravitatingfluids have become an indispensable tool in cosmology. Theyare now routinely used to study the non-linear gravitationalclustering of dark matter, the formation of clusters of galax-ies, the interactions of isolated galaxies, and the evolution ofthe intergalactic gas. Without numerical techniques the im-mense progress made in these fields would have been nearlyimpossible, since analytic calculations are often restrictedto idealized problems of high symmetry, or to approximatetreatments of inherently nonlinear problems.

The advances in numerical simulations have become pos-sible both by the rapid growth of computer performance andby the implementation of ever more sophisticated numericalalgorithms. The development of powerful simulation codesstill remains a primary task if one wants to take full advan-tage of new computer technologies.

Early simulations (Holmberg 1941; Peebles 1970; Press& Schechter 1974; White 1976, 1978; Aarseth et al.1979,among others) largely employed the direct summationmethod for the gravitational N-body problem, which remains

useful in collisional stellar dynamical systems, but it is inef-ficient for largeN due to the rapid increase of its computa-tional cost withN . A large number of groups have thereforedeveloped N-body codes for collisionless dynamics that com-pute the large-scale gravitational field by means of Fouriertechniques. These are the PM, P3M, and AP3M codes (East-wood & Hockney 1974; Hohl 1978; Hockney & Eastwood1981; Efstathiou et al. 1985; Couchman 1991; Bertschinger& Gelb 1991; MacFarland et al. 1998). Modern versionsof these codes supplement the force computation on scalesbelow the mesh size with a direct summation, and/or theyplace mesh refinements on highly clustered regions. Pois-son’s equation can also be solved on a hierarchically refinedmesh by means of finite-difference relaxation methods, an ap-proach taken in the ART code by Kravtsov et al. (1997).

An alternative to these schemes are the so-called tree al-gorithms, pioneered by Appel (1981, 1985). Tree algorithmsarrange particles in a hierarchy of groups, and compute thegravitational field at a given point by summing over multi-pole expansions of these groups. In this way the computa-tional cost of a complete force evaluation can be reduced toa O(N logN) scaling. The grouping itself can be achieved

1

http://arxiv.org/abs/astro-ph/0003162v3

V. Springel, N. Yoshida and S. D. M. White

in various ways, for example with Eulerian subdivisions ofspace (Barnes & Hut 1986), or with nearest-neighbour pair-ings (Press 1986; Jernigan & Porter 1989). A technique re-lated to ordinary tree algorithms is the fast multipole-method(e.g. Greengard & Rokhlin 1987), where multipole expan-sions are carried out for the gravitational field in a region ofspace.

While mesh-based codes are generally much faster forclose-to-homogeneous particle distributions, tree codescanadapt flexibly to any clustering state without significant lossesin speed. This Lagrangian nature is a great advantage if alarge dynamic range in density needs to be covered. Heretree codes can outperform mesh based algorithms. In addi-tion, tree codes are basically free from any geometrical re-strictions, and they can be easily combined with integrationschemes that advance particles on individual timesteps.

Recently, PM and tree solvers have been combined intohybrid Tree-PM codes (Xu 1995; Bagla 1999; Bode et al.2000). In this approach, the speed and accuracy of the PMmethod for the long-range part of the gravitational force arecombined with a tree-computation of the short-range force.This may be seen as a replacement of the direct summationPP part in P3M codes with a tree algorithm. The Tree-PMtechnique is clearly a promising new method, especially iflarge cosmological volumes with strong clustering on smallscales are studied.

Yet another approach to the N-body problem is providedby special-purpose hardware like theGRAPEboard (Makino1990; Ito et al. 1991; Fukushige et al. 1991; Makino & Funato1993; Ebisuzaki et al. 1993; Okumura et al. 1993; Fukushigeet al. 1996; Makino et al. 1997; Kawai et al. 2000). It con-sists of custom chips that compute gravitational forces by thedirect summation technique. By means of their enormouscomputational speed they can considerably extend the rangewhere direct summation remains competitive with pure soft-ware solutions. A recent overview of the family ofGRAPE-boards is given by Hut & Makino (1999). The newest gen-eration ofGRAPE technology, theGRAPE-6, will achieve apeak performance of up to 100 TFlops (Makino 2000), allow-ing direct simulations of dense stellar systems with particlenumbers approaching106. Using sophisticated algorithms,GRAPEmay also be combined with P3M (Brieu et al. 1995)or tree algorithms (Fukushige et al. 1991; Makino 1991a;Athanassoula et al. 1998) to maintain its high computationalspeed even for much larger particle numbers.

In recent years, collisionless dynamics has also been cou-pled to gas dynamics, allowing a more direct link to observ-able quantities. Traditionally, hydrodynamical simulationshave usually employed some kind of mesh to represent thedynamical quantities of the fluid. While a particular strengthof these codes is their ability to accurately resolve shocks, themesh also imposes restrictions on the geometry of the prob-lem, and onto the dynamic range of spatial scales that can besimulated. New adaptive mesh refinement codes (Norman &

Bryan 1998; Klein et al. 1998) have been developed to pro-vide a solution to this problem.

In cosmological applications, it is often sufficient to de-scribe the gas by smoothed particle hydrodynamics (SPH), asinvented by Lucy (1977) and Gingold & Monaghan (1977).The particle-based SPH is extremely flexible in its ability toadapt to any given geometry. Moreover, its Lagrangian natureallows a locally changing resolution that ‘automatically’fol-lows the local mass density. This convenient feature helps tosave computing time by focusing the computational effort onthose regions that have the largest gas concentrations. Fur-thermore, SPH ties naturally into the N-body approach forself-gravity, and can be easily implemented in three dimen-sions.

These advantages have led a number of authors to developSPH codes for applications in cosmology. Among them areTREESPH(Hernquist & Katz 1989; Katz et al. 1996),GRAPE-SPH (Steinmetz 1996),HYDRA (Couchman et al. 1995;Pearce & Couchman 1997), and codes by Evrard (1988);Navarro & White (1993); Hultman & Kallander (1997); Daveet al. (1997); Carraro et al. (1998). See Kang et al. (1994) andFrenk et al. (1999) for a comparison of many of these cosmo-logical hydrodynamic codes.

In this paper we describe our simulation codeGADGET(GAlaxies withDark matter andGas intEracT), which canbe used both for studies of isolated self-gravitating systemsincluding gas, or for cosmological N-body/SPH simulations.We have developed two versions of this code, a serial work-station version, and a version for massively parallel super-computers with distributed memory. The workstation codeuses either a tree algorithm for the self-gravity, or the special-purpose hardwareGRAPE, if available. The parallel versionworks with a tree only. Note that in principle severalGRAPEboards, each connected to a separate host computer, can becombined to work as a large parallel machine, but this possi-bility is not implemented in the parallel code yet. While theserial code largely follows known algorithmic techniques,weemploy a novel parallelization strategy in the parallel version.

A particular emphasis of our work has been on the use of atime integration scheme with individual and adaptive particletimesteps, and on the elimination of sources of overhead bothin the serial and parallel code under conditions of large dy-namic range in timestep. Such conditions occur in dissipativegas-dynamical simulations of galaxy formation, but also inhigh-resolution simulations of cold dark matter. The code al-lows the usage of different timestep criteria and cell-openingcriteria, and it can be comfortably applied to a wide rangeof applications, including cosmological simulations (with orwithout periodic boundaries), simulations of isolated or inter-acting galaxies, and studies of the intergalactic medium.

We thus think thatGADGET is a very flexible code thatavoids obvious intrinsic restrictions for the dynamic range ofthe problems that can be addressed with it. In this methods-paper, we describe the algorithmic choices made inGADGET

2

GADGET: A code for collisionless and gasdynamical cosmological simulations

which we release in its parallel and serial versions on the in-ternet1, hoping that it will be useful for people working oncosmological simulations, and that it will stimulate code de-velopment efforts and further code-sharing in the community.

This paper is structured as follows. In Section 2, we givea brief summary of the implemented physics. In Section 3,we discuss the computation of the gravitational force bothwith a tree algorithm, and withGRAPE. We then describe ourspecific implementation of SPH in Section 4, and we discussour time integration scheme in Section 5. The parallelizationof the code is described in Section 6, and tests of the code arepresented in Section 7. Finally, we summarize in Section 8.

2. Implemented physics

2.1. Collisionless dynamics and gravity

Dark matter and stars are modeled as self-gravitating col-lisionless fluids, i.e. they fulfill the collisionless Boltzmannequation (CBE)

df

dt≡ ∂f

∂t+ v

∂f

∂x− ∂Φ

∂r

∂f

∂v= 0, (1)

where the self-consistent potentialΦ is the solution of Pois-son’s equation

∇2Φ(r, t) = 4πG

∫

f(r,v, t)dv, (2)

and f(r,v, t) is the mass density in single-particle phase-space. It is very difficult to solve this coupled system of equa-tions directly with finite difference methods. Instead, we willfollow the common N-body approach, where the phase fluidis represented byN particles which are integrated along thecharacteristic curves of the CBE. In essence, this is a MonteCarlo approach whose accuracy depends crucially on a suffi-ciently high number of particles.

The N-body problem is thus the task of following Newton’sequations of motion for a large number of particles undertheir own self-gravity. Note that we will introduce a soften-ing into the gravitational potential at small separations.Thisis necessary to suppress large-angle scattering in two-bodycollisions and effectively introduces a lower spatial resolutioncut-off. For a given softening length, it is important to choosethe particle number large enough such that relaxation effectsdue to two-body encounters are suppressed sufficiently, oth-erwise the N-body system provides no faithful model for acollisionless system. Note that the optimum choice of soft-ening length as a function of particle density is an issue thatis still actively discussed in the literature (e.g. Splinter et al.1998; Romeo 1998; Athanassoula et al. 2000).

1GADGET’s web-site is:http://www.mpa-garching.mpg.de/gadget

2.2. Gasdynamics

A simple description of the intergalactic medium (IGM), orthe interstellar medium (ISM), may be obtained by modelingit as an ideal, inviscid gas. The gas is then governed by thecontinuity equation

dρ

dt+ ρ∇ · v = 0, (3)

and the Euler equation

dv

dt= −∇P

ρ−∇Φ. (4)

Further, the thermal energyu per unit mass evolves accordingto the first law of thermodynamics, viz.

du

dt= −P

ρ∇ · v − Λ(u, ρ)

ρ. (5)

Here we used Lagrangian time derivatives, i.e.

d

dt=

∂

∂t+ v · ∇, (6)

and we allowed for a piece of ‘extra’ physics in form of thecooling functionΛ(u, ρ), describing external sinks or sourcesof heat for the gas.

For a simple ideal gas, the equation of state is

P = (γ − 1)ρu, (7)

whereγ is the adiabatic exponent. We usually takeγ = 5/3,appropriate for a mono-atomic ideal gas. The adiabaticsound speedc of this gas isc2 = γP/ρ.

3. Gravitational forces

3.1. Tree algorithm

An alternative to Fourier techniques, or to direct summation,are the so-called tree methods. In these schemes, the particlesare arranged in a hierarchy of groups. When the force on aparticular particle is computed, the force exerted by distantgroups is approximated by their lowest multipole moments.In this way, the computational cost for a complete force eval-uation can be reduced to orderO(N logN) (Appel 1985).The forces become more accurate if the multipole expansionis carried out to higher order, but eventually the increasingcost of evaluating higher moments makes it more efficientto terminate the multipole expansion and rather use a largernumber of smaller tree nodes to achieve a desired force accu-racy (McMillan & Aarseth 1993). We will follow the com-mon compromise to terminate the expansion after quadrupolemoments have been included.

3

http://www.mpa-garching.mpg.de/gadget


Figure 1: Schematic illustration of the Barnes & Hut oct-tree in two dimensions. The particles are first enclosed in a square (root node). Thissquare is then iteratively subdivided in four squares of half the size, until exactly one particle is left in each final square (leaves of the tree). Inthe resulting tree structure, each square can be progenitorof up to four siblings. Note that empty squares need not to be stored.

We employ the Barnes & Hut (1986,henceforth BH) treeconstruction in this work. In this scheme, the computationaldomain is hierarchically partitioned into a sequence of cubes,where each cube contains eight siblings, each with half theside-length of the parent cube. These cubes form the nodesof an oct-tree structure. The tree is constructed such that eachnode (cube) contains either exactly one particle, or is pro-genitor to further nodes, in which case the node carries themonopole and quadrupole moments of all the particles thatlie inside its cube. A schematic illustration of the BH tree isshown in Figure 1.

A force computation then proceeds by walking the tree,and summing up appropriate force contributions from treenodes. In the standard BH tree walk, the multipole expan-sion of a node of sizel is used only if

r >l

θ, (8)

wherer is the distance of the point of reference to the center-of-mass of the cell andθ is a prescribed accuracy parameter.If a node fulfills the criterion (8), the tree walk along thisbranch can be terminated, otherwise it is ‘opened’, and thewalk is continued with all its siblings. For smaller values ofthe opening angle, the forces will in general become moreaccurate, but also more costly to compute. One can try tomodify the opening criterion (8) to obtain higher efficiency,i.e. higher accuracy at a given length of the interaction list,something that we will discuss in more detail in Section 3.3.

A technical difficulty arises when the gravity is softened.In regions of high particle density (e.g. centers of dark haloes,or cold dense gas knots in dissipative simulations), it can hap-pen that nodes fulfill equation (8), and simultaneously onehasr < h, whereh is the gravitational softening length. Inthis situation, one formally needs the multipole moments ofthesoftenedgravitational field. One can work around this sit-uation by opening nodes always forr < h, but this can slowdown the code significantly if regions of very high particledensity occur. Another solution is to use the proper multipole

expansion for the softened potential, which we here discussfor definiteness. We want to approximate the potential atr

due to a (distant) bunch of particles with massesmi and co-ordinatesxi. We use a spline-softened force law, hence theexact potential of the particle group is

Φ(r) = −G∑

k

mk g (|xk − r|) , (9)

where the functiong(r) describes the softened force law. ForNewtonian gravity we haveg(r) = 1/r, while the spline soft-ened gravity with softening lengthh gives rise to

g(r) = − 1

hW2

( r

h

)

. (10)

The functionW2(u) is given in the Appendix. It arises byreplacing the force due to a point massm with the force ex-erted by the mass distributionρ(r) = mW (r;h), where wetakeW (r;h) to be the normalized spline kernel used in theSPH formalism. The spline softening has the advantage thatthe force becomes exactly Newtonian forr > h, while someother possible force laws, like the Plummer softening, con-verge relatively slowly to Newton’s law.

Let s be the center-of-mass, andM the total mass of theparticles. Further we definey ≡ r−s. The potential may thenbe expanded in a multipole series assuming|y| ≫ |xk − s|.Up to quadrupole order, this results in

Φ(r) = −G

Mg(y) (11)

+1

2yT

[

g′′(y)

y2Q+

g′(y)

y3(P−Q)

]

y

.

Here we have introduced the tensors

Q =∑

k

mk(xk − s)(xk − s)T =∑

k

mkxk xTk −Ms sT ,

(12)

4


and

P = I∑

k

mk(xk − s)2 = I

[

∑

k

mkx2k −Ms2

]

, (13)

whereI is the unit matrix. Note that for Newtonian gravity,equation (11) reduces to the more familiar form

Φ(r) = −G

[

M

y+

1

2yT 3Q−P

y5y

]

. (14)

Finally, the quadrupole approximation of the softened gravi-tational field is given by

f(r) = −∇Φ = G

Mg1(y)y + g2(y)Qy

+1

2g3(y)

(

yTQy)

y +1

2g4(y)Py

. (15)

Here we introduced the functionsg1(y), g2(y), g3(y), andg4(y) as convenient abbreviations. Their definition is givenin the Appendix. In the Newtonian case, this simplifies to

f(r) = G

−M

y3y +

3Q

y5y − 15

2

yTQy

y7y +

3

2

P

y5y

. (16)

Note that although equation (15) looks rather cumbersome,its actual numerical computation is only marginally morecostly than that of the Newtonian form (16) because all fac-tors involvingg(y) and derivatives thereof can be tabulatedfor later use in repeated force calculations.

3.2. Tree construction and tree walks

The tree construction can be done by inserting the particlesone after the other in the tree. Once the grouping is com-pleted, the multipole moments of each node can be recur-sively computed from the moments of its daughter nodes(McMillan & Aarseth 1993).

In order to reduce the storage requirements for tree nodes,we use single-precision floating point numbers to store nodeproperties. The precision of the resulting forces is still fullysufficient for collisionless dynamics as long as the node prop-erties are calculated accurately enough. In the recursive cal-culation, node properties will be computed from nodes thatare already stored in single precision. When the particle num-ber becomes very large (note that more than 10 million par-ticles can be used in single objects like clusters these days),loss of sufficient precision can then result for certain particledistributions. In order to avoid this problem,GADGET op-tionally uses an alternative method to compute the node prop-erties. In this method, a link-list structure is used to accessall of the particles represented by each tree node, allowingacomputation of the node properties in double-precision anda storage of the results in single-precision. While this tech-nique guarantees that node properties are accurate up to a rel-ative error of about10−6, it is also slower than the recursive

computation, because it requires of orderO(N logN ) oper-ations, while the recursive method is only of orderO(N ).

The tree-construction can be considered very fast in bothcases, because the time spent for it is negligible compared toa complete force walk for all particles. However, in the in-dividual time integration scheme only a small fraction of allparticles may require a force walk at each given timestep. Ifthis fraction drops below∼ 1 per cent, a full reconstructionof the tree can take as much time as the force walk itself. For-tunately, most of this tree construction time can be eliminatedby dynamic tree updates (McMillan & Aarseth 1993), whichwe discuss in more detail in Section 5. The most time con-suming routine in the code will then always remain the treewalk, and optimizing it can considerably speed up tree codes.Interestingly, in the grouping technique of Barnes (1990),thespeed of the gravitational force computation can be increasedby performing a common tree-walk for a localized group ofparticles. Even though the average length of the interactionlist for each particles becomes larger in this way, this canbe offset by saving some of the tree-walk overhead, and byimproved cache utilization. Unfortunately, this advantage isnot easily kept if individual timesteps are used, where onlya small fraction of the particles are active, so we do not usegrouping.

GADGETallows different gravitational softenings for parti-cles of different ‘type’. In order to guarantee momentum con-servation, this requires a symmetrization of the force whenparticles with different softening lengths interact. We sym-metrize the softenings in the form

h = max(hi, hj). (17)

However, the usage of different softening lengths leads tocomplications for softened tree nodes, because strictly speak-ing, the multipole expansion is only valid if all the particlesin the node have the same softening.GADGET solves thisproblem by constructing separate trees for each species ofparticles with different softening. As long as these speciesare more or less spatially separated (e.g. dark halo, stellardisk, and stellar bulge in simulations of interacting galaxies),no severe performance penalty results. However, this is dif-ferent if the fluids are spatially well ‘mixed’. Here a singletree would result in higher performance of the gravity com-putation, so it is advisable to choose a single softening in thiscase. Note that for SPH particles we nevertheless always cre-ate a separate tree to allow its use for a fast neighbour search,as will be discussed below.

3.3. Cell-opening criterion

The accuracy of the force resulting from a tree walk dependssensitively on the criterion used to decide whether the multi-pole approximation for a given node is acceptable, or whetherthe node has to be ‘opened’ for further refinement. The stan-dard BH opening criterion tries to limit the relative error of

5


every particle-node interaction by comparing a rough esti-mate of the size of the quadrupole term,∼ Ml2/r4, withthe size of the monopole term,∼ M/r2. The result is thepurely geometrical criterion of equation (8).

However, as Salmon & Warren (1994) have pointed out,the worst-case behaviour of the BH criterion for commonlyemployed opening angles is somewhat worrying. Althoughtypically very rare in real astrophysical simulations, thege-ometrical criterion (8) can then sometimes lead to very largeforce errors. In order to cure this problem, a number of mod-ifications of the cell-opening criterion have been proposed.For example, Dubinski et al. (1996) have used the simplemodificationr > l/θ + δ, where the quantityδ gives the dis-tance of the geometric center of the cell to its center-of-mass.This provides protection against pathological cases wherethecenter-of-mass lies close to an edge of a cell.

Such modifications can help to reduce the rate at whichlarge force errors occur, but they usually do not help to dealwith another problem that arises for geometric opening cri-teria in the context of cosmological simulations at high red-shift. Here, the density field is very close to being homoge-neous and the peculiar accelerations are small. For a tree al-gorithm this is a surprisingly tough problem, because the treecode always has to sum up partial forces fromall the massin a simulation. Small net forces at highz then arise in adelicate cancellation process between relatively large partialforces. If a partial force is indeed much larger than the netforce, even a small relative error in it is enough to result inalarge relative error of the net force. For an unclustered par-ticle distribution, the BH criterion therefore requires a muchsmaller value of the opening angle than for a clustered onein order to achieve a similar level of force accuracy. Alsonote that in a cosmological simulation the absolute sizes offorces between a given particle and tree-nodes of a certainopening angle can vary by many orders of magnitude. In thissituation, the purely geometrical BH criterion may end up in-vesting a lot of computational effort for the evaluation of allpartial forces to the same relative accuracy, irrespectiveof theactual size of each partial force and the size of the absoluteerror thus induced. It would be better to invest more compu-tational effort in regions that provide most of the force on theparticle and less in regions whose mass content is unimpor-tant for the total force.

As suggested by Salmon & Warren (1994), one may there-fore try to devise a cell-opening criterion that limits the abso-lute error in every cell-particle interaction. In principle, onecan use analytic error bounds (Salmon & Warren 1994) toobtain a suitable cell-opening criterion, but the evaluation ofthe relevant expressions can consume significant amounts ofCPU time.

Our approach to a new opening criterion is less stringent.Assume the absolute size of the true total force is alreadyknown before the tree walk. In the present code, we will usethe acceleration of the previous timestep as a handy approx-

imate value for that. We will now require that the estimatederror of an acceptable multipole approximation is some smallfraction of this total force. Since we truncate the multipoleexpansion at quadrupole order, the octupole moment will ingeneral be the largest term in the neglected part of the se-ries, except when the mass distribution in the cubical cellis close to being homogeneous. For a homogeneous cubethe octupole moment vanishes by symmetry (Barnes & Hut1989), such that the hexadecapole moment forms the leadingterm. We may very roughly estimate the size of these terms as∼ M/r2(l/r)3, or∼ M/r2(l/r)4, respectively, and take thisas a rough estimate of the size of the truncation error. We canthen require that this error should not exceed some fractionαof the total force on the particle, where the latter is estimatedfrom the previous timestep. Assuming the octupole scaling,a tree-node has then to be opened ifM l3 > α|aold|r5. How-ever, we have found that in practice the opening criterion

M l4 > α|aold|r6 (18)

provides still better performance in the sense that it producesforces that are more accurate at a given computational ex-pense. It is also somewhat cheaper to evaluate during thetree-walk, becauser6 is simpler to compute thanr5, whichrequires the evaluation of a root of the squared node dis-tance. The criterion (18) does not suffer from the high-z prob-lem discussed above, because the same value ofα producesa comparable force accuracy, independent of the clusteringstate of the material. However, we still need to compute thevery first force using the BH criterion. In Section 7.2, wewill show some quantitative measurements of the relative per-formance of the two criteria, and compare it to the optimumcell-opening strategy.

Note that the criterion (18) is not completely safe fromworst-case force errors either. In particular, such errorscanoccur for opening angles so large that the point of force eval-uation falls into the node itself. If this happens, no upperbound on the force error can be guaranteed (Salmon & War-ren 1994). As an option to the code, we therefore combinethe opening criterion (18) with the requirement that the pointof reference may not lie inside the node itself. We formulatethis additional constraint in terms ofr > bmax, wherebmax isthe maximum distance of the center-of-mass from any pointin the cell. This additional geometrical constraint provides avery conservative control of force errors if this is needed,butincreases the number of opened cells.

3.4. Special purpose hardware

An alternative to software solutions to theN2-bottleneckof self-gravity is provided by theGRAPE (GRAvity PipE)special-purpose hardware. It is designed to solve the grav-itational N-body problem in a direct summation approachby means of its superior computational speed. The latter isachieved with custom chips that compute the gravitational

6


force with a hardwired Plummer force law. The Plummer-potential ofGRAPEtakes the form

Φ(r) = −G∑

j

mj

(|r− rj |2 + ǫ2)12

. (19)

As an example, theGRAPE-3A boards installed at the MPAin 1998 have 40 N-body integrator chips in total with an ap-proximate peak performance of 25 GFlops. Recently, newergenerations ofGRAPE boards have achieved even highercomputational speeds. In fact, with theGRAPE-4 the 1 TFlopbarrier was broken (Makino et al. 1997), and even fasterspecial-purpose machines are in preparation (Hut & Makino1999; Makino 2000). The most recent generation,GRAPE-6,can not only compute accelerations, but also its first and sec-ond time derivatives. Together with the capability to performparticle predictions, these machines are ideal for high-orderHermite integration schemes applied in simulations of colli-sional systems like star clusters. However, our present codeis only adapted to the somewhat olderGRAPE-3 (Okumuraet al. 1993), and the following discussion is limited to it.

TheGRAPE-3A boards are connected to an ordinary work-station via a VME or PCI interface. The boards consist ofmemory chips that can hold up to 131072 particle coordi-nates, and of integrator chips that can compute the forces ex-erted by these particles for 40 positions in parallel. Higherparticle numbers can be processed by splitting them up in suf-ficiently small groups. In addition to the gravitational force,the GRAPE board returns the potential, and a list of neigh-bours for the 40 positions within search radiihi specified bythe user. This latter feature makesGRAPEattractive also forSPH calculations.

The parts of our code that useGRAPEhave benefited fromthe codeGRAPESPHby Steinmetz (1996), and are similar toit. In short, the usage ofGRAPEproceeds as follows. For theforce computation, the particle coordinates are first loadedonto theGRAPE board, thenGADGET calls GRAPE repeat-edly to compute the force for up to 40 positions in parallel.The communication withGRAPEis done by means of a con-venient software interface in C.GRAPEcan also provide listsof nearest neighbours. For SPH-particles,GADGETcomputesthe gravitational force and the interaction list in just onecallof GRAPE. The host computer then still does the rest of thework, i.e. it advances the particles, and computes the hydro-dynamical forces.

In practice, there are some technical complications whenone works withGRAPE-3. In order to achieve high computa-tional speed, theGRAPE-3 hardware works internally withspecial fixed-point formats for positions, accelerations andmasses. This results in a reduced dynamic range comparedto standard IEEE floating point arithmetic. In particular, oneneeds to specify a minimum length scaledmin and a minimummass scalemmin when working withGRAPE. The spatial dy-namic range is then given bydmin[−218; 218] and the massrange ismmin[1; 64ǫ/dmin] (Steinmetz 1996).

While the communication time withGRAPEscales propor-tional to the particle numberN , the actual force computationof GRAPE is still anO(N2)-algorithm, because theGRAPEboard implements a direct summation approach to the grav-itational N-body problem. This implies that for very largeparticle number a tree code running on the workstation alonewill eventually catch up and outperform the combination ofworkstation andGRAPE. For our current set-up at MPA thisbreak-even point is about at 300000 particles.

However, it is also possible to combineGRAPEwith a treealgorithm (Fukushige et al. 1991; Makino 1991a; Athanas-soula et al. 1998; Kawai et al. 2000), for example by export-ing tree nodes instead of particles in an appropriate way. Sucha combination of tree+GRAPEscales asO(N logN) and isable to outperform pure software solutions even for largeN .

4. Smoothed particle hydrodynamics

SPH is a powerful Lagrangian technique to solve hydrody-namical problems with an ease that is unmatched by gridbased fluid solvers (see Monaghan 1992,for an excellent re-view). In particular, SPH is very well suited for three-dimensional astrophysical problems that do not crucially relyon accurately resolved shock fronts.

Unlike other numerical approaches for hydrodynamics, theSPH equations do not take a unique form. Instead, many for-mally different versions of them can be derived. Furthermore,a large variety of recipes for specific implementations of forcesymmetrization, determinations of smoothing lengths, andar-tificial viscosity, have been described. Some of these choicesare crucial for the accuracy and efficiency of the SPH imple-mentation, others are only of minor importance. See the re-cent work by Thacker et al. (2000) and Lombardi et al. (1999)for a discussion of the relative performance of some of thesepossibilities. Below we give a summary of the specific SPHimplementation we use.

4.1. Basic equations

The computation of the hydrodynamic force and the rate ofchange of internal energy proceeds in two phases. In the firstphase, new smoothing lengthshi are determined for theac-tiveparticles (these are the ones that need a force update at thecurrent timestep, see below), and for each of them, the neigh-bouring particles inside their respective smoothing radiiarefound. The Lagrangian nature of SPH arises when this num-ber of neighbours is kept either exactly, or at least roughly,constant. This is achieved by varying the smoothing lengthhi of each particle accordingly. Thehi thus adjust to thelocal particle density adaptively, leading to a constant massresolution independent of the density of the flow. Nelson &Papaloizou (1994) argue that it is actually best to keep thenumber of neighbours exactly constant, resulting in the low-est level of noise in SPH estimates of fluid quantities, and in

7


the best conservation of energy. In practice, similarly goodresults are obtained if the fluctuations in neighbour numberremain very small. In the serial version ofGADGET we keepthe number of neighbours fixed, whereas it is allowed to varyin a small band in the parallel code.

Having found the neighbours, we compute the density ofthe active particles as

ρi =N∑

j=1

mjW (rij ;hi), (20)

whererij ≡ ri − rj , and we compute a new estimate ofdivergence and vorticity as

ρi(∇ · v)i =∑

j

mj(vj − vi)∇iW (rij ;hi), (21)

ρi(∇× v)i =∑

j

mj(vi − vj)×∇iW (rij ;hi). (22)

Here we employ thegatherformulation for adaptive smooth-ing (Hernquist & Katz 1989).

For thepassiveparticles, values for density, internal en-ergy, and smoothing length are predicted at the current timebased on the values of the last update of those particles (seeSection 5). Finally, the pressure of the particles is set toPi = (γ − 1)ρiui.

In the second phase, the actual forces are computed. Herewe symmetrize the kernels ofgatherandscatterformulationsas in Hernquist & Katz (1989). We compute the gasdynami-cal accelerations as

agasi = −

(∇P

ρ

)

i

+ avisci = −∑

j

mj

(

Pi

ρ2i+

Pj

ρ2j+ Πij

)

[

1

2∇iW (rij ;hi) +

1

2∇iW (rij ;hj)

]

, (23)

and the change of the internal energy as

dui

dt=

1

2

∑

j

mj

(

Pi

ρ2i+

Pj

ρ2j+ Πij

)

(vi − vj)

[

1

2∇iW (rij ;hi) +

1

2∇iW (rij ;hj)

]

. (24)

Instead of symmetrizing the pressure terms with an arithmeticmean, the code can also be used with a geometric mean ac-cording to

Pi

ρ2i+

Pj

ρ2j−→ 2

√

PiPj

ρiρj. (25)

This may be slightly more robust in certain situations (Hern-quist & Katz 1989). The artificial viscosityΠij is taken tobe

Πij =1

2(fi + fj)Πij , (26)

with

Πij =

[

−αcijµij + 2αµ2ij

]

/ρij if vij · rij < 00 otherwise,

(27)where

fi =|(∇ · v)i|

|(∇ · v)i|+ |(∇× v)i|, (28)

and

µij =hij(vi − vj)(ri − rj)

|ri − rj |2 + ǫh2ij

. (29)

This form of artificial viscosity is the shear-reduced version(Balsara 1995; Steinmetz 1996) of the ‘standard’ Monaghan& Gingold (1983) artificial viscosity. Recent studies (Lom-bardi et al. 1999; Thacker et al. 2000) that test SPH imple-mentations endorse it.

In equations (23) and (24), a given SPH particlei will inter-act with a particlej whenever|rij | < hi or |rij | < hj . Stan-dard search techniques can relatively easily find all neigh-bours of particlei inside a sphere of radiushi, but mak-ing sure that one really finds all interacting pairs in the casehj > hi is slightly more tricky. One solution to this problemis to simply find all neighbours ofi insidehi, and to considerthe force components

fij = −mimj

(

Pi

ρ2i+

Pj

ρ2j+ Πij

)

1

2∇iW (rij ;hi). (30)

If we addfij to the force oni, and−fij to the force onj, thesum of equation (23) is reproduced, and the momentum con-servation is manifest. This also holds for the internal energy.Unfortunately, this only works if all particles are active.Inan individual timestep scheme, we therefore need an efficientway to find all the neighbours of particlei in the above sense,and we discuss our algorithm for doing this below.

4.2. Neighbour search

In SPH, a basic task is to find the nearest neighbours of eachSPH particle to construct its interaction list. Specifically, inthe implementation we have chosen we need to find all par-ticles closer than a search radiushi in order to estimate thedensity, and one needs all particles with|rij | < max(hi, hj)for the estimation of hydrodynamical forces. Similar to grav-ity, the naive solution that checks the distance ofall particlepairs is anO(N2) algorithm which slows down prohibitivelyfor large particle numbers. Fortunately, there are faster searchalgorithms.

When the particle distribution is approximately homoge-neous, perhaps the fastest algorithms work with a search gridthat has a cell size somewhat smaller than the search radius.The particles are then first coarse-binned onto this searchgrid, and link-lists are established that quickly deliver onlythose particles that lie in a specific cell of the coarse grid.

8


The neighbour search proceeds then byrange searching; onlythose mesh cells that have a spatial overlap with the searchrange have to be opened.

For highly clustered particle distributions and varyingsearch rangeshi, the above approach quickly degrades, sincethe mesh chosen for the coarse grid has not the optimum sizefor all particles. A more flexible alternative is to employ ageometric search tree. For this purpose, a tree with a struc-ture just like the BH oct-tree can be employed, Hernquist &Katz (1989) were the first to use the gravity tree for this pur-pose. InGADGET we use the same strategy and perform aneighbour search by walking the tree. A cell is ‘opened’ (i.e.further followed) if it has a spatial overlap with the rectan-gular search range. Testing for such an overlap is faster witha rectangular search range than with a spherical one, so weinscribe the spherical search region into a little cube for thepurpose of this walk. If one arrives at a cell with only oneparticle, this is added to the interaction list if it lies inside thesearch radius. We also terminate a tree walk along a branch,if the cell liescompletelyinside the search range. Then all theparticles in the cell can be added to the interaction list, with-out checking any of them for overlap with the search rangeany more. The particles in the cell can be retrieved quicklyby means of a link-list, which can be constructed along withthe tree and allows a retrieval of all the particles that lie in-side a given cell, just as it is possible in the coarse-binningapproach. Since this short-cut reduces the length of the treewalk and the number of required checks for range overlap, thespeed of the algorithm is increased by a significant amount.

With a slight modification of the tree walk, one can alsofind all particles with|rij | < max(hi, hj). For this pur-pose, we store in each tree node the maximum SPH smooth-ing length occurring among its particles. The test for over-lap is then simply done between a cube of side-lengthmax(hi, hnode) centered on the particlei and the node itself,wherehnode is the maximum smoothing length among theparticles of the node.

There remains the task to keep the number of neighboursaround a given SPH particle approximately (or exactly) con-stant. We solve this by predicting a valuehi for the smoothinglength based on the lengthhi of the previous timestep, the ac-tual number of neighboursNi at that timestep, and the localvelocity divergence:

hi =1

2h(old)i

[

1 +

(

Ns

Ni

)1/3]

+ hi∆t, (31)

where hi = 13hi(∇ · v)i, andNs is the desired number

of neighbours. A similar form for updating the smoothinglengths has been used by Hernquist & Katz (1989), see alsoThacker et al. (2000) for a discussion of alternative choices.The term in brackets tries to bring the number of neighboursback to the desired value ifNi deviates from it. Should the re-sulting number of neighbours nevertheless fall outside a pre-

scribed range of tolerance, we iteratively adjusthi until thenumber of neighbours is again brought back to the desiredrange. Optionally, our code allows the user to impose a min-imum smoothing length for SPH, typically chosen as somefraction of the gravitational softening length. A larger num-ber of neighbours thanNs is allowed to occur ifhi takes onthis minimum value.

One may also decide to keep the number of neighboursexactly constant by defininghi to be the distance to theNs-nearest particle. We employ such a scheme in the serial code.Here we carry out a range-search withR = 1.2hi, on aver-age resulting in∼ 2Ns potential neighbours. From these weselect the closestNs (fast algorithms for doing this exist, seePress et al. 1995). If there are fewer thanNs particles in thesearch range, or if the distance of theNs-nearest particle in-side the search range is larger thanR, the search is repeatedfor a larger search range. In the first timestep no previoushi

is known, so we follow the neighbour tree backwards fromthe leaf of the particle under consideration, until we obtain afirst reasonable guess for the local particle density (basedonthe numberN of particles in a node of volumel3), providingan initial guess forhi.

However, the above scheme for keeping the number ofneighbours exactly fixed is not easily accommodated in ourparallel SPH implementation, because SPH particles mayhave a search radius that overlaps with several processor do-mains. In this case, the selection of the closestNs neigh-bours becomes non-trivial, because the underlying data is dis-tributed across several independent processor elements. Forparallel SPH, we therefore revert to the simpler scheme andallow the number of neighbours to fluctuate within a smallband.

5. Time integration

As a time integrator, we use a variant of the leapfrog in-volving an explicit prediction step. The latter is introducedto accommodate individual particle timesteps in the N-bodyscheme, as explained later on.

We start by describing the integrator for a single particle.First, a particle position at the middle of the timestep∆t ispredicted according to

r(n+12 ) = r(n) + v(n)∆t

2, (32)

and an acceleration based on this position is computed, viz.

a(n+12 ) = − ∇Φ|

r(n+1

2) . (33)

Then the particle is advanced according to

v(n+1) = v(n) + a(n+12 )∆t, (34)

r(n+1) = r(n) +1

2

[

v(n) + v(n+1)]

∆t. (35)

9


5.1. Timestep criterion

In the above scheme, the timestep may vary from step to step.It is clear that the choice of timestep size is very importantin determining the overall accuracy and computational effi-ciency of the integration.

In a static potentialΦ, the error in specific energy arisingin one step with the above integrator is

∆E =1

4

∂2Φ

∂xi∂xjv(n)i a

(n+ 12 )

j ∆t3 + (36)

1

24

∂3Φ

∂xi∂xj∂xkv(n)i v

(n)j v

(n)k ∆t3 +O(∆t4)

to leading order in∆t, i.e. the integrator is second order ac-curate. Here the derivatives of the potential are taken at coor-dinater(n) and summation over repeated coordinate indicesis understood.

In principle, one could try to use equation (36) directly toobtain a timestep by imposing some upper limit on the tol-erable error∆E. However, this approach is quite subtle inpractice. First, the derivatives of the potential are difficultto obtain, and second, there is no explicit guarantee that theterms of higher order in∆t are really small.

High-order Hermite schemes use timestep criteria that in-clude the first and second time derivative of the acceleration(e.g. Makino 1991b; Makino & Aarseth 1992). While thesetimestep criteria are highly successful for the integration ofvery nonlinear systems, they are probably not appropriate forour low-order scheme, apart from the fact that substantialcomputational effort is required to evaluate these quantitiesdirectly. Ideally, we therefore want to use a timestep crite-rion that is only based on dynamical quantities that are eitheralready at hand or are relatively cheap to compute.

Note that a well known problem of adaptive timestepschemes is that they will usually break the time reversibil-ity and symplectic nature of the simple leapfrog. As a re-sult, the system does not evolve under a pseudo-Hamiltonianany more and secular drifts in the total energy can occur. AsQuinn et al. (1997) show, reversibility can be obtained witha timestep that depends only on the relative coordinates ofparticles. This is for example the case for timesteps that de-pend only on acceleration or on local density. However, toachieve reversibility the timestep needs to be chosen basedonthe state of the system in the middle of the timestep (Quinnet al. 1997), or on the beginningandend of the timestep (Hutet al. 1995). In practice, this can be accomplished by discard-ing trial timesteps appropriately. The present code selects thetimestep based on the previous step and is thus not reversiblein this way.

One possible timestep criterion is obtained by constrain-ing the absolute size of the second order displacement of thekinetic energy, assuming a typical velocity dispersionσ2 forthe particles, which corresponds to a scaleE = σ2 for the

typical specific energy. This results in

∆t = αtolσ

|a| . (37)

For a collisionless fluid, the velocity scaleσ should ideallybe chosen as thelocal velocity dispersion, leading to smallertimesteps in smaller haloes, or more generally, in ‘colder’parts of the fluid. The local velocity dispersion can be es-timated from a local neighbourhood of particles, obtained asin the normal SPH formalism.

Alternatively, one can constrain the second order term inthe particle displacement, obtaining

∆t =

√

2α′tolǫ

|a| . (38)

Here some length scaleα′tolǫ is introduced, which will typ-

ically be related to the gravitational softening. This formhas quite often been employed in cosmological simulations,sometimes with an additional restriction on the displacementof particles in the form∆t = αǫ/|v|. It is unclear thoughwhy the timesteps should depend on the gravitational soft-ening length in this way. In a well-resolved halo, most or-bits are not expected to change much if the halo is modeledwith more particles and a correspondingly smaller softeninglength, so it should not be necessary to increase the accuracyof the time integration for all particles by the same factor ifthe mass/length resolution is increased.

For self-gravitating collisionless fluids, another plausibletimestep criterion is based on the local dynamical time:

∆t = α′′tol

3√8πGρ

. (39)

One advantage of this criterion is that it provides a monoton-ically decreasing timestep towards the center of a halo. Onthe other hand, it requires an accurate estimate of the localdensity, which may be difficult to obtain, especially in re-gions of low density. In particular, Quinn et al. (1997) haveshown that haloes in cosmological simulations that containonly a small number of particles, about equal or less than thenumber employed to estimate the local density, are suscepti-ble to destruction if a timestep based on (39) is used. This isbecause the kernel estimates of the density are too small inthis situation, leading to excessively long timesteps in thesehaloes.

In simple test integrations of singular potentials, we havefound the criterion (37) to give better results compared to thealternative (38). However, neither of these simple criteria isfree of problems in typical applications to structure forma-tion, as we will later show in some test calculations. In thecenter of haloes, subtle secular effects can occur under con-ditions of coarse integration settings. The criterion based onthe dynamical time does better in this respect, but it does not

10


work well in regions of very low density. We thus suggest touse a combination of (37) and (39) by taking the minimum ofthe two timesteps. This provides good integration accuracyin low density environments and simultaneously does well inthe central regions of large haloes. For the relative settingof the dimensionless tolerance parameters we useα′′ ≃ 3α,which typically results in a situation where roughly the samenumber of particles are constrained by each of the two cri-teria in an evolved cosmological simulation. The combinedcriterion is Galilean-invariant and does not make an explicitreference to the gravitational softening length employed.

5.2. Integrator for N-body systems

In the context of stellar dynamical integrations, individualparticle timesteps have long been used since they were firstintroduced by Aarseth (1963). We here employ an integratorwith completely flexible timesteps, similar to the one used byGroom (1997) and Hiotelis & Voglis (1991). This schemediffers slightly from more commonly employed binary hier-archies of timesteps (e.g. Hernquist & Katz 1989; McMillan& Aarseth 1993; Steinmetz 1996).

Each particle has a timestep∆ti, and a current timeti,where its dynamical state (ri,vi, ai) is stored. The dynamicalstate of the particle can be predicted at timest ∈ [ti±0.5∆ti]with first order accuracy.

The next particlek to be advanced is then the one with theminimum prediction time defined asτp ≡ min (ti + 0.5∆ti).The timeτp becomes the new current time of the system. Toadvance the corresponding particle, we first predict positionsfor all particles at timeτp according to

ri = ri + vi(τp − ti). (40)

Based on these positions, the acceleration of particlek at themiddle of its timestep is calculated as

a(n+ 1

2 )

k = − ∇Φ(ri)|rk . (41)

Position and velocity of particlek are then advanced as

v(n+1)k = v

(n)k + 2a

(n+ 12 )

k (τp − tk), (42)

r(n+1)k = r

(n)k +

[

v(n)k + v

(n+1)k

]

(τp − tk), (43)

and its current time can be updated to

t(new)k = tk + 2(τp − tk). (44)

Finally, a new timestep∆t(new)k for the particle is estimated.

At the beginning of the simulation, all particles start outwith the same current time. However, since the timesteps ofthe particles are all different, the current times of the particlesdistribute themselves nearly symmetrically around the cur-rent prediction time, hence the prediction step involves for-ward and backward prediction to a similar extent.

Of course, it is impractical to advance only a single particleat any given prediction time, because the prediction itselfandthe (dynamic) tree updates induce some overhead. For thisreason we advance particles in bunches. The particles maybe thought of as being ordered according to their predictiontimes tpi = ti +

12∆ti. The simulation works through this

time line, and always advances the particle with the smallesttpi , and also all subsequent particles in the time line, until thefirst is found with

τp ≤ ti +1

4∆ti . (45)

This condition selects a group of particles at the lower endof the time line, and all the particles of the group are guaran-teed to be advanced by at least half of their maximum allowedtimestep. Compared to using a fixed block step scheme witha binary hierarchy, particles are on average advanced closerto their maximum allowed timestep in this scheme, which re-sults in a slight improvement in efficiency. Also, timestepscan more gradually vary than in a power of two hierarchy.However, a perhaps more important advantage of this schemeis that it makes work-load balancing in the parallel code sim-pler, as we will discuss in more detail later on.

In practice, the sizeM of the group that is advanced at agiven step is often only a small fraction of the total particlenumberN . In this situation it becomes important to elimi-nate any overhead that scales withO(N). For example, weobviously need to find the particle with the minimum predic-tion time at every timestep, and also the particles followingit in the time line. A loop over all the particles, or a com-plete sort at every timestep, would induce overhead of orderO(N) or O(N logN), which can become comparable to theforce computation itself ifM/N ≪ 1. We solve this problemby keeping the maximum prediction times of the particles inan ordered binary tree (Wirth 1986) at all times. Finding theparticle with the minimum prediction time and the ones thatfollow it are then operations of orderO(logN). Also, oncethe particles have been advanced, they can be removed andreinserted into this tree with a cost of orderO(logN). To-gether with the dynamic tree updates, which eliminate predic-tion and tree construction overhead, the cost of the timestepthen scales asO(M logN).

5.3. Dynamic tree updates

If the fraction of particles to be advanced at a given timestepis indeed small, the prediction ofall particles and the recon-struction of thefull tree would also lead to significant sourcesof overhead. However, as McMillan & Aarseth (1993) havefirst discussed, the geometric structure of the tree, i.e. the waythe particles are grouped into a hierarchy, evolves only rela-tively slowly in time. It is therefore sufficient to reconstructthis grouping only every few timesteps, provided one can stillobtain accurate node properties (center of mass, multipolemoments) at the current prediction time.

11


We use such a scheme of dynamic tree updates by predict-ing properties of tree-nodes on the fly, instead of predictingall particles every single timestep. In order to do this, eachnode carries a center-of-mass velocity in addition to its posi-tion at the time of its construction. New node positions canthen be predicted while the tree is walked, and only nodesthat are actually visited need to be predicted. Note that theleaves of the tree point to single particles. If they are usedin the force computation, their prediction corresponds to theordinary prediction as outlined in equation (43).

In our simple scheme we neglect a possible time variationof the quadrupole moment of the nodes, which can be takeninto account in principle (McMillan & Aarseth 1993). How-ever, we introduce a mechanism that reacts to fast time vari-ations of tree nodes. Whenever the center-of mass of a treenode under consideration has moved by more than a smallfraction of the nodes’ side-length since the last reconstructionof this part of the tree, the node is completely updated, i.e.thecenter-of-mass, center-of-mass velocity and quadrupole mo-ment are recomputed from the individual (predicted) phase-space variables of the particles. We also adjust the side-lengthof the tree node if any of its particles should have left its orig-inal cubical volume.

Finally, the full tree is reconstructed from scratch everyonce in a while to take into account the slow changes in thegrouping hierarchy. Typically, we update the tree wheneveratotal of∼ 0.1N force computations have been done since thelast full reconstruction. With this criterion the tree construc-tion is an unimportant fraction of the total computation time.We have not noticed any significant loss of force accuracyinduced by this procedure.

In summary, the algorithms described above result in anintegration scheme that can smoothly and efficiently evolvean N-body system containing a large dynamic range in timescales. At a given timestep, only a small numberM of par-ticles are then advanced, and the total time required for thatscales asO(M logN).

5.4. Including SPH

The above time integration scheme may easily be extendedto include SPH. Here we also need to integrate the internalenergy equation, and the particle accelerations also receive ahydrodynamical component. To compute the latter we alsoneed predicted velocities

vi = vi + ai−1(τp − ti), (46)

where we have approximatedai with the acceleration of theprevious timestep. Similarly, we obtain predictions for theinternal energy

ui = ui + ui(τp − ti), (47)

and the density of inactive particles as

ρi = ρi + ρi(τp − ti). (48)

For those particles that are to be advanced at the currentsystem step, these predicted quantities are then used to com-pute the hydrodynamical part of the acceleration and the rateof change of internal energy with the usual SPH estimates, asdescribed in Section 4.

For hydrodynamical stability, the collisionless timestepcriterion needs to be supplemented with the Courant condi-tion. We adopt it for the gas particles in the form

∆ti =αcour hi

hi|(∇ · v)i|+max(ci, |vi|) (1 + 0.6αvisc), (49)

whereαvisc regulates the strength of the artificial bulk vis-cosity, andαcour is an accuracy parameter, the Courant fac-tor. Note that we use the maximum of the sound speedciand the bulk velocity|vi| in this expression. This improvesthe handling of strong shocks when the infalling material iscold, but has the disadvantage of not being Galilean invari-ant. For the SPH-particles, we use either the adopted criterionfor collisionless particles or (49), whichever gives the smallertimestep.

As defined above, we evaluateagas andu at the middle ofthe timestep, when the actual timestep∆t of the particle thatwill be advancedis already set. Note that there is a term inthe artificial viscosity that can cause a problem in this explicitintegration scheme. The second term in equation (27) tries toprevent particle inter-penetration. If a particle happensto getrelatively close to another SPH particle in the time∆t/2 andthe relative velocity of the approach is large, this term cansuddenly lead to a very large repulsive accelerationavisc, try-ing to prevent the particles from getting any closer. However,it is then too late to reduce the timestep. Instead, the veloc-ity of the approaching particle will be changed byavisc∆t,possiblyreversingthe approach of the two particles. But theartificial viscosity should at mosthalt the approach of the par-ticles. To guarantee this, we introduce an upper cut-off to themaximum acceleration induced by the artificial viscosity. Ifvij · rij < 0, we replace equation (26) with

Πij =1

2(fi + fj) min

[

Πij ,vij · rij

(mi +mj)Wij∆t

]

, (50)

whereWij = rij∇i [W (rij ;hi) +W (rij ;hj)] /2. With thischange, the integration scheme still works reasonably wellin regimes with strong shocks under conditions of relativelycoarse timestepping. Of course, a small enough value of theCourant factor will prevent this situation from occurring tobegin with.

Since we use the gravitational tree of the SPH particles forthe neighbour search, another subtlety arises in the contextof dynamic tree updates, where the full tree is not necessar-ily reconstructed every single timestep. The range searchingtechnique relies on the current values of the maximum SPHsmoothing length in each node, and also expects that all par-ticles of a node are still inside the boundaries set by the side-length of a node. To guarantee that the neighbour search will

12


always give correct results, we perform a special update of theSPH-tree every timestep. It involves a loop over every SPHparticle that checks whether the particle’s smoothing lengthis larger thanhmax stored in its parent node, or if it falls out-side the extension of the parent node. If either of these isthe case, the properties of the parent node are updated ac-cordingly, and the tree is further followed ‘backwards’ alongthe parent nodes, until each node is again fully ‘contained’in its parent node. While this update routine is very fast ingeneral, it does add some overhead, proportional to the num-ber of SPH particles, and thus breaks in principle the idealscaling (proportional toM ) obtained for purely collisionlesssimulations.

5.5. Implementation of cooling

When radiative cooling is included in simulations of galaxyformation or galaxy interaction, additional numerical prob-lems arise. In regions of strong gas cooling, the cooling timescan become so short that extremely small timesteps would berequired to follow the internal energy accurately with the sim-ple explicit integration scheme used so far.

To remedy this problem, we treat the cooling semi-implicitly in an isochoric approximation. At any giventimestep, we first compute the rateuad of change of the in-ternal energy due to the ordinary adiabatic gas physics. In anisochoric approximation, we then solve implicitly for a newinternal energy predicted at the end of the timestep, i.e.

u(n+1)i = u

(n)i + uad∆t−

Λ[

ρ(n)i , u

(n+1)i

]

∆t

ρ(n)i

. (51)

The implicit computation of the cooling rate guarantees sta-bility. Based on this estimate, we compute an effective rateof change of the internal energy, which we then take as

ui =[

u(n+1)i − u

(n)i

]

/∆t . (52)

We use this last step because the integration scheme requiresthe possibility to predict the internal energy at arbitrarytimes.With the above procedure,ui is always a continuous functionof time, and the prediction ofui may be done for times in be-tween the application of the isochoric cooling/heating. Still,there can be a problem with predicted internal energies incases when the cooling time is very small. Then a particle canlose a large fraction of its energy in a single timestep. Whilethe implicit solution will still give a correct result for the tem-perature at the end of the timestep, the predicted energy in themiddle of thenexttimestep could then become very small oreven negative because of the large negative value ofu. Wetherefore restrict the maximum cooling rate such that a parti-cle is only allowed to lose at most half its internal energy inagiven timestep, preventing the predicted energies from ‘over-shooting’. Katz & Gunn (1991) have used a similar methodto damp the cooling rate.

5.6. Integration in comoving coordinates

For simulations in a cosmological context, the expansion ofthe universe has to be taken into account. Letx denote co-moving coordinates, anda be the dimensionless scale factor(a = 1.0 at the present epoch). Then the Newtonian equationof motion becomes

x+ 2a

ax = −G

∫

δρ(x′) (x− x′)

|x− x′|3 d3x′. (53)

Here the functionδρ(x) = ρ(x) − ρ denotes the (proper)density fluctuation field.

In an N-body simulation with periodic boundary condi-tions, the volume integral of equation (53) is carried out overall space. As a consequence, the homogeneous contributionarising fromρ drops out around every point. Then the equa-tion of motion of particlei becomes

xi + 2a

axi = −G

a3

∑

j 6=i

periodic

mj (xi − xj)

|xi − xj |3, (54)

where the summation includes all periodic images of the par-ticlesj.

However, one may also employ vacuum boundary condi-tions if one simulates a spherical region of radiusR aroundthe origin, and neglects density fluctuations outside this re-gion. In this case, the background densityρ gives rise to anadditional term, viz.

xi + 2a

axi =

1

a3

−G∑

j 6=i

mj xij

|xij |3+

1

2Ω0H

20xi

. (55)

GADGET supports both periodic and vacuum boundary con-ditions. We implement the former by means of the Ewaldsummation technique (Hernquist et al. 1991).

For this purpose, we modify the tree walk such that eachnode is mapped to the position of its nearest periodic imagewith respect to the coordinate under consideration. If the mul-tipole expansion of the node can be used according to the cellopening criterion, its partial force is computed in the usualway. However, we also need to add the force exerted by allthe other periodic images of the node. The slowly converg-ing sum over these contributions can be evaluated with theEwald technique. Ifx is the coordinate of the point of force-evaluation relative to a node of massM , the resultingaddi-tional acceleration is given by

ac(x) = M

x

|x|3 −∑

n

x− nL

|x− nL|3 ×[

erfc(α|x− nL|) +

2α|x− nL|√π

exp(

−α2|x− nL|2)

]

− 2

L2

∑

h 6=0

h

|h|2 exp

(

−π2|h|2α2L2

)

sin

(

2π

Lh · x

)

. (56)

13


Heren andh are integer triplets,L is the box size, andα is anarbitrary number (Hernquist et al. 1991). Good convergenceis achieved forα = 2/L, where we sum over the range|n| <5 and|h| < 5. Similarly, the additional potential due to theperiodic replications of the node is given by

φc(x) = M

1

x+

π

α2L3−∑

n

erfc(α|x − nL|)|x− nL| −

1

L

∑

h 6=0

1

π|h|2 exp

(

−π2|h|2α2L2

)

cos

(

2π

Lh · x

)

. (57)

We follow Hernquist et al. (1991) and tabulate the correc-tion fieldsac(x)/M andφc(x)/M for one octant of the sim-ulation box, and obtain the result of the Ewald summationduring the tree walk from trilinear interpolation off this grid.It should be noted however, that periodic boundaries have astrong impact on the speed of the tree algorithm. The numberof floating point operations required to interpolate the correc-tion forces from the grid has a significant toll on the raw forcespeed and can slow it down by almost a factor of two.

In linear theory, it can be shown that the kinetic energy

T =1

2

∑

i

miv2 (58)

in peculiar motion grows proportional toa, at least at earlytimes. This implies that

∑

i mix2 ∝ 1/a, hence the co-

moving velocitiesx = v/a actually diverge fora → 0.Since cosmological simulations are usually started at redshiftz ≃ 30 − 100, one therefore needs to follow a rapid deceler-ation ofx at high redshift. So it is numerically unfavourableto solve the equations of motion in the variablex.

To remedy this problem, we use an alternative velocityvariable

w ≡ a12 x, (59)

and we employ the expansion factor itself as time variable.Then the equations of motion become

dw

da= −3

2

w

a+

1

a2S(a)

−G∑

j 6=i

mj xij

|xij |3+

1

2Ω0H

20xi

,

(60)dx

da=

w

S(a), (61)

with S(a) = a32H(a) given by

S(a) = H0

√

Ω0 + a(1− Ω0 − ΩΛ) + a3ΩΛ. (62)

Note that for periodic boundaries the second term in thesquare bracket of equation (60) is absent, instead the sum-mation extends over all periodic images of the particles.

Using the Zel’dovich approximation, one sees thatw re-mains constant in the linear regime. Strictly speaking this

holds at all times only for an Einstein-de-Sitter universe,however, it is also true for other cosmologies at early times.Hence equations (59) to (62) in principle solve linear theoryfor arbitrarily large steps ina. This allows to traverse the lin-ear regime with maximum computational efficiency. Further-more, equations (59) to (62) represent a convenient formula-tion for general cosmologies, and for our variable timestepin-tegrator. Sincew does not vary in the linear regime, predictedparticle positions based onxi = xi+wi(ap−ai)/S(ap) arequite accurate. Also, the acceleration entering the timestepcriterion may now be identified withdw/da, and the timestep(37) becomes

∆a = αtolσ

∣

∣

∣

∣

dw

da

∣

∣

∣

∣

−1

. (63)

The above equations only treat the gravity part of the dy-namical equations. However, it is straightforward to expressthe hydrodynamical equations in the variables (x, w, a) aswell. For gas particles, equation (60) receives an additionalcontribution due to hydrodynamical forces, viz.

(

dw

da

)

hydro

= − 1

aS(a)

∇xP

ρ. (64)

For the energy equation, one obtains

du

da= −3

a

P

ρ− 1

S(a)

P

ρ∇x ·w. (65)

Here the first term on the right hand side describes the adia-batic cooling of gas due to the expansion of the universe.

6. Parallelization

Massively parallel computer systems with distributed mem-ory have become increasingly popular recently. They canbe thought of as a collection of workstations, connected bya fast communication network. This architecture promiseslarge scalability for reasonable cost. Current state-of-the artmachines of this type include the Cray T3E and IBM SP/2.It is an interesting development that ‘Beowulf’-type systemsbased on commodity hardware have started to offer floatingpoint performance comparable to these supercomputers, butat a much lower price.

However, an efficient use of parallel distributed memorymachines often requires substantial changes of existing algo-rithms, or the development of completely new ones. Concep-tually, parallel programming involves two major difficultiesin addition to the task of solving the numerical problem ina serial code. First, there is the difficulty of how to dividethe work and dataevenlyamong the processors, and second,an efficient communication scheme between the processorsneeds to be devised.

In recent years, a number of groups have developed par-allel N-body codes, all of them with different paralleliza-tion strategies, and different strengths and weaknesses. Early

14


Domain decomposition

Computational domain

Individual tree construction

Figure 2: Schematic representation of the domain decomposition in two dimensions, and for four processors. Here, the first split occurs alongthe y-axis, separating the processors into two groups. Theythen independently carry out a second split along the x-axis. After completion ofthe domain decomposition, each processor element (PE) can construct its own BH tree just for the particles in its part of the computationaldomain.

versions of parallel codes include those of Barnes (1986),Makino & Hut (1989) and Theuns & Rathsack (1993). Later,Warren et al. (1992) parallelized the BH-tree code on mas-sively parallel machines with distributed memory. Dubinski(1996) presented the first parallel tree code based on MPI.Dikaiakos & Stadel (1996) have developed a parallel simu-lation code (PKDGRAV) that works with a balanced binarytree. More recently, parallel tree-SPH codes have been intro-duced by Dave et al. (1997) and Lia & Carraro (2000), anda PVM implementation of a gravity-only tree code has beendescribed by Viturro & Carpintero (2000).

We here report on our newly developed parallel version ofGADGET, where we use a parallelization strategy that differsfrom previous workers. It also implements individual particletimesteps for the first time on distributed-memory, massivelyparallel computers. We have used theMessage Passing Inter-face(MPI) (Snir et al. 1995; Pacheco 1997), which is an ex-plicit communication scheme, i.e. it is entirely up to the userto control the communication. Messages containing data canbe sent between processors, both in synchronous and asyn-chronous modes. A particular advantage of MPI is its flexi-bility and portability. Our simulation code uses only standardC and standard MPI, and should therefore run on a variety ofplatforms. We have confirmed this so far on Cray T3E andIBM SP/2 systems, and on Linux-PC clusters.

6.1. Domain decomposition

The typical size of problems attacked on parallel computersis usually much too large to fit into the memory of individ-ual computational nodes, or into ordinary workstations. Thisfact alone (but of course also the desire to distribute the workamong the processors) requires a partitioning of the problemonto the individual processors.

For our N-body/SPH code we have implemented a spatialdomain decomposition, using the orthogonal recursive bisec-tion (ORB) algorithm (Dubinski 1996). In the first step, asplit is found along one spatial direction, e.g. the x-axis,andthe collection of processors is grouped into two halves, onefor each side of the split. These processors then exchangeparticles such that they end up hosting only particles lyingon their side of the split. In the simplest possible approach,the position of the split is chosen such that there are an equalnumber of particles on both sides. However, for an efficientsimulation code the split should try to balance the work donein the force computation on the two sides. This aspect will bediscussed further below.

In a second step, each group of processors finds a new splitalong a different spatial axis, e.g. the y-axis. This splittingprocess is repeated recursively until the final groups consistof just one processor, which then hosts a rectangular pieceof the computational volume. Note that this algorithm con-

15


All to all communication

Identification of active particles

Tree walks

Communication andsummation of forcecomponents

Figure 3: Schematic illustration of the parallelization scheme of GADGET for the force computation. In the first step, each PE identifiesthe active particles, and puts their coordinates in a communication buffer. In a communication phase, a single and identical list of all thesecoordinates is then established on all processors. Then each PE walks its local tree for this list, thereby obtaining a list of partial forces. Theseare then communicated in a collective process back to the original PE that hosts the corresponding particle coordinate.Each processor thensums up the incoming force contributions, and finally arrives at the required total forces for its active particles.

strains the number of processors that may be used to a powerof two. Other algorithms for the domain decomposition, forexample Voronoi tessellations (Yahagi et al. 1999), are freeof this restriction.

A two-dimensional schematic illustration of the ORB isshown in Figure 2. Each processor can construct a local BHtree for its domain, and this tree may be used to computethe force exerted by the processors’ particles on arbitrarytestparticles in space.

6.2. Parallel computation of the gravitational force

GADGET’s algorithm for parallel force computation differsfrom that of Dubinski (1996), who introduced the notion oflocally essential trees. These are trees that are sufficientlydetailed to allow the full force computation for any parti-cle local to a processor, without further need for informationfrom other processors. The locally essential trees can be con-structed from the local trees by pruning and exporting partsof these trees to other processors, and attaching these parts asnew branches to the local trees. To determine which parts ofthe trees need to be exported, special tree walks are required.

A difficulty with this technique occurs in the context ofdynamic tree updates. While the additional time required topromote local trees to locally essential trees should not bean issue for an integration scheme with a global timestep,it can become a significant source of overhead in individual

timestep schemes. Here, often only one per cent or less ofall particles require a force update at one of the (small) sys-tem timesteps. Even if a dynamic tree update scheme is usedto avoid having to reconstruct the full tree every timestep,the locally essential trees are still confronted with subtle syn-chronization issues for the nodes and particles that have beenimported from other processor domains. Imported particlesin particular may have received force computations since thelast ‘full’ reconstruction of the locally essential tree occurred,and hence need to be re-imported. The local domain willalso lack sufficient information to be able to update importednodes on its own if this is needed. So some additional com-munication needs to occur to properly synchronize the locallyessential trees on each timestep. ‘On-demand’ schemes, in-volving asynchronous communication, may be the best wayto accomplish this in practice, but they will still add someoverhead and are probably quite complicated to implement.Also note that the construction of locally essential trees de-pends on the opening criterion. If the latter is not purely ge-ometric but depends on the particle for which the force isdesired, it can be difficult to generate a fully sufficient lo-cally essential tree. For these reasons we chose a differentparallelization scheme that scales linearly with the number ofparticles that need a force computation.

Our strategy starts from the observation that each of thelocal processor trees is able to provide the force exerted byits particles for any location in space. The full force might

16


thus be obtained by adding up all the partial forces from thelocal trees. As long as the number of these trees is less thanthe number of typical particle-node interactions, this compu-tational scheme is practically not more expensive than a treewalk of the corresponding locally essential tree.

A force computation therefore requires a communicationof the desired coordinates to all processors. These then walktheir local trees, and send partial forces back to the originalprocessor that sent out the corresponding coordinate. The to-tal force is then obtained by summing up the incoming con-tributions.

In practice, a force computation for asingleparticle wouldbe badly imbalanced in work in such a scheme, since someof the processors could stop their tree walk already at theroot node, while others would have to evaluate several hun-dred particle-node interactions. However, the time integra-tion scheme advances at a given timestep always a group ofM particles of size about 0.5-5 per cent of the total num-ber of particles. This group represents a representative mixof the various clustering states of matter in the simulation.Each processor contributes some of its particle positions tothis mix, but the total list of coordinates is the same for allprocessors. If the domain decomposition is done well, onecan arrange that the cumulative time to walk the local treefor all coordinates in this list is the same for all processors,resulting in good work-load balance. In the time integrationscheme outlined above, the sizeM of the group ofactiveparticles is always roughly the same from step to step, andit also represents always the same statistical mix of particlesand work requirements. This means that the same domaindecomposition is appropriate for each of a series of consec-utive steps. On the other hand, in a block step scheme withbinary hierarchy, a step where all particles are synchronizedmay be followed by a step where only a very small fractionof particles are active. In general, one cannot expect that thesame domain decomposition will balance the work for bothof these steps.

Our force computation scheme proceeds therefore assketched schematically in Figure 3. Each processor identifiesthe particles that are to be advanced in the current timestep,and puts their coordinates in a communication buffer. Next,an all-to-all communication phase is used to establish thesame list of coordinates on all processors. This communi-cation is done in a collective fashion: ForNp processors,the communication involvesNp − 1 cycles. In each cy-cle, the processors are arranged inNp/2 pairs. Each pairexchanges their original list of active coordinates. Whilethe amount of data that needs to be communicated scales asO[M(Np − 1)] ≃ O(MNp), the wall-clock time requiredscales only asO(M + cNp) because the communication isdone fully in parallel. The termcNp describes losses due tomessage latency and overhead due to the message envelopes.In practice, additional losses can occur on certain networktopologies due to message collisions, or if the particle num-

bers contributed toM by the individual processors are signif-icantly different, resulting in communication imbalance.Onthe T3E, the communication bandwidth is large enough thatonly a very small fraction of the overall simulation time isspent in this phase, even if processor partitions as large as512 are used. On Beowulf-class networks of workstations,we find that typically less than about 10-20% of the time islost due to communication overhead if the computers are con-nected by a switched network with a speed of100Mbit s−1.

In the next step, all processors walk their local trees andreplace the coordinates with the corresponding force contri-bution. Note that this is the most time-consuming step ofa collisionless simulation (as it should be), hence work-loadbalancing is most crucial here. After that, the force contri-butions are communicated in a similar way as above betweenthe processor pairs. The processor that hosted a particularco-ordinate adds up the incoming force contributions and finallyends up with the full force for that location. These forcescan then be used to advance its locally active particles, andtodetermine new timesteps for them. In these phases of the N-body algorithm, as well as in the tree construction, no furtherinformation from other processors is required.

6.3. Work-load balancing

Due to the high communication bandwidth of parallel super-computers like the T3E or the SP/2, the time required forforce computation is dominated by the tree walks, and thisis also the dominating part of the simulation as a whole. Itis therefore important that this part of the computation paral-lelizes well. In the context of our parallelization scheme,thismeans that the domain decomposition should be done suchthat the time spent in the tree walks of each step is the samefor all processors.

It is helpful to note, that the list of coordinates for the treewalks is independentof the domain decomposition. We canthen think of each patch of space, represented by its particles,to cause some cost in the tree-walk process. A good measurefor this cost is the number of particle-node interactions orig-inating from this region of space. To balance the work, thedomain decomposition should therefore try to make this costequal on the two sides of each domain split.

In practice, we try to reach this goal by letting each tree-node carry a counter for the number of node-particle inter-actions it participated in since the last domain decomposi-tion occurred. Before a new domain decomposition starts,we then assign this cost to individual particles in order to ob-tain a weight factor reflecting the cost they on average incurin the gravity computation. For this purpose, we walk the treebackwards from a leaf (i.e. a single particle) to the root node.In this walk, the particle collects its total cost by adding upits share of the cost from all its parent nodes. The computa-tion of these cost factors differs somewhat from the methodof Dubinski (1996), but the general idea of such a work-load

17


balancing scheme is similar.Note that an optimum work-load balance can often result in

substantial memory imbalance. Tree-codes consume plentyof memory, so that the feasible problem size can becomememory rather than CPU-time limited. For example, a sin-gle node with 128 Mbyte on the Garching T3E is alreadyfilled to roughly 65 per cent with 4.0×105 particles by thepresent code, including all memory for the tree structures,andthe time integration scheme. In this example, the remainingfree memory can already be insufficient to allow an optimumwork-load balance in strongly clustered simulations. Unfor-tunately, such a situation is not untypical in practice, sinceone usually strives forlargeN in N-body work, so one is al-ways tempted to fill up most of the available memory withparticles, without leaving much room to balance the work-load. Of course,GADGET can only try to balance the workwithin the constraints set by the available memory. The codewill also strictly observe a memory limit in all intermediatesteps of the domain decomposition, because some machines(e.g. T3E) do not have virtual memory.

6.4. Parallelization of SPH

Hydrodynamics can be seen as a more natural candidate forparallelization than gravity, because it is alocal interaction.In contrast to this, the gravitational N-body problem has theunpleasant property that at all timeseach particle interactswith every other particle, making gravity intrinsically diffi-cult to parallelize on distributed memory machines.

It therefore comes as no large surprise that the paralleliza-tion of SPH can be handled by invoking the same strategy weemployed for the gravitational part, with only minor adjust-ments that take into account that most SPH particles (thoseentirely ‘inside’ a local domain) do not rely on informationfrom other processors.

In particular, as in the purely gravitational case, we do notimport neighbouring particles from other processors. Rather,we export the particle coordinates (and other information,ifneeded) to other processors, which then deliver partial hy-drodynamical forces or partial densities contributed by theirparticle population. For example, the density computationof SPH can be handled in much the same way as that of thegravitational forces. In a collective communication, the pro-cessors establish a common list of all coordinates of activeparticles together with their smoothing lengths. Each proces-sor then computes partial densities by finding those of its par-ticles that lie within each smoothing region, and by doing theusual kernel weighting for them. These partial densities arethen delivered back to the processor that holds the particlecorresponding to a certain entry in the list of active coordi-nates. This processor then adds up the partial contributions,and obtains the full density estimate for the active particle.

The locality of the SPH interactions brings an importantsimplification to this scheme. If the smoothing region of a

particle lies entirely inside a local domain, the particle co-ordinate does not have to be exported at all, allowing a largereduction in the length of the common list of coordinates eachprocessor has to work on, and reducing the necessary amountof communication involved by a large factor.

In the first phase of the SPH computation, we find in thisway the density and the total number of neighbours inside thesmoothing length, and we evaluate velocity divergence andcurl. In the second phase, the hydrodynamical forces are thenevaluated in an analogous fashion.

Notice that the computational cost and the communicationoverhead of this scheme again scale linearly with the numberM of active particles, a feature that is essential for good adap-tivity of the code. We also note that in the second phase ofthe SPH computation, the particles ‘see’ the updated densityvalues automatically. When SPH particles themselves are im-ported to form a locally essential SPH-domain, they have tobe exchanged a second time if they are active particles them-selves in order to have correctly updated density values forthe hydrodynamical force computation (Dave et al. 1997).

An interesting complication arises when the domain de-composition is not repeated every timestep. In practice thismeans that the boundaries of a domain may become ‘diffuse’when some particles start to move out of the boundaries ofthe original domain. If the spatial region that we classifiedas interior of a domain is not adjusted, we then risk missinginteractions because we might not export a particle that startsto interact with a particle that has entered the local domainboundaries, but is still hosted by another processor. Notethat our method to update the neighbour-tree on each sin-gle timestep already guarantees that the correct neighbour-ing particles are always found on a local processor – if sucha search is conducted at all. So in the case of soft domainboundaries we only need to make sure that all those particlesare really exported that can possibly interact with particles onother processors.

We achieve this by ‘shrinking’ the interior of the local do-main. During the neighbour-tree update, we find the maxi-mum spatial extension of all SPH particles on the local do-main. These rectangular regions are then gathered from allother processors, and cut with the local extension. If an over-lap occurs the local ‘interior’ is reduced accordingly, therebyresulting in a region that is guaranteed to be free of any parti-cle lying on another processor. In this way, the SPH algorithmwill produce correct results, even if the domain decomposi-tion is repeated only rarely, or never again. In the case ofperiodic boundaries, special care has to be taken in treatingcuts of the current interior with periodic translations of theother domain extensions.

7. Results and tests

18


1 10 100 1000R [ h-1 kpc ]

10-6

10-5

10-4

10-3

10-2

10-1

100

ρ(r)

Figure 4: NFW density profiles used in the test of time integrationcriteria. The solid line gives the shape of the normal (unsoftened)NFW profile, dashed lines are softened according toǫ = 0.0004,0.004, and 0.04, respectively (see equation 66).

7.1. Tests of timestep criteria

Orbit integration in the collisionless systems of cosmologi-cal simulations is much less challenging than in highly non-linear systems like star clusters. As a consequence, all thesimple timestep criteria discussed in Section 5.1 are capableof producing accurate results for sufficiently small settingsof their tolerance parameters. However, some of these cri-teria may require a substantially higher computational effortto achieve a desired level of integration accuracy than oth-ers, and the systematic deviations under conditions of coarsetimestepping may be more severe for certain criteria than forothers.

In order to investigate this issue, we have followed theorbits of the particle distribution of a typical NFW-halo(Navarro et al. 1996, 1997) for one Hubble time. The initialradial particle distribution and the initial velocity distributionwere taken from a halo of circular velocity1350 km s−1 iden-tified in a cosmological N-body simulation and well fit by theNFW profile. However, we now evolved the particle distri-bution using a fixed softened potential, corresponding to themass profile

ρ(r) =δcρcrit.

(

rrs

+ ǫ)(

1 + rrs

)2 , (66)

fitted to the initial particle distribution. We use a static poten-tial in order to be able to isolate properties of the time integra-

tion only. As a side-effect this also allows a quick integrationof all orbits for a Hubble time, and the analytic formulae forthe density and potential can also be used to check energyconservation for each of the particles individually. Note thatwe introduced a softening parameterǫ in the profile which isnot present in the normal NFW parameterization. The halohad concentrationc = 8 with rs = 170 h−1kpc, and con-tained about 140000 particles inside the virial radius. Wefollowed the orbits of all particles inside a sphere of radius3000 h−1kpc around the halo center (about 200000 in total)using GADGET’s integration scheme for single particles inhaloes of three different softening lengths,ǫ = 0.0004, 0.004,and 0.04. The corresponding density profiles are plotted inFigure 4.

In Figure 5, we give the radii of shells enclosing fixednumbers of particles as a function of the mean number offorce evaluations necessary to carry out the integration forone Hubble time. The Figure thus shows the convergence ofthese radii when more and more computational effort is spent.

The results show several interesting trends. In general,for poor integration accuracy (on the left), all radii are muchlarger than the converged values, i.e. the density profile showssevere signs of relaxation even at large distances from thecenter. When an integrator with fixed timestep is used (solidlines), the radii decrease monotonically when more forceevaluations, i.e. smaller timesteps, are taken, until conver-gence to asymptotic values is reached. The other timestep cri-teria converge to these asymptotic values much more quickly.Also, for small softening values, the timestep criterion∆t ∝|a|−1 (dashed lines) performs noticeably better than the onewith t ∝ |a|−1/2 (dotted lines), i.e. it is usually closer to theconverged values at a given computational cost. On the otherhand, the criterion based on the local dynamical time (dot-dashed lines) does clearly the best job. It stays close to theconverged values even at a very low number of timesteps.

This impression is corroborated when the softening lengthis increased. The timestep based on the local dynamical timeperforms better than the other two simple criteria, and thefixed timestep. Interestingly, the criteria based on the accel-eration develop a non-monotonic behaviour when the poten-tial is softened. Already atǫ = 0.004 this is visible at thethree lowest radii, but more clearly so forǫ = 0.04. Be-fore the radii increase when poorer integration settings arechosen, they actually firstdecrease. There is thus a regimewhere both of these criteria can lead to a net loss of energyfor particles close to the shallow core induced by the soft-ening, thereby steepening it. We think that this is related tothe fact that the timesteps actuallyincreasetowards the cen-ter for the criteria (37) and (38) as soon as the logarithmicslope of the density profile becomes shallower than -1. Onthe other hand, the density increases monotonically towardsthe center, and hence a timestep based the local dynamicaltime will decrease. If one uses thelocal velocity dispersionin the criterion (37), a better behaviour can be obtained, be-

19


100 1000 10000< forces / particle >

10

100

1000

R

ε = 0.0004


10

100

1000

R

ε = 0.004


10

100

1000

R

ε = 0.04

Figure 5: Radii enclosing a fixed mass (7.1×10−5 , 3.6×10−4 , 1.8×10−3 , 7.1×10−3, 3.6×10−2, 2.1×10−1, and7.1×10−1 in units ofthe virial mass) after integrating the particle populationof a NFW halo in a fixed potential for one Hubble time with various timestep criteria.Solid lines are for fixed timesteps for all particles, dashedlines for timesteps based on∆t ∝ |a|−1, dotted lines for∆t ∝ |a|−1/2, anddot-dashed for timesteps based on the local dynamical time,i.e.∆t ∝ ρ−1/2. The three panels are for softened forms of the NFW potentialaccording to equation (66). Note that the fluctuations of thelowest radial shell are caused by the low sampling of the haloin the core and arenot significant. The horizontal axis gives the mean number offorce evaluation per particle required to advance the particle set for one Hubbletime.

cause the velocity dispersion declines towards the center inthe core region of dark matter haloes. However, this is notenough to completely suppress non-monotonic behaviour, aswe have found in further sets of test calculations.

We thus think that in particular for the cores of large haloesthe local dynamical time provides the best of the timestep cri-teria tested. It leads to hardly any secular evolution of thedensity profile down to rather coarse timestepping. On theother hand, as Quinn et al. (1997) have pointed out, estimat-ing the density in a real simulation poses additional problems.Since kernel estimates are carried out over a number of neigh-bours, the density in small haloes can be underestimated, re-sulting in ‘evaporation’ of haloes. These haloes are bettertreated with a criterion based on local acceleration, whichismuch more accurately known in this case.

We thus think that a conservative and accurate way to

choose timesteps in cosmological simulations is obtained bytaking the minimum of (37) and (39). This requires a compu-tation of the local matter density and local velocity dispersionof collisionless material. Both of these quantities can be ob-tained for the dark matter just as it is done for the SPH parti-cles. Of course, this requires additional work in order to findneighbours and to do the appropriate summations and kernelestimates, which is, however, typically not more than∼ 30%of the cost the gravitational force computation. However, asFigure 5 shows, this is in general worth investing, becausethe alternative timestep criteria require more timesteps,andhence larger computational cost, by a larger factor in ordertoachieve the same accuracy.

20


10-5 10-4 10-3 10-2 10-1 100

δa/a

0

20

40

60

80

100pe

rcen

tile

z=25BH opening criterion

θ = 0.4 0.6 0.8 1.0 1.2

10-5 10-4 10-3 10-2 10-1 100

δa/a

0

20

40

60

80

100

perc

entil

e

z=0BH opening criterion

θ = 0.4 0.6 0.8 1.0 1.2

10-5 10-4 10-3 10-2 10-1 100

δa/a

0

20

40

60

80

100

perc

entil

e

z=25New criterion

α = 0.0002 α = 0.02

10-5 10-4 10-3 10-2 10-1 100

δa/a

0

20

40

60

80

100

perc

entil

e

z=0New criterion

α = 0.0002 α = 0.02

10-5 10-4 10-3 10-2 10-1 100

δa/a

0

20

40

60

80

100

perc

entil

e

z=25Optimal opening

δ = 0.0001 δ = 0.01

10-5 10-4 10-3 10-2 10-1 100

δa/a

0

20

40

60

80

100

perc

entil

e

z=0Optimal opening

δ = 0.0001 δ = 0.01

Figure 6: Cumulative distribution of the relative force error obtained as a function of the tolerance parameter for three openingcriteria, andfor two different particle distribution. The panels in the left column show results for the particle distribution in theinitial conditions of a323

simulation atz = 25, while the panels on the right give the force errors for the evolved and clustered distribution atz = 0.

21


100 1000 10000interactions / particle

0.0001

0.0010

0.0100

0.1000

1.0000

δa/a

(95

% p

erce

ntile

) BH opening criterion

New opening criterion

Optimal opening

z = 25

100 1000 10000interactions / particle

0.0001

0.0010

0.0100

0.1000

1.0000

δa/a

(95

% p

erce

ntile

) BH opening criterion

New opening criterion

Optimal opening

z = 0

Figure 7: Computational efficiency of three different cell-opening criteria, parameterized by their tolerance parameters. Thehorizontal axesis proportional to the computational cost, and the verticalaxes gives the 95% percentile of the cumulative distribution of relative force errorsin a323 cosmological simulation. The left panel shows results for the initial conditions atz = 25, the right panel for the clustered distributionat redshiftz = 0.

7.2. Force accuracy and opening criterion

In Section 3.3, we have outlined that standard values of theBH-opening criterion can result in very high force errors forthe initial conditions of cosmological N-body simulations.Here, the density field is very close to being homogeneous, sothat small peculiar accelerations arise from a cancellation ofrelatively large partial forces. We now investigate this prob-lem further using a cosmological simulation with323 parti-cles in a periodic box. We consider the particle distribution ofthe initial conditions at redshiftz = 25, and the clustered oneof the evolved simulation atz = 0. Romeel Dave has kindlymade these particle dumps available to us, together with exactforces he calculated for all particles with a direct summationtechnique, properly including the periodic images of the par-ticles. In the following we will compare his forces (whichwe take to be the exact result) to the ones computed by theparallel version ofGADGETusing two processors on a LinuxPC.

We first examine the distribution of the relative force er-ror as a function of the tolerance parameter used in the twodifferent cell-opening criteria (8) and (18). These results areshown in Figure 6, where we also contrast these distributionswith the ones obtained with anoptimumcell-opening strat-egy. The latter may be operationally defined as follows:

Any complete tree walk results in an interaction list thatcontains a number of internal tree nodes (whose multipoleexpansions are used) and a number of single particles (whichgive rise to exact partial forces). Obviously, the shortestpos-sible interaction list is that of just one entry, the root node of

the tree itself. Suppose we start with this interaction list, andthen open always the one node of thecurrent interaction listthat gives rise to the largest absolute force error. This is thenode with the largest difference between multipole expansionand exact force, as resulting from direct summation over allthe particles represented by the node. With such an openingstrategy, the shortest possible interaction list can be found forany desired level of accuracy. The latter may be set by requir-ing that the largest force error of any node in the interactionlist has fallen below a fractionδ|a| of the true force field.

Of course, this optimum strategy is not practical in a realsimulation, after all it requires the knowledge of all the ex-act forces for all internal nodes of the tree. If these wereknown, we wouldn’t have to bother with trees to begin with.However, it is very interesting to find out what such a fidu-cial optimum opening strategy would ultimately deliver. Anyother analytic or heuristic cell-opening criterion is bound tobe worse. Knowing how much worse such criteria are willthus inform us about the maximum performance improve-ment possible by adopting alternative cell-opening criteria.We therefore computed for each particle the exact direct sum-mation forces exerted by each of the internal nodes, hence al-lowing us to perform an optimum tree walk in the above way.The resulting cumulative error distributions as a functionof δare shown in Figure 6.

Looking at this Figure, it is clear that the BH error distri-bution has a more pronounced tail of large force errors com-pared to the opening criterion (18). What is most strikinghowever is that at high redshift the BH force errors can be-come very large for values of the opening angleθ that tend to

22


give perfectly acceptable accuracy for a clustered mass dis-tribution. This is clearly caused by its purely geometricalnature, which does not know about the smallness of the netforces, and thus does not invest sufficient effort to evaluatepartial forces to high enough accuracy.

The error distributions alone do not tell yet about the com-putational efficiency of the cell-opening strategies. To assessthese, we define the error for a given cell-opening criterionas the 95% percentile of the error distribution, and we plotthis versus the mean length of the interaction list per particle.This length is essentially linearly related to the computationalcost of the force evaluation.

In Figure 7, we compare the performance of the cell-opening criteria in this way. At high redshift, we see that thelarge errors of the BH criterion are mainly caused because themean length of the interaction list remains small and does notadapt to the more demanding requirements of the force com-putation in this regime. Our ‘new’ cell-opening criterion doesmuch better in this respect; its errors remain well controlledwithout having to adjust the tolerance parameter.

Another advantage of the new criterion is that it is in factmore efficient than the BH criterion, i.e. it achieves higherforce accuracy for a given computational expense. As Figure7 shows, for a clustered particle distribution in a cosmolog-ical simulation, the implied saving can easily reach a factor2-3, speeding up the simulation by the same factor. Similarperformance improvements are obtained for isolated galax-ies, as the example in Figure 8 demonstrates. This can be un-derstood as follows: The geometrical BH criterion does nottake the dynamical significance of the mass distribution intoaccount. For example, it will invest a large number of cell-particle interactions to compute the force exerted by a largevoid to a high relative accuracy, while actually this force isof small absolute size, and it would be better to concentratemore on those regions that provide most of the force on thecurrent particle. The new opening criterion follows this latterstrategy, improving the force accuracy at a given number ofparticle-cell interactions.

We note that the improvement obtained by criterion (18)brings us about halfway to what might ultimately be possiblewith an optimum criterion. It thus appears that there is stillroom for further improvement of the cell-opening criterion,even though it is clear that the optimum will likely not bereachable in practice.

7.3. Colliding disk galaxies

As a test of the performance and accuracy of the integrationof collisionless systems, we here consider a pair of mergingdisk galaxies. Each galaxy has a massive dark halo consistingof 30000 particles, and an embedded stellar disk, representedby 20000 particles. The dark halo is modeled according to theNFW-profile, adiabatically modified by the central exponen-tial disk, which contributes 5% of the total mass. The halo

0 500 1000 1500interactions / particle

0.0001

0.0010

0.0100

0.1000

δa/a

θ=1.2

θ=1.0

θ=0.8

θ=0.6

α=0.0025

α=0.0100

α=0.0400

Figure 8: Force errors for an isolated halo/disk galaxy with the BH-criterion (boxes), and the new opening criterion (triangles). The darkhalo of the galaxy is modeled with a NFW profile and is truncated atthe virial radius. The plot shows the 90% percentile of the error dis-tribution, i.e. 90% of the particles have force errors belowthe citedvalues. The horizontal axes measures the computational expense.

has a circular velocityv200 = 160 kms−1, a concentrationc = 5, and spin parameterλ = 0.05. The radial exponen-tial scale length of the disk isRd = 4.5 h−1kpc, while thevertical structure is that of an isothermal sheet with thicknessz0 = 0.2Rd. The gravitational softening of the halo particlesis set to0.4 h−1kpc, and that of the disk to0.1 h−1kpc.

Initially, the two galaxies are set-up on a parabolic orbit,with separation such that their dark haloes just touch. Bothof the galaxies have a prograde orientation, but are inclinedwith respect to the orbital plane. In fact, the test consideredhere corresponds exactly to the simulation ‘C1’ computed bySpringel (2000), where more information about the construc-tion of the initial conditions can be found (see also Springel& White 1999).

We first consider a run of this model with a set of param-eters equal to the coarsest values we would typically employfor a simulation of this kind. For the time integration, weused the parameterαtolσ = 25 kms−1, and for the forcecomputation with the tree algorithm, the new opening cri-terion with α = 0.04. The simulation was then run fromt = 0 to t = 2.8 in internal units of time (correspondingto 2.85 h−1Gyr). During this time the galaxies have theirfirst close encounter at aroundt ≃ 1.0, where tidal tails areejected out of the stellar disks. Due to the braking by dynam-ical friction, the galaxies eventually fall together for a secondtime, after which they quickly merge and violently relax toform a single merger remnant. Att = 2.8, the inner parts ofthe merger remnant are well relaxed.

This simulation required a total number of 4684 steps and27795733 force computations, i.e. a computationally equallyexpensive computation with fixed timesteps could just make280 full timesteps. The relative error in the total energy was

23


1 10r [ h-1 kpc ]

105

106

107

108

109

ρ(r)

[ h2

MO • k

pc-3

]

Figure 9: Spherically averaged density profile of the stellar compo-nent in the merger remnant of two colliding disk galaxies. Trianglesshow the results obtained using our variable timestep integration,while boxes and diamonds are for fixed timestep integrationswith∆t = 0.01 (10 h−1Myr) and∆t = 0.0025 (2.5 h−1Myr), respec-tively. Note that the simulation using the adaptive timestep is aboutas expensive as the one with∆t = 0.01. In each case, the centerof the remnant was defined as the position of the particle withtheminimum gravitational potential.

3.0×10−3, and a Sun Ultrasparc-II workstation (sparcv9 pro-cessor, 296 MHz clock speed) did the simulation in 4 hours(using the serial code). The raw force speed was∼2800 forcecomputations per second, older workstations will achievesomewhat smaller values, of course. Also, higher force ac-curacy settings will slow down the code somewhat.

We now consider a simulation of the same system using afixed timestep. For∆t = 0.01, the run needs 280 full steps,i.e. it consumes the same amount of CPU time as above.However, in this simulation, the error in the total energy is2.2%, substantially larger than before. There are also dif-ferences in the density profile of the merger remnants. InFigure 9, we compare the inner density profile of the simula-tion with adaptive timesteps (triangles) to the one with a fixedtimestep of∆t = 0.01 (boxes). We here show the sphericallyaveraged profile of the stellar component, with the center ofthe remnants defined as the position of the particle with theminimum gravitational potential. It can be seen that in theinnermost∼ 1 h−1kpc, the density obtained with the fixedtimestep falls short of the adaptive timestep integration.

To get an idea how small the fixed timestep has to be toachieve similar accuracy as with the adaptive timestep, wehave simulated this test a second time, with a fixed timestepsof ∆t = 0.0025. We also show the corresponding profile(diamonds) in Figure 9. It can be seen that for smaller∆t,the agreement with the variable timestep result improves. At∼ 2 × 0.4 h−1kpc, the softening of the dark matter startsto become important. For agreement down to this scale, the

fixed timestep needs to be slightly smaller than∆t = 0.0025,so the overall saving due to the use of individual particletimesteps is at least a factor of4− 5 in this example.

7.4. Collapse of a cold gas sphere

The self-gravitating collapse of an initially isothermal,coolgas sphere has been a common test problem of SPH codes(Evrard 1988; Hernquist & Katz 1989; Steinmetz & Muller1993; Thacker et al. 2000; Carraro et al. 1998,and others).Following these authors, we consider a spherically symmetricgas cloud of total massM , radiusR, and initial density profile

ρ(r) =M

2πR2

1

r. (67)

We take the gas to be of constant temperature initially, withan internal energy per unit mass of

u = 0.05GM

R. (68)

At the start of the simulation, the gas particles are at rest.Weobtain their initial coordinates from a distorted regular gridthat reproduces the density profile (67), and we use a systemof units withG = M = R = 1.

In Figure 10, we show the evolution of the potential, thethermal, and the kinetic energy of the system from the startof the simulation att = 0 to t = 3. We plot results for twosimulations, one with 30976 particles (solid), and one with4224 particles (dashed). During the central bounce betweent ≈ 0.8 andt ≈ 1.2 most of the kinetic energy is convertedinto heat, and a strong shock wave travels outward. Later, thesystem slowly settles to virial equilibrium.

For these runsNs was set to 40, the gravitational soften-ing to ǫ = 0.02, and time integration was controlled by theparametersαtolσ = 0.05 andαcour = 0.12, resulting in verygood energy conservation. The absolute error in the total en-ergy is less than1.1 × 10−3 at all times during the simula-tion, translating to a relative error of 0.23%. Since we usea time integration scheme with individual timesteps of arbi-trary size, this small error is reassuring. The total numberofsmall steps taken by the 4224 particle simulation was 3855,with a total of 2192421 force computations, i.e. the equiva-lent number of ‘full’ timesteps was 519. A Sun Ultrasparc-II workstation needed 2300 seconds for the simulation. Thelarger 30976 particle run took 10668 steps, with an equiva-lent of 1086 ‘full’ steps, and 12 hours of CPU time. Notethat by reducing the time integration accuracy by a factor of2, with a corresponding reduction of the CPU time consump-tion by the same factor, the results do practically not change,however, the maximum error in the energy goes up to 1.2%in this case.

2Note that our definition of the smoothing lengthh differs by a factor of2 from most previous SPH implementations. As a consequence,corre-sponding values ofαcour are different by a factor of 2, too.

24


0 1 2 3time

-2

-1

0

1

ener

gythermal

total

kinetic

potential

Figure 10: Time evolution of the thermal, kinetic, potential, and total energy for the collapse of an initially isothermal gas sphere. Solid linesshow results for a simulations with 30976 particles, dashedlines are for a 4224 particle run.

The results of Figure 10 agree very well with those ofSteinmetz & Muller (1993), and with Thacker et al. (2000)if we compare to their ‘best’ implementation of artificial vis-cosity (their version 12). Steinmetz & Muller (1993) havealso computed a solution of this problem with a very accurate,one-dimensional, spherically symmetric piecewise parabolicmethod (PPM). For particle numbers above 10000, our SPHresults become very close to the finite difference result. How-ever, even for very small particle numbers, SPH is capable ofreproducing the general features of the solution very well.We also expect that athree-dimensionalPPM calculation ofthe collapse would likely require at least the same amount ofCPU time as our SPH calculation.

7.5. Performance and scalability of the parallelgravity

We here show a simple test of the performance of the parallelversion of the code under conditions relevant for real target

applications. For this test, we have used a ‘stripped-down’version of the initial conditions originally constructed for ahigh-resolution simulation of a cluster of galaxies. The orig-inal set of initial conditions was set-up to follow the cosmo-logical evolution of a large spherical region with comovingradius70 h−1Mpc, within aΛCDM cosmogony correspond-ing to Ω0 = 0.3, ΩΛ = 0.7, zstart = 50, andh = 0.7. Inthe center of the simulation sphere, 2 million high-resolutionparticles were placed in the somewhat enlarged Lagrangianregion of the cluster. The rest of the volume was filled withan extended shell of boundary particles of larger mass andlarger softening; they are needed for a proper representationof the gravitational tidal field.

To keep our test simple, we have cut a centred sphere of co-moving radius12.25 h−1Mpc from these initial conditions,and we only simulated the 500000 high-resolution particleswith massmp = 1.36×109h−1M⊙ found within this region.Such a simulation will not be useful for direct scientific anal-ysis because it does not model the tidal field properly. How-

25


Table 1: Consumption of CPU-time in various parts of the code for a cosmological run fromz = 50 to z = 4.3. The table gives timings forruns with 4, 8 and 16 processors on the Garching Cray T3E. The computation of the gravitational forces is by far the dominant computationaltask. We have further split up that time into the actual tree-walk time, the tree-construction time, the time for communication and summation offorce contributions, and into the time lost by work-load imbalance. The potential computation is done only once in this test (it can optionallybe done in regular intervals to check energy conservation ofthe code). ‘Miscellaneous’ refers to time spent in advancing and predictingparticles, and in managing the binary tree for the timeline.I/O time for writing a snapshot file (groups of processors canwrite in parallel) isonly 1-2 seconds, and therefore not listed here.

4 processors 8 processors 16 processors

cumulativetime [sec]

relativetime


relativetime


relativetime

total 8269.0 100.0 % 4074.0 100.0 % 2887.5 100.0 %gravity 7574.6 91.6 % 3643.0 89.4 % 2530.3 87.6 %

tree walks 5086.3 61.5 % 2258.4 55.4 % 1322.4 45.8 %tree construction 1518.4 18.4 % 773.7 19.0% 588.4 20.4 %communication & summation 24.4 0.3 % 35.4 0.9 % 54.1 1.9 %work-load imbalance 901.5 10.9 % 535.1 13.1 % 537.4 18.6 %

domain decomposition 209.9 2.5 % 158.1 3.9 % 172.2 6.0 %potential computation (optional) 46.3 0.2 % 18.1 0.4 % 11.4 0.4 %miscellaneous 438.3 5.3 % 254.8 6.3 % 173.6 6.0 %

0 5 10 15 20Np

0

5.0•103

1.0•104

1.5•104

2.0•104

2.5•104

raw

gra

vity

spe

ed [p

art/s

ec]

0 5 10 15 20Np

0

2.0•103

4.0•103

6.0•103

8.0•103

1.0•1041.2•104

over

all c

ode

spee

d [p

art/s

ec]

Figure 11: Code performance and scalability for a cosmological integration with vacuum boundaries. The left panel shows the speedof thegravitational force computation as a function of processornumber (in particles per second). This is based on the tree walk time alone. Inthe simulation, additional time is needed for tree construction, work-load imbalance, communication, domain decomposition, prediction ofparticles, timeline, etc. This reduces the ‘effective’ speed of the code, as shown in the right panel. This effective speed gives the number ofparticles advanced by one timestep per second. Note that theonly significant source of work-load imbalance in the code occurs in the gravitycomputation, where some small fraction of time is lost when processors idly wait for others to finish their tree-walks.

ever, this test will show realistic clustering and time-steppingbehaviour, and thus allows a reasonable assessment of the ex-pected computational cost and scaling of the full problem.

We have run the test problem withGADGET on the Garch-ing T3E from redshiftz = 50 to redshiftz = 4.3. Werepeated the identical run on partitions of size 4, 8, and 16processors. In this test, we included quadrupole moments inthe tree computation, we used a BH opening criterion withθ = 1.0, and a gravitational softening length of15 h−1kpc.

In Table 1 we list in detail the elapsed wall-clock time for

various parts of the code for the three simulations. The dom-inant sink of CPU time is the computation of gravitationalforces for the particles. To advance the test simulation fromz = 50 to z = 4.3, GADGET needed30.0 × 106 force com-putations in a total of 3350 timesteps. Note that on averageonly 1.8% of the particles are advanced in a single timestep.Under these conditions it is challenging to eliminate sourcesof overhead incurred by the time-stepping and to maintainwork-load balancing.GADGET solves this task satisfactorily.If we used a fixed timestep, the work-load balancing would be

26


essentially perfect. Note that the use of our adaptive timestepintegrator results in a saving of about a factor of 3-5 comparedto a fixed timestep scheme with the same accuracy.

We think that the overall performance ofGADGET is goodin this test. The raw gravitational speed is high, and the algo-rithm used to parallelize the force computation scales well, asis seen in the left panel of Figure 11. Note that the force-speedof theNp = 8 run is even higher than that of theNp = 4 run.This is because the domain decomposition does exactly onesplit in thex-, y-, andz-directions in theNp = 8 case. Thedomains are then close to cubes, which reduces the depth ofthe tree and speeds up the tree-walks.

Also, the force communication does not involve a signifi-cant communication overhead, and the time spent in miscella-neous tasks of the simulation code scales closely with proces-sor number. Most losses inGADGET occur due to work-loadimbalance in the force computation. While we think theselosses are acceptable in the above test, one should keep inmind that we here kept the problem sizefixed, and just in-creased the processor number. If wealso scale up the prob-lem size, work-load balancing will be significantly easier toachieve, and the efficiency ofGADGETwill be nearly as highas for small processor number.

Nevertheless, the computational speed ofGADGET mayseem disappointing when compared with the theoretical peakperformance of modern microprocessors. For example, theprocessors of the T3E used for the timings have a nominalpeak performance of 600 MFlops, butGADGET falls far shortof reaching this. However, the peak performance can only bereached under the most favourable of circumstances, and typ-ical codes operate in a range where they are a factor of 5-10slower. GADGET is no exception here. While we think thatthe code does a reasonable job in avoiding unnecessary float-ing point operations, there is certainly room for further tuningmeasures, including processor-specific ones which haven’tbeen tried at all so far. Also note that our algorithm of individ-ual tree walks produces essentially random access of memorylocations, a situation that could hardly be worse for the cachepipeline of current microprocessors.

7.6. Parallel SPH in a periodic volume

As a further test of the scaling and performance of the paral-lel version ofGADGET in typical cosmological applicationswe consider a simulation of structure formation in a periodicbox of size(50h−1Mpc)3, including adiabatic gas physics.We use323 dark matter particles, and323 SPH particles in aΛCDM cosmology withΩ0 = 0.3, ΩΛ = 0.7 andh = 0.7,normalized toσ8 = 0.9. For simplicity, initial conditions areconstructed by displacing the particles from a grid, with theSPH particles placed on top of the dark matter particles.

We have evolved these initial conditions fromz = 10 toz = 1, using 2, 4, 8, 16, and 32 processors on the Cray T3Ein Garching. The final particle distributions of all these runs

0 10 20 30 40Np

0

2000

4000

6000

8000

spee

d [p

art/s

ec]

gravity speed

0 10 20 30 40Np

0

5.0•103

1.0•104

1.5•104

2.0•104

2.5•104

spee

d [p

art/s

ec]

SPH speed

0 10 20 30 40Np

0

1000

2000

3000

4000

5000

6000

spee

d [p

art/s

ec]

overall code speed

Figure 12: Speed of the code in a gasdynamical simulation of aperiodicΛCDM universe, run fromz = 10 to z = 1, as a func-tion of the number of T3E processors employed. The top panelshows the speed of the gravitational force computation (includingtree construction times). We define the speed here as the number offorce computations per elapsed wall-clock time. The middlepanelgives the speed in the computation of hydrodynamical forces, whilethe bottom panel shows the resulting overall code speed in terms ofparticles advanced by one timestep per second. This effective codespeed includes all other code overhead, which is less than 5%of thetotal cpu time in all runs.

are in excellent agreement with each other.In the bottom panel of Figure 12, we show the code speed

27


as a function of the number of processors employed. We heredefine the speed as the total number of force computationsdivided by the total elapsed wall-clock time during this test.Note that the scaling of the code is almost perfectly linear inthis example, even better than for the set-up used in the pre-vious section. In fact, this is largely caused by the simplergeometry of the periodic box as compared to the sphericalvolume used earlier. The very absence of such boundary ef-fects makes the periodic box easier to domain-decompose,and to work-balance.

The top and middle panels of Figure 12 show the speed ofthe gravitational force computation and that of the SPH partseparately. What we find encouraging is that the SPH algo-rithm scales really very well, which is promising for futurelarge-scale applications.

It is interesting that in this test (whereNs = 40 SPH neigh-bours have been used) the hydrodynamical part actually con-sumes only∼ 25% of the overall CPU time. Partly this isdue to the slower gravitational speed in this test compared tothe results shown in Figure 11, which in turn is caused by theEwald summation needed to treat the periodic boundary con-ditions, and by longer interaction lists in the present test(wehere used our new opening criterion). Also note that only halfof the particles in this simulation are SPH particles.

We remark that the gravitational force computation willusually be more expensive at higher redshift than at lowerredshift, while the SPH part does not have such a dependence.The fraction of time consumed by the SPH part thus tends toincrease when the material becomes more clustered. In dissi-pative simulations one will typically form clumps of cold gaswith very high density - objects that we think will form starsand turn into galaxies. Such cold knots of gas can slow downthe computation substantially, because they require smallhy-drodynamical timesteps, and if a lower spatial resolution cut-off for SPH is imposed, the hydrodynamical smoothing maystart to involve more neighbours thanNs.

8. Discussion

We have presented the numerical algorithms of our codeGADGET, designed as a flexible tool to study a wide rangeof problems in cosmology. Typical applications ofGADGETcan include interacting and colliding galaxies, star formationand feedback in the interstellar medium, formation of clus-ters of galaxies, or the formation of large-scale structureinthe universe.

In fact,GADGET has already been used successfully in allof these areas. Using our code, Springel & White (1999) havestudied the formation of tidal tails in colliding galaxies,andSpringel (2000) has modeled star formation and feedback inisolated and colliding gas-rich spirals. For these simulations,the serial version of the code was employed, both with andwithout support by theGRAPEspecial-purpose hardware.

The parallel version ofGADGEThas been used to computehigh-resolution N-body simulations of clusters of galaxies(Springel 1999; Springel et al. 2000; Yoshida et al. 2000a,b).In the largest simulation of this kind, 69 million particleshavebeen employed, with 20 million of them ending up in thevirialized region of a single object. The particle mass in thehigh-resolution zone was just∼ 10−10 of the total simulatedmass, and the gravitational softening length was0.7 h−1kpcin a simulation volume of diameter140 h−1Mpc, translatingto an impressive spatial dynamic range of2 × 105 in threedimensions.

We have also successfully employedGADGET for two‘constrained-realization’ (CR) simulations of the Local Uni-verse (Mathis et al. 2000, in preparation). In these simu-lations, the observed density field as seen by IRAS galaxieshas been used to constrain the phases of the waves of the ini-tial fluctuation spectrum. For each of the two CR simulations,we employed∼ 75 million particles in total, with 53 millionhigh-resolution particles of mass3.6× 109h−1M⊙ (ΛCDM)or 1.2×1010h−1M⊙ (τCDM) in the low-density and critical-density models, respectively.

The main technical features ofGADGET are as follows.Gravitational forces are computed with a Barnes & Hut oct-tree, using multipole expansions up to quadrupole order. Pe-riodic boundary conditions can optionally be used and are im-plemented by means of Ewald summation. The cell-openingcriterion may be chosen either as the standard BH-criterion,or a new criterion which we have shown to be computation-ally more efficient and better suited to cosmological simula-tions starting at high redshift. As an alternative to the tree-algorithm, the serial code can use the special-purpose hard-wareGRAPE both to compute gravitational forces and for thesearch for SPH neighbours.

In our SPH implementation, the number of smoothingneighbors is kept exactly constant in the serial code, andis allowed to fluctuate in a small band in the parallel code.Force symmetry is achieved by using the kernel averagingtechnique, and a suitable neighbour searching algorithm isused to guarantee that all interacting pairs of SPH particlesare always found. We use a shear-reduced artificial viscos-ity that has emerged as a good parameterization in recentsystematic studies that compared several alternative formu-lations (Thacker et al. 2000; Lombardi et al. 1999).

Parallelization of the code for massively parallel super-computers is achieved in an explicit message passing ap-proach, using the MPI standard communication library. Thesimulation volume is spatially split using a recursive or-thogonal bisection, and each of the resulting domains ismapped onto one processor. Dynamic work-load balancingis achieved by measuring the computational expense incurredby each particle, and balancing the sum of these weights inthe domain decomposition.

The code allows fully adaptive, individual particletimesteps, both for collisionless particles and for SPH par-

28


ticles. The speed-up obtained by the use of individualtimesteps depends on the dynamic range of the time scalespresent in the problem, and on the relative population of thesetime scales with particles. For a collisionless cosmologicalsimulation with a gravitational softening length larger than∼ 30 h−1kpc the overall saving is typically a factor of3− 5.However, if smaller softening lengths are desired, the use ofindividual particle timesteps results in larger savings. In thehydrodynamical part, the savings can be still larger, espe-cially if dissipative physics is included. In this case, adap-tive timesteps may be required to make a simulation feasibleto begin with.GADGET can be used to run simulations bothin physical and in comoving coordinates. The latter is usedfor cosmological simulations only. Here, the code employsan integration scheme that can deal with arbitrary cosmolog-ical background models, and which is exact in linear theory,i.e. the linear regime can be traversed with maximum effi-ciency.

GADGET is an intrinsically Lagrangian code. Both thegravity and the hydrodynamical parts impose no restrictionon the geometry of the problem, nor any hard limit on theallowable dynamic range. Current and future simulations ofstructure formation that aim to resolve galaxies in their cor-rect cosmological setting will have to resolve length scales ofsize0.1− 1 h−1kpc in volumes of size∼ 100 h−1Mpc. Thisrange of scales is accompanied by a similarly large dynamicrange in mass and time scales. Our new code is essentiallyfree to adapt to these scales naturally, and it invests computa-tional work only where it is needed. It is therefore a tool thatshould be well suited to work on these problems.

SinceGADGET is written in standard ANSI-C, and the par-allelization for massively parallel supercomputers is achievedwith the standard MPI library, the code runs on a large vari-ety of platforms, without requiring any change. Having elim-inated the dependence on proprietary compiler software andoperating systems we hope that the code will remain usablefor the foreseeable future. We release the parallel and the se-rial version ofGADGET publically in the hope that they willbe useful for others as a scientific tool and as a basis for fur-ther numerical developments.

Acknowledgements

We are grateful to Barbara Lanzoni, Bepi Tormen, and Si-mone Marri for their patience in working with earlier ver-sions ofGADGET. We thank Lars Hernquist, Martin White,Charles Coldwell, Jasjeet Bagla and Matthias Steinmetz formany helpful discussions on various algorithmic and numer-ical issues. We also thank Romeel Dave for making someof his test particle configurations available to us. We are in-debted to the Rechenzentrum of the Max-Planck-Society inGarching for providing excellent support for their T3E, onwhich a large part of the computations of this work have been

0.0 0.5 1.0 1.5u

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

Φ

Figure 13: Comparison of spline-softened (solid) and Plummer-softened (dotted) potential of a point mass with the Newtonian po-tential (dashed). Hereh = 1.0, andǫ = h/2.8.

carried out. We want to thank the referee, Junichiro Makino,for very insightful comments on the manuscript.

Appendix: Softened tree nodes

The smoothing kernel we use for SPH calculations is a splineof the form (Monaghan & Lattanzio 1985)

W (r;h) =8

πh3

1− 6(

rh

)2+ 6

(

rh

)3, 0 ≤ r

h ≤ 12 ,

2(

1− rh

)3, 1

2 < rh ≤ 1,

0, rh > 1.

(69)Note that we define the smoothing kernel on the interval[0, h]and not on[0, 2h] as it is frequently done in other SPH calcu-lations.

We derive the spline-softened gravitational force from thiskernel by taking the force from a point massm to be the oneresulting from a density distributionρ(r) = mW (r;h). Thisleads to a potential

Φ(r) = Gm

hW2

( r

h

)

(70)

with a kernel

W2(u) =

163 u

2 − 485 u4 + 32

5 u5 − 14

5 , 0 ≤ u < 12 ,

115u + 32

3 u2 − 16u3 + 48

5 u4

− 3215u

5 − 165 ,

12 ≤ u < 1,

− 1u , u ≥ 1.

(71)The multipole expansion of a group of particles is discussedin Section 3.1. It results in a potential and force given byequations (11) and (15), respectively. The functions appear-

29


0.0 0.5 1.0 1.5u

-4

-3

-2

-1

0

- dΦ

/du

Figure 14: Comparison of spline-softened (solid) and Plummer-softened (dotted) force law with Newton’s law (dashed). Hereh = 1.0, andǫ = h/2.8.

ing in equation (15) are defined as

g1(y) =g′(y)

y, (72)

g2(y) =g′′(y)

y2− g′(y)

y3, (73)

g3(y) =g′2(y)

y, (74)

g4(y) =g′1(y)

y. (75)

Writing u = y/h, the explicit forms of these functions are

g1(y) =1

h3

− 323 + 192

5 u2 − 32u3, u ≤ 12 ,

115u3 − 64

3 + 48u− 192

5 u2 + 323 u

3, 12 < u < 1,

− 1u3 , u > 1,

(76)

g2(y) =1

h5

3845 − 96u, u ≤ 1

2 ,

− 3845 − 1

5u5 + 48u + 32u, 1

2 < u < 1,3u5 , u > 1,

(77)

g3(y) =1

h7

− 96u , u ≤ 1

2 ,32u + 1

u7 − 48u3 ,

12 < u < 1,

− 15u7 , u > 1,

(78)

g4(y) =1

h5

− 965 (5u− 4), u ≤ 1

2 ,48u − 1

5u5 − 3845 + 32u, 1

2 < u < 1,3u5 , u > 1.

(79)In Figures 13 and 14, we show the spline-softened and

Plummer-softened force and potential of a point mass. For

a given spline softening lengthh, we define the ‘equivalent’Plummer softening length asǫ = h/2.8. For this choice, theminimum of the potential atu = 0 has the same depth.

References

Aarseth S. J., 1963, MNRAS, 126, 223Aarseth S. J., Turner E. L., Gott J. R., 1979, ApJ, 228, 664Appel A. W., 1981, Undergraduate Thesis, Princeton Univer-

sityAppel A. W., 1985, SIAM, J. Sci. Stat. Comp, 6, 85Athanassoula E., Bosma A., Lambert J. C., Makino J., 1998,

MNRAS, 293, 369Athanassoula E., Fady E., Lambert J. C., Bosma A., 2000,

MNRAS, 314, 475Bagla J. S., 1999, preprint, astro-ph/9911025Balsara D. W., 1995, J. Comp. Phys., 121, 357Barnes J., 1986, in The Use of Supercomputers in Stel-

lar Dynamics, edited by P. Hut, S. L. J. McMillan, 175,Springer, New York

Barnes J., Hut P., 1986, Nature, 324, 446Barnes J. E., 1990, J. Comp. Phys., 87, 161Barnes J. E., Hut P., 1989, ApJS, 70, 389Bertschinger E., Gelb J. M., 1991, Computers in Physics, 5,

164Bode P., Ostriker J. P., Xu G., 2000, ApJS, 128, 561Brieu P. P., Summers F. J., Ostriker J. P., 1995, ApJ, 453, 566Carraro G., Lia C., Chiosi C., 1998, MNRAS, 297, 1021Couchman H. M. P., 1991, ApJ, 368, 23Couchman H. M. P., Thomas P., Pearce F., 1995, ApJ, 452,

797Dave R., Dubinski J., Hernquist L., 1997, New Astronomy,

2, 277Dikaiakos M. D., Stadel J., 1996, ICS conference proceed-

ingsDubinski J., 1996, New Astronomy, 1, 133Dubinski J., Mihos J. C., Hernquist L., 1996, ApJ, 462, 576Eastwood J. W., Hockney R. W., 1974, J. Comp. Phys., 16,

342Ebisuzaki T., Makino J., Fukushige T., et al., 1993, PASJ, 45,

269Efstathiou G., Davis M., Frenk C. S., White S. D. M., 1985,

ApJS, 57, 241Evrard A. E., 1988, MNRAS, 235, 911Frenk C. S., White S. D. M., Bode P., et al., 1999, ApJ, 525,

554Fukushige T., Ito T., Makino J., Ebisuzaki T., Sugimoto D.,

Umemura M., 1991, PASJ, 43, 841Fukushige T., Taiji M., Makino J., Ebisuzaki T., Sugimoto D.,

1996, ApJ, 468, 51Gingold R. A., Monaghan J. J., 1977, MNRAS, 181, 375

30


Greengard L., Rokhlin V., 1987, J. Comp. Phys., 73, 325Groom W., 1997, Ph.D. thesis, University of CambridgeHernquist L., Bouchet F. R., Suto Y., 1991, ApJS, 75, 231Hernquist L., Katz N., 1989, ApJ, 70, 419Hiotelis N., Voglis N., 1991, A&A, 243, 333Hockney R. W., Eastwood J. W., 1981, Computer Simulation

Using Particles, McGraw-Hill, New WorkHohl F., 1978, ApJ, 83, 768Holmberg E., 1941, ApJ, 94, 385Hultman J., Kallander D., 1997, A&A, 324, 534Hut P., Makino J., 1999, Science, 283, 501Hut P., Makino J., McMillan S., 1995, ApJ, 443, 93Ito T., Ebisuzaki T., Makino J., Sugimoto D., 1991, PASJ, 43,

547Jernigan J. G., Porter D. H., 1989, ApJS, 71, 871Kang H., Ostriker J. P., Cen R., et al., 1994, ApJ, 430, 83Katz N., Gunn J. E., 1991, ApJ, 377, 365Katz N., Weinberg D. H., Hernquist L., 1996, ApJS, 105, 19Kawai A., Fukushige T., Makino J., Taiji M., 2000, PASJ, 52,

659Klein R. I., Fisher R. T., McKee C. F., Truelove J. K., 1998, in

Proceedings of the International Conference on Numeri-cal Astrophysics 1998 (NAP98), edited by S. M. Miyama,K. Tomisaka, T. Hanawa, p. 131, Kluwer Academic,Boston, Mass.

Kravtsov A. V., Klypin A. A., Khokhlov A. M. J., 1997,ApJS, 111, 73

Lia C., Carraro G., 2000, MNRAS, 314, 145Lombardi J. C., Sills A., Rasio F. A., Shapiro S. L., 1999, J.

of Comp. Phys., 152, 687Lucy L. B., 1977, AJ, 82, 1013MacFarland T., Couchman H. M. P., Pearce F. R., Pichlmeier

J., 1998, New Astronomy, 3, 687Makino J., 1990, PASJ, 42, 717Makino J., 1991a, PASJ, 43, 621Makino J., 1991b, ApJ, 369, 200Makino J., 2000, in Dynamics of Star Clusters and the Milky

Way, edited by S. Deiters, B. Fuchs, A. Just, R. Spurzem,R. Wielen, astro-ph/0007084

Makino J., Aarseth S. J., 1992, PASJ, 44, 141Makino J., Funato Y., 1993, PASJ, 45, 279Makino J., Hut P., 1989, Computer Physics Reports, 9, 199Makino J., Taiji M., Ebisuzaki T., Sugimoto D., 1997, ApJ,

480, 432McMillan S. L. W., Aarseth S. J., 1993, ApJ, 414, 200Monaghan J. J., 1992, Ann. Rev. Astron. Astrophys., 30, 543Monaghan J. J., Gingold R. A., 1983, J. Comp. Phys., 52,

374Monaghan J. J., Lattanzio J. C., 1985, A&A, 149, 135Navarro J. F., Frenk C. S., White S. D. M., 1996, ApJ, 462,

563

Navarro J. F., Frenk C. S., White S. D. M., 1997, ApJ, 490,493

Navarro J. F., White S. D. M., 1993, MNRAS, 265, 271Nelson R. P., Papaloizou J. C. B., 1994, MNRAS, 270, 1Norman M. L., Bryan G. L., 1998, in Proceedings of

the International Conference on Numerical Astrophysics1998 (NAP98), edited by S. M. Miyama, K. Tomisaka,T. Hanawa, p. 19, Kluwer Academic, Boston, Mass.

Okumura S. K., Makino J., Ebisuzaki T., et al., 1993, PASJ,45, 329

Pacheco P. S., 1997, Parallel Programming with MPI, MorganKaufmann Publishers, San Francisco

Pearce F. R., Couchman H. M. P., 1997, New Astronomy, 2,411

Peebles P. J. E., 1970, Astron. J., 75, 13Press W. H., 1986, in The Use of Supercomputers in Stellar

Dynamics, edited by P. Hut, S. McMillan, 184, Springer-Verlag, Berlin Heidelberg New York

Press W. H., Schechter P., 1974, ApJ, 187, 425Press W. H., Teukolsky S. A., Vetterling W. T., Flannery

B. P., 1995, Numerical recipies in C, Cambridge Univer-sity Press, Cambridge

Quinn T., Katz N., Stadel J., Lake G., 1997, preprint, astro-ph/9710043

Romeo A. B., 1998, A&A, 335, 922Salmon J. K., Warren M. S., 1994, J. Comp. Phys., 111, 136Snir M., Otto S. W., Huss-Lederman S., Walker D. W., Don-

garra J., 1995, MPI: The complete reference, MIT Press,Cambridge

Splinter R. J., Melott A. L., Shandarin S. F., Suto Y., 1998,ApJ, 497, 38

Springel V., 1999, Ph.D. thesis, Ludwig-Maximilians-Universitat Munchen

Springel V., 2000, MNRAS, 312, 859Springel V., White S. D. M., 1999, MNRAS, 307, 162Springel V., White S. D. M., Tormen B., Kauffmann G., 2000,

preprint, astro-ph/0012055Steinmetz M., 1996, MNRAS, 278, 1005Steinmetz M., Muller E., 1993, A&A, 268, 391Thacker R. J., Tittley E. R., Pearce F. R., Couchman H. M. P.,

Thomas P. A., 2000, MNRAS, 319, 619Theuns T., Rathsack M. E., 1993, Comp. Phys. Comm., 76,

141Viturro H. R., Carpintero D. D., 2000, A& AS, 142, 157Warren M. S., Quinn P. J., Salmon J. K., Zurek W. H., 1992,

ApJ, 399, 405White S. D. M., 1976, MNRAS, 177, 717White S. D. M., 1978, MNRAS, 184, 185Wirth N., 1986, Algorithms and data structures, Prentice-

Hall, LondonXu G., 1995, ApJS, 98, 355

31


Yahagi H., Mori M., Yoshii Y., 1999, ApJS, 124, 1Yoshida N., Springel V., White S. D. M., Tormen G., 2000a,

ApJ, 535, L103Yoshida N., Springel V., White S. D. M., Tormen G., 2000b,

ApJ, 544, L87

32

gadget: a code for collisionless and gasdynamical ...gadget: a code for collisionless and...

Documents