1 mpj express bryan carpenter omii, university of southampton southampton so17 1bj, uk february 7...

48
1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 [email protected] and Lessons from Java Gadget

Upload: egbert-lawson

Post on 15-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

11

MPJ Express

Bryan Carpenter

OMII, University of SouthamptonSouthampton SO17 1BJ, UK

February 7 2007

[email protected]

and Lessons from Java Gadget

Page 2: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

22

Contents MPJ Express overview The Gadget 2 code Making a Java version of Gadget

Page 3: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

33

Acknowledgements

Work done in collaboration with Mark Baker and Aamir Shafir.

Most work done by latter!

Page 4: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

44

MPJ Express

Page 5: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

55

Java for High Performance Computing?

Various people have argued that Java provides a good programming platform for large scale computation (e.g. Java Grande, 1997 - ~2003).• Parallel computing has a long history of novel language

concepts to support parallelism, multithreading etc.

• Java seems to incorporate at least some of these ideas, and has the benefit that it is a “mainstream” language.

Java encourages better software engineering and is more portable than, say, Fortran or C.• There is a large body of supporting software for Java.

Page 6: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

66

Performance Modern JVMs perform compilation from byte code to

native machine code, as the Java program is executed.• Typical JVMs implement many of the optimizations used by

compilers for “traditional” programming languages.• Adaptive compilation may even allow optimizations that are

impractical for static compilers, because run-time information is available?

• Quality of compilation comparable to many C or Fortran compilers.

But Java has safety features that may limit performance.• Difficulty of managing memory explicitly, exact exception

processing, …

Page 7: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

77

MPI Background MPI was introduced in 1994 as a standard message

passing API for parallel scientific computing:• Language bindings for C, C++, and Fortran.

Original programming model was SPMD (Single Program Multiple Data) ― P copies of a single program run on a fixed number, P, of processors.• Only interaction is by message exchange ― no sharing of

memory between processors (c.f. CSP). Introduced idea of communicator as basic

communication abstraction (c.f. channels). A communicator is shared by a group of processors.• Supports library development and collective operations.

Page 8: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

88

Communicators

C.Send(…, 2)

Proc 0Proc 1

C.Recv(…, 0)

Proc 2

Communicator C

Page 9: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

99

Collective Communications

C.Bcast(…, 0)

Proc 0Proc 1

C.Bcast(…, 0)

Proc 2

Communicator C

C.Bcast(…, 0)

Page 10: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

1010

MPJ Background Java Grande Message Passing Workgroup defined Java

bindings for MPI in 1998. Previous efforts follow two approaches: • Pure Java approaches:

Remote Method Invocation (RMI), Java Sockets,

• Java Native Interface (JNI) approaches: mpiJava – Java wrapper to platform native MPI.

Page 11: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

1111

Minimal mpiJava Programimport mpi.*

class Hello { static public void main(String[] args) {

MPI.Init(args) ;

int myrank = MPI.COMM_WORLD.Rank() ;

if(myrank == 0) { char[] message = “Hello, there”.toCharArray() ; MPI.COMM_WORLD.Send(message, 0, message.length, MPI.CHAR, 1, 99) ; } else { char[] message = new char [20] ; MPI.COMM_WORLD.Recv(message, 0, 20, MPI.CHAR, 0, 99) ;

System.out.println(“received:” + new String(message) + “:”) ; }

MPI.Finalize() ; }}

Page 12: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

1212

MPJ Express Goals Modular design: unify earlier approaches by

pluggable communication devices.

Support high-level functionality of MPI-1.2 in pure Java code, as far as possible:• Point to point communications,• Collective communications,• Groups, communicators, and contexts, • Derived datatypes – buffering API.

Thread safety ― support Java threads for parallelism within multicore nodes.

Open source distribution: http://mpj-express.org

Page 13: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

1313

MPJ Design

MPJ point to point communications (Base level)

mpjdev (MPJ Device level)

MPJ collective Communications (High level)

Hardware (NIC, Memory etc)

MPJ API

JNI Java NIO

Java Virtual Machine (JVM)

JNI

Native MPI

gmdev

ThreadsAPI

smpdev

xdev

niodev

Page 14: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

1414

MPJ Demos There were a few existing demos for MPJ, inherited

from the earlier mpiJava project, e.g.• A 2d Fluid Flow example.

• Some Monte Carlo simulations of 2d spin systems from condensed matter physics.

These have nice GUIs, but aren’t very representative of real codes used by computational scientists.

Page 15: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

1515

MPJ demo: CFD – inviscid flow

Page 16: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

1616

MPJ demo: Q-state Potts model

Page 17: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

1717

Gadget

Page 18: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

1818

Gadget-2 Gadget-2 is a free-software, production code for

cosmological N-body (and hydrodynamic) computations.

http://www.mpa-garching.mpg.de/gadget• Written by Volker Springel, of the Max Plank Institute for

Astrophysics, Garching. It is written in the C language – already parallelized

using MPI. Versions have been used in various research papers in

astrophysics literature, including the Millennium Simulation.

Page 19: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

1919

Millennium Simulation See the paper Simulating the Joint Evolution of Quasars,

Galaxies and their Large-Scale Distribution by Springel et al on Gadget home page.

Follows evolution of 1010 dark matter “particles” from early Universe (z = 127) to current day.

Performed on 512 nodes of IBM p690.• Used 1TB of distributed memory.

• 350,000 CPU hours – 28 days elapsed time.

• Floating point performance around 0.2 TFLOPS. Around 20Tb total data generated.

Page 20: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

2020

Dynamics in Gadget Gadget is “just” simulating the movement of (a lot

of) representative particles under the influence of Newton’s law of gravity.• Plus some hydrodynamic forces, but these don’t affect

dominant Dark Matter. Co-moving coordinates take account of expansion of

Universe according to General Relativity, but otherwise basically classical mechanics.

Classical N-body problem.

Page 21: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

2121

Gadget Main Loop This is a slightly simplified view of the Gadget code:

… Initialize … while (not done) { move_particles() ; // update positions

domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities

}

• Most of the interesting work happens in compute_accelerations() (and domain_Decomposition() – see later).

Page 22: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

2222

Computing Forces The function compute_accelerations() must

compute forces experienced by all particles.• In particular must compute gravitational force.

Because gravity is a long range force, every particle affects every other. Naively, total cost is O(N2).• with N ≈ 1010, this is quite infeasible.

Need some kind of approximation.• Intuitive approach: when computing force on particle i,

group together particles in selected regions distant from i, and treat groups as single particles, located at their centres of mass.

Page 23: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

2323

Barnes Hut Tree First divide cubical region of 3d space into 23 = 8

regions, halving each dimension. For every sub-region that contains any particles, divide

again recursively to make an oct-tree, until “leaf” regions have at most one particle.

Page 24: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

2424

Two Dimensional Example

Picture borrowed from

www.cs.berkeley.edu/~demmel/cs267/lecture26/lecture26.html

Page 25: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

2525

Barnes Hut Force Computation To compute the force on a particle i, traverse tree

starting from root:• if a node n is “distant from” i, just add contribution to force

on i from centre of mass of n – no need to visit children of n;

• if node n is “close to” i, visit children of n and recurse. Hinges on definition of distant from/close to.• Basic idea is that a node representing some region of space is

distant from a particle i if the angle it subtends is smaller than a threshold opening angle:

i

n

θ

Page 26: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

2626

Complexity

On average, number of nodes “opened” to compute force on i is O(log N), as opposed to visiting O(N) particles in naïve algorithm.

A huge win when N ≈ 1010.

Page 27: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

2727

Domain Decomposition Need to divide space and/or particle set into domains,

where each domain is handled by a single processor. Problem is that we can’t just divide space evenly,

because some regions will have many more particle than others – poor load balancing.

Conversely, can’t just divide particles evenly, because particles move throughout space, and want to maintain physically close particles on the same processor, as far as practical – communication problem.

Page 28: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

2828

Peano-Hilbert Curve Warren and Salmon originally suggested using a space-

filling curve:

Picture borrowed from

http://www.mpa-garching.mpg.de/gadget/gadget2-paper.pdf

Page 29: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

2929

Peano-Hilbert Key Gadget applies the recursion 20 times, logically

dividing space into up to 220 × 220 × 220 cells on the Peano-Hilbert curve.

Then can label each cell by its location along the Peano-Hilbert curve – 260 possible locations comfortably fit into a 64-bit word.

Page 30: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

3030

Decomposition based on P-H Curve Because number of cells > > number of particles,

segments of linear Peano-Hilbert curve sparsely populated.

Sort particles by their Peano-Hilbert key, then divide evenly into P domains.• Intuitively – stretch out the P-H curve with particles dotted

along it; segment it into P parts where each part has the same number of particles.

Characteristics of this decomposition:• Good load balancing.• Domains simply connected and quite “compact” in real space,

because particles that are close along P-H curve are close in real space (converse often but not always true).

• Domains have relatively simple mapping to BH octtree nodes.

Page 31: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

3131

Distribution of BH Tree in Gadget

Ibid.

Page 32: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

3232

Distributed Representation of Tree Every processor holds a copy of the root nodes, and a

copy of all child nodes down to the point where all particles in of a node are held on a single remote processor. Remotely held nodes are called pseudo-particles.

To compute the force on a single local target particle, traverse tree from root as usual, and accumulate force contributions from locally held particles.

Build an export list containing target particle and hosts of pseudo-particles encountered in walk.

Page 33: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

3333

Communication After computation for all locally held target particles,

process export list and send list of local target particles to all hosts that own pseudo-particle nodes needed for those particles.

All processors do another tree walk to compute their contributions to remotely owned (from their point of view) target particles.

These contributions are returned to the original processor, and added into the accelerations for the target particles.

Page 34: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

3434

Other Communications Notably:• The Domain Decomposition itself requires (in principle) a

distributed sort of the particle list.

• Gadget only approximates this sort, but still fairly intricate communications.

In Gadget all communication is using the standard MPI library. It makes fairly extensive use of collective communication routines.

Page 35: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

3535

Java Gadget

Page 36: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

3636

Timeline MPJ Express released 2005. Realized we needed a realistic exemplar code: both as a

demo, and to drive further improvements of our software.

First thoughts that an N-body simulation code would make sense – Summer 2005.

Learned about Gadget in December that year. Effort to make a Java version of Gadget started

seriously early February 2006 – expecting many months work.

Non-trivial simulations running in March.

Page 37: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

3737

Gadget Code Statistics Public distribution is around 17,000 lines of C,

distributed over about 30 source files. Dependencies on:• MPI library for parallelization.

• GNU scientific library (but only a handful of functions)

• FFTW – library for parallel Fourier transforms. Comes with a set of initial conditions files that we can

use to test our code, including “colliding galaxies” and “cluster formation”.

Page 38: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

3838

Translation of Gadget Code Manually translated, but in first cut, deliberately keep

the Java code as similar to C as possible.• e.g. nearly all of Java Gadget is currently implemented as

static methods on one enormous class called Main.• Small supporting classes corresponding to structs of C code.

Currently not supporting a number of features, including:• PMTree algorithm (an optimization using Fourier Transforms

on a mesh to compute long-range part of gravitational force),• Multiple Initial Conditions files,• COMPUTE_POTENTIAL_ENERGY, ISOTHERM_EQS, FLEXSTEPS, OUTPUTPOTENTIAL, FORCETEST, MAKEGLASS, PSEUDOSYMMETRIC, …

Page 39: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

3939

Handling Dependencies Replace MPI calls with MPJ calls (!) Use of FFTW avoided by disabling the PMTree

algorithm (but may slow down some simulations a lot). Few GNU Scientific Library functions that were needed

(numeric quadrature, etc) were hand translated to Java.

Page 40: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

4040

Test Cases Restrictions aside, we successfully ran Colliding

Galaxies and Cluster Formation example simulations varying numbers of processors.• These use pure Dark Matter – hydrodynamics code not yet

tested. Results are indistinguishable from running the original

Gadget.

Page 41: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

4141

Colliding Galaxies (Java Gadget)

Page 42: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

4242

Cluster Formation (Java Gadget)

Page 43: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

4343

Optimizing Java Gadget Initial Java version of Gadget was about 3 times slower

than C. Optimizations:• Replace communication of Object arrays (corresponding to C

structs) with communication of component primitive elements (floats, ints, etc).

• Still wasn’t enough – final version actually expunges large Object arrays from program – flattens all struct arrays into arrays of primitive types (see later comments).

• Extended the MPJ Express API to manipulate Java ByteBuffers directly.

Page 44: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

4444

Flattening Object Arrays To preserve memory locality, we decided to “flatten”

large arrays of small objects (ParticleData) – replace an array of objects with an array of doubles containing object component fields, for example.

This makes the code less readable and maintainable than the original C code – clearly not acceptable. But we wanted speed.

Need a much better approach here. One possibility is some kind of macro pre-processing (it would be better if the Java language had structs…).

Page 45: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

4545

Benchmarks Graphs taken from Aamir Shafi’s thesis – you will find

many more results there. Some benchmarks are on up to 8 processors of Starbug

cluster in Portsmouth Distributed Systems Group. Other benchmarks are on up to 32 processors from

NW-GRID cluster at Daresbury Laboratory.

Page 46: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

4646

Starbug: Colliding Galaxies with 60,000 Particles

Page 47: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

4747

NW-Grid: Structure Formation with two million Particles

Page 48: 1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk and Lessons from Java Gadget

4848

Conclusions Possible to do large scale numerical simulations in Java Speed is OK. Memory consumption still an issue (Java Gadget needs

perhaps three times more memory than C Gadget). Made various improvements to MPJ Express

implementation and API along the way. As things stand, the Java Gadget code is not very

maintainable – C-like, and then optimized some. Needs more work to learn how to write maintainable

code in Java that is also fast – e.g. how to preserve memory locality in arrays of objects.