parallel dtfe surface density field reconstructiontpeterka/papers/2016/rangel...parallel dtfe...

Parallel DTFE Surface Density Field Reconstruction

Esteban Rangel∗†, Nan Li†, Salman Habib†, Tom Peterka†, Ankit Agrawal∗, Wei-keng Liao∗, Alok Choudhary∗∗Electrical Engineering and Computer Science Department

Northwestern UniversityEvanston, IL USA

†Argonne National LaboratoryLemont, IL USA

Abstract—We improve the interpolation accuracy and effi-ciency of the Delaunay tessellation field estimator (DTFE) forsurface density field reconstruction by proposing an algorithmthat takes advantage of the adaptive triangular mesh for line-of-sight integration. The costly computation of an intermediate3D grid is completely avoided by our method and onlyoptimally chosen interpolation points are computed, thus, theoverall computational cost is often significantly reduced.

The algorithm is implemented as a parallel shared-memorykernel for large-scale grid rendered field reconstructions in ourdistributed-memory framework designed for N-body gravita-tional lensing simulations in large volumes. We also introduce aload balancing scheme to optimize the efficiency of processinga large number of field reconstructions. Our results showour kernel outperforms existing software packages for volumeweighted density field reconstruction, achieving ∼10x speedupand our load balancing algorithm gains an additional ∼3.6xspeedup at scales with ∼16k of processes, with ∼1.5 relativestd. in wall time.

Keywords-parallel surface density; Delaunay tessellation fieldestimator;

I. INTRODUCTION

Reconstructing a continuous field from a discrete pointset is a fundamental operation in scientific data analysis.From a given a set of irregularly distributed points in 3D,the task is to create a regular grid from some propertydefined on the points. The property can be any locallydescribing quantity [13], e.g. mass density, with the usualassumption being that the points are acting as tracers ofan underlying continuous field. The grid representation ofirregularly distributed points is often preferred for sometasks such as visualization or applying certain mathematicaloperations, e.g. Fourier transform. For many applications inastrophysics and cosmology [7], only the 2D projected fieldis required, however, practically all methods for creatinga 2D grid compute a costly 3D grid as an intermediaterepresentation [20].

Our work is motivated by a gravitational lensing [21]simulation where accurate surface density1 estimation is acritical step [6], but the techniques developed generalize to

1Surface density is volumetric density integrated along a path, i.e., it isthe mass per unit area.

Figure 1: An example of a typical surface density fieldcomputed during a strong lensing study from an N-bodyparticle simulation. The DTFE method was used to generatethis 2048x2048 grid representing ∼1.5 million particleswithin a sub-volume of (4Mpc h−1)3. This is the largestobject in the final snapshot of a 1 billion particle N-bodysimulation with volume of (256Mpc h−1)3.

any application requiring high performance grid interpola-tion with low noise properties. For many types of analysisthat involve gathering meaningful statistics, it is typical formultiple grids – representing sub-volumes – to be neededfrom a much larger domain volume. The configuration ofthese sub-volumes may represent regions of interest, suchas high particle concentrations, or an arrangement in someparticular geometry, such as co-location along a line ofsight [22]; depending on the application, sub-volumes maybe completely disjoint, or have significant overlap. Thenumber of particles within each sub-volume may also have asignificant amount of variation. Together, the common non-uniform spatial distribution of interesting sub-volumes andtheir varying number of particles often cause a significantload imbalance on the computing resources needed fora typical data-parallel approach to large volume N-bodysimulations. While there are a few freely available parallel

density estimation software packages [2], [20], they tend tofocus on generating a single continuous grid from the input.

We propose a hybrid shared/distributed-memory algo-rithm that uses the DTFE method [1] for creating gridrepresentations of specified sub-volumes with low noiseproperties. Our distributed memory parallelization strategyuses a uniform volume decomposition of the data with ghostzones that replicate particles in adjacent volumes. The ghostzones are large enough such that any region of interestcan be computed independently by the process owning thevolume where the region is located using our shared memorykernel. The density kernel was algorithmically optimized forcomputing surface density by identifying the mathematicallyoptimal points for interpolation using the DTFE method.In cases where there is an expected computational loadimbalance across processes, we propose an algorithm fordetermining an a priori communication pattern for worksharing that is based on modeling the total computation.

Our load balancing results show experiments on two largescience quality N-body simulation datasets. The first is ahigh mass resolution simulation with 10243 particles anddifferent distributions for the (1024× 1024) grids that needto be computed. We show that with a high mass resolutionsimulation where work imbalance becomes problematic, wecan regain much of the performance loss with our loadbalancing algorithm, showing a speedup of ∼2.8 with 240MPI processes (6 OpenMP threads/process) on an InfiniBandanalysis/visualization cluster. The second larger volume sim-ulation with less mass resolution has 32003 particles. Wedemonstrate scaling up to 16,384 MPI processes (4 OpenMPthreads/process) on an IBM BG/Q supercomputer and showa load balancing speedup of ∼3.6. We also evaluated ourmethod against existing parallel software without enablingour load balancing functionality. As mentioned, since thesesoftware packages construct a single large grid, the datasetsare restricted to a much smaller size on the order of 106

particles. We ran our implementation in single-process-multiple-thread mode and show that our grid renderingkernel is more efficient for computing surface density by∼10x when compared to available shared memory parallelsoftware. To compare against distributed-memory parallelsoftware, we ran in multiple-process-single-thread mode anddecomposed the single large grid into multiple sub-grids,showing ∼8x improvement in execution time.

The rest of the paper is organized as follows. In sectionII we describe existing density estimation software and theparallelization strategy used. In Section III we provide es-sential background on the DTFE method and point location,performing a brief analysis of point location strategies. Wethen discuss surface density and the typical approach for gridrendering continuous fields using DTFE. In Section IV, wedescribe our kernel algorithm for grid rendering projectedfields using DTFE and our load balancing method. SectionV describes the experiments and performance improvements

we achieved. Finally, in section VI we discuss future workand provide concluding remarks.

II. RELATED WORK

The DTFE public software [2] is a freely available C++code for rendering grid fields from point sets. While thesoftware does not directly compute the surface density field,it can be used to produce 3D fields, which can be convertedto surface densities as described in section III. The codewas designed to implement the DTFE method and uses theCGAL [12] computational geometry library for constructingthe Delaunay triangulation. Grid rendering is done usingthe walking algorithm described in section III. The DTFEpublic software uses OpenMP to parallelize computation inshared memory architectures implementing an equal volumedecomposition with ghost particle regions. Computation onthe sub-volumes is performed by individual threads and theresulting density fields are recombined to produce the finalresult. No attempt is made to balance workloads because theintention of the software is for large volume analyses in cos-mology where the distribution of particles becomes mostlyhomogeneous at the scales of interest. Significant threadimbalance is noticed when small volumes are parallelized,as is needed by high mass resolution N-body simulationssince parallelization is limited to the memory of a singlecomputing node.

The TESS Density Estimator [20] is a freely availableC++ MPI software for computing surface density. Like theDTFE public software, the TESS Density Estimator runsin two stages: constructing a Voronoi tessellation with theQhull library [5], followed by estimating density at gridpoints covered by Voronoi cells. In the two codes, bothstages are executed in parallel, however, the primary differ-ence is that TESS constructs a global tessellation [9] ratherthan overlapping triangulations for interpolating the densitygrid. The advantage to this method is that ghost zones do notneed to be specified by a user, but, additional interprocesscommunication is required. Another fundamental differenceis the interpolation method; TESS uses a zero-order inter-polation rather than a first-order linear interpolation. Thesoftware allows for the data partitioning to be specified andwe will use this feature to make a comparison with ourimplementation.

A. Complexity Analysis

We can describe the computational complexity of bothsoftware packages together because the two stages theyboth implement are so similar. The CGAL library usesa flip algorithm for constructing a Delaunay triangulationwith overall complexity of O(N2), and Qhull computesthe Voronoi tessellation by computing the d+ 1 dimensionconvex hull, also with overall complexity of O(N2). Bothsoftware packages either incrementally or fully constructthe 3D density grid and require O(Ng

3) operations. Thus,

the total computational complexity for both methods isO(N2 +Ng

3).

III. BACKGROUND

A. DTFE Method

DTFE is a first order multi-dimensional interpolationmethod that relies on a Delaunay triangulation of the inputto obtain the local gradients of a function. First proposed byBernardeau and van de Weygaert [13] for producing volume-weighted velocity fields, the gradient within a tetrahedronis assumed constant, and discontinuous at its boundaries.The field value for a d -dimensional point x is interpolatedusing the Delaunay vertices x0, x1, ...xd of the containingtetrahedron by

f(x) = f(x0) + ∇f |Del · (x− x0) (1)

where ∇f |Del is the estimated constant field gradient usingthe known field values f (x0), f (x1), ...f (xd). For densityfield reconstruction, the on-site density values are estimatedby the inverse volume of the contiguous Voronoi cell,thereby ensuring mass conservation. The estimated densityfor each input, xi, is given by

ρ(xi) =(d+ 1)m∑NT ,i

j=1 V (Tj,i)(2)

where NT ,i denotes the number of tetrahedra Tj,i having xi

as a vertex, m is the mass, and (d + 1) is the tetrahedralvolume normalization factor.

B. Surface Mass Density

The projected 3D density, known as the surface mass den-sity, is used by the thin lens approximation in gravitationallensing. The equation for the surface density is given as

Σ(ξ) =

∫ρ(ξ, z) dz (3)

C. Grid Rendering Continuous Fields

It follows that a regular uniform 2D grid can be com-puted from a regular uniform 3D grid in a straightforwardmanner. The density value for a grid cell having dimensions(∆x,∆y) with representative point ξ in the 2D grid rendereddensity field is approximated by

Σi,j(ξ) =

Nz∑k=1

ρi,j,k(ξ, zk)∆z (4)

where Nz is the number of 3D grid cells in the z dimension,and (ξ, zk) is the representative point(s) for the 3D cell withdimensions ∆x,∆y,∆z. Using this method, it is necessaryto determine the containing tetrahedra for all the represen-tative points of the 3D grid to interpolate their field values.This is usually incrementally done using the walking method(equation 6) by locating representative points for adjacentgrid cells to minimize the walk length. The computational

complexity is therefore O(Ncell), where Ncell is the numberof 3D grid cells. Note that when ∆x,∆y,∆z are fixed, asin the case for a uniform grid, there is a tendency to under-sample interpolation points in regions where the particlespacing is smaller than the grid spacing. In practice, caremust be taken to avoid this issue, with a common approachto estimate the density of the 3D cell volume using a MonteCarlo method:

Σi,j(ξ) =

Nz∑k=1

⟨ ρi,j,k(ξ, zk)⟩

∆z (5)

While this helps with under-sampling, it does not alleviateover-sampling in low density regions.

1) Point Location by Walking: In order to interpolate thefield value for a query point, q, the tetrahedra containingq must be identified. The location of an arbitrary pointinside a triangulation is a classic problem in computationalgeometry. It has been very well studied, with several al-gorithms providing optimal solutions in the literature [10].In practice, however, algorithms that ‘walk through’ thetriangulation have been widely adopted for their simplicityand lack of need for preprocessing and additional storage.Walking algorithms only require an additional adjacencymatrix for storing neighbor simplices, i.e., simplices thatshare a common facet. The general approach begins withchoosing an initial simplex in the triangulation and movingtowards q by moving to a neighbor simplex in the directionof q until the simplex containing q is identified. A robusttechnique for walking in three dimensions, Sambridge etal. [14], tests if q is on the interior side of a tetrahedral faceby the following test

[(xi − xj) · [(xk − xj)× (xl − xj)]]·[(q − xj) · [(xk − xj)× (xl − xj)]] ≥ 0,

(6)

moving to the neighbor tetrahedron of a face that fails, andlocating the tetrahedron if all faces pass. This method isguaranteed to converge for a Delaunay triangulation, but maygo into an infinite loop with an arbitrary triangulation. This isbecause walking moves in the general direction of q, whichis not always a straight line. The performance of walking canbe greatly improved by choosing an initial tetrahedra that isclose to q, usually done by randomly sampling tetrahedravertices and selecting the tetrahedron with the vertex thatis nearest. For a regular grid, the tetrahedron containing thequery point of an adjacent grid cell is clearly a good initialchoice, however, the orientation tests from equation 6 muststill be computed.

2) Ray-Tetrahedron Intersection : Similar to walking,Mucke et al. [15] proposed the idea of ‘marching’, where aline, L, is used to traverse the triangulation by identifyingsimplices that intersect L. The endpoints of L are the querypoint, q, and some initial vertex p. Algorithms such as theMoller and Trumbore ray-triangle algorithm [16] are fast and

efficient, but usually do not perform well in practice becauseof floating point round-off error. An algorithm proposed byPlatis et al. [17] using Plucker coordinates has better results,and we use it for the ray-tetrahedra intersection testing inour algorithm. The algorithm tests each face individuallyby checking the relative orientation of two lines, namelythe ray and a triangle edge. For some three-dimensional rayr, determined by a point x with direction l, the Pluckercoordinates representation is a six-dimensional vector of theform

πr = {l : l× x} = {ur : vr}. (7)

The relative orientation of two rays, r and s is determinedby computing the sign of the permuted inner product:

πr � πs = ur · vs + us · vr (8)

The ray orientation test is performed for each edge of atetrahedra face, but shared edge calculations can be reused.When all the permuted inner products for a face are greaterthan zero, r enters the face, if all are less than zero, rleaves the face, otherwise one of the degeneracy cases 2 hasbeen encountered. For any intersecting face, the barycentriccoordinates of the intersection is computed by

wi = πr � πei , hi =wi∑3i=1 wi

(9)

where πei is the Plucker vector for an edge. The Cartesiancoordinates follow as

xintersect = h0x0 + h1x1 + h2x2 (10)

where x0, x1, and x2 are the vertices of the face.

IV. DESIGN AND IMPLEMENTATION

Our approach follows the principles of data locality andreplication to minimize communication and synchronizationby creating particle ghost zones. Our load balancing schememodels the computational phases of a small set of testproblems at runtime – drawn from the problem domainrandomly – to estimate the remaining work and determinean a priori work sharing schedule. The distributed-memoryalgorithm consists of four phases, namely: (1) data parti-tioning and redistribution, (2) workload modeling, (3) worksharing scheduling, and (4) execution and communication.These phases are described in detail after presenting oursurface density kernel.

A. Shared Memory DTFE KernelIn this section, we present the algorithm we designed for

computing a 2D grid rendered surface mass density fieldusing the DTFE method. The algorithm computes the 2Dgrid by marching through the triangulation (ray-tetrahedronintersection) in the direction of integration and interpolatingvalues from tetrahedra intersecting the line(s) of sight of the3D projection, L, onto a 2D grid cell point, ξ.

2Degeneracy cases occur when the ray intersects a vertex, an edge, or isco-planar to a face.

1) Optimal Line-of-Sight Interpolation: To determine theoptimal points for interpolation along a line of sight, L, wecompute the points of intersection using equations 9 and10 for each tetrahedron intersecting L. Substituting the 3Ddensity ρ(ξ, z) with the interpolation function in equation 1,the integral in equation 3 becomes

ΣTi(ξ) =

∫ b

a

ρ(x0) + ∇ρ|Del ·

ξxξyz

− x0

dz (11)

where ξ is the 2D projection of L, Ti is a tetrahedronintersecting L, x0 is an arbitrarily chosen vertex of Ti takento compute the gradient, and a, b, are the z components ofthe points of intersection of L and Ti. Solving the integralin equation 11 gives

ΣTi(ξ) =

ρ(x0) + ∇ρ|Del ·

ξxξyb−a2

− x0

(b− a)

(12)showing that for any tetrahedra intersecting L, equation 3is solved exactly when using linear interpolation by interpo-lating at the midpoint of the intersection interval. Equations11 and 12 are defined on individual tetrahedra. The surfacedensity for a 2D grid cell is then the sum of all tetrahedraintersecting L given by

Σi,j(ξ) =

NT∑k=1

ΣTk(ξ) (13)

where NT is the number of tetrahedra intersecting L. Thereis potential for under-sampling in the x and y dimensions,as we described in the section III, however, Monte Carloapproximations can also be computed to obtain the meansurface density for the 2D grid cell, with the advantageover 3D grid rendering methods being one fewer degree offreedom in the error of the estimator:

Σi,j(ξ) =

⟨NT∑k=1

ΣTk(ξ)

⟩(14)

2) Algorithm: Our convention is to integrate along the zdimension to make calculations simpler, however, in princi-ple any arbitrary direction can be chosen by a simple rotationof the triangulation. In order to initiate the march, thealgorithm must determine the first tetrahedron intersectingL. This is done by constructing a 2D triangulation from the3D Delaunay triangulation’s convex hull by projecting hullfacets facing the opposite direction of integration,

nhull · z < 0 (15)

where nhull is the surface normal of a hull face, and z is thez unit vector. Determining the first tetrahedron intersectingL (Fig. 3 line 5) is done by locating its projection, ξ, in the2D triangulation, where any point location method can beused.

With high resolution grids and a large number of particlesper volume, it is inevitable that degeneracy cases will beencountered. The RayTetra subroutine computes equation8 and returns a degeneracy status, Err, along with pointsof intersection, a, b, and the neighbor of the exit face,Neighbor. To resolve degeneracy cases, we perturb L withthe subroutine Perturb.

Input: Line L, Tetrahedron T , εOutput: Line L

1: ξ ← xyProjection(L)2: v ← getRandomV ertex(T )3: δ ← xyProjection(v)− ξ4: if ‖δ‖ > ε then5: δ ← δ ÷ ‖δ‖6: δ ← δε7: end if8: L← L+ δ

Figure 2: Perturb

Input: Discrete point set XOutput: 2D grid ρ

1: D ← Delaunay(X)2: for all i such that 0 ≤ i < Ncell do3: ρi ← 04: L← GridLineOfSight(i)5: T ← HullTetrahedron(D,L)6: while T do7: (Err, a, b,Neighbor)← RayTetra(T , L)8: if Err then9: Perturb(L, T , ε)

10: else11: ρi = ρi + Interpolate(T , L, a, b) b−a212: T = Neighbor13: end if14: end while15: end for

Figure 3: Kernel

3) Complexity Analysis: Constructing a Delaunay tri-angulation takes O(N2) operations and contains O(N2)tetrahedra in 3D. For each Ng

2 2D grid cell, a marchthrough the triangulation is needed, making the worst casecomplexity O(N2 + Ng

2N2) and best case complexityO(N2 +Ng

2(N2)1/3

), assuming a cube sub-volume.

B. Data Partitioning and Redistribution

Data partitioning is done using a spatial volume de-composition where each sub-volume is of equal size andnot guaranteed to have an equal number of particles. Theparticle imbalance is strongly dependent on the size of thesub-volumes – with more imbalance in highly clustered

situations (typically occuring at later simulation times) andsmaller sub-volume scales. Abiding by the principles ofdata locality and replication, we create particle ghost zoneslarge enough such that any surface density tile subproblemwithin the active region of a process can be completedwithout communication to any other process. More formally,if surface a density tile, T , has physical length lT , ghostregions will contain duplicate particles beyond the sub-volume boundaries within a distance of lT

2 .The input particle dataset we used was generated from an

N-body simulation run on a supercomputer usually using agreater number of processes, where the data was written toseveral files, with offsets within each file for an individualprocess’s data. The simulation is spatially decomposed intoequal size sub-volumes, thus, on disk the data block writtenby a process represents its contiguous sub-volume. We useMPI-IO to perform a parallel read of the data using metadatainformation about the simulation sub-volumes on disk. Fi-nally, all processes perform a neighbor-to-neighbor exchangeto fill the ghost regions. The set of locations for computingsurface densities is generally much smaller, which can beread by a single process to perform a broadcast; each processdiscards locations outside of its local sub-volume.

C. Modeling Workload

We assume all surface density fields to be of the same sizeand resolution, which is not unreasonable for many typesof analysis. This allows for the input of where to computethe fields to be specified as a set of points, with additionalparameters for the field volume size and grid resolution.Input points for computing density fields within the sub-volume of a process are considered its local work items.All processes estimate the total amount of time needed tocomplete the local work by performing the following steps:

1) Count the number of particles, ni, needed to completeeach local work item.

2) Time the execution of a random local work item, tr.3) Exchange results, (nr, tr), with all processes.4) Fit the global timing data to a predictive model, fpre.5) Estimate the remaining work by fpre(ni)1) Counting particles: The modeling phase begins with

each process counting the number of particles, ni, neededfor computing each of its surface density tiles, Ti. This isdone by centering a cube, with dimensions specified by aninput parameter, on the input point representing the fieldlocation. The number of particles in the cube is ni.

2) Timing: A randomly chosen tile is computed by eachprocess, recording the triangulation time, tdel, and interpo-lation time, tinterp.

3) Data Exchange: The result of each process’s test prob-lem timing, (nr, tdel,r, tinterp,r), is shared with all processes.This is implemented with MPI using the MPI_Allgatherfunction providing implicit synchronization.

4) Fitting: Each process independently fits two modelsfor computing the Delaunay triangulation and grid interpola-tion, both as a function of ni. The Delaunay model is simplytaken from the quickhull algorithm average case complexityanalysis,

ftri(n) = c n log2(n) (16)

where c is the parameter fit for capturing environmentalcharacteristics. Fitting is performed using ordinary leastsquares.

c = (XTX)−1XTtdel (17)

The interpolation model is a power law function,

finterp(n) = α nβ (18)

To fit the function we use the non-linear least squares Gauss-Newton method. The initial guess is taken from linearlyfitting the log of the data and the log of the function.

D. Work Sharing

Once the timing estimates for each work item have beencompleted, the total amount of local work time, ti, can becomputed and shared with all processes, implemented withMPI_Allgather. Each process computes 〈t〉 to determinethe senders and receivers in the work sharing schedule. Weconstruct a list of pairs to hold the process ID and total timefor all processes. Pi = (idi, ti)

Rank ID14 4 0 15 8 10 12 6 1 13 11 5 7 2 3 9

Tim

e (

s)

Create Communication List

Average Time <t>

SendersReceivers

Figure 4: Each process independently computes its commu-nication list; determining if it is a sender or receiver basedon the average total time of all processes.

After executing the CreateCommunicationList rou-tine, the RecvList contains information on which processesare going to send work, and in what order messages willbe received. The SendList contains information on whichprocess, when, and how much work a sender process cansend to a receiver processes. Senders then need to determineat what point in the computation of their local work they

Input: Process data POutput: SendList, RecvList

1: myProcID = getMyProcID();2: lr = −13: Ps← SortByT imeDescending(P )4: for all t in Ps.t do5: if t > 〈t〉 then6: lr = lr + 17: break8: end if9: end for

10: cr = size(Ps)− 111: for all i such that 0 ≤ i < lr do12: while cr ≥ lr and Ps[i].t > 〈t〉 do13: if (Ps[i].t− 〈t〉) > (〈t〉 − Ps[cr].t) then14: if myProcID = Ps[i].id then15: SendList.add(Ps[cr], 〈t〉 − Ps[cr].t)16: else if myProcID = Ps[cr].id then17: RecvList.add(Ps[i].id)18: end if19: Ps[i].t = Ps[i].t− (〈t〉 − Ps[cr].t)20: Ps[cr].t = 〈t〉21: cr = cr − 122: else23: if myProcID = Ps[i].id then24: SendList.add(Ps[cr], Ps[i]− 〈t〉25: else if myProcID = Ps[cr].id then26: RecvList.add(Ps[i].id)27: end if28: Ps[cr].t = Ps[cr].t+ Ps[i].t− 〈t〉29: Ps[i].t = 〈t〉30: end if31: end while32: end for

Figure 5: CreateCommunicationList

should pause and send to a free receiver in their list. Todo this, senders sort their SendList by send time (Ps.t)in ascending order. By taking the time difference betweensubsequent list items, senders determine which local workitems can be computed to fill the time between sending worksharing messages. We view the time to fill between messagesas work bins, where the objective is to fill them with localwork items as best as possible. To take this concept a stepfurther, the amount of work to send to a receiver can also beconsidered a work bin. We combine the two work bin liststo simultaneously solve both problems using a greedy first-fit approximation [23] to the variable bin packing problem,where we sort the work items in descending order and sortthe bins in ascending order. The result is a permutation ofthe local work items that minimize delays in communication.Note that this work sharing scheme is greedy in minimizingthe overall amount of communication messages, that is, the

senders with the most work to share send to receivers withthe largest ability to receive.

E. Execution and Communication

Receivers simply execute all their local work and listenfor a message from the next sender in their list. This isimplemented by calling MPI_Recv. When a work sharingmessage is received, they receive a copy of the sender’sparticle set and density field positions that need to becomputed. This process is repeated until there are no morereceive messages are in their list. Senders execute their localwork items and call MPI_Send after iterations determinedby the optimization algorithm.

V. EXPERIMENTAL RESULTS

The DTFE surface density grid rendering kernel describedin this paper is implemented in C using OpenMP to paral-lelize the loop over the individual grid cell computations.Computing multiple fields is implemented in C++ and par-allelized using MPI for all interprocess communication. Thekernel comparison with the DTFE software was performedon a workstation with two Intel Xeon Processors (E5-2620;15M Cache, 2.00 GHz), 12 physical cores supporting 2hardware threads per core. The distributed-memory experi-ment was carried out on Cooley, an InfiniBand data analy-sis/visualization cluster at Argonne National Laboratory withtwo six-core 2.4GHz Intel Haswell processors and 384GBmemory/node. The distributed-memory comparison with theTESS density estimator [20] was performed on Cooley with-out OpenMP support. Cooley was also used for the our firstshared/distributed-memory experiments on a high resolutionN-body simulation. The second shared/distributed-memoryexperiment was carried out on Mira, an IBM Blue Gene/Qsupercomputer with a PowerPC A2 1600 MHz processorcontaining 16 cores and 16 GB memory/node. For creatingDelaunay triangulations we use the Qhull software library[5]. In all our experiments, triangulations are constructedwithout any threading as this was not the focus of thiswork. Our code makes no special use of triangulation librarydata structures so substituting the triangulation library canbe performed with little refactoring.

1) Shared-Memory Comparison: For an empirical com-parison with walking based software, we compared our im-plementation with the publicly available DTFE software [2]using the provided demo dataset from a publicly availableN-body simulation software called Gadget [24]. The resultsare displayed in Figure 6 for a 10243 grid. The datasethas 650,466 particles in a volume of (100 Mpc/h)3. Thetiming comparison is restricted only to the time it takes tointerpolate the grid and not for generating the triangulation.The software task is to compute a regular grid with a singlepoint for computing the density at each grid cell. For clarity,our algorithm did not run using dynamic grid spacing, butrather an equally spaced grid in all three dimensions. In

this way, both approaches are locating and interpolatingexactly the same number of grid cells. Overall, as shown inFigure 6, our method is approximately an order of magnitudefaster and threads have a much more evenly distributedexecution time. Both codes were run using 24 threads. Thedifference in overall performance is attributed to the pre-factor in time complexity of the algorithms, which in thiscase is O(N2 +Ng

3). The difference comes from how thetwo methods perform point location for determining thetetrahedra for interpolation; the DTFE public software usesthe ’walking’ method as opposed to our ’marching’ method.

0 5 10 15 20 250

5

10

15

20

25

30

35

Thread id

Tim

e (

Se

co

nd

s)

DTFE 1.1.1

Our method

Mean times

Figure 6: Comparison of the thread timing from the publiclyavailable DTFE software [2] and using the kernel algorithmdescribed here for a single 10243 grid.

Number of MPI Processes1 2 4 8 16 32 64

Tim

e (

s)

8

16

32

64

128

256

512

1024

2048

4096Execution time

InterpolationTriangulationDENSETESS

(a)

Number of MPI Processes0 10 20 30 40 50 60 70

Speeedup

0

10

20

30

40

50

60

70Speedups

InterpolationTriangulationDENSETESSLinear

(b)

Figure 7: Execution and speedups for TESS, DENS, and thecorresponding steps of our algorithm.

2) Distributed-Memory Comparison: The next experi-ment is another comparison with the distributed-memoryTESS Density Estimator [20], a freely available C++ MPIsoftware for computing surface density. The dataset for thisexperiment is a 32 Mpc h−1 sub-volume with 1,711,563particles taken from the large 32003 particle simulationwe use in this paper. The TESS Density Estimator, runsin two stages: constructing a Voronoi tessellation, TESS,followed by estimating density at grid points covered byVoronoi cells, DENSE. Figure 7 presents the execution andspeedup of the corresponding stages of TESS and DENSEwith our algorithm. To compare we ran our implementationin multiple-process-single-thread mode and decomposed the

single large grid into multiple sub-grids. The results show∼8x improvement in execution time. In this experiment wegenerated a 4096x4096 surface density grid from the entiredataset. Shown in Figure 8 are the outputs from singleprocess runs of the codes.

1000 2000 3000

500

1000

1500

2000

2500

3000

3500

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

(a) DTFE Method

1000 2000 3000

500

1000

1500

2000

2500

3000

3500

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

(b) TESS/DENSE

1000 2000 3000

500

1000

1500

2000

2500

3000

3500

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(c) Relative Difference Map

(DTFE-DENSE)/DENSE 0 2 4 6 8 10

Count

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

(d) Difference Histogram

Figure 8: The 4096x4096 grids generated by our DTFEimplementation and TESS/DENSE. We show the absolutedifference map and the difference histogram to understandany bias between the estimators.

From Figure 8c, we see the density maps produced bythe two different methods are more in agreement in the highdensity regions. This difference is largely attributed to theinterpolation scheme being of a lower oder. We see fromFigure 8d that, relative to each other, there is a bias towardshigher density with DTFE in the low density regions whencompared to TESS.

3) Shared/Distributed-Memory with Load Balancing:Our shared/distributed-memory experiments were performedon N-body simulation datasets generated by HACC (Hard-ware/Hybrid Accelerated Cosmology Code) [4], [18]. Weuse two science quality datasets to demonstrate the overallperformance and load balancing capability of our method.

The first dataset is a simulation called Planck and has10243 particles in a 256 Mpc h−1 volume. We perform twoexperiments on this dataset where we computed thousandsof density fields in different spatial configurations.

Galaxy-Galaxy Lensing Experiment: The next set offigures were generated by computing 7,209 density fieldscentered on the positions of simulated galaxies. Galaxypositions are assigned to the most dense regions in thesimulation volume by a model for the galaxy distribution.This configuration of fields is particularly challenging com-putationally, not only because fields are required in the mosthighly concentrated particle regions, but also because thedecomposition sub-volumes become more imbalanced themore we decompose to achieve higher performance.

Number of MPI Processes0 50 100 150 200

Sp

ee

du

p

0

50

100

150

200

Speedup Compute (10243 Particles)

IdealTotalPartitionModelComputeWork Sharing

(a)

Number of MPI Processes8 16 32 64 128 240

Tim

e(m

in)

0

50

100

150

200

250

300

350

7,209 Fields (10243 Particles)

PartitionModelTriangulateGrid RenderWork Sharing

(b)

Figure 9: Speedup on up to 240 processes showing theoverall performance scaling to be near linear until 64processes, at which point the cost of partitioning and theoverhead of modeling and work sharing become apparent.This experiment computes density fields in the most highlyconcentrated regions of particles.

Figure 9 present the speedup and timing of our parallelalgorithm. To illustrate reasons for the performance drop offat higher number of processes, we include the speedup of allphases. We can see that as the number of processes increases,the partitioning phase is nearly flat, showing we becomebound by the IO operation. Similarly the overhead of mod-eling becomes apparent since it is nearly constant due toeach process computing a random test problem calculation.The early improvements are due to estimating the time forfewer local work items, but this flattens as computing the testproblem dominates the phase. Figure 10 demonstrates howtiming becomes more imbalanced as sub-volumes becomesmaller due to particles becoming more imbalanced. Theproblem is two-fold since there are more work items in sub-volumes with higher particle concentrations, and the workitems themselves are more costly.

Model Prediction: We characterize the modeling predic-tive accuracy in Figure 11 to support our speedup and imbal-ance claims as shown in Figure 10. The error distributionsare symmetric with mean centered near zero.

Multiplane Lensing Experiment: We repeated the exper-iment on the same dataset using a different configurationfor the density fields. Here we are creating density fieldsin adjacent sub-volumes along the entire line of sight ofthe complete volume. We construct 700 such line-of-sightfield configurations for a total of 9,061 density fields. These

number of processes8 16 32 64 128 240

std

. w

ork

loa

d (

no

rma

lize

d)

0

0.1

0.2

0.3

0.4

0.5

0.6

Workload (10243 Particles)

BalancedUnbalanced

Figure 10: Workload imbalance measured by calculating thestandard deviation in compute time (unbalanced estimatedby model prediction) showing the trend for more imbalanceas sub-volumes become smaller. Balanced compute time wasobtained from execution wall time.

time-1 -0.5 0 0.5 1 1.5 2

co

un

t

0

100

200

300

400

500

600

700

Triangulation Model Error

(a)

time-1.5 -1 -0.5 0 0.5 1 1.5

co

un

t

0

50

100

150

200

250

300

350

400

Interpolation Model Error

(b)

Figure 11: Histograms of the error made by work predictionmodels fit with 32 test samples compared with actual walltiming from the galaxy-galaxy lensing experiment. (top)Triangulation model µ = 5.0925 (bottom) Interpolationmodel µ = 12.1209

Number of MPI Processes8 16 32 64 128 220

Tim

e(m

in)

0

50

100

150

200

250

300

350

400



(a)

Number of MPI Processes0 20 40 60 80 100 120 140 160 180 200 220

Sp

ee

du

p

0

20

40

60

80

100

120

140

160

180

200

220Speedup Compute (10243 Particles)


(b)

Figure 12: Speedup on up to 220 processes showing theoverall performance scaling to be near linear with only smalldeviation. This experiment computes density fields in bothhigh and low regions of particle concentration.

fields are a mixture of high and low density sub-volumes.Interestingly, we see better overall scalability performancein this experiment. Looking at the work sharing speedupin Figure 12, we conclude that there are more small workitems to complete and the variable bin size optimizer canbe more efficient, resulting in less waiting time for blocking

messages to complete. The explanation is rooted in that morework items were performed and work sharing message sizesremained the same.

Large Scale Experiment: The second dataset is a sim-ulation called MiraU, a large volume (1491Mpc h−1)3,32002 (∼32 billion) particle simulation where we computed1024x1024 surface density grids centered on the 233,230most massive objects found by a density based clusteringalgorithm. This is similar to the galaxy-galaxy lensingexperiment we performed on the Plank dataset, but simplyat a much larger scale.

Number of MPI Processes4k 6k 8k 12k 14k 16k

Tim

e(m

in)

0

50

100

150



(a)

Number of MPI Processes0 2000 4000 6000 8000 10000 12000 14000 16000

Sp

ee

du

p

0

2000

4000

6000

8000

10000

12000

14000

16000

Speedup Compute (32003 Particles)


(b)

Figure 13: Speedup on up to 16,384 processes showingthe overall performance scaling to be near linear until16,384 processes, at which point a few degenerate pointconfigurations make the model predicted runtime inaccurate.This experiment computes density fields in the most highlyconcentrated regions of particles.

From Figure 13 we can see that the speedup is near linearuntil 16,384 processes, then, the computing speedup dropssignificantly. A closer look indicates that a small numberof degenerate point configurations on a few MPI processesmade the model predicted runtime inaccurate and delayedsending work to idle processes. This is clearly indicated bythe corresponding drop in work sharing speedup and pointsout a drawback to our a priori model for work sharing.

VI. CONCLUSION

Performing high throughput analysis tasks that involvecomputing high quality surface density fields in state of theart N-body cosmological simulations poses computationalchallenges primarily related to load distribution. We addressthe issue by proposing a fast and accurate grid renderingkernel for the DTFE method that is easily parallelizableand computationally efficient, achieving well balanced mul-tithread performance when compared to existing softwarepackages. We leverage the computational predictability ofthe kernel to achieve well balanced workloads in our dis-tributed memory framework by accurately modeling com-putation to determining an a priori work sharing schedule.Future work will involve kernel optimizations for emerginghardware such as many-core CPU’s and extending the worksharing algorithm to a more general framework.

REFERENCES

[1] Schaap, W. E. “DTFE: The Delaunay Tesselation Field Esti-mator”, University of Groningen, The Netherlands, 2007. Diss.Ph. D. Dissertation.

[2] Cautun, Marius C., and Rien van de Weygaert. ”The DTFEpublic software-The Delaunay Tessellation Field Estimatorcode.” arXiv preprint arXiv:1105.0370 (2011).

[3] Ellis, R.S.. “Gravitational Lensing: A Unique Probe of DarkMatter and Dark Energy”. Philosophical Transactions of theRoyal Society A 368 (2010): 967-987

[4] Habib, Salman, et al. “HACC: Simulating Sky Sur-veys on State-of-the-Art Supercomputing Architectures”.arXiv:1410.2805 [astro-ph.IM]

[5] Barber, C.B., Dobkin, D.P., and Huhdanpaa, H.T. “The Quick-hull Algorithm for Convex Hulls”, ACM Trans. on Mathemat-ical Software, 22(4):469-483, Dec 1996, http://www.qhull.org.

[6] Seitz, Stella, Peter Schneider, and Jrgen Ehlers. “Light propa-gation in arbitrary spacetimes and the gravitational lens approx-imation.” Classical and Quantum Gravity 11.9 (1994): 2345.

[7] Bradac, M., et al. “The signature of substructure on gravita-tional lensing in the mathsf Lambda CDM cosmologicalmodel.” Astronomy & Astrophysics 423.3 (2004): 797-809.

[8] Rau, Stefan, Simona Vegetti, and Simon DM White. ”Theeffect of particle noise in N-body simulations of gravitationallensing.” Monthly Notices of the Royal Astronomical Society430.3 (2013): 2232-2248.

[9] Peterka, T., Morozov, D., Phillips, C.. “High-PerformanceComputation of Distributed-Memory Parallel 3D Voronoi andDelaunay Tessellation,” In Proceedings of SC14, New Orleans,LA, 2014.

[10] J. Snoeyink, Point location, in: J.E. Goodman and J. ORourke(Eds.), Handbook of Discrete and Computational Geometry,CRC Press, Boca Raton, FL, 1997, pp. 559574.

[11] Batista, Vicente HF, et al. “Parallel Multi-Core GeometricAlgorithms in CGAL.”

[12] CGAL, Computational Geometry Algorithms Library,http://www.cgal.org

[13] Bernardeau, Francis, and Rien van de Weygaert. “A newmethod for accurate estimation of velocity field statistics”.Monthly Notices of the Royal Astronomical Society 279.2(1996): 693-711.

[14] Sambridge, Malcolm, Jean Braun, and Herbert McQueen.“Geophysical parametrization and interpolation of irregulardata using natural neighbours”. Geophysical Journal Interna-tional 122.3 (1995): 837-857.

[15] Mucke, Ernst P., Isaac Saias, and Binhai Zhu. “Fast random-ized point location without preprocessing in two-and three-dimensional Delaunay triangulations”. Computational Geome-try 12.1 (1999): 63-83.

[16] Moller, Tomas, and Ben Trumbore. “Fast, minimum stor-age ray-triangle intersection”. Journal of Graphics Tools 2.1(1997): 21-28.

[17] Platis, Nikos, and Theoharis Theoharis. “Fast Ray-Tetrahedron Intersection Using Plucker Coordinates.” Journalof Graphics Tools 8.4 (2003): 37-48.

[18] Habib, Salman, et al. “The Universe at Extreme Scale: Multi-Petaflop Sky Simulation on the BG/Q.” Proceedings of theInternational Conference on High Performance Computing,Networking, Storage and Analysis. IEEE Computer SocietyPress, 2012.

[19] Pelupessy, F. Inti, W. E. Schaap, and R. Van De Weygaert.“Density estimators in particle hydrodynamics–DTFE versusregular SPH.” Astronomy & Astrophysics 403.2 (2003): 389-398. APA

[20] Peterka, T., et al. “Self-Adaptive Density Estimation of Parti-cle Data.” Submitted to SIAM Journal on Scientific ComputingSISC Special Section on CSE15: Software and Big Data(2015).

[21] Schneider, Peter. Gravitational lensing statistics. SpringerBerlin Heidelberg, 1992.

[22] Petkova, Margarita, R. Benton Metcalf, and Carlo Giocoli.”glamerII. Multiple-plane gravitational lensing.” Monthly No-tices of the Royal Astronomical Society 445.2 (2014): 1954-1966.

[23] Kang, Jangha, and Sungsoo Park. ”Algorithms for the variablesized bin packing problem.” European Journal of OperationalResearch 147.2 (2003): 365-372.

[24] Springel, Volker. ”The cosmological simulation codeGADGET-2.” Monthly Notices of the Royal AstronomicalSociety 364.4 (2005): 1105-1134.

parallel dtfe surface density field reconstructiontpeterka/papers/2016/rangel...parallel dtfe...

Documents