number of processes large-scale parallel analysis of ...large-scale parallel analysis of...

1
The Scalable Data Management, Analysis, and Visualization Institute http://sdav-scidac.org Large-Scale Parallel Analysis of Applications with the DIY Library SDAV technologies help scientists analyze data ranging from nanometers to gigaparsecs and femtoseconds to billions of years. DIY is a technology that enables parallelization of numerous data analysis methods at extreme scale. Tom Peterka (ANL), Dmitriy Morozov (LBNL), Carolyn Phillips (ANL), Youssef Nashed (ANL), Han-Wei Shen (OSU), Boonthanome Nouanesensy (LANL), Kewei Lu (OSU) LaGrangian Coherent Structures Based on FTLE Computation in Climate Science Left: Tessellation of halo 6606356352 shows substructures inside the halo. Parallel Ptychographic Image Reconstruction [Courtesy Dmitriy Morozov, LBL] LBL (Dmitriy Morozov and Patrick O’Neil) developed tools for segmentation and connectivity analysis of granular and porous media using diy2. Three representations of the same halo. From left to right: original raw particle data, Voronoi tessellation, and regular grid density sampling. [Courtesy Youssef Nashed, ANL]. An illustration of multi- GPU phase retrieval. The diffraction patterns are subdivided and distributed among available GPUs. The initial diffraction patterns are subdivided and merged together using DIY’s merge-based reduction. Strong scaling efficiency on synthetic data is 55% on 21,000 256x256 images. A gold Siemens star test pattern, with 30 nm smallest feature size, was raster scanned through a 26×26 grid using a step size of 40nm using a 5.2 keV X-ray beam. Image Segmentation in Porous Media Left: 3D image of a granular material (flexible sandstone) acquired at ALS by Michael Manga and Dula Parkinson. (Data: 2560 × 2560 × 1276). Right: Watershed segmentation of the material identifies individual grains (run on Edison @ NERSC) [courtesy Morozov, O’Neil (LBL)]. [Courtesy of Carolyn Phillips, ANL] In simulations of soft matter systems, molecules self- organize to form two or more domains. These domains can form complicated geometries such as the double gyroid. Summing the Voronoi cells of 300 time steps of a system of an A-B-A triblock copolymer suggests that the B domain dilates relative to the A domain. Tessellations in Molecular Dynamics Voronoi tessellation of 1,000 A-B-C ``telechelics’’ composed of two nanospheres connected by polymer tether beads in a double gyroid morphology. Only the Voronoi cells associated with the A species are shown. Such surfaces are usually constructed by using isosurface methods, which require averaging over many time steps; whereas by using the tessellation, such surfaces can be constructed for every time step. Density Estimation in Cosmology and Astrophysics [Courtesy Boonthanome Nouanesengsy, OSU] Left: Particle tracing of 288 million particles over 36 time steps in a 3600x2400x40 eddy resolving dataset. Right: 131 million particles over 48 time steps in a 500x500x100 simulation of Hurricane Isabel. Time includes I/O. Streamlines, Pathlines, and Stream Surfaces in Nuclear Engineering [Courtesy Kewei Lu, OSU] 64 surfaces each with 2K seeds in a 2K x 2K x 2K Nek5000 thermal hydraulics simulation. Time excludes I/O. Left: Strong scaling. Right: Percentage of time in the three stages of our algorithm. Voronoi and Delaunay Tessellations in Cosmology Information Entropy in Astrophysics Computational Topology in Combustion [Courtesy Abon Chaudhuri, OSU] Computation of information entropy in solar plume dataset. [Courtesy Attila Gyulassy, SCI] Computation of Morse- Smale complex in jet mixture fraction data set Proxy Applications for Exascale Codesign 10 20 50 100 200 Strong Scaling For Various Data Sizes Number of Processes Time (s) 128 256 512 1024 2048 4096 8192 16384 2048^3 1024^3 512^3 Particle tracing of thermal hydraulics data is plotted in log-log scale. The top panel shows 400 particles tracing streamlines in this flow field. In the center panel, 128 K particles are traced in three data sizes: 512 3 (134 million cells), 1024 3 (1 billion cells), and 2048 3 (8 billion cells). End- to- end results are shown, including I/O (reading the vector dataset from storage and writing the output particle traces.) 0.002 0.005 0.020 0.050 0.200 0.500 Swap Noop Performance Number of Processes Time (s) 512 KB 2 MB 8 MB 32 MB 128 MB 8 16 32 64 128 256 512 1024 4096 16384 MPI_Reduce_scatter DIY_Swap 0.005 0.020 0.050 0.200 0.500 2.000 Swap Singlethread Performance Number of Processes Time (s) 512 KB 2 MB 8 MB 32 MB 128 MB 8 16 32 64 128 256 512 1024 4096 16384 MPI_Reduce_scatter DIY_Swap # Items Bytes/ Item Total Bytes # Procs Exchange Time (s) 64 20 1 K 512 0.003 256 20 5 K 512 0.010 1 K 20 20 K 512 0.040 4K 20 80 K 512 0.148 16 K 20 320 K 512 0.627 64 K 20 1 M 512 2.516 256 K 20 5 M 512 10.144 1 M 20 20 M 512 40.629 0.1 0.5 5.0 50.0 Strong and Weak Scaling with CGAL Number of Processes Time (s) 16 64 256 1K 4K 16K 64K 2048^3 Particles 1024^3 Particles 512^3 Particles 256^3 Particles 128^3 Particles Perfect Strong Scaling Perfect Weak Scaling 100 200 500 1000 Strong Scaling of HACC Data on Mira Number of Processes Time (s) 512 1K 2K 4K 8K 16K Time Step 499 Time Step 247 Time Step 68 Perfect Strong Scaling 10 20 50 100 Strong Scaling Number of Processes Time (s) 256 512 1024 2048 4096 8192 8192^3 grid 4096^3 grid 2048^3 grid 1024^3 grid Perfect strong scaling Left and center: Cross sections of CIC and TESS-based density estimators. Right: Strong scaling for grid sizes ranging from 1024 3 to 8192 3 . 1 5 10 50 100 500 Total & Component Time For Jet Mixture Fraction Number of Processes Time (s) 32 64 128 256 512 1024 2048 4096 8192 Total time Compute time Merge time Input time Output time Perfect scaling 32 64 128 256 512 0.2 3.8 7.9 21.4 44.3 #Procs Time(sec) Global Histogram Plume(S) Plume(V) Nek(S) Nek(V) Left: Communication time only for our merge algorithm compared with MPI’s reduction algorithm. Center: Our swap algorithm compared with MPI’s reduce-scatter algorithm. Right: Performance of neighbor exchange algorithm at high numbers of small messages is a stress test of point to point communication. Diy2 shows linear complexity in total data size. Right: Delaunay tessellation of 128 3 dark matter tracer particles. Left: Strong and weak scaling (excluding I/O time) from 16 to 128K processes. Right: Strong scaling (excluding I/O time) for three time steps of HACC data of 1024 3 particles.

Upload: others

Post on 01-Feb-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Number of Processes Large-Scale Parallel Analysis of ...Large-Scale Parallel Analysis of Applications with the DIY Library SDAV technologies help scientists analyze data ranging from

The Scalable Data Management, Analysis, and Visualization Institute http://sdav-scidac.org

Large-Scale Parallel Analysis of Applications with the DIY Library

SDAV technologies help scientists analyze data ranging from nanometers to gigaparsecs and femtoseconds to billions of

years. DIY is a technology that enables parallelization of numerous data analysis methods at extreme scale.

Tom Peterka (ANL), Dmitriy Morozov (LBNL), Carolyn Phillips (ANL), Youssef Nashed (ANL), Han-Wei Shen (OSU), Boonthanome Nouanesensy (LANL), Kewei Lu (OSU)

LaGrangian Coherent Structures Based on FTLE Computation in Climate Science

Left: Tessellation of halo 6606356352 shows substructures inside the halo.

Parallel Ptychographic Image Reconstruction

[Courtesy Dmitriy Morozov, LBL] LBL (Dmitriy Morozov and Patrick O’Neil) developed tools for segmentation

and connectivity analysis of granular and porous media using diy2.

Three representations of the same halo. From left to right: original raw particle data, Voronoi tessellation, and regular grid density sampling.

[Courtesy Youssef Nashed, ANL]. An illustration of multi-GPU phase retrieval. The diffraction patterns are subdivided and distributed among available GPUs. The initial diffraction patterns are subdivided and merged together using DIY’s merge-based reduction. Strong scaling efficiency on synthetic data is 55% on 21,000 256x256 images.

A gold Siemens star test pattern, with 30 nm smallest feature size, was raster scanned through a 26×26 grid using a step size of 40nm using a 5.2 keV X-ray beam.

Image Segmentation in Porous Media

Left: 3D image of a granular material (flexible sandstone) acquired at ALS by Michael Manga and Dula Parkinson. (Data: 2560 × 2560 × 1276). Right: Watershed segmentation of the material identifies individual grains (run on

Edison @ NERSC) [courtesy Morozov, O’Neil (LBL)].

[Courtesy of Carolyn Phillips, ANL] In simulations of soft matter systems, molecules self-organize to form two or more domains. These domains can form complicated geometries such as the double gyroid. Summing the Voronoi cells of 300 time steps of a system of an A-B-A triblock copolymer suggests that the B domain dilates relative to the A domain.

Tessellations in Molecular Dynamics

Voronoi tessellation of 1,000 A-B-C ``telechelics’’ composed of two nanospheres connected by polymer tether beads in a double gyroid morphology. Only the Voronoi cells associated with the A species are shown. Such surfaces are usually constructed by using isosurface methods, which require averaging over many time steps; whereas by using the tessellation, such surfaces can be constructed for every time step.

Density Estimation in Cosmology and Astrophysics

[Courtesy Boonthanome Nouanesengsy, OSU] Left: Particle tracing of 288 million particles over 36 time steps in a

3600x2400x40 eddy resolving dataset. Right: 131 million particles over 48 time steps in a 500x500x100 simulation of Hurricane Isabel. Time includes I/O.

Streamlines, Pathlines, and Stream Surfaces in Nuclear Engineering

[Courtesy Kewei Lu, OSU] 64 surfaces each with 2K seeds in a 2K x 2K x 2K

Nek5000 thermal hydraulics simulation. Time excludes I/O. Left: Strong scaling. Right: Percentage of time in the three

stages of our algorithm.

Voronoi and Delaunay Tessellations in Cosmology

Information Entropy in Astrophysics

Computational Topology in Combustion

[Courtesy Abon Chaudhuri, OSU] Computation of information entropy in solar plume dataset.

[Courtesy Attila Gyulassy, SCI]

Computation of Morse-Smale complex in jet mixture fraction data

set

Proxy Applications for Exascale Codesign

1020

5010

020

0

Strong Scaling For Various Data Sizes

Number of Processes

Tim

e (s

)

128 256 512 1024 2048 4096 8192 16384

2048 3̂1024 3̂512 3̂

Particle tracing of thermal hydraulics data is plotted in log-log scale. The top

panel shows 400 particles tracing streamlines in this

flow field. In the center panel, 128 K particles are traced in three data sizes:

5123 (134 million cells), 10243 (1 billion cells), and

20483 (8 billion cells). End-to- end results are shown, including I/O (reading the

vector dataset from storage and writing the output particle traces.)

0.00

20.

005

0.02

00.

050

0.20

00.

500

Swap No−op Performance

Number of Processes

Tim

e (s

)

512 KB

2 MB

8 MB

32 MB

128 MB

8 16 32 64 128 256 512 1024 4096 16384

MPI_Reduce_scatterDIY_Swap

0.00

50.

020

0.05

00.

200

0.50

02.

000

Swap Single−thread Performance

Number of Processes

Tim

e (s

)512 KB

2 MB

8 MB

32 MB

128 MB

8 16 32 64 128 256 512 1024 4096 16384

MPI_Reduce_scatterDIY_Swap

# Items Bytes/Item

Total Bytes

# Procs Exchange Time (s)

64 20 1 K 512 0.003

256 20 5 K 512 0.010

1 K 20 20 K 512 0.040

4K 20 80 K 512 0.148

16 K 20 320 K 512 0.627

64 K 20 1 M 512 2.516

256 K 20 5 M 512 10.144

1 M 20 20 M 512 40.629

0.1

0.5

5.0

50.0

Strong and Weak Scaling with CGAL

Number of Processes

Tim

e (s

)

16 64 256 1K 4K 16K 64K

2048^3 Particles1024^3 Particles512^3 Particles256^3 Particles128^3 ParticlesPerfect Strong ScalingPerfect Weak Scaling

100

200

500

1000

Strong Scaling of HACC Data on Mira

Number of Processes

Tim

e (s

)

512 1K 2K 4K 8K 16K

Time Step 499Time Step 247Time Step 68Perfect Strong Scaling

1020

5010

0

Strong Scaling

Number of Processes

Tim

e (s

)256 512 1024 2048 4096 8192

8192^3 grid4096^3 grid2048^3 grid1024^3 gridPerfect strong scaling

Left and center: Cross sections of CIC and TESS-based density estimators. Right: Strong scaling for grid sizes ranging from 10243 to 81923.

15

1050

100

500

Total & Component Time For Jet Mixture Fraction

Number of Processes

Time (

s)

32 64 128 256 512 1024 2048 4096 8192

Total timeCompute timeMerge timeInput timeOutput timePerfect scaling

32 64 128 256 5120.2

3.8

7.9

21.4

44.3

#Procs

Tim

e(se

c)

Global Histogram

Plume(S)Plume(V)Nek(S)Nek(V)

Left: Communication time only for our merge algorithm compared with MPI’s reduction algorithm. Center: Our swap algorithm compared with MPI’s reduce-scatter algorithm. Right: Performance of neighbor exchange algorithm at high numbers of small messages is a stress test of point to point communication. Diy2 shows linear complexity in total data size.

Right: Delaunay tessellation of 1283 dark

matter tracer particles.

Left: Strong and weak scaling (excluding I/O time) from 16 to 128K processes.

Right: Strong scaling (excluding I/O time) for three time steps of HACC data of

10243 particles.