parallel visualization of large-scale datasets for the earth simulator li chen issei fujishiro kengo...

Parallel Visualization of Parallel Visualization of Large-Scale Datasets for Large-Scale Datasets for

the Earth Simulatorthe Earth Simulator

Li Chen Issei Fujishiro Kengo Nakajima

Research Organization for Information Science and Technology (RIST)

Japan

3rd ACES Workshop, May 5-10, 2002, Maui, Hawaii.

VisualizationVisualization VisualizationVisualization

Basic Design & Parallel/SMP/Vector AlgorithmBasic Design & Parallel/SMP/Vector Algorithm

BackgroundBackground

Role of Visualization SubsystemRole of Visualization Subsystem

Earth Simulator

Software Hardware

GeoFEM

Mesh gen.

Solver Application analysis

Vis Subsys

Tools for:Tools for: (1) Post Processing(1) Post Processing (2) Data Mining etc.(2) Data Mining etc.

Background: Requirements

Target 1: Powerful visualization functions

Translate data from numerical forms to visual forms.

Provide the researchers with immense assistance in the process of understanding their computational results.

Target 2: Suitable for large-scale datasets

High parallel performance

Target 3: Available for unstructured datasets

Complicated grids

Target 4: SMP cluster architecture oriented

Effective based on the SMP cluster architecture

We have developed many visualization techniques in GeoFEM, for scalar, vector and tensor data fields, to reveal data distribution from many aspects.

Our modules have been parallelized and obtained a high parallel performance

All of our modules are based on unstructured datasets, and can be extended to hybrid grids.

Three-level hybrid parallel programming model is adopted in our modules

Works after 2nd ACES (Oct. 2000)

Developed more visualization techniques for GeoFEMDeveloped more visualization techniques for GeoFEM Improved parallel performanceImproved parallel performance

Please Visit Our Poster for Detail !!Please Visit Our Poster for Detail !!

OverviewOverview

• Visualization Subsystem in GeoFEM

• Newly Developed Parallel Volume Rendering (PVR)– Algorithm– Parallel/Vector Efficiency

• Examples

• Future Works

Parallel VisualizationFile Version or “DEBUGGING” Version

mesh#0

mesh#1

mesh#n-1

MeshFiles

FEM-#0

I/O Solver I/O

FEM-#1

I/O Solver I/O

FEM-#n-1

I/O Solver I/O

FEMAnalysis

result#0

result#1

result#n-1

ResultFiles

UCDetc.

Images

VIEWERAVS etc.

VisualizationResult Files

InputOutputCommunication

VIS-#0

VIS-#1

VIS-#n-1

Visualization

on Client

includes simplification, combination etc.

Large-Scale Data in GeoFEM

1km x 1km x 1km mesh for1000km x 1000km x 100km "local" region

1000 x 1000 x 100 = 108 grid points

1GB/variable/time step~10GB/time step for 10 variables

Huge

TB scale for 100 steps !!

Parallel VisualizationMemory/Concurrent Version

mesh#0

mesh#1

mesh#n-1

MeshFiles

FEM-#0

I/O Solver I/O

FEM-#1

I/O Solver I/O

FEM-#n-1

I/O Solver I/O

FEM+Visualizationon GeoFEM Platform

VIS-#0

VIS-#1

VIS-#n-1

UCDetc.

Images

VIEWERAVS etc.

VisualizationResult Files

InputOutputCommunication

on Client

Dangerous if detailed physics is not clearDangerous if detailed physics is not clear

Parallel Visualization Techniques in GeoFEM

Scalar Field Vector Field Tensor Field

Cross-sectioning

Isosurface-fitting

Surface-fitting

Interval Volume-fitting

Volume Rendering

Streamlines

Particle Tracking

Topological Map

LIC

Volume Rendering

Hyperstreamlines

In the following, we will take the Parallel Volume Rendering module as example to demonstrate our strategies on improving parallel performance

available June 2002, http://geofem.tokyo.rist.or.jp/available June 2002, http://geofem.tokyo.rist.or.jp/


• Newly Developed Parallel Volume Newly Developed Parallel Volume Rendering (PVR)Rendering (PVR)– AlgorithmAlgorithm– Parallel/Vector Efficiency

• Examples

• Future Works

Design of Visualization Methods

Principle Taking account of parallel performanceTaking account of parallel performance

Taking account of huge data sizeTaking account of huge data size

Taking account of unstructured gridsTaking account of unstructured grids

Traversal Approach

Image-order volume rendering (Ray casting) Object-order volume rendering (Cell projection)

Grid type

Regular Curvilinear

Composition Approach Projection

Parallel

Classification of Current Volume Rendering Methods

Unstructured Hybrid order volume rendering

Perspective From front to back

From back to front

Design of Visualization Methods

Principle

Taking account of concurrent with computational processTaking account of concurrent with computational process

Classification of Parallelism

Object-space parallelism

Partition object space and each PE gets a portion of the dataset. Each PE calculates an image of the sub-volume.

Image-space parallelism

Partition image space and each PE calculates a portion of the whole image.

Time-space parallelism

Partition time space and each PE calculates the images of several timesteps.

Large storage requirementLarge storage requirement

Slow down volume rendering processSlow down volume rendering process

Design for Parallel Volume Rendering

Unstructured Locally Refined Octree/Hierarchical

Why not unstructured grid?

Hard to build hierarchical structureHard to build hierarchical structure

Connectivity information should be found beforehandConnectivity information should be found beforehand

Unstructured grid makes image composition and load balance difficultUnstructured grid makes image composition and load balance difficult

Irregular shape makes sampling slowerIrregular shape makes sampling slower

regular grid?Why not

Parallel Transformation Unstructured Hierarchical

One Solution

FEM Data

ResamplingHierarchical

data

Ray-casting PVR VR

Image

Original GeoFEM MeshesOriginal GeoFEM Meshes

PE#0 PE#1 PE#2 PE#3 PE#4 PE#5

PE#6 PE#7 PE#8 PE#9 PE#10 PE#11

PE#12 PE#13 PE#14 PE#15 PE#16 PE#17

Background CellsBackground Cells

VoxelsVoxels

Accelerated Ray-casting PVR

VR parameters

Hierarchical datasets

Determine sampling and mapping parameters

Build Branch-on-need octree

Generate subimages for each PE

Build topological structure of subvolumes on all PEs

Composite subimages from front to back

for each subvolume

for j=startj to endj do

for i=starti to endi do

Fast find the intersection voxels with ray (i,j)

Compute (r,g,b) at each intersection voxel based on volume illumination model and transfer functions

Compute (r,g,b) for pixel(i,j) based on front-to-back composition


• Newly Developed Parallel Volume Newly Developed Parallel Volume Rendering (PVR)Rendering (PVR)– Algorithm– Parallel/Vector EfficiencyParallel/Vector Efficiency

• Examples

• Future Works

SMP Cluster Type ArchitecturesSMP Cluster Type Architectures

Earth Simulator

ASCI Hardwares

Various Types of Communication, Parallelism.Inter-SMP node, Intra-SMP node, Individual PE

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Memory

PE

PE

PE

Intra NODE Inter NODEEach PE

F90 + directive(OpenMP) MPI

MPIF90

MPIHPF

HPF

Optimum Programming ModelsOptimum Programming Modelsfor Earth Simulator ?for Earth Simulator ?

Three-Level Hybrid parallelization

Flat MPI parallelization

Each PE: independentEach PE: independent

Hybrid Parallel Programming Model

Based on Memory hierarchyBased on Memory hierarchy

• Inter-SMP node MPI• Intra-SMP node OpenMP for parallelization• Individual PE Compiler directives for vectorization/pseudo vectorization

Flat MPI vs. OpenMP/MPI HybridFlat MPI vs. OpenMP/MPI Hybrid

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Hybrid： Hierarchal Structure

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Flat-MPI： Each PE -> Independent


Previous work on hybrid parallelization R. Falgout, and J. Jones, "Multigrid on Massively Parallel Architectures", 1999. F. Cappelo, and D. Etiemble, "MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks", 2000. K. Nakajima and H. Okuda, "Parallel Iterative Solvers for Unstructured Grids using Directive/MPI Hybrid Programming Model for GeoFEM Platform on SMP Cluster Architectures", 2001

All these are in computational research area. No visualization papers are found on this topic.

Previous parallel visualization methods Classification by platform

• Shared memory machines: J. Nieh and M. Levoy 1992, P. Lacroute 1996

• Distributed memory machines:

U. Neumann 1993, C. M. Wittenbrink and A. K. Somani, 1997

• SMP cluster machines: almost no papers are found.

SMP Cluster Architecture

PE

PE

PE

PE

PE

PE

PE

PE

Memory

PE

PE

PE

PE

PE

PE

PE

PE

Memory

PE

PE

PE

PE

PE

PE

PE

PE

Memory

PE

PE

PE

PE

PE

PE

PE

PE

Memory

Node-0 Node-1

Node-2 Node-3

Node-1

Node-3

Node-0

Node-2

Partitioning of data domain

The Earth Simulator

640 SMP nodes, and 8 vector processors in each SMP node


• Local operation and no global dependency

• Continuous memory access

• Sufficiently long loops

Criteria to achieve high parallel performance

Vectorization for Each PEConstruct Vectorizatoin Loop

• Combine some short loops into one long loop by reordering

• Exchange the innermost and outer loop to make the innermost loop longer

• Avoid using tree and single/double link data structure, especially in the inner loop

for(i=0;i<MAX_N_VERTEX; i++) for(j=0;j<3;j++) { p[i][j]=…. …. }

for(i=0;i<MAX_N_VERTEX*3; i++){ p[i/3][i % 3]=…. …. }

for(i=0;i<MAX_N_VERTEX; i++) for(j=0;j<3;j++) { p[i][j]=…. …. }

for(j=0;j<3;j++) for(i=0;i<MAX_N_VERTEX; i++) { p[i][j]=…. …. }

link (single or double) structure tree structure

Intra-SMP Node Parallelization OpenMP http://www.openmp.org

Multi-coloring for Removing the Data Race[Nakajima, et al. 2001]

Ex: gradient computation in PVR

#pragma omp parallel {for(i=0;i<num_element;i++) { compute jacobian matrix of shape function; for(j=0; j<8;j++) { for(k=0; k<8;k++) accumulate gradient value of vertex j contributed by vertex k;} }}

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

PE#0

PE#1

PE#2

PE#3

Inter-SMP Node Parallelization MPI

Parallel Data Structure in GeoFEM

External nodeInternal node

Communication

Overlapped elements are used for reducing communication among SMP nodes

Overlap removal is necessary for final results

Dynamic Load Repartition

Why?

Initial partition on each PE:

(Same with analysis computing)

Load on each PE:

(PVR process)

Load balance during PVR

Keep almost equal number of rendered voxels on each PE

Number of non-empty voxelsNumber of non-empty voxels

Opacity transfer functionsOpacity transfer functions

ViewpointViewpointDynamic

almost equal number of voxels

the number of rendered voxels

Rendered voxels often accumulate in small portions of the field during visualization


Most previous methods

Scattered decomposition [K.-L. Ma, et al, 1997]

Advantage: Can get very good load balance easilyAdvantage: Can get very good load balance easily

DisadvantageDisadvantage

Large amount of intermediate results have to be storedLarge amount of intermediate results have to be stored

Large extra memory

Large extra communication

Assign several continuous subvolumes on each PE Count the number of rendered voxels during the process of gird transformationCount the number of rendered voxels during the process of gird transformation Move a subvolume from a PE with a larger number of rendered voxels to Move a subvolume from a PE with a larger number of rendered voxels to another PE with a smaller one another PE with a smaller one


Assign several continuous subvolumes on each PE Count the number of rendered voxels during the process of gird transformationCount the number of rendered voxels during the process of gird transformation Move a subvolume from a PE with a larger number of rendered voxels to Move a subvolume from a PE with a larger number of rendered voxels to another PE with a smaller one another PE with a smaller one

PE0 PE1

PE2 PE3

PE0 PE1

PE2 PE3

Initial partition Repartition


• Newly Developed Parallel Volume Rendering (PVR)– Algorithm– Parallel/Vector Efficiency

• ExamplesExamples

• Future Works

Speedup Test 1

Demonstrate the effect of three-level hybrid parallelization

Dataset: Pin Grid Array (PGA) dataset

Simulate the Mises Stress distribution on the pin grid board by the Linear Elastostatic Analysis

Data size: 7,869,771 nodes and 7,649,024 elements

Running environment

SR8000 Each node: 8 PEs 8GFLOPS peak performance 8GB memory Total system: 128 nodes (1024 PEs) 1.0TFLOPS peak performance 1.0TB memory

(Data courtesy of H. Okuda and S. Ezure).

Speedup Test 1

Top view Bottom view

Volume rendered images to show the equivalent scalar value of stress by the linear elastostatic analysis for a PGA dataset with 7,869,771 nodes and 7,649,024 elements (Data courtesy of H. Okuda and S. Ezure).

Speedup Test 1

0

20

40

60

80

100

120

140

0 50 100 150

PE #

Spe

ed U

p

MPI Parallel Hybrid Parallel

Comparison of speedup performance between flat MPI and the hybrid parallel method for our parallel volume rendering module.

Original (MPI) to Vector Version (Hybrid) Speed-up for 1PE : 4.301283 Uniform Cubes for PVR

Speedup Test 2

Demonstrate the effect of three-level hybrid parallelization

Test Dataset: Core dataset

(Data courtesy of H. Matsui in GeoFEM) Simulate thermal convection in a rotating spherical shell

Data size: 257,414 nodes and 253,440 elements

Running environmentSR8000

Each node: 8 PEs 8GFLOPS peak performance 8GB memory

Test Module

Parallel Surface Rendering module

Total system: 128 nodes (1024 PEs) 1.0TFLOPS peak performance 1.0TB memory

Speedup Test 2

Pressure isosurfaces and temperature cross-sections for a core dataset with 257,414 nodes and 253,440 elements. The speedup of our 3-level parallel method is 231.7 for 8 nodes (64PEs) on SR8000.

Speedup Test 2

Comparison of speedup performance between flat MPI and the hybrid parallel method for our parallel surface rendering module.

0

50

100

150

200

250

300

350

400

0 20 40 60 80 100 120 140

PE #

Spee

d U

p

MPI Parallel Hybrid Parallel

Original (MPI) to Vector Version (Hybrid) Speed-up for 1PE : 4.00

Speedup Test 3

Simulate groundwater flow and convection/diffusion transportation through heterogeneous porous media

Dataset: Underground water dataset

200100100 RegionDifferent Water Conductivity for 16,000, 128,000, 1,024,000 Meshes (∆h= 5.00/2.50/1.25)100 timesteps

Demonstrate the effect of dynamic load repartition

Compaq alpha 21164 cluster machine (8 PEs, 600MHz/PE, 512M RAM/PE)

Running environment

Result

Without dynamic load repartition: 8.15 seconds for one time-step averagely.

After dynamic load repartition: 3.37 seconds for one time-step averagely.

For mesh 3 (about 10 million cubes and 100 timesteps)

Effects of convection & diffusion for different mesh sizesh= 5.00 h=2.50 h=1.25

GroundwaterFlow Channel

Speedup Test 3

Application (2)

• Flow/Transportation• 505050 Region • Different Water Conductivity for ea

ch (h=5)3 cube

• d/dx=0.01, =0@xmax

• 1003 Meshes– h= 0.50

• 64PEs : Hitachi SR2201

Parallel PerformanceConvection & Diffusion

• 13,280 steps for 200 Time Unit

• 106 Meshes, 1,030,301 Nodes

• 3,984 sec. for elapsed time including communication on Hitachi SR2201/64PEs– 3,934 sec. for real CPU– 98.7% parallel performance

Convection & Diffusion Visualization by PVR

GroundwaterFlow Channel

Conclusions and Future Work

Future Work

Improve Parallel Performance of Visualization Subsystem in GeoFEM

● Improve the parallel performance of the visualization algorithms

● Three-level hybrid parallel based on SMP cluster architecture

• Inter-SMP node MPI• Intra-SMP node OpenMP for parallelization• Individual PE Compiler directives for vectorization/pseudo vectorization

● Dynamic load balancing

Tests on the Earth Simulator

http://www.es.jamstec.go.jp/

parallel visualization of large-scale datasets for the earth simulator li chen issei fujishiro kengo...

Documents

high parallel performanceall

data mining

data distribution

geofem1km x

account of huge data

tensor data fields

mesh for1000km x

local region1000 x