parallel visualization of large-scale datasets for the earth simulator li chen issei fujishiro kengo...
TRANSCRIPT
Parallel Visualization of Parallel Visualization of Large-Scale Datasets for Large-Scale Datasets for
the Earth Simulatorthe Earth Simulator
Li Chen Issei Fujishiro Kengo Nakajima
Research Organization for Information Science and Technology (RIST)
Japan
3rd ACES Workshop, May 5-10, 2002, Maui, Hawaii.
VisualizationVisualization VisualizationVisualization
Basic Design & Parallel/SMP/Vector AlgorithmBasic Design & Parallel/SMP/Vector Algorithm
BackgroundBackground
Role of Visualization SubsystemRole of Visualization Subsystem
Earth Simulator
Software Hardware
GeoFEM
Mesh gen.
Solver Application analysis
Vis Subsys
Tools for:Tools for: (1) Post Processing(1) Post Processing (2) Data Mining etc.(2) Data Mining etc.
Background: Requirements
Target 1: Powerful visualization functions
Translate data from numerical forms to visual forms.
Provide the researchers with immense assistance in the process of understanding their computational results.
Target 2: Suitable for large-scale datasets
High parallel performance
Target 3: Available for unstructured datasets
Complicated grids
Target 4: SMP cluster architecture oriented
Effective based on the SMP cluster architecture
We have developed many visualization techniques in GeoFEM, for scalar, vector and tensor data fields, to reveal data distribution from many aspects.
Our modules have been parallelized and obtained a high parallel performance
All of our modules are based on unstructured datasets, and can be extended to hybrid grids.
Three-level hybrid parallel programming model is adopted in our modules
Works after 2nd ACES (Oct. 2000)
Developed more visualization techniques for GeoFEMDeveloped more visualization techniques for GeoFEM Improved parallel performanceImproved parallel performance
Please Visit Our Poster for Detail !!Please Visit Our Poster for Detail !!
OverviewOverview
• Visualization Subsystem in GeoFEM
• Newly Developed Parallel Volume Rendering (PVR)– Algorithm– Parallel/Vector Efficiency
• Examples
• Future Works
Parallel VisualizationFile Version or “DEBUGGING” Version
mesh#0
mesh#1
mesh#n-1
MeshFiles
FEM-#0
I/O Solver I/O
FEM-#1
I/O Solver I/O
FEM-#n-1
I/O Solver I/O
FEMAnalysis
result#0
result#1
result#n-1
ResultFiles
UCDetc.
Images
VIEWERAVS etc.
VisualizationResult Files
InputOutputCommunication
VIS-#0
VIS-#1
VIS-#n-1
Visualization
on Client
includes simplification, combination etc.
Large-Scale Data in GeoFEM
1km x 1km x 1km mesh for1000km x 1000km x 100km "local" region
1000 x 1000 x 100 = 108 grid points
1GB/variable/time step~10GB/time step for 10 variables
Huge
TB scale for 100 steps !!
Parallel VisualizationMemory/Concurrent Version
mesh#0
mesh#1
mesh#n-1
MeshFiles
FEM-#0
I/O Solver I/O
FEM-#1
I/O Solver I/O
FEM-#n-1
I/O Solver I/O
FEM+Visualizationon GeoFEM Platform
VIS-#0
VIS-#1
VIS-#n-1
UCDetc.
Images
VIEWERAVS etc.
VisualizationResult Files
InputOutputCommunication
on Client
Dangerous if detailed physics is not clearDangerous if detailed physics is not clear
Parallel Visualization Techniques in GeoFEM
Scalar Field Vector Field Tensor Field
Cross-sectioning
Isosurface-fitting
Surface-fitting
Interval Volume-fitting
Volume Rendering
Streamlines
Particle Tracking
Topological Map
LIC
Volume Rendering
Hyperstreamlines
In the following, we will take the Parallel Volume Rendering module as example to demonstrate our strategies on improving parallel performance
available June 2002, http://geofem.tokyo.rist.or.jp/available June 2002, http://geofem.tokyo.rist.or.jp/
• Visualization Subsystem in GeoFEM
• Newly Developed Parallel Volume Newly Developed Parallel Volume Rendering (PVR)Rendering (PVR)– AlgorithmAlgorithm– Parallel/Vector Efficiency
• Examples
• Future Works
Design of Visualization Methods
Principle Taking account of parallel performanceTaking account of parallel performance
Taking account of huge data sizeTaking account of huge data size
Taking account of unstructured gridsTaking account of unstructured grids
Traversal Approach
Image-order volume rendering (Ray casting) Object-order volume rendering (Cell projection)
Grid type
Regular Curvilinear
Composition Approach Projection
Parallel
Classification of Current Volume Rendering Methods
Unstructured Hybrid order volume rendering
Perspective From front to back
From back to front
Design of Visualization Methods
Principle
Taking account of concurrent with computational processTaking account of concurrent with computational process
Classification of Parallelism
Object-space parallelism
Partition object space and each PE gets a portion of the dataset. Each PE calculates an image of the sub-volume.
Image-space parallelism
Partition image space and each PE calculates a portion of the whole image.
Time-space parallelism
Partition time space and each PE calculates the images of several timesteps.
Large storage requirementLarge storage requirement
Slow down volume rendering processSlow down volume rendering process
Design for Parallel Volume Rendering
Unstructured Locally Refined Octree/Hierarchical
Why not unstructured grid?
Hard to build hierarchical structureHard to build hierarchical structure
Connectivity information should be found beforehandConnectivity information should be found beforehand
Unstructured grid makes image composition and load balance difficultUnstructured grid makes image composition and load balance difficult
Irregular shape makes sampling slowerIrregular shape makes sampling slower
regular grid?Why not
Parallel Transformation Unstructured Hierarchical
One Solution
FEM Data
ResamplingHierarchical
data
Ray-casting PVR VR
Image
Original GeoFEM MeshesOriginal GeoFEM Meshes
PE#0 PE#1 PE#2 PE#3 PE#4 PE#5
PE#6 PE#7 PE#8 PE#9 PE#10 PE#11
PE#12 PE#13 PE#14 PE#15 PE#16 PE#17
Background CellsBackground Cells
VoxelsVoxels
Accelerated Ray-casting PVR
VR parameters
Hierarchical datasets
Determine sampling and mapping parameters
Build Branch-on-need octree
Generate subimages for each PE
Build topological structure of subvolumes on all PEs
Composite subimages from front to back
for each subvolume
for j=startj to endj do
for i=starti to endi do
Fast find the intersection voxels with ray (i,j)
Compute (r,g,b) at each intersection voxel based on volume illumination model and transfer functions
Compute (r,g,b) for pixel(i,j) based on front-to-back composition
• Visualization Subsystem in GeoFEM
• Newly Developed Parallel Volume Newly Developed Parallel Volume Rendering (PVR)Rendering (PVR)– Algorithm– Parallel/Vector EfficiencyParallel/Vector Efficiency
• Examples
• Future Works
SMP Cluster Type ArchitecturesSMP Cluster Type Architectures
Earth Simulator
ASCI Hardwares
Various Types of Communication, Parallelism.Inter-SMP node, Intra-SMP node, Individual PE
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
Memory
PE
PE
PE
Intra NODE Inter NODEEach PE
F90 + directive(OpenMP) MPI
MPIF90
MPIHPF
HPF
Optimum Programming ModelsOptimum Programming Modelsfor Earth Simulator ?for Earth Simulator ?
Three-Level Hybrid parallelization
Flat MPI parallelization
Each PE: independentEach PE: independent
Hybrid Parallel Programming Model
Based on Memory hierarchyBased on Memory hierarchy
• Inter-SMP node MPI• Intra-SMP node OpenMP for parallelization• Individual PE Compiler directives for vectorization/pseudo vectorization
Flat MPI vs. OpenMP/MPI HybridFlat MPI vs. OpenMP/MPI Hybrid
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
Hybrid: Hierarchal Structure
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
Flat-MPI: Each PE -> Independent
Three-Level Hybrid parallelization
Previous work on hybrid parallelization R. Falgout, and J. Jones, "Multigrid on Massively Parallel Architectures", 1999. F. Cappelo, and D. Etiemble, "MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks", 2000. K. Nakajima and H. Okuda, "Parallel Iterative Solvers for Unstructured Grids using Directive/MPI Hybrid Programming Model for GeoFEM Platform on SMP Cluster Architectures", 2001
All these are in computational research area. No visualization papers are found on this topic.
Previous parallel visualization methods Classification by platform
• Shared memory machines: J. Nieh and M. Levoy 1992, P. Lacroute 1996
• Distributed memory machines:
U. Neumann 1993, C. M. Wittenbrink and A. K. Somani, 1997
• SMP cluster machines: almost no papers are found.
SMP Cluster Architecture
PE
PE
PE
PE
PE
PE
PE
PE
Memory
PE
PE
PE
PE
PE
PE
PE
PE
Memory
PE
PE
PE
PE
PE
PE
PE
PE
Memory
PE
PE
PE
PE
PE
PE
PE
PE
Memory
Node-0 Node-1
Node-2 Node-3
Node-1
Node-3
Node-0
Node-2
Partitioning of data domain
The Earth Simulator
640 SMP nodes, and 8 vector processors in each SMP node
Three-Level Hybrid parallelization
• Local operation and no global dependency
• Continuous memory access
• Sufficiently long loops
Criteria to achieve high parallel performance
Vectorization for Each PEConstruct Vectorizatoin Loop
• Combine some short loops into one long loop by reordering
• Exchange the innermost and outer loop to make the innermost loop longer
• Avoid using tree and single/double link data structure, especially in the inner loop
for(i=0;i<MAX_N_VERTEX; i++) for(j=0;j<3;j++) { p[i][j]=…. …. }
for(i=0;i<MAX_N_VERTEX*3; i++){ p[i/3][i % 3]=…. …. }
for(i=0;i<MAX_N_VERTEX; i++) for(j=0;j<3;j++) { p[i][j]=…. …. }
for(j=0;j<3;j++) for(i=0;i<MAX_N_VERTEX; i++) { p[i][j]=…. …. }
link (single or double) structure tree structure
Intra-SMP Node Parallelization OpenMP http://www.openmp.org
Multi-coloring for Removing the Data Race[Nakajima, et al. 2001]
Ex: gradient computation in PVR
#pragma omp parallel {for(i=0;i<num_element;i++) { compute jacobian matrix of shape function; for(j=0; j<8;j++) { for(k=0; k<8;k++) accumulate gradient value of vertex j contributed by vertex k;} }}
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
PE#0
PE#1
PE#2
PE#3
Inter-SMP Node Parallelization MPI
Parallel Data Structure in GeoFEM
External nodeInternal node
Communication
Overlapped elements are used for reducing communication among SMP nodes
Overlap removal is necessary for final results
Dynamic Load Repartition
Why?
Initial partition on each PE:
(Same with analysis computing)
Load on each PE:
(PVR process)
Load balance during PVR
Keep almost equal number of rendered voxels on each PE
Number of non-empty voxelsNumber of non-empty voxels
Opacity transfer functionsOpacity transfer functions
ViewpointViewpointDynamic
almost equal number of voxels
the number of rendered voxels
Rendered voxels often accumulate in small portions of the field during visualization
Dynamic Load Repartition
Most previous methods
Scattered decomposition [K.-L. Ma, et al, 1997]
Advantage: Can get very good load balance easilyAdvantage: Can get very good load balance easily
DisadvantageDisadvantage
Large amount of intermediate results have to be storedLarge amount of intermediate results have to be stored
Large extra memory
Large extra communication
Assign several continuous subvolumes on each PE Count the number of rendered voxels during the process of gird transformationCount the number of rendered voxels during the process of gird transformation Move a subvolume from a PE with a larger number of rendered voxels to Move a subvolume from a PE with a larger number of rendered voxels to another PE with a smaller one another PE with a smaller one
Dynamic Load Repartition
Assign several continuous subvolumes on each PE Count the number of rendered voxels during the process of gird transformationCount the number of rendered voxels during the process of gird transformation Move a subvolume from a PE with a larger number of rendered voxels to Move a subvolume from a PE with a larger number of rendered voxels to another PE with a smaller one another PE with a smaller one
PE0 PE1
PE2 PE3
PE0 PE1
PE2 PE3
Initial partition Repartition
• Visualization Subsystem in GeoFEM
• Newly Developed Parallel Volume Rendering (PVR)– Algorithm– Parallel/Vector Efficiency
• ExamplesExamples
• Future Works
Speedup Test 1
Demonstrate the effect of three-level hybrid parallelization
Dataset: Pin Grid Array (PGA) dataset
Simulate the Mises Stress distribution on the pin grid board by the Linear Elastostatic Analysis
Data size: 7,869,771 nodes and 7,649,024 elements
Running environment
SR8000 Each node: 8 PEs 8GFLOPS peak performance 8GB memory Total system: 128 nodes (1024 PEs) 1.0TFLOPS peak performance 1.0TB memory
(Data courtesy of H. Okuda and S. Ezure).
Speedup Test 1
Top view Bottom view
Volume rendered images to show the equivalent scalar value of stress by the linear elastostatic analysis for a PGA dataset with 7,869,771 nodes and 7,649,024 elements (Data courtesy of H. Okuda and S. Ezure).
Speedup Test 1
0
20
40
60
80
100
120
140
0 50 100 150
PE #
Spe
ed U
p
MPI Parallel Hybrid Parallel
Comparison of speedup performance between flat MPI and the hybrid parallel method for our parallel volume rendering module.
Original (MPI) to Vector Version (Hybrid) Speed-up for 1PE : 4.301283 Uniform Cubes for PVR
Speedup Test 2
Demonstrate the effect of three-level hybrid parallelization
Test Dataset: Core dataset
(Data courtesy of H. Matsui in GeoFEM) Simulate thermal convection in a rotating spherical shell
Data size: 257,414 nodes and 253,440 elements
Running environmentSR8000
Each node: 8 PEs 8GFLOPS peak performance 8GB memory
Test Module
Parallel Surface Rendering module
Total system: 128 nodes (1024 PEs) 1.0TFLOPS peak performance 1.0TB memory
Speedup Test 2
Pressure isosurfaces and temperature cross-sections for a core dataset with 257,414 nodes and 253,440 elements. The speedup of our 3-level parallel method is 231.7 for 8 nodes (64PEs) on SR8000.
Speedup Test 2
Comparison of speedup performance between flat MPI and the hybrid parallel method for our parallel surface rendering module.
0
50
100
150
200
250
300
350
400
0 20 40 60 80 100 120 140
PE #
Spee
d U
p
MPI Parallel Hybrid Parallel
Original (MPI) to Vector Version (Hybrid) Speed-up for 1PE : 4.00
Speedup Test 3
Simulate groundwater flow and convection/diffusion transportation through heterogeneous porous media
Dataset: Underground water dataset
200100100 RegionDifferent Water Conductivity for 16,000, 128,000, 1,024,000 Meshes (∆h= 5.00/2.50/1.25)100 timesteps
Demonstrate the effect of dynamic load repartition
Compaq alpha 21164 cluster machine (8 PEs, 600MHz/PE, 512M RAM/PE)
Running environment
Result
Without dynamic load repartition: 8.15 seconds for one time-step averagely.
After dynamic load repartition: 3.37 seconds for one time-step averagely.
For mesh 3 (about 10 million cubes and 100 timesteps)
Effects of convection & diffusion for different mesh sizesh= 5.00 h=2.50 h=1.25
GroundwaterFlow Channel
Speedup Test 3
Application (2)
• Flow/Transportation• 505050 Region • Different Water Conductivity for ea
ch (h=5)3 cube
• d/dx=0.01, =0@xmax
• 1003 Meshes– h= 0.50
• 64PEs : Hitachi SR2201
Parallel PerformanceConvection & Diffusion
• 13,280 steps for 200 Time Unit
• 106 Meshes, 1,030,301 Nodes
• 3,984 sec. for elapsed time including communication on Hitachi SR2201/64PEs– 3,934 sec. for real CPU– 98.7% parallel performance
Conclusions and Future Work
Future Work
Improve Parallel Performance of Visualization Subsystem in GeoFEM
● Improve the parallel performance of the visualization algorithms
● Three-level hybrid parallel based on SMP cluster architecture
• Inter-SMP node MPI• Intra-SMP node OpenMP for parallelization• Individual PE Compiler directives for vectorization/pseudo vectorization
● Dynamic load balancing
Tests on the Earth Simulator
http://www.es.jamstec.go.jp/