1 applying automated memory analysis to improve the iterative solver in the parallel ocean program...
TRANSCRIPT
![Page 1: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/1.jpg)
1
Applying Automated Memory Analysis to
improve the iterative solver in the Parallel Ocean
ProgramJohn M. Dennis: [email protected]
Elizabeth R. Jessup: [email protected]
April 5, 2006
![Page 2: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/2.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
2
MotivationMotivation
Outgrowth of PhD thesis Memory efficient iterative solversData movement is expensiveDeveloped techniques to improve memory efficiency
Apply Automated Memory Analysis to POP
Parallel Ocean Program (POP) solverLarge % of timeScalability issues
Outgrowth of PhD thesis Memory efficient iterative solversData movement is expensiveDeveloped techniques to improve memory efficiency
Apply Automated Memory Analysis to POP
Parallel Ocean Program (POP) solverLarge % of timeScalability issues
![Page 3: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/3.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
3
Outline:Outline:
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling Curves Conclusions
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling Curves Conclusions
![Page 4: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/4.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
4
Automated Memory Analysis?
Automated Memory Analysis?
Analyze algorithm written in Matlab
Predicts data movement if algorithm written in C/C++ or Fortran-> Minimum Required
Predictions allow:Evaluate design choicesGuide performance tuning
Analyze algorithm written in Matlab
Predicts data movement if algorithm written in C/C++ or Fortran-> Minimum Required
Predictions allow:Evaluate design choicesGuide performance tuning
![Page 5: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/5.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
5
POP using 20x24 blocks (gx1v3)
POP using 20x24 blocks (gx1v3)
POP data structure Flexible block structure land ‘block’ elimination Small blocks
Better {load balanced, land block elimination}
Larger halo overhead Larger blocks
Smaller halo overheadLoad imbalancedNo land block elimination
Grid resolutions: test: (128x192) gx1v3 (320x384)
POP data structure Flexible block structure land ‘block’ elimination Small blocks
Better {load balanced, land block elimination}
Larger halo overhead Larger blocks
Smaller halo overheadLoad imbalancedNo land block elimination
Grid resolutions: test: (128x192) gx1v3 (320x384)
![Page 6: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/6.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
6
Alternate Data Structure
Alternate Data Structure
2D data structure Advantages
Regular stride-1 access
Compact form of stencil operator
Disadvantages Includes land points
Problem specific data structure
2D data structure Advantages
Regular stride-1 access
Compact form of stencil operator
Disadvantages Includes land points
Problem specific data structure
1D data structure Advantages
No more land points General data structure
Disadvantages Indirect addressing Larger stencil operator
1D data structure Advantages
No more land points General data structure
Disadvantages Indirect addressing Larger stencil operator
![Page 7: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/7.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
7
Outline:Outline:
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
![Page 8: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/8.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
8
Data movementData movement
Working set load size (WSL)(MM --> L1 cache)Measure using PAPI (WSLM)Compute platforms:
Sun Ultra II (400Mhz)IBM POWER4 (1.3 Ghz)SGI R14K (500Mhz)
Compare with prediction (WSLP)
Working set load size (WSL)(MM --> L1 cache)Measure using PAPI (WSLM)Compute platforms:
Sun Ultra II (400Mhz)IBM POWER4 (1.3 Ghz)SGI R14K (500Mhz)
Compare with prediction (WSLP)
![Page 9: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/9.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
9
Predicting Data Movement
Predicting Data Movement
solver w/2D (Matlab) solver w/1D (Matlab)
4902 Kbytes 3218 Kbytes
1D data structure --> 34% reduction in data movement
>
Predicts WSLP
![Page 10: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/10.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
10
Measured versus Predicted data
movement
Measured versus Predicted data
movementSolver Ultra II POWER4 R14K
WSLP WSLM err WSLM err WSLM err
PCG2+2D v1
4902 5163 5% 5068 3% 5728 17%
PCG2+2D v2
4902 4905 0% 4865 -1% 4854 -1%
PCG2+1D 3218 3164 -2% 3335 4% 3473 8%
![Page 11: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/11.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
11
Measured versus Predicted data
movement
Measured versus Predicted data
movementSolver Ultra II POWER4 R14K
WSLP WSLM err WSLM err WSLM err
PCG2+2D v1
4902 5163 5% 5068 3% 5728 17%
PCG2+2D v2
4905 0% 4865 -1% 4854 -1%
PCG2+1D 3218 3164 -2% 3335 4% 3473 8%
Excessive data movement
![Page 12: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/12.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
12
Two blocks of source code
Two blocks of source code
do i=1,nblocksp(:,:,i)=z(:,:,i) + p(:,:,i)*ß
q(:,:,i) = A*p(:,:,i)
w0(:,:,i)=Q(:,:,i)*P(:,:,i)
enddodelta = gsum(w0,lmask)
do i=1,nblocksp(:,:,i)=z(:,:,i) + p(:,:,i)*ß
q(:,:,i) = A*p(:,:,i)
w0(:,:,i)=Q(:,:,i)*P(:,:,i)
enddodelta = gsum(w0,lmask)
ldelta=0do i=1,nblocks
p(:,:,i) = z(:,:,i) + p(:,:,i)* ß
q(:,:,i) = A*p(:,:,i)w0=q(:,:,i)*P(:,:,i)ldelta = ldelta + lsum(w0,lmask)
enddodelta=gsum(ldelta)
ldelta=0do i=1,nblocks
p(:,:,i) = z(:,:,i) + p(:,:,i)* ß
q(:,:,i) = A*p(:,:,i)w0=q(:,:,i)*P(:,:,i)ldelta = ldelta + lsum(w0,lmask)
enddodelta=gsum(ldelta)
PCG2+2D v1 PCG2+2D v2
w0 array accessed after loop!extra access of w0 eliminated
![Page 13: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/13.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
13
Measured versus Predicted data
movement
Measured versus Predicted data
movementSolver Ultra II POWER4 R14K
WSLP WSLM err WSLM err WSLM err
PCG2+2D v1
4902 5163 5% 5068 3% 5728 17%
PCG2+2D v2
4902 4905 0% 4865 -1% 4854 -1%
PCG2+1D 3218 3164 -2% 3335 4% 3473 8%
Data movement matches predicted!
![Page 14: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/14.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
14
Outline:Outline:
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
![Page 15: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/15.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
15
Using 1D data structures in POP2
solver (serial)
Using 1D data structures in POP2
solver (serial)Replace solvers.F90Execution time on cache microprocessors
Examine two CG algorithms w/Diagonal precondPCG2 ( 2 inner products)PCG1 ( 1 inner product) [D’Azevedo 93]
Grid: test [128x192 grid points]w/(16x16)
Replace solvers.F90Execution time on cache microprocessors
Examine two CG algorithms w/Diagonal precondPCG2 ( 2 inner products)PCG1 ( 1 inner product) [D’Azevedo 93]
Grid: test [128x192 grid points]w/(16x16)
![Page 16: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/16.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
16
0
1
2
3
4
5
6
POWER4 1.3 Ghz
Compute Platform
secon
ds f
or 2
0 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
0
1
2
3
4
5
6
POWER4 1.3 Ghz
Compute Platform
secon
ds f
or 2
0 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
Serial execution time on IBM POWER4 (test)
Serial execution time on IBM POWER4 (test)
56% reduction in cost/iteration
![Page 17: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/17.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
17
Outline:Outline:
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
![Page 18: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/18.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
18
Using 1D data structure in POP2 solver (parallel)
Using 1D data structure in POP2 solver (parallel) New parallel halo update
Examine several CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product)
Existing solver/preconditioner technology: Hypre (LLNL)
http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners:
Diagonal Hypre integration -> Work in progress
New parallel halo update Examine several CG algorithms w/Diagonal
precond PCG2 ( 2 inner products) PCG1 ( 1 inner product)
Existing solver/preconditioner technology: Hypre (LLNL)
http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners:
Diagonal Hypre integration -> Work in progress
![Page 19: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/19.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
19
Solver execution time for POP2 (20x24) on
BG/L (gx1v3)
Solver execution time for POP2 (20x24) on
BG/L (gx1v3)
0
5
10
15
20
25
30
35
40
64
# processors
Secon
ds f
or 2
00 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
Hypre (PCG+Diag)
0
5
10
15
20
25
30
35
40
64
# processors
Secon
ds f
or 2
00 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
Hypre (PCG+Diag)
48% cost/iteration
27% cost/iteration
![Page 20: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/20.jpg)
20
64 processors != PetaScale
![Page 21: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/21.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
21
Outline:Outline:
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
![Page 22: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/22.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
22
0.1 degree POP0.1 degree POP
Global eddy-resolving Computational grid:
3600 x 2400 x 40Land creates problems:
load imbalancesscalability
Alternative partitioning algorithm:Space-filling curves
Evaluate using Benchmark:1 day/ Internal grid / 7 minute timestep
Global eddy-resolving Computational grid:
3600 x 2400 x 40Land creates problems:
load imbalancesscalability
Alternative partitioning algorithm:Space-filling curves
Evaluate using Benchmark:1 day/ Internal grid / 7 minute timestep
![Page 23: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/23.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
23
Partitioning with Space-filling Curves
Partitioning with Space-filling Curves
Map 2D -> 1DVariety of sizes
Hilbert (Nb=2n)
Peano (Nb=3m)
Cinco (Nb=5p) [New]Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco (Nb=2n3m5p) [New]
Partitioning 1D array
Map 2D -> 1DVariety of sizes
Hilbert (Nb=2n)
Peano (Nb=3m)
Cinco (Nb=5p) [New]Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco (Nb=2n3m5p) [New]
Partitioning 1D array
Nb
![Page 24: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/24.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
24
Partitioning with SFCPartitioning with SFC
Partition for 3 processors
![Page 25: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/25.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
25
POP using 20x24 blocks (gx1v3)
POP using 20x24 blocks (gx1v3)
![Page 26: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/26.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
26
POP (gx1v3) + Space-filling curve
POP (gx1v3) + Space-filling curve
![Page 27: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/27.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
27
Space-filling curve (Hilbert Nb=24)
Space-filling curve (Hilbert Nb=24)
![Page 28: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/28.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
28
Remove Land blocksRemove Land blocks
![Page 29: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/29.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
29
Space-filling curve partition for 8
processors
Space-filling curve partition for 8
processors
![Page 30: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/30.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
30
POP 0.1 degree benchmark on Blue
Gene/L
POP 0.1 degree benchmark on Blue
Gene/L
![Page 31: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/31.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
31
POP 0.1 degree benchmark
POP 0.1 degree benchmark
Courtesy of Y. Yoshida, M. Taylor, P. Worley
![Page 32: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/32.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
32
ConclusionsConclusions
1D data structures in Barotropic SolverNo more land pointsReduces execution time vs 2D data structure
48% reduction in Solver time! (64 procs BG/L) 9.5% reduction in Total time! (64 procs POWER4)
Allows use of solver/preconditioner packagesImplementation quality critical!
Automated Memory Analysis (SLAMM)Evaluate design choicesGuide performance tuning
1D data structures in Barotropic SolverNo more land pointsReduces execution time vs 2D data structure
48% reduction in Solver time! (64 procs BG/L) 9.5% reduction in Total time! (64 procs POWER4)
Allows use of solver/preconditioner packagesImplementation quality critical!
Automated Memory Analysis (SLAMM)Evaluate design choicesGuide performance tuning
![Page 33: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/33.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
33
Conclusions (con’t)Conclusions (con’t) Good scalability to 32K processors on BG/L Increase simulation rate by 2x on 32K processors SFC partitioning 1D data structure in solver Modify 7 source files
Future work Improve scalability
55% Efficiency 1K => 32K Better preconditioners Improve load-balance
Different block sizesImprove partitioning algorithm
Good scalability to 32K processors on BG/L Increase simulation rate by 2x on 32K processors SFC partitioning 1D data structure in solver Modify 7 source files
Future work Improve scalability
55% Efficiency 1K => 32K Better preconditioners Improve load-balance
Different block sizesImprove partitioning algorithm
![Page 34: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/34.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
34
Acknowledgements/Questions?
Acknowledgements/Questions?
Thanks to: F. Bryan (NCAR)J. Edwards (IBM) P. Jones (LANL)K. Lindsay (NCAR)M. Taylor (SNL)H. Tufo (NCAR)W. Waite (CU)S. Weese (NCAR)
Thanks to: F. Bryan (NCAR)J. Edwards (IBM) P. Jones (LANL)K. Lindsay (NCAR)M. Taylor (SNL)H. Tufo (NCAR)W. Waite (CU)S. Weese (NCAR)
Blue Gene/L time:NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program
BGW Consortium DaysIBM research (Watson)
Blue Gene/L time:NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program
BGW Consortium DaysIBM research (Watson)
![Page 35: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/35.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
35
Serial Execution time on Multiple platforms
(test)
Serial Execution time on Multiple platforms
(test)
0
1
2
3
4
5
6
7
8
9
10
IBM POWER4 IBM POWER5 IBM PPC 440 AMD Opteron Intel P4
(1.3 Ghz) (1.9 Ghz) (700 Mhz) (2.2 Ghz) (2.0 Ghz)
Compute Platform
secon
ds f
or 2
0 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
0
1
2
3
4
5
6
7
8
9
10
IBM POWER4 IBM POWER5 IBM PPC 440 AMD Opteron Intel P4
(1.3 Ghz) (1.9 Ghz) (700 Mhz) (2.2 Ghz) (2.0 Ghz)
Compute Platform
secon
ds f
or 2
0 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
![Page 36: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/36.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
36
Total execution time for POP2 (40x48) on
POWER4 (gx1v3)
Total execution time for POP2 (40x48) on
POWER4 (gx1v3)
66
68
70
72
74
76
78
80
82
84
86
88
64
# of processors
secon
ds f
or 2
00 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
66
68
70
72
74
76
78
80
82
84
86
88
64
# of processors
secon
ds f
or 2
00 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
9.5% reduction
Eliminate need for ~216,000 CPU hours per year @ NCAR
![Page 37: 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth](https://reader036.vdocument.in/reader036/viewer/2022081602/5518a0d4550346991f8b4882/html5/thumbnails/37.jpg)
April 5, 2006 Petascale Computation for the Geosciences Workshop
37
POP 0.1 degreePOP 0.1 degreeblocksize
Nb Nb2 Max ||
36x24 100 10000 7545
30x20 120 14400 10705
24x16 150 22500 16528
18x12 200 40000 28972
15x10 240 57600 41352
12x8 300 90000 64074
Increasing || -->D
ecre
asin
g ov
erhe
ad -
->