parallelizing roms for distributed memory machines using the scalable modeling system (sms)

28
August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL) August 2001

Upload: isha

Post on 19-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS). Dan Schaffer NOAA Forecast Systems Laboratory (FSL) August 2001. Outline. Who we are Intro to SMS Application of SMS to ROMS Ongoing Work Conclusion. Who we are. Mark Govett Leslie Hart - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling

System (SMS)

Dan Schaffer

NOAA Forecast Systems Laboratory (FSL)

August 2001

Page 2: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Outline

• Who we are

• Intro to SMS

• Application of SMS to ROMS

• Ongoing Work

• Conclusion

Page 3: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Who we are

• Mark Govett

• Leslie Hart

• Tom Henderson

• Jacques Middlecoff

• Dan Schaffer

• Developing SMS for 20+ man years

Page 4: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Intro to SMS

• Overview– Directive based

• FORTRAN comments

• Enables single source parallelization

– Distributed or shared memory machines– Performance portability

Page 5: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Distributed Memory Parallelism

Page 6: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Add SMSDirectives

Code Parallelization using SMS

SMS Serial Code

SMS Parallel

Code

OriginalSerialCode

PPPParallel Pre-Processor

SerialExecutable

ParallelExecutable

Page 7: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Low-Level SMS

MPI,

SHMEM, etc.

NNT SRSSSTFDA Library Spectral Library Parallel I/O

SMS Parallel

Code

Page 8: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Intro to SMS (contd)

– Support for all of F77 plus much of F90 including:• Dynamic memory allocation

• Modules (partially supported)

• User-defined types

– Supported Machines• COMPAQ Alpha-Linux Cluster (FSL “Jet”)

• PC-Linux Cluster

• SUN Sparcstation

• SGI Origin 2000

• IBM SP-2

Page 9: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Intro to SMS (contd)

• Models Parallelized– Ocean : ROMS, HYCOM, POM – Mesoscale Weather : FSL RUC, FSL QNH,

NWS Eta, Taiwan TFS (Nested)– Global Weather : Taiwan GFS (Spectral)– Atmospheric Chemistry : NOAA Aeronomy

Lab

Page 10: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Key SMS Directives

• Data Decomposition– csms$declare_decomp

– csms$create_decomp

– csms$distribute

• Communication– csms$exchange

– csms$reduce

• Index Translation– csms$parallel

• Incremental Parallelization – csms$serial

• Performance Tuning– csms$flush_output

• Debugging Support– csms$reduce (bitwise exact)

– csms$compare_var

– csms$check_halo

Page 11: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

SMS Serial Code program DYNAMIC_MEMORY_EXAMPLE parameter(IM = 15) CSMS$DECLARE_DECOMP(my_dh) CSMS$DISTRIBUTE(my_dh, 1) BEGIN real, allocatable :: x(:) real, allocatable :: y(:) real xsum CSMS$DISTRIBUTE END CSMS$CREATE_DECOMP (my_dh, <IM>, <2>) allocate(x(im)) allocate(y(im)) open (10, file = 'x_in.dat', form='unformatted') read (10) x CSMS$PARALLEL(my_dh, <i>) BEGIN do 100 i = 3, 13 y(i) = x(i) - x(i-1) - x(i+1) - x(i-2) - x(i+2) 100 continue CSMS$EXCHANGE(y) do 200 i = 3, 13 x(i) = y(i) + y(i-1) + y(i+1) + y(i-2) + y(i+2) 200 continue xsum = 0.0 do 300 i = 1, 15 xsum = xsum + x(i) 300 continue CSMS$REDUCE(xsum, SUM) CSMS$PARALLEL END print *,'xsum = ',xsum end

Page 12: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Advanced Features

• Nesting• Incremental Parallelization• Debugging Support (Run-time configurable)

– CSMS$REDUCE• Enables bit-wise exact reductions

– CSMS$CHECK_HALO • Verifies a halo region is up-to-date

– CSMS$COMPARE_VAR• Compare variables for simultaneous runs with different

numbers of processors• HYCOM 1-D decomp parallelized in 9 days

Page 13: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Incremental Parallelization

“global” “local”

“local” “global”

CALL NOT_PARALLEL(...)

SMS Directive: CSMS$SERIAL

Page 14: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Advanced Features (contd)

• Overlapping Output with Computations (FORTRAN Style I/O only)

• Run-time Process Configuration– Specify

• number of processors per decomposed dim or

• number of grid points per processor

• 15% performance boost for HYCOM

– Support for irregular grids coming soon

Page 15: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

SMS Performance (Eta)

• Eta model run in production at NCEP for use in National Weather Service Forecasts

• 16000 Lines of Code (excluding comments)

• 198 SMS Directives added to the code

Page 16: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

ETA Performance• Performance measured on NCEP SP2• I/O excluded• Resolution : 223x365x45• 88 PE run-time beats NCEP hand-coded MPI by 1%• 88 PE Exchange time beats hand-coded MPI by 17%

Processors Time (sec.) Efficiency

4 406 1.00

16 103 0.99

64 29.3 0.86

88 23.9 0.80

Page 17: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

SMS Performance (HYCOM)

• 4500 Lines of Code (excluding comments)

• 108 openMP directives included in the code

• 143 SMS Directives added to the code

Page 18: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

HYCOM Performance

• Performance measured on O2K• Resolution : 135x256x14 • Serial code runs in 136 seconds

Procs openMP Time

Efficiency SMS Time Efficiency

1 142 0.96 127 1.07

8 22.6 0.75 14.5 1.17

16 12.9 0.66 7.60 1.18

Page 19: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Intro to SMS (contd)

– Extensive documentation available on the web

– New development aided by• Regression test suite

• Web-based bug tracking system

Page 20: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Outline

• Who we are

• Intro to SMS

• Application of SMS to ROMS

• Ongoing Work

• Conclusion

Page 21: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

SMS ROMS Implementation

• Used awk and cpp to convert to dynamic memory; simplifying SMS parallelization

• Leveraged existing shared memory parallelism do I = ISTR, IEND

• Directives added to handle NEP scenario • 13000 Lines of Code, 132 SMS directives• Handled netCDF I/O with CSMS$SERIAL

Page 22: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Results and Performance

• Runs and produces correct answer on all supported SMS machines

• Low Resolution 128x128x30– “Jet”, O2K Scaling

– Run-times for main loop (21 time steps) excluding I/O

• High Resolution 210x550x30– PMEL using in production

– 97% Efficiency between 8 and 16 processors on “Jet”

Page 23: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

SMS Low Res ROMS “Jet” Performance

Processors Time (sec.) Efficiency

1

(serial code)

153 1.00

4 41.3 0.93

8 21.6 0.89

16 12.6 0.76

Page 24: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

SMS Low Res ROMS O2K Performance

Processors Time (sec.) Efficiency

1

(serial code)

298 1.00

8 41.6 0.90

16 22.4 0.83

Page 25: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Outline

• Who we are

• Intro to SMS

• Application of SMS to ROMS

• Ongoing Work

• Conclusion

Page 26: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Ongoing Work (funding dependent)

• Full F90 Support

• Support for parallel netCDF

• T3E port

• SHMEM implementation on T3E, O2K

• Parallelize other ROMS scenarios

• Implement SMS nested ROMS

• Implement SMS coupled ROMS/COAMPS

Page 27: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Conclusion

• SMS is a high level directive-based tool

• Simple single source parallelization

• Performance optimizations provided

• Strong debugging support included

• Performance beats hand-coded MPI

• SMS is performance portable

Page 28: Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

August 2001

Web-Site

www-ad.fsl.noaa.gov/ac/sms.html