DMI UpdateWWW.DMI.DK
Leif Laursen ( [email protected] )
Jan Boerhout ( [email protected] )
CAS2K3, September 7-11, 2003
Annecy, France
Danish Meteorological Institute
• DMI is the national weather service for Denmark, Greenland and the Faeroes.
• Weather forecasting, Oceanography, Climate Research and Environmental studies
• Use of numerical models in all areas
• Increased used of automatic products
• Demanding high availability of systems
32 Kbyte/s
10 Mbyte/s
24 processor
SGI
ORIGIN 200
data processing
graphics
verification
operational database
NEC-SX6
preprocessing
analysis
initialisation
forecast
postprocessing
Mass storage device
GTS-observations
ECMWF boundary files
00Z ECMWFboundaries
Valid time start end Model Forecast
length Valid time start end
00 0140 0200 DMI-HIRLAM-G 60 12 1340 1400 00 0143 0205 DMI-HIRLAM-E 48 12 1343 1405 00 0230 0245 DMI-HIRLAM-D 36 12 1430 1445 00 0255 0310 DMI-HIRLAM-N 36 12 1455 1505 06 0737 0800 DMI-HIRLAM-G 60 18 1937 2000 06 0743 0805 DMI-HIRLAM-E 48 18 1943 2005 00 1100 1105 DMI-HIRLAM-G 3 12 2245 2250 03 1105 1115 DMI-HIRLAM-G 3 15 2250 2300 06 1115 1125 DMI-HIRLAM-G 3 18 2300 2310 09 1125 1135 DMI-HIRLAM-G 3 21 2310 2320 03 1135 1140 DMI-HIRLAM-E 3 15 2320 2325 06 1140 1145 DMI-HIRLAM-E 3 18 2325 2330 09 1145 1150 DMI-HIRLAM-E 3 21 2330 2335 03 1147 1149 DMI-HIRLAM-D 3 15 2335 2337 06 1149 1151 DMI-HIRLAM-D 3 18 2337 2339 09 1151 1153 DMI-HIRLAM-D 3 21 2339 2341 03 1153 1155 DMI-HIRLAM-N 3 15 2341 2343 06 1155 1157 DMI-HIRLAM-N 3 18 2343 2345 09 1157 1159 DMI-HIRLAM-N 3 21 2345 2345
12Z ECMWFboundaries
06Z ECMWFboundaries
18Z ECMWFboundaries
Evolution in RMS for MSLP
Quality of 24h forecasts of 10m wind speeds >= 8 m/s
Weibull distributions for 24 hour forecasts E, D, ECMWF and UKMO is also shown as well as curve for the observations.
The new NEC-SX6 computer at DMI
April 2002 March 2003
16+2 60+2
Memory 96 Gbyte 320 Gbyte
Peak 128 Gflops+ 16 480 Gflops+ 16
Increase 4 15
Disc 1 Tbyte 4 Tbyte
Front end2 AzusA systems with 4 cpu’s and 4 Gbyte each
PLUS 2 AsamA systems with 4 cpu’s and 8 Gbyte each
Total peak 25.6 Gflops 25.6+51.2 Gflops
SXSX--66/8/864GB64GB
AsAmA AsAmA 44 CPU CPU 88GBGBAsAmA AsAmA 44 CPU CPU 88GBGB
80 x 80FC Switch
8*4
X 8
1.1TB
1TB
1TB 1.1TB1.1TB
1TB1TB
1TBX 8
X 8
SIOXHP L-class
DMI Phase2 ConfigurationDMI Phase2 ConfigurationSXSX--66//60M860M8, , 320320GB GB
SXSX--66/8/83232GBGB
SXSX--66/8/864GB64GB
SXSX--66/8/83232GBGB
SXSX--66/8/83232GBGB
SXSX--66/8/83232GBGB
SXSX--66/8/83232GBGB
SXSX--66/8/83232GBGB
AsAmA AsAmA 44 CPU CPU 88GBGBAsAmA AsAmA 44 CPU CPU 88GBGB
1.1TB
1TB
1TB 1.1TB1.1TB
1TB1TB
1TB
X 8
AAzuszusA A 44 CPU CPU 88GBGBAAzuszusA A 44 CPU CPU 88GBGB
AAzuszusA A 44 CPU CPU 88GBGBAAzuszusA A 44 CPU CPU 88GBGB
X 4 X 4
GE
FEGE
FE
FE
GE
IXS
Some events during the migration to the NEC-SX6
• Oct. 01: Signature of contract between NEC and DMI• April 02: Upgrade (advection scheme for q, CW and TKE) • May 02: Installation of phase 1 of SX6• May 02: Parallel system on SX6 • June 02: DMI-HIRLAM-I (0.014 degree, 602x600 grid) on SX-6 • July 02: Stability test passed • Sep. 02: Operational suite on SX6, later removal of SX4• Sep. 02: Testing of new developments (diff. and convection)• Dec. 02: Upgrade: 40 levels, reduced time step, AMSU-A data• Jan. 03: Revised contract between NEC and DMI • Mar. 03: Installation of phase 2 of SX6• July 03: Stability test passed. • Sep. 03: Improvement in data-assimilation (FGAT, QuikScat etc.)• Early 04: New operational HIRLAM set-up using 6 nodes
HIRLAM Scalability Optimization
• Methods
• Implementation
• Performance
Optimization Focus
• Data transposition– from 2D to FFT distribution and reverse– from FFT to TRI distribution and reverse
• Exchange of halo points– between north and south– between east and west
• GRIB File I/O• Statistics
Approach
• First attempt: straight-forward conversion from SHMEM to MPI-2 put/get calls– it works, but:– too much overhead due to fine granularity
• Redesign of transposition and halo swap routinesless and larger messagesindependent message passing process groups
2D Sub Grids
• HIRLAM sub grid definition in TWOD data distribution
• Processors:
0 1 2
3 4 5
6 7 8
9 10 11
nprocynprocxnproc
lati
tud
e
longitude
levels
Original FFT Sub Grids
• HIRLAM sub grid definition in FFT data distribution
• Each processor handles slabs of full longitude lines
lati
tud
e
longitude
levels
2D↔FFT Redistribution
Sub grid data to be distributed to all processors:
send-receive pairs
4
2nproc
lati
tud
e
longitude
levels
3 4 5
2D↔FFT Redistribution
• Sub grids in east-west direction form full longitude lines
• nprocy independent sets of nprocx2 send-receive pairs, or:
send-receive pairs
• nprocy x less messages
nprocy
nproc 2
lati
tud
e
longitude
levels
5
4
3
Transpositions 2D↔FFT↔TRI
0 1 2
3 4 5
6 7 8
9 10 11
2
1
0
5
4
3
8
7
6
11
10
9
2 5 8 111 4 7 10
0 3 6 9
2D FFT TRI
lati
tud
e
longitude
levels
MPI Methods
• Transfer Methods– Remote Memory Access: mpi_put, mpi_get– Async Point-to-Point: mpi_isend, mpi_irecv– All-to-All: mpi_alltoallv,
mpi_alltoallw
• Buffering vs. direct– Explicit buffering– MPI derived types
(Method selection by environment variables)
Test grid Details
Parameter Value Notes
Longitudes 602
Latitudes 568
Levels 60
NSTOP 40 steps
Initialization none
Time step 180 seconds
Performance
Parallel Speedup on NEC SX-6
• Cluster of 8 NEC SX-6 nodes at DMI
• Up to 60 processors:
7 nodes with 8 processors per node
1 node with 4 processors
• Parallel efficiency 78% on 60 processors
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0 10 20 30 40 50 60
Processors
Sp
eed
up
ideal original optimized
Performance - Observations
• New data redistribution method much more efficient (78% vs. 45% on 60 processors)
• No performance advantage with RMA (one-sided MP) or All-to-All over plain Point-to-Point method
• Elegant code with MPI derived types, but:• Explicit buffering faster
Questions?
• Thank you!