priority project performance on massively parallel architectures (pompa) nice to meet you!

38
Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you! COSMO GM10, Moscow

Upload: studs

Post on 22-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!. COSMO GM10, Moscow. Overview. Motivation COSMO code (as seen by computer engineer) Important Bottlenecks Memory bandwidth Scaling I/O POMPA overview. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Priority Project

Performance On Massively Parallel Architectures (POMPA)

Nice to meet you!

COSMO GM10, Moscow

Page 2: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Overview

• Motivation

• COSMO code (as seen by computer engineer)

• Important Bottlenecks• Memory bandwidth• Scaling• I/O

• POMPA overview

Page 3: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Motivation

• What can you do with more computational power?

# EPSmembers (x 2)

Resolution (x 1.25)Lead time (x 2)

Modelcomplexity (x 2)

Page 4: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Motivation

• How to increase computational power?

Efficiency

Algorithm

Computer

POMPA

Page 5: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Motivation

• Moore’s law has held since 1970’s and will probably continue to hold

• Up to now we didn’t need to worry too much about adapting our codes, why should we worry now?

?

Page 6: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Current HPC Platforms

• Research system: Cray XT5 – “Rosa”• 3688 AMD hexa-core Opteron @ 2.4 GHz (212 TF)• 28.8 TB DDR2 RAM• 9.6 GB/s interconnect bandwidth

• Operational system: Cray XT4 – “Buin”• 264 AMD quad-core Opteron @ 2.6 GHz (4.6 TF)• 2.1 TB DDR RAM• 7.6 GB/s interconnect bandwidth

• Old system: Cray XT3 – “Palu”• 416 AMD dual-core Opteron @ 2.6 GHz (5.7 TF)• 0.83 TB DDR RAM• 7.6 GB/s interconnect bandwidth

Source: CSCS

Page 7: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

The Thermal Wall

• Power ~ Voltage2 × Frequency ~ Frequency3

• Clock frequency will not follow Moore’s Law!

Source: Intel

Page 8: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Moore’s Law Reinterpreted

• Number of cores doubles every year while clock speed decreases (not increases)

Source: Wikipedia

Page 9: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

What are transistors used for?

• AMD Opteron (single-core)

Source: Advanced Micro Devices Inc.

memory(latency avoidance)

load/store/control(latency tolerance)

memory and I/O interface

Page 10: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

The Memory Gap

• Memory speed only doubles every 6 years!

Source: Hennessy and Patterson, 2006

Page 11: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

“Brutal Facts of HPC”

• Massive concurrency – increase in number of cores, stagnant or decreasing clock frequency

• Less and “slower” memory per thread – memory bandwidth per insruction/second and thread will decrease, more complex memory hierarchies

• Only slow improvements of inter-processor and inter-thread communication – interconnect bandwidth will improve only slowly

• Stagnant I/O sub-systems – technology for long-term data storage will stagnate compared to compute performance

• Resilience and fault tolerance – mean time to failure of massively parallel system may be short as compared to time to solution of simulation, need fault tolerant software layers

We will have to adapt our codes to exploit the power of future HPC architectures!

Source: HP2C

Page 12: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Why a new Priority Project?

• Efficient codes may enable new science and save money for operations

• We need to adapt our codes to efficiently run on current / future massively parallel architectures!

• Great opportunity to profit from the momentum and knowhow generated by the HP2C or G8 projects and use synergies (e.g. ICON).

• Consistent with goals of the COSMO Science Plan and similar activities in other consortia.

Page 13: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

COSMO Code

• How would a computer engineer look at the COSMO code?

Page 14: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

COSMO Code

• 227’389 lines of Fortran 90 code

% Code Lines % Runtime (C-2 forecast)

active

Page 15: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Dynamics

Page 16: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Key Algorithmic Motifs

• Stencil computationsdo k=1,ie do j=1,je do i=1,ie a(i,j,k) = w1 * b(i+1,j,k) + w2 * b(i,j,k) + w3 * b(i-1,j,k) end do end doend do

• Tridiagonal solver (vertical, Thomas alogrithm)do j=1,je

! Modify coefficients do k=2,ke do i=1,ie c(i,j,k) = 1.0 / ( b(i,j,k) – c(i,j,k-1) * a(i,j,k) ) d(i,j,k) = ( d(i,j,k) – d(i,j,k-1) * a(i,j,k) ) * c(i,j,k) end do end do

! Back substitution do k=n-1,1,-1 do i=1,ie x(i,j,k) = d(i,j,k) – c(i,j,k) * x(i,j,k+1) end do end do

end do

Page 17: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

• field(ie,je,ke,nt) [in Fortran first is fastest varying]

• Optimized for minimal computation (pre calculations)

• Optimized for vector machine

• Often repeatedly sweeps over the complete grid (bad cache usage)

• A lot of copy paste for handling different configurations(difficult to maintain)

• Metric terms and different averaging positions make code complex

Code / Data Structures

Page 18: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Parallelization Strategy

• How do distribute work onto O(1000) cores?

• 2D-domain decomposition using MPI library calls

• Example: operational COSMO-2

Total: 520 x 350 x 60 gridpoints Per core: 24 x 16 x 60 gridpoints

Exchange information with MPI

halo/comp = 0.75

Page 19: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Bottlenecks?

• What are/will be the main bottlenecks of the COSMO code on current/future massively parallel architectures?

• Memory bandwidth

• Scalability

• I/O

Page 20: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Memory scaling

• Problem size 102 x 102 x 60 gridpoints (60 cores, similar to COSMO-2)

• Keep number of cores constant, vary number of cores/node used

Re

lati

ve

Ru

nti

me

(4

co

res

= 1

00

%)

Page 21: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

HP2C: Feasibility Study

• Goal: Investigate how COSMO would have to be implemented in order to reach optimal performance on modern processors

• Tasks• understand the code• performance model• prototype software• new software design proposal

• Company

http://www.scs.ch/

• Duration 4 months (3 months of work)

Page 22: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

• Focus only on dynamical core (fast wave solver) as it…• dominates profiles (30% time)• contains the key algorithmic motifs

(stencils, tridiagonal solver)• is manageable size (14’000 lines)• can be run stand-alone in a meaningful way• correctness of prototype can be verified

Feasibility Study: Idea

Page 23: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Feasibility Study: Results

Prototype vs. Original

Page 24: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Key Ingredients

• Reduce number of memory accesses (less precalculation)

• Change index order from (i,j,k) to (2,k,i/2,j) or (2,k,j/2,i,)

• cache efficiency in tridiagonal solver

• don’t load halo into cache

• Use iterators instead of on the fly array position computations

• Merge loops in order to reduce the number of sweeps over full domain

• Vectorize as much as possible of code

Page 25: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

GPUs have O(10) higher bandwidth!

Source: Prof. Aoki, Tokio Tech

Page 26: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Bottlenecks?

• What are the main bottlenecks of the COSMO code on current/future massively parallel architectures?

• Memory bandwidth

• Scalability

• I/O

Page 27: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

“Weak” scaling

• Problem size 1142 x 765 x 90 gridpoints (dt = 8s)

“COSMO-2”

Matt Cordery, CSCS

Page 28: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Strong scaling (small problem)

• Problem size 102 x 102 x 60 gridpoints (dt = 20s)

“COSMO-2”

Page 29: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Improve Scalability?

• Several approaches can be followed...

• Improve MPI parallelization

• Hybrid parallelization (loop level)

• Hybrid parallelization (restructure code)

• ...

Page 30: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Hybrid Motivation

• NUMA = Non-Uniform Memory Access• Nodes views…

Reality

Page 31: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Hybrid Pros / Cons

• Pros• Eliminates domain decomposition at node• Automatic memory coherency at node• Lower (memory) latency and faster data movement

within node• Can synchronize on memory instead of barrier• Easier on-node load balancing

• Cons• Benefit for memory bound codes questionable• Can be hard to maintain

Page 32: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Hybrid: First Results

• OpenMP on loop level (> 600 directives)

Matt Cordery, CSCS

linear speedup

Page 33: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Bottlenecks?

• What are the main bottlenecks of the COSMO code on current/future massively parallel architectures?

• Memory bandwidth

• Scalability

• I/O

Page 34: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

The I/O Bottleneck

• NetCDF I/O is serial and synchronous• grib1 output is asynchronous (and probably not in an ideal

way)• No parallel output exists!• Example: Operational COSMO-2 run

REF (s) NO OUTPUT (s) DIFF (s)

TOTAL 1889 1676 -212 (-11%)

MPI 571 387 -184

USER 1317 1289 -28

MPI_gather 178 1 -177

cal_conv_ind 22 0 -22

organize_output 3 0 -3

tautsp2d 1 0 -1

Page 35: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

PP-POMPA

• Performance On Massively Parallel Architectures

• Goal Prepare COSMO code for emerging massively parallel architectures

• Timeframe 3 years (Sep. 2010 – Sep. 2013)

• Status Draft of project plan has been sent around. STC has approved the project.

• Next step Kickoff meeting and detailed planning of activities with all participants.

Page 36: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Tasks

① Performance analysis

② Redesign memory layout

③ Improving scalability (MPI, hybrid)

④ Massively parallel I/O

⑤ Adapt physical parametrizations

⑥ Redesign dynamical core

⑦ Explore GPU acceleration

⑧ Update documentation

Current COSMOcode base

New code and programming models

See project plan!

Page 37: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Who is POMPA?

• DWD (Ulrich Schättler, …)

• ARPA-SIMC, USAM & CASPUR (Davide Cesari, Stefano Zampini, David Palella, Piero Lancura, Alessandro Cheloni, Pier Francesco Coppola, …)

• MeteoSwiss, CSCS & SCS (Oliver Fuhrer, Will Sawyer, Thomas Schulthess, Matt Cordery, Xavier Lapillonne, Neil Stringfellow, Tobias Gysi, …)

• And you?

Page 38: Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you!

Questions?

Coming to a supercomputer near your soon!