the need for speed: parallelization in gem michel desgagné recherche en prévision numérique...

25
The Need for Speed Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

Upload: pauline-horn

Post on 26-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

The Need for Speed:Parallelization in

GEM

Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN

Many thanks to Michel Valin

Page 2: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

Single processor limitationsSingle processor limitations

● Processor clock speed is limited● Physical size of processor limits speed because signal speed

cannot exceed speed of light

● Single processor speed is limited by integrated circuits feature size (propagation delays and thermal problems)

● Memory (size and speed - especially latency)

● The amount of logic on a processor chip is limited by real estate considerations (die size / transistor size)

● Algorithm limitations

The reason behind parallel programming

Page 3: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

Parallel computing: a solutionParallel computing: a solution

● Increase parallelism within processor (multi operand functional units like vector units)

● Increase parallelism on chip (multiple processors on chip)

● Multi processor computers● Multi computer systems using a communication network

(latency and bandwidth considerations)

Page 4: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

Parallel computing paradigmsParallel computing paradigms

● memory taxonomy:memory taxonomy:

● SMP Shared Memory Parallelism

● One processor can “ see ” another's memory● Cray X-MP, single node NEC SX-3/4/5/6

● DMP Distributed Memory Parallelism

● Processors exchange “ messages ”● Cray T3D, IBM SP, ES-40, ASCI machines

● hardware taxonomy:

SISD (Single Instruction Single Data)

SIMD (Single Instruction Multiple Data)

MISD (Multiple Instruction Single Data)

MIMD (Multiple Instruction Multiple Data)

● programmer taxonomyprogrammer taxonomy:

SPMD : Single Program Multiple Data

MPMD : Multiple Program Multiple Data

Page 5: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

SMP architecturesSMP architectures

Network / crossbarBus topology

Cpu

Cpu

Cpu

Cpu

Mem

Mem

Mem

NODE

Cpu

Cpu

Cpu

Cpu

Mem

Mem

Mem

Mem

NODE

Page 6: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

SMPOpenMP (microtasking / autotasking)

OpenMP works at the loop level (small granularity often at the loop level), multiple CPUs execute the same code in a shared memory space

OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization.

OpenMP uses the fork-join model of parallel execution:

Page 7: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

OpenMP Basic features (FORTRAN “comments”)

PROGRAM VEC_ADD_SECTIONS

INTEGER ni, I, n PARAMETER (ni=1000) REAL A(ni), B(ni), C(ni)

! Some initializations n=4 DO I = 1, ni A(I) = I * 1.0 B(I) = A(I) ENDDO! At the Fortran level: call omp_set_num_threads(n)

!$OMP PARALLEL SHARED(A,B,C), PRIVATE(I)

!$omp do DO I = 1, ni C(I) = A(I) + B(I) ENDDO!$omp enddo

!$OMP END PARALLEL

END

2 ways to initiate threads:

At the shell level:n=4export OMP_NUM_THREADS=n

Parallel region

Page 8: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

OpenMP!$omp parallel!$omp do do n=1,omp_get_max_threads() call itf_phy_slb ( n , F_stepno,obusval, cobusval, $ pvptr, cvptrp,cvptrm, ndim, chmt_ntr, $ trp,trm, tdu,tdv,tdt,kmm,ktm, $ LDIST_DIM, l_nk) enddo!$omp enddo!$omp end parallel

!$omp critical jdo = jdo + 1!$omp end critical

!$omp single call vexp (expf_8,xmass_8,nij)!$omp end single

Page 9: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

SMP: General remarksSMP: General remarks

● Shared memory parallelism at the loop level can often be implemented after the fact if what is desired is a moderate level of parallelism

● It can be also done to a lesser extent at the thread level in some cases but reentrancy, data scope (thread local vs global) and race conditions can be a problem.

● Does NOT scale all that well

● Limited to the real estate of a node

Page 10: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

DMP architecture

Node Node Node

Node Node Node

High speed interconnect (network / crossbar)

. . . . . .

. . .. . .

Cpu

Cpu

Cpu

Cpu

Mem

Mem

Mem

NODE

Page 11: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

2D domain decomposition:2D domain decomposition:regular horizontal block partitioningregular horizontal block partitioning

Gni

Lnj

1

Lnj

1

1 Lni 1 Lni

1

1

Gnj

Lni Lni+1

Lnj

Lnj+1

global indexing

local indexingPe (0,0)

Pe (0,1) Pe (1,1)

Pe (1,0)

N

S

W E

PE topology: npex=2, npey=2

PE #0 PE #1

PE #2 PE #3Rank

PE matrix

Page 12: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

High level operations

● Halo exchange● What is a halo ?● Why and when is it necessary to exchange a halo ?

● Data transpose● What is a data transpose ?● Why and when is it necessary to transpose data ?

● Collective and Reduction operations

Page 13: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

2D array layout with halos

Halo y

Halo x Halo x

Halo y

Halo y

Halo x Halo x

Halo y

Mini Maxi1 Lni

1

Lnj

Minj

Maxj

Inner halo

Outer halo

Private data

N

S

W E

Page 14: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

● Need to access neighboring data in order to perform local computation

● In general any stencil type discrete operator

dfdx(i) = (f(i+1) - f(i-1)) / (x(i+1)-x(i-1))

● Halo width depends on the operator

Halo exchange:Why and when?

1 Lni

Halo x Halo x

Halo x Halo x

Page 15: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

Halo exchange 051

How many neighbor PEs must local PE exchange data with to get data from the shaded area (outer halo)?

Local pe

South

North

EastWest

North West North East

South West South East

PE topology: npex=3, npey=3

Page 16: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

Data Transposition 051PE topology: npex=4, npey=4

X

Y

Z

npex

npey

X

Y

Z

T2

npex

npey

X

Y

Z

T1

npex

npey

Page 17: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

What is MPI ?

● A Message Passing Interface

● Communications through messages can be

● Cooperative send / receive (democratic)

● One sided get / put (autocratic)

● Bindings defined for FORTRAN, C, C++

● For parallel computers, clusters, heterogeneous networks

● Full featured (but can be used in simple fashion)

Time

Message length

Startup

Tw = cost / word

● Include 'mpif.h'● Call MPI_INIT(ierr)● Call MPI_FINALIZE(ierr)● Call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)● Call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)● Call MPI_SEND(buffer,count,datatype,destination,tag,comm,ierr)● Call MPI_RECV(buffer,count,datatype,source,tag,comm,status,ierr)

MPI_gather, MPI_allgatherMPI_scatter, MPI_alltoallMPI_bcast,MPI_reduce, MPI_allreduce

mpi_summpi_min, mpi_max

Page 18: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

The RPN_COMM toolkitMichel Valin

● NO INCLUDE FILE NEEDED (like mpif.h)

● Higher level of abstraction

● Initialization / termination of communications

● Topology determination

● Point to point operations

● Halo exchange● (Direct message to NSWE neighbor)

● Collective operations

● Transpose● Gather / distribute● Data reduction

● Equivalent calls to most frequently used MPI routines● MPI_[something] => RPN_COMM_[something]

Page 19: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

Partitioning Global Data

Gni=62Gnj=25 PE topology: npex=4, npey=3

lni=16lnj=8

lni=16lnj=9

lni=15lnj=9

lni=16lnj=8

lni=15lnj=9

lni=16lnj=9

lni=15lnj=8

lni=16lnj=9

lni=16lnj=9

lni=14lnj=9

lni=16lnj=7

lni=16lnj=9

lni=16lnj=9

lni=14lnj=7

Valin(Gni + npex – 1) / npexThomas

Dimensions of largest subdomain NOT affected

checktopo -gni 62 -gnj 25 -gnk 58 -npx 4 -npy 2 -pil 7 -hblen 10

Page 20: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

DMP Scalability

Scaling up with an optimumsubdomain dimension

Size: 500 x 50 onvector processor systems

Size: 100 x 50 oncache systems

Scaling up on a fixed size problem sze

Time to solution shouldremain the same

Time to solution shoulddecrease linearly with the

# of CPUs

Page 21: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

700

750

800

850

900

0 5 10 15 20 25 30 35

SX4_1NodeSX4_2NodesVPP700

MC2 Performance on NEC SX4and Fujitsu VPP700

MC2 Performance on NEC SX4and Fujitsu VPP700

Flo

p R

ate

/ P

E (

MF

lop

s/s

ec

.)

Number of PEs

SX4: npx=2 VPP700: npx=1

Grid: 513 x 433 x 41

Page 22: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

IFS Performance on NEC SX4and Fujitsu VPP700

IFS Performance on NEC SX4and Fujitsu VPP700

Number of PEs

100

1000

10 100

Fo

rec

as

t d

ay

s /

da

y

VPP700

SX-4

Amdahl's law for parallel programmingAmdahl's law for parallel programming The speedup factor is influenced very much by the residual serial (non parallelizable) work. As the number of processors grows, so does the damage caused by non parallelizable work.

Page 23: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

Scalability: limiting factors

• Any algorithms requiring global communications

– One should THINK LOCAL

• SL transport on a global configuration lat-lon grid

point model – numerical poles (GEM)

• 2-time-level fully implicit discretization leading to an

elliptic problem: direct solver requires data transpose

• Any algorithms producing inherent load imbalance

Page 24: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

DMP - General remarks

● More difficult but more powerful programming paradigm

● Easily combined with SMP (on all MPI processes)

● Distributed memory parallelism does not happen, it must be DESIGNED.

● One does not parallelizes a code, the code must be rebuilt (and often redesigned) taking into account the constraints imposed upon the dataflow by message passing. Array dimensioning and loop indexing are likely to be VERY HEAVVILY IMPACTED.

● One may get lucky and HPF or an automatic parallelizing compiler will solve the problem (if one believes in miracles, Santa Claus, the tooth fairy or all of them).

Page 25: The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

Web sites and Books

● http://pollux.cmc.ec.gc.ca/~armnmfv/MPI_workshop

● http://www.llnl.gov/ , OpenMP, threads, MPI, ...

● http://hpcf.nersc.gov/

● http://www.idris.fr/ , en français, OpenMP, MPI, F90

● Using MPI, Gropp et al, ISBN 0-262-57204-8

● MPI, The Compl. Ref., Snir et al, ISBN 0-262-69184-1

● MPI, The Compl. Ref. vol 2, Gropp et al, ISBN 0-262-57123-4