accelerating cice on the gpuon-demand.gputechconf.com/gtc/2015/presentation/s5322-rob-aulw… ·...

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Accelerating CICE on the GPU Rob T. Aulwes, CCS-7

March 19, 2015 LA-UR-15-21044


UNCLASSIFIED

§  Mat Colgrove §  Jeff Larkin §  Jiri Kraus §  Carl Ponder §  Justin Luitjens §  Tony Scuderio

Acknowledgements

Slide 2


UNCLASSIFIED

§  CICE model §  Strategy to acceleration

– Profiling – OpenACC – GPUDirect

§  Results §  Path forward

Outline

Slide 3


UNCLASSIFIED

Los Alamos Sea Ice Model - CICE

Slide 4

§  Global model of sea ice for climate and forecast –  Used by many climate,

forecast groups

§  Coupled to atmospheric-ice-ocean-land global climate models –  http://oceans11.lanl.gov/trac/CICE


UNCLASSIFIED

Los Alamos Sea Ice Model - CICE

Slide 5

§  Ice thickness distribution –  Multiple discrete thickness bins –  All tracers, etc. exist in each thick class

§  Transport –  Tracer conservation, horizontal transport, ITD –  Incremental remap: efficient for many tracers

§  Dynamics –  Momentum: includes forcing, grav –  Stress: Elastic-Viscous-Plastic, EAP (anisotropic) rheology, stress tensor

§  Thermo/salinity/column physics –  Melt/freeze, radiation, fluxes at top/bottom, melt ponds –  T, S through a few vert levels


UNCLASSIFIED

OpenACC gotchas

Slide 6

§  Must delete device memory before host memory §  Passing non-contiguous array slice creates

temporary arrays –  Means the temporary array is not in device memory if

original array was

§  Use assumed shape for array declaration –  Unless lower bound is not 1

•  real, dimension(:,0:), intent(in) :: foo –  Bonus: assumed shapes improved CPU performance


UNCLASSIFIED

Strategy

§  Profile §  Minimize data movement §  Exploit CUDA streams §  Use GPUDirect for MPI between devices

Slide 7


UNCLASSIFIED

gprof + gprof2dot.py

Slide 8


UNCLASSIFIED

gprof + gprof2dot.py

Slide 9


UNCLASSIFIED

Challenges for Accelerating Dynamics

§  Halo updates between computations §  Many arrays to move

– Move all computations to GPU – Hide latencies where possible

Slide 10


UNCLASSIFIED

Strategy


Slide 11


UNCLASSIFIED

Strategy – Minimize Data Movement

§  Use Fortran pointers into large memory chunks –  Reduces amount of data movement

Slide 12

allocate( mem_chunk(nx, ny,2) )! v => mem_chunk(:,:,1)! w => mem_chunk(:,:,2)! !$acc enter data create(mem_chunk)! ! !$acc update device(mem_chunk)! !$acc data present(v,w)! !$acc parallel loop collapse(2)! do i = 1,ny! do j = 1,nx! v(i,j) = alpha * w(i,j)! enddo! enddo!


UNCLASSIFIED

Strategy


Slide 13


UNCLASSIFIED

Strategy – Use CUDA Streams

§  Use CUDA streams –  If loops are data independent, launch with separate

streams along with data updates to host/device

Slide 14

!$acc parallel loop collapse(2) async(1)!do i = 1,n! do j = 1,m! a(i,j) = a(i,j) * w(i,j)! enddo!enddo!!!$acc parallel loop collapse(2) async(2)!do i = 1,n! do j = 1,m! b(i,j) = b(i,j) + alpha * t(i,j)! enddo!enddo!!


UNCLASSIFIED

Strategy – Use CUDA Streams

§  Invoke subroutine calls using streams

Slide 15

do cat = 1,ncat! call construct_fields(mx, my)!enddo!

do cat = 1,ncat! ! In construct_fields, use ‘cat’ as the! ! async stream value! call construct_fields(cat,mx(:,:,cat), &! my(:,:,cat))!enddo!


UNCLASSIFIED

Strategy


Slide 16


UNCLASSIFIED

GPUDirect

Slide 17

§  Built CUDA-enabled OpenMPI 1.8.5 on moonlight –  Used PGI 14.7 for CICE compiler

§  Titan has Cray’s CUDA-enabled version of MPICH –  Also used PGI 14.7 –  However, the XK7 hardware doesn’t support RDMA, so

MPI still goes through CPU –  But, coding to GPUDirect now prepares for upcoming

Summit cluster


UNCLASSIFIED

GPUDirect

Slide 18

call ice_haloUpdate(dpx, halo_info)!call ice_haloUpdate(dpy, halo_info)!call ice_haloUpdate(mx, halo_info)!

Call ice_haloBegin(halo_info, 3, updateInfo)!call ice_devHaloUpdate(halo_info, updateInfo, dpx)!call ice_devHaloUpdate(halo_info, updateInfo, dpy)!call ice_devHaloUpdate(halo_info, updateInfo, mx)!call ice_haloEnd(halo_info, updateInfo)!


UNCLASSIFIED

Results

Slide 19


UNCLASSIFIED

§  Used two test problems: gx1 and tp4 – gx1: 16 procs, grid size 320x384 –  tp4: 60 procs, grid size 900x600

§  Ran with longitudinal blocks – Reduces load imbalance

Test Cases

Slide 20


UNCLASSIFIED

§  LANL Moonlight – PGI 14.7, OpenMPI 1.8.5, CUDA 6.0 –  Intel Xeon (SandyBridge) 16 cores/node – 2 Nvidia M2090/node

§  ORNL Titan – PGI 14.7, Cray’s MPICH, CUDA 5.5 – AMD Interlagos 16 cores/node – 1 Nvidia K20x/node

Test Platforms

Slide 21


UNCLASSIFIED

§  Nvidia’s PSG cluster –  Ivy Bridge E5-2690, dual socket 10 cores/

socket, 6 x K40 – OpenMPI 1.8.5

Test Platforms

Slide 22


UNCLASSIFIED

Test Cases – p4

4 nodes, 15 procs per node

Slide 23

Runtime (secs)

Moonlight Titan PSG

Baseline 96 173 78 OpenACC + GPUDirect

99 194 91


UNCLASSIFIED

p4 Rank Distribution

Slide 24

Runtime (secs) 4 nodes/15 ppn 6 nodes/10 ppn 10 nodes/6 ppn

Baseline Moonlight

73

69 67

GPU Moonlight 96 83 96

Baseline Titan 173 169 160

GPU Titan 194 182 163

Baseline PSG 78 74 69

GPU PSG 91 81 80


UNCLASSIFIED

p4 Scaling

Slide 25

Runtime (secs)

10 procs 5 nodes

20 procs 10 nodes

40 procs 20 nodes

Baseline Titan 337 187 201

GPU+GPUDirect

325 180 193

Fixed at 2 MPI procs per node


UNCLASSIFIED

p4 Scaling

Slide 26

Runtime (secs)

10 procs 2 nodes

20 procs 4 nodes

40 procs 5 nodes

Baseline PSG 161 87 91

GPU+GPUDirect

197 107 109

Fixed at 5 MPI procs per node


UNCLASSIFIED

Topology on PSG

Slide 27

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 mlx5_0 CPU Affinity GPU0 X PIX PHB PHB SOC SOC PHB 0-9 GPU1 PIX X PHB PHB SOC SOC PHB 0-9 GPU2 PHB PHB X PIX SOC SOC PHB 0-9 GPU3 PHB PHB PIX X SOC SOC PHB 0-9 GPU4 SOC SOC SOC SOC X PHB SOC 10-19 GPU5 SOC SOC SOC SOC PHB X SOC 10-19 mlx5_0 PHB PHB PHB PHB SOC SOC X X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch

nvidia-smi topo -m!


UNCLASSIFIED

Conclusion - Path Forward

Slide 28

§  Focus on dynamics/transport §  Improve use of GPUDirect

–  Get rid of aggregation device buffer –  Restructure code in order to get better communication/

computation overlap

§  Can we find task parallelism? –  Fuse kernels (not enough work) –  Spawn computation in stream while performing halo

updates –  OpenACC + OpenMP

accelerating cice on the gpuon-demand.gputechconf.com/gtc/2015/presentation/s5322-rob-aulw… ·...

Documents