accelerating cice on the gpuon-demand.gputechconf.com/gtc/2015/presentation/s5322-rob-aulw… ·...

28
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Upload: others

Post on 09-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Accelerating CICE on the GPU Rob T. Aulwes, CCS-7

March 19, 2015 LA-UR-15-21044

Page 2: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

§  Mat Colgrove §  Jeff Larkin §  Jiri Kraus §  Carl Ponder §  Justin Luitjens §  Tony Scuderio

Acknowledgements

Slide 2

Page 3: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

§  CICE model §  Strategy to acceleration

– Profiling – OpenACC – GPUDirect

§  Results §  Path forward

Outline

Slide 3

Page 4: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Los Alamos Sea Ice Model - CICE

Slide 4

§  Global model of sea ice for climate and forecast –  Used by many climate,

forecast groups

§  Coupled to atmospheric-ice-ocean-land global climate models –  http://oceans11.lanl.gov/trac/CICE

Page 5: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Los Alamos Sea Ice Model - CICE

Slide 5

§  Ice thickness distribution –  Multiple discrete thickness bins –  All tracers, etc. exist in each thick class

§  Transport –  Tracer conservation, horizontal transport, ITD –  Incremental remap: efficient for many tracers

§  Dynamics –  Momentum: includes forcing, grav –  Stress: Elastic-Viscous-Plastic, EAP (anisotropic) rheology, stress tensor

§  Thermo/salinity/column physics –  Melt/freeze, radiation, fluxes at top/bottom, melt ponds –  T, S through a few vert levels

Page 6: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

OpenACC gotchas

Slide 6

§  Must delete device memory before host memory §  Passing non-contiguous array slice creates

temporary arrays –  Means the temporary array is not in device memory if

original array was

§  Use assumed shape for array declaration –  Unless lower bound is not 1

•  real, dimension(:,0:), intent(in) :: foo –  Bonus: assumed shapes improved CPU performance

Page 7: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy

§  Profile §  Minimize data movement §  Exploit CUDA streams §  Use GPUDirect for MPI between devices

Slide 7

Page 8: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

gprof + gprof2dot.py

Slide 8

Page 9: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

gprof + gprof2dot.py

Slide 9

Page 10: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Challenges for Accelerating Dynamics

§  Halo updates between computations §  Many arrays to move

– Move all computations to GPU – Hide latencies where possible

Slide 10

Page 11: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy

§  Profile §  Minimize data movement §  Exploit CUDA streams §  Use GPUDirect for MPI between devices

Slide 11

Page 12: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy – Minimize Data Movement

§  Use Fortran pointers into large memory chunks –  Reduces amount of data movement

Slide 12

allocate( mem_chunk(nx, ny,2) )! v => mem_chunk(:,:,1)! w => mem_chunk(:,:,2)! !$acc enter data create(mem_chunk)! ! !$acc update device(mem_chunk)! !$acc data present(v,w)! !$acc parallel loop collapse(2)! do i = 1,ny! do j = 1,nx! v(i,j) = alpha * w(i,j)! enddo! enddo!

Page 13: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy

§  Profile §  Minimize data movement §  Exploit CUDA streams §  Use GPUDirect for MPI between devices

Slide 13

Page 14: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy – Use CUDA Streams

§  Use CUDA streams –  If loops are data independent, launch with separate

streams along with data updates to host/device

Slide 14

!$acc parallel loop collapse(2) async(1)!do i = 1,n! do j = 1,m! a(i,j) = a(i,j) * w(i,j)! enddo!enddo!!!$acc parallel loop collapse(2) async(2)!do i = 1,n! do j = 1,m! b(i,j) = b(i,j) + alpha * t(i,j)! enddo!enddo!!

Page 15: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy – Use CUDA Streams

§  Invoke subroutine calls using streams

Slide 15

do cat = 1,ncat! call construct_fields(mx, my)!enddo!

do cat = 1,ncat! ! In construct_fields, use ‘cat’ as the! ! async stream value! call construct_fields(cat,mx(:,:,cat), &! my(:,:,cat))!enddo!

Page 16: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy

§  Profile §  Minimize data movement §  Exploit CUDA streams §  Use GPUDirect for MPI between devices

Slide 16

Page 17: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

GPUDirect

Slide 17

§  Built CUDA-enabled OpenMPI 1.8.5 on moonlight –  Used PGI 14.7 for CICE compiler

§  Titan has Cray’s CUDA-enabled version of MPICH –  Also used PGI 14.7 –  However, the XK7 hardware doesn’t support RDMA, so

MPI still goes through CPU –  But, coding to GPUDirect now prepares for upcoming

Summit cluster

Page 18: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

GPUDirect

Slide 18

call ice_haloUpdate(dpx, halo_info)!call ice_haloUpdate(dpy, halo_info)!call ice_haloUpdate(mx, halo_info)!

Call ice_haloBegin(halo_info, 3, updateInfo)!call ice_devHaloUpdate(halo_info, updateInfo, dpx)!call ice_devHaloUpdate(halo_info, updateInfo, dpy)!call ice_devHaloUpdate(halo_info, updateInfo, mx)!call ice_haloEnd(halo_info, updateInfo)!

Page 19: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Results

Slide 19

Page 20: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

§  Used two test problems: gx1 and tp4 – gx1: 16 procs, grid size 320x384 –  tp4: 60 procs, grid size 900x600

§  Ran with longitudinal blocks – Reduces load imbalance

Test Cases

Slide 20

Page 21: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

§  LANL Moonlight – PGI 14.7, OpenMPI 1.8.5, CUDA 6.0 –  Intel Xeon (SandyBridge) 16 cores/node – 2 Nvidia M2090/node

§  ORNL Titan – PGI 14.7, Cray’s MPICH, CUDA 5.5 – AMD Interlagos 16 cores/node – 1 Nvidia K20x/node

Test Platforms

Slide 21

Page 22: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

§  Nvidia’s PSG cluster –  Ivy Bridge E5-2690, dual socket 10 cores/

socket, 6 x K40 – OpenMPI 1.8.5

Test Platforms

Slide 22

Page 23: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Test Cases – p4

4 nodes, 15 procs per node

Slide 23

Runtime (secs)

Moonlight Titan PSG

Baseline 96 173 78 OpenACC + GPUDirect

99 194 91

Page 24: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

p4 Rank Distribution

Slide 24

Runtime (secs) 4 nodes/15 ppn 6 nodes/10 ppn 10 nodes/6 ppn

Baseline Moonlight

73

69 67

GPU Moonlight 96 83 96

Baseline Titan 173 169 160

GPU Titan 194 182 163

Baseline PSG 78 74 69

GPU PSG 91 81 80

Page 25: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

p4 Scaling

Slide 25

Runtime (secs)

10 procs 5 nodes

20 procs 10 nodes

40 procs 20 nodes

Baseline Titan 337 187 201

GPU+GPUDirect

325 180 193

Fixed at 2 MPI procs per node

Page 26: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

p4 Scaling

Slide 26

Runtime (secs)

10 procs 2 nodes

20 procs 4 nodes

40 procs 5 nodes

Baseline PSG 161 87 91

GPU+GPUDirect

197 107 109

Fixed at 5 MPI procs per node

Page 27: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Topology on PSG

Slide 27

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 mlx5_0 CPU Affinity GPU0 X PIX PHB PHB SOC SOC PHB 0-9 GPU1 PIX X PHB PHB SOC SOC PHB 0-9 GPU2 PHB PHB X PIX SOC SOC PHB 0-9 GPU3 PHB PHB PIX X SOC SOC PHB 0-9 GPU4 SOC SOC SOC SOC X PHB SOC 10-19 GPU5 SOC SOC SOC SOC PHB X SOC 10-19 mlx5_0 PHB PHB PHB PHB SOC SOC X X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch

nvidia-smi topo -m!

Page 28: Accelerating CICE on the GPUon-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulw… · Accelerating CICE on the GPU Rob T. Aulwes, CCS-7 March 19, 2015 LA-UR-15-21044

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Conclusion - Path Forward

Slide 28

§  Focus on dynamics/transport §  Improve use of GPUDirect

–  Get rid of aggregation device buffer –  Restructure code in order to get better communication/

computation overlap

§  Can we find task parallelism? –  Fuse kernels (not enough work) –  Spawn computation in stream while performing halo

updates –  OpenACC + OpenMP