Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Accelerating CICE on the GPU Rob T. Aulwes, CCS-7
March 19, 2015 LA-UR-15-21044
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
§ Mat Colgrove § Jeff Larkin § Jiri Kraus § Carl Ponder § Justin Luitjens § Tony Scuderio
Acknowledgements
Slide 2
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
§ CICE model § Strategy to acceleration
– Profiling – OpenACC – GPUDirect
§ Results § Path forward
Outline
Slide 3
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Los Alamos Sea Ice Model - CICE
Slide 4
§ Global model of sea ice for climate and forecast – Used by many climate,
forecast groups
§ Coupled to atmospheric-ice-ocean-land global climate models – http://oceans11.lanl.gov/trac/CICE
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Los Alamos Sea Ice Model - CICE
Slide 5
§ Ice thickness distribution – Multiple discrete thickness bins – All tracers, etc. exist in each thick class
§ Transport – Tracer conservation, horizontal transport, ITD – Incremental remap: efficient for many tracers
§ Dynamics – Momentum: includes forcing, grav – Stress: Elastic-Viscous-Plastic, EAP (anisotropic) rheology, stress tensor
§ Thermo/salinity/column physics – Melt/freeze, radiation, fluxes at top/bottom, melt ponds – T, S through a few vert levels
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
OpenACC gotchas
Slide 6
§ Must delete device memory before host memory § Passing non-contiguous array slice creates
temporary arrays – Means the temporary array is not in device memory if
original array was
§ Use assumed shape for array declaration – Unless lower bound is not 1
• real, dimension(:,0:), intent(in) :: foo – Bonus: assumed shapes improved CPU performance
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy
§ Profile § Minimize data movement § Exploit CUDA streams § Use GPUDirect for MPI between devices
Slide 7
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
gprof + gprof2dot.py
Slide 8
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
gprof + gprof2dot.py
Slide 9
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Challenges for Accelerating Dynamics
§ Halo updates between computations § Many arrays to move
– Move all computations to GPU – Hide latencies where possible
Slide 10
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy
§ Profile § Minimize data movement § Exploit CUDA streams § Use GPUDirect for MPI between devices
Slide 11
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy – Minimize Data Movement
§ Use Fortran pointers into large memory chunks – Reduces amount of data movement
Slide 12
allocate( mem_chunk(nx, ny,2) )! v => mem_chunk(:,:,1)! w => mem_chunk(:,:,2)! !$acc enter data create(mem_chunk)! ! !$acc update device(mem_chunk)! !$acc data present(v,w)! !$acc parallel loop collapse(2)! do i = 1,ny! do j = 1,nx! v(i,j) = alpha * w(i,j)! enddo! enddo!
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy
§ Profile § Minimize data movement § Exploit CUDA streams § Use GPUDirect for MPI between devices
Slide 13
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy – Use CUDA Streams
§ Use CUDA streams – If loops are data independent, launch with separate
streams along with data updates to host/device
Slide 14
!$acc parallel loop collapse(2) async(1)!do i = 1,n! do j = 1,m! a(i,j) = a(i,j) * w(i,j)! enddo!enddo!!!$acc parallel loop collapse(2) async(2)!do i = 1,n! do j = 1,m! b(i,j) = b(i,j) + alpha * t(i,j)! enddo!enddo!!
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy – Use CUDA Streams
§ Invoke subroutine calls using streams
Slide 15
do cat = 1,ncat! call construct_fields(mx, my)!enddo!
do cat = 1,ncat! ! In construct_fields, use ‘cat’ as the! ! async stream value! call construct_fields(cat,mx(:,:,cat), &! my(:,:,cat))!enddo!
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy
§ Profile § Minimize data movement § Exploit CUDA streams § Use GPUDirect for MPI between devices
Slide 16
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
GPUDirect
Slide 17
§ Built CUDA-enabled OpenMPI 1.8.5 on moonlight – Used PGI 14.7 for CICE compiler
§ Titan has Cray’s CUDA-enabled version of MPICH – Also used PGI 14.7 – However, the XK7 hardware doesn’t support RDMA, so
MPI still goes through CPU – But, coding to GPUDirect now prepares for upcoming
Summit cluster
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
GPUDirect
Slide 18
call ice_haloUpdate(dpx, halo_info)!call ice_haloUpdate(dpy, halo_info)!call ice_haloUpdate(mx, halo_info)!
Call ice_haloBegin(halo_info, 3, updateInfo)!call ice_devHaloUpdate(halo_info, updateInfo, dpx)!call ice_devHaloUpdate(halo_info, updateInfo, dpy)!call ice_devHaloUpdate(halo_info, updateInfo, mx)!call ice_haloEnd(halo_info, updateInfo)!
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Results
Slide 19
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
§ Used two test problems: gx1 and tp4 – gx1: 16 procs, grid size 320x384 – tp4: 60 procs, grid size 900x600
§ Ran with longitudinal blocks – Reduces load imbalance
Test Cases
Slide 20
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
§ LANL Moonlight – PGI 14.7, OpenMPI 1.8.5, CUDA 6.0 – Intel Xeon (SandyBridge) 16 cores/node – 2 Nvidia M2090/node
§ ORNL Titan – PGI 14.7, Cray’s MPICH, CUDA 5.5 – AMD Interlagos 16 cores/node – 1 Nvidia K20x/node
Test Platforms
Slide 21
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
§ Nvidia’s PSG cluster – Ivy Bridge E5-2690, dual socket 10 cores/
socket, 6 x K40 – OpenMPI 1.8.5
Test Platforms
Slide 22
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Test Cases – p4
4 nodes, 15 procs per node
Slide 23
Runtime (secs)
Moonlight Titan PSG
Baseline 96 173 78 OpenACC + GPUDirect
99 194 91
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
p4 Rank Distribution
Slide 24
Runtime (secs) 4 nodes/15 ppn 6 nodes/10 ppn 10 nodes/6 ppn
Baseline Moonlight
73
69 67
GPU Moonlight 96 83 96
Baseline Titan 173 169 160
GPU Titan 194 182 163
Baseline PSG 78 74 69
GPU PSG 91 81 80
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
p4 Scaling
Slide 25
Runtime (secs)
10 procs 5 nodes
20 procs 10 nodes
40 procs 20 nodes
Baseline Titan 337 187 201
GPU+GPUDirect
325 180 193
Fixed at 2 MPI procs per node
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
p4 Scaling
Slide 26
Runtime (secs)
10 procs 2 nodes
20 procs 4 nodes
40 procs 5 nodes
Baseline PSG 161 87 91
GPU+GPUDirect
197 107 109
Fixed at 5 MPI procs per node
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Topology on PSG
Slide 27
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 mlx5_0 CPU Affinity GPU0 X PIX PHB PHB SOC SOC PHB 0-9 GPU1 PIX X PHB PHB SOC SOC PHB 0-9 GPU2 PHB PHB X PIX SOC SOC PHB 0-9 GPU3 PHB PHB PIX X SOC SOC PHB 0-9 GPU4 SOC SOC SOC SOC X PHB SOC 10-19 GPU5 SOC SOC SOC SOC PHB X SOC 10-19 mlx5_0 PHB PHB PHB PHB SOC SOC X X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch
nvidia-smi topo -m!
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Conclusion - Path Forward
Slide 28
§ Focus on dynamics/transport § Improve use of GPUDirect
– Get rid of aggregation device buffer – Restructure code in order to get better communication/
computation overlap
§ Can we find task parallelism? – Fuse kernels (not enough work) – Spawn computation in stream while performing halo
updates – OpenACC + OpenMP