poster contact name p6338 valeriu codreanu: valeriu...

1
AceCAST – High Performance CUDA based Weather Research Forecasting (WRF) Model Allen HUANG ([email protected] ) Tempo Quest, Inc. And Scien>sts from Space Science and Engineering Center, University of Wisconsin–Madison Introduction WRF System Components Code Validation GPU Technology Conference, Silicon Valley, April 2016 Summary Jimy Dudhia: WRF physics options Potential temperatures [K] Difference between potential temperatures on CPU and GPU [K] Performance Profile of WRF Jan. 2000 30km workload profiling by John Michalakes, “Code restructuring to improve performance in WRF model physics on Intel Xeon Phi”, Workshop on Programming weather, climate, and earth-system models on heterogeneous multi-core platforms, September 20, 2013 Performance Comparison ~35% of widely used WRF codes ported to CUDA-C Highly skilled core GPU programming team established High-performance WRF demonstrated with ~70 papers and independently validated by NVIDIA (CUDA-based modules achieved a speedup range from 105-1311 times) Lessons learned during CUDA C could be readily applied to OpenACC 2.0/OpenMP 4.0 optimization of WRF modules for GPU/Intel MIC Currently funding, not science & not technology, is the only barrier to commercialization (time-to-solution is only ~12 months) The WRF physics components are microphysics, cumulus parametrization, planetary boundary layer (PBL), land- surface model and shortwave/longwave radiation. AceCAST is a proprietary version of WRF, a mesoscale and global Weather Research and Forecasting model Designed for both operational forecasters and atmospheric researchers & widely used by commercial, government & institutional users around the world, in >150 countries WRF is suitable for a broad spectrum of applications across domain scales ranging from meters to hundreds of kilometers • Increases in computational power enables - Increased vertical as well as horizontal resolution - More timely delivery of forecasts - Probabilistic forecasts based on ensemble methods with much improved forecast accuracy Why accelerated AceCAST? -High resolution accuracy & cost performance -Need for strong scaling -Greatly improved profits for weather sensitive industry GPU Speedups Parallel Execution of WRF on GPU Single threaded non-vectorized CPU code is compiled with gfortran 4.4.6 WRF model integration procedure Fused multiply-addition was turned off (--fmad=false) GNU C math library was used on GPU, i.e. powf(), expf(), sqrt() and logf() were replaced by routines from GNU C library bit-exact output compared to gfortran compiler on CPU Small output differences for –fast-math (shown below) Equations describing YSU PBL scheme are executed in one thread for each grid point Mapping of the CONUS domain onto one GPU thread-block-grid domain Implementation of YSU PBL in GPUs with CUDA Program CPU runtime GPU runtime Speedup One CPU core 1800.0 ms Non-coalesced 50.0 ms 36.0x Coalesced 48.0 ms 37.5x0 Improvement of YSU PBL in GPU-Based Parallelism Three configurations of memory between shared memory and L1 cache are : (1) 48 KB shared memory, 16 KB L1 cache default (2) 32 KB shared memory, 32 KB L1 cache (3) 16 KB shared memory, 48 KB L1 cache can be achieved by applying “cudaFuncCachePreferL1” After increasing L1 cache with “cudaFuncCachePreferL1” , the GPU runtime reduces and speedup increases CPU runtime GPU runtime Speedup One CPU core 1800.0 ms Non-coalesced 48.0 ms 37.5x Coalesced 45.0 ms 40.0x By Scalarization, the temporary arrays are reduced from 68 down to 14 arrays – This makes global memory access reduced a lot !!! CPU runtime GPU runtime Speedup One CPU core 1800.0 ms Non-coalesced 39.0 ms 46.2x Coalesced 35.0 ms 51.4x GPU Runtime & Speedups with Multi-GPU Implementations for YSU PBL Module TQI AceCast high performance has been validated by peer reviewed publications and independently verified by NVIDIA: Peer reviewed of software implementation and optimization to validate TQI claimed performance by multiple anonymous experts Independent evaluation and implementation of the source codes (7 targeted CUDA-C modules achieve published GPU performance within full WRF model run) Official presentations of TQI performance results in international professional and scientific conferences WRF model physics performance comparison between Intel Xeon Phi and NVIDIA Kepler K20/K40 (w/wo boast mode) Results from NVIDIA PSG Cluster (HQ, USA) - >10x speedups NVIDIA independent evaluation of WRF GPU CUDA-C acceleration : WSM6 – 64x speedups All runs made on NVIDIA PSG cluster with an user-provided WRF namelist CPU-only and CPU+GPU hybrid results based on standard NCAR WRF 3.6.1 plus CUDA WRF 3.6.1 modules developed by SSEC CPU-Only results run on 2 x IVB CPU, total of 20 cores CPU+GPU results run on 2 x IVB CPU, total of 20 and using only 1 of 2 Tesla K80 GPU System software included: CUDA 7.0, PGI 15.7, running on CentOS 6 CONTACT NAME Valeriu Codreanu: [email protected] POSTER P6338 CATEGORY: EARTH SYSTEM MODELLING - ESM01 GTC_2016_Earth_Systems_Modelling_ESM_01_P6338.indd 1 2/23/16 1:20 PM

Upload: others

Post on 01-Mar-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

AceCAST – High Performance CUDA based Weather Research Forecasting (WRF) Model

Allen  HUANG      ([email protected])  Tempo  Quest,  Inc.  And  Scien>sts  from  Space  Science  and  Engineering  Center,  University  of  Wisconsin–Madison  

Introduction

WRF System Components

Code Validation

GPU Technology Conference, Silicon Valley, April 2016

Summary Jimy Dudhia: WRF physics options

Potential temperatures [K] Difference between potential temperatures on CPU and GPU [K]

Performance Profile of WRF

Jan. 2000 30km workload profiling by John Michalakes, “Code restructuring to improve performance in WRF model physics on Intel Xeon Phi”, Workshop on Programming weather, climate, and earth-system models on heterogeneous multi-core platforms, September 20, 2013

Performance Comparison

• ~35% of widely used WRF codes ported to CUDA-C • Highly skilled core GPU programming team established • High-performance WRF demonstrated with ~70 papers and independently validated by NVIDIA (CUDA-based modules achieved a speedup range from 105-1311 times) • Lessons learned during CUDA C could be readily applied to OpenACC 2.0/OpenMP 4.0 optimization of WRF modules for GPU/Intel MIC • Currently funding, not science & not technology, is the only barrier to commercialization (time-to-solution is only ~12 months)

The WRF physics components are microphysics, cumulus parametrization, planetary boundary layer (PBL), land-surface model and shortwave/longwave radiation. ‏

• AceCAST is a proprietary version of WRF, a mesoscale and global Weather Research and Forecasting model • Designed for both operational forecasters and atmospheric researchers & widely used by commercial, government & institutional users around the world, in >150 countries • WRF is suitable for a broad spectrum of applications across domain scales ranging from meters to hundreds of kilometers • Increases in computational power enables - Increased vertical as well as horizontal resolution - More timely delivery of forecasts - Probabilistic forecasts based on ensemble methods with much improved forecast accuracy • Why accelerated AceCAST? -High resolution accuracy & cost performance -Need for strong scaling -Greatly improved profits for weather sensitive industry

GPU Speedups

Parallel Execution of WRF on GPU

Single threaded non-vectorized CPU code is compiled with gfortran 4.4.6

WRF model integration procedure

• Fused multiply-addition was turned off (--fmad=false) • GNU C math library was used on GPU, i.e. powf(), expf(), sqrt() and logf() were replaced by routines from GNU C library → bit-exact output compared to gfortran compiler on CPU • Small output differences for –fast-math (shown below)

Equations describing YSU PBL scheme are executed in one thread for each grid point

Mapping of the CONUS domain onto one GPU thread-block-grid domain

Implementation of YSU PBL in GPUs with CUDA Program

CPU runtime GPU runtime Speedup One CPU core 1800.0 ms Non-coalesced 50.0 ms 36.0x Coalesced 48.0 ms 37.5x0

Improvement of YSU PBL in GPU-Based Parallelism

Three configurations of memory between shared memory and L1 cache are : (1) 48 KB shared memory, 16 KB L1 cache – default (2) 32 KB shared memory, 32 KB L1 cache (3) 16 KB shared memory, 48 KB L1 cache à can be achieved by applying “cudaFuncCachePreferL1” à After increasing L1 cache with

“cudaFuncCachePreferL1” , the GPU runtime reduces and speedup increases CPU runtime GPU runtime Speedup

One CPU core 1800.0 ms Non-coalesced 48.0 ms 37.5x Coalesced 45.0 ms 40.0x

By Scalarization, the temporary arrays are reduced from

68 down to 14 arrays –

àThis makes global memory access reduced a lot !!!

CPU runtime GPU runtime Speedup One CPU core 1800.0 ms Non-coalesced 39.0 ms 46.2x Coalesced 35.0 ms 51.4x

GPU Runtime & Speedups with Multi-GPU Implementations for YSU PBL Module

TQI AceCast high performance has been validated by peer reviewed publications and independently verified by NVIDIA: Ø Peer reviewed of software

implementation and optimization to validate TQI claimed performance by multiple anonymous experts

Ø Independent evaluation and implementation of the source codes (7 targeted CUDA-C modules achieve published GPU performance within full WRF model run)

Ø Official presentations of TQI performance results in international professional and scientific conferences

ü  WRF model physics performance comparison between Intel Xeon Phi and NVIDIA Kepler K20/K40 (w/wo boast mode)

ü  Results from NVIDIA PSG Cluster (HQ, USA) - >10x speedups

ü  NVIDIA independent evaluation of WRF GPU CUDA-C acceleration : WSM6 – 64x speedups

All runs made on NVIDIA PSG cluster with an user-provided WRF namelist •  CPU-only and CPU+GPU hybrid results based on standard NCAR WRF

3.6.1 plus CUDA WRF 3.6.1 modules developed by SSEC •  CPU-Only results run on 2 x IVB CPU, total of 20 cores •  CPU+GPU results run on 2 x IVB CPU, total of 20 and using only 1 of 2

Tesla K80 GPU •  System software included: CUDA 7.0, PGI 15.7, running on CentOS 6

contact name

Valeriu Codreanu: [email protected]

P6338

category: earth system modelling - esm01

GTC_2016_Earth_Systems_Modelling_ESM_01_P6338.indd 1 2/23/16 1:20 PM