james lin - hpc advisory council › events › 2013 › china-workshop › ...openmp accelerator...

Early Experience with OpenACC and OpenMP4 on

π, the supercomputer of SJTU

Visiting Associate

Satoshi MATSUOKA Laboratory

Tokyo Institute of Technology

James Lin

Vice Director

Center for High Performance Computing

Shanghai Jiao Tong University

HPC Advisory 2013@Guilin

Outline

1. π, the supercomputer of SJTU

2. People behind π

3. Research on π

– Background

– OpenACC

– OpenMP for Accelerators

4. Summary


• NO.1 in China MOE Universities, NO.158 of TOP500 in June 2013

– Plan to update some accelerators in Y2015 and V2.0 around Y2017

• Specification

– Type: INSPUR TS10000

– Performance: 830 Intel SNB E5-2670 + 100 NVIDIA Kepler K20 + 10 Intel Xeon Phi 5110P=196.2/262.6 TFlops

– Nodes: Intel EPSD JP/WP

– Interconnection: Mellanox IB FDR 56Gbps, 648ports

– Parallel File Storage: DDN SFA12K 720TB with Lustre

– SSD: 80 Intel SSD 400G

– Cooling System: Rittal Liquid

• Applications

– Open Source: Gromacs, LAMMPS…

– In-house: PIC, NS3D, DSMC…

Milestones in Year 2013

Apr Oct June

Assembled

Early Access Program

Production Submit LINPCK to TOP500

Why we build a GPU-Phi Cluster?

• Mostly based on academic consideration, and quite confident GPU or Phi can be fully used – Large Scale GPU Supercomputer: Titan, TSUBME

2.5

– Large Scale Xeon Phi Supercomputer: Tianhe-2, Stampede

• Accelerators could be a path to Exascale

• More applications are ready for GPU in this generation

• We will keep our mind open to next generation, Maxwell and Knights Landing

Single source code base for GPU and Xeon Phi?

• Low Level: OpenCL

• High Level: Directive-based Programming

• Just like Java for Windows and Linux in

the early days

Outline


2. People behind π

3. Research on π

– Background

– OpenACC


4. Summary

Adjuncts

User Committee

Outline


2. People behind π

3. Research on π

– Background

– OpenACC


4. Summary

Current Research Topics on π

• Directive-based Programming on Accelerators

– Ninja GAP[1]

• Application Performance Evaluation and Optimization on Accelerators

– Particle in Cell

– Molecular Dynamics

[1] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and P. Dubey, “Can traditional programming bridge the Ninja performance gap for parallel computing applications?,” ISCA 2011

• We may have many reasons to use directive-based

programming, but for me, it can keep the code

readable/maintainable from the application developers'

point of view

CUDA Experts

Application Developers

Version 1 Version 2 is based on develops’ own version

Port to CUDA

CUDA Version 1

Unmaintainable to application developers

Why Directive-based Programming?

Directive-based Programming for Accelerators[1]

• Standard – OpenACC

– OpenMP >=4.0

• Product

– PGI Accelerators

– HMPP

• Research Projects – R-Stream

– HiCUDA

– OpenMPC/OpenMP for accelerators

[1] S. Lee and J. S. Vetter, “Early evaluation of directive-based GPU programming models for productive exascale computing,” presented at the High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, 2012, pp. 1–11.

Outline


2. People behind π

3. Research on π

– Background

– OpenACC


4. Summary

OpenACC

• Standard version evolution is much faster than OpenMP and MPI : ~1.5year

– Maybe too fast for compiler vendors and application developers to catch up

• NVIDIA proposes a "Fork and Merge" model for OpenACC to OpenMP

Standard/Version 1.0 2.0 3.0 4.0

OpenACC Nov 2011 July 2013 -- --

OpenMP-Fortran Oct 1997 Nov 2000 May 2008 July 2013

MPI June 1994 July 1997 Sept 2012 --

EPCC OpenACC Benchmark Suite

• Version 1.0 is released in this summer – Developed by Nick Johnson of EPCC

– Source code is available on github

• Divided into 3 sections – Level A benchmarks mainly measure the speed of

operation of various OpenACC functions

– Level B benchmarks measure the performance of some BLAS-type kernels

– Level C are real application codes • Himeno

• 27stencil

• Le2d

http://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openacc-benchmark-suite














Preliminary Results

1. Most cases of CUDA on K20 run faster then OpenCL on MIC

2. Except Guassian and Himeno

Roofline Model[1]

[1] S. Williams, A. Waterman, and D. Patterson, “Roofline,” Commun. ACM, vol. 52, no. 4, p. 65, Apr. 2009.

• The reason we apply Roofline mode here is to explore any relationship between arithmetic Intensity with performance portability

• What kind of application could archive good performance portability

Intensity of Benchmark

Level A SpMV GEMM GESUMM 2MM 3MM ATAX BICG MVT

Intensity 0.06 4.41~ 122.04 0.24 3.28~62.35 6.10~101.67 0.49 0.49 0.49

Level B SYRK SYR2K 2DConv 3DConv COR COV Pathfinder

Intensity 5.28~135.96 4.97~192.02 0.708 1.244 2.77~50.87 2.00~51.15 0.09

Level C Hotspot Gaussian 27Stencil Himeno Le2d

Intensity 0.50 0.24 1.47 1.78 1.63

Roofline Model Analysis

Outline


2. People behind π

3. Research on π

– Background

– OpenACC


4. Summary

OpenMP 4.0 for Accelerators

• Released in July 2013, it supports on directive-based programming on accelerators, such as GPU and Xeon Phi

• Directives for – Parallelism: target/parallel

– Data: target data

– Two levels of parallelism: teams/distribute

• A standard supported by Intel and will be supported by NVIDIA

Code Example: Jacobi Kernel with OpenMP4.0

HOMP, OpenMP compiler for CUDA

• Developed by LLNL, it is build on ROSE[1],

a source-to-source compiler

• It is an early research implementation[2] of

OpenMP4.0 on GPU for CUDA5.0

• Support C/C++ only now

[1] http://rosecompiler.org

[2] C. Liao, Y. Yan, B. R. Supinski, D. J. Quinlan, and B. Chapman, “Early Experiences with the OpenMP Accelerator Model,” IWOMP13

Missing Features of OpenMP4.0

• Multiple Device Support

• Combined Constructs Separate

• No-Middle-Sync

• Array Sections

• Global Barrier

• Mapping Nested Loops

• Mapped data Reuse

C. Liao, Y. Yan, B. R. Supinski, D. J. Quinlan, and B. Chapman, “Early Experiences with the OpenMP Accelerator Model,” IWOMP13

OpenMP 4.0

HOMP-Rose

Kepler K20m

Intel Compiler

MIC 5110p

AXPY Jacobi MM

Experiment Methodology

Test cases

• AXPY

• Jacobi

• Matrix Multiplication

Software used in π

• HOMP

• CUDA 5.0

• Intel Compiler

34

Hardware used in π

Xeon E5-2670 Xeon Phi 5110P Tesla K20m

Performance (SP)

333 GFlops

2022 GFlops

3520 GFlops

Memory Bandwidth 51.2 GB/s 320 GB/s (ECC off)

208 GB/s (ECC off)

Memory Size --- 8 GB 5 GB

number of cores 8 60/61 2496

Clock Speed 2.6 GHz 1.053 GHz 0.706 GHz

Anatomy of a GPU/Phi Node in π

Outline


2. People behind π

3. Research on π

– Background

– OpenACC


4. Summary

Summary

• π, the supercomputer of SJTU, is a GPU-

Phi hybrid cluster

• Directive-based programming approach,

such as OpenACC and OpenMP4.0, has

potential for single source running on both

GPU and Xeon Phi

Contact Information Website http://hpc.sjtu.edu.cn

Blog http://ccoe.sjtu.edu.cn

Weibo http://e.weibo.com/sjtuhpc

Email [email protected]

http://hpc.sjtu.edu.cn

http://ccoe.sjtu.edu.cn

http://e.weibo.com/sjtuhpc

mailto:[email protected]

http://icsc2014.sjtu.edu.cn/

May 7-9, 2014 @SJTU



Pre-conference GPU tutorial, May 5-6, 2014

• Parallel Programming tutorial, V.Morozov

• GPU and Beyond, L.Grinberg

• Hybrid/GPU Programming, TH Tang

james lin - hpc advisory council › events › 2013 › china-workshop › ...openmp accelerator...

Documents