james lin - hpc advisory council › events › 2013 › china-workshop › ...openmp accelerator...
TRANSCRIPT
Early Experience with OpenACC and OpenMP4 on
π, the supercomputer of SJTU
Visiting Associate
Satoshi MATSUOKA Laboratory
Tokyo Institute of Technology
James Lin
Vice Director
Center for High Performance Computing
Shanghai Jiao Tong University
HPC Advisory 2013@Guilin
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
π, the supercomputer of SJTU
π, the supercomputer of SJTU
• NO.1 in China MOE Universities, NO.158 of TOP500 in June 2013
– Plan to update some accelerators in Y2015 and V2.0 around Y2017
• Specification
– Type: INSPUR TS10000
– Performance: 830 Intel SNB E5-2670 + 100 NVIDIA Kepler K20 + 10 Intel Xeon Phi 5110P=196.2/262.6 TFlops
– Nodes: Intel EPSD JP/WP
– Interconnection: Mellanox IB FDR 56Gbps, 648ports
– Parallel File Storage: DDN SFA12K 720TB with Lustre
– SSD: 80 Intel SSD 400G
– Cooling System: Rittal Liquid
• Applications
– Open Source: Gromacs, LAMMPS…
– In-house: PIC, NS3D, DSMC…
Milestones in Year 2013
Apr Oct June
Assembled
Early Access Program
Production Submit LINPCK to TOP500
Why we build a GPU-Phi Cluster?
• Mostly based on academic consideration, and quite confident GPU or Phi can be fully used – Large Scale GPU Supercomputer: Titan, TSUBME
2.5
– Large Scale Xeon Phi Supercomputer: Tianhe-2, Stampede
• Accelerators could be a path to Exascale
• More applications are ready for GPU in this generation
• We will keep our mind open to next generation, Maxwell and Knights Landing
Single source code base for GPU and Xeon Phi?
• Low Level: OpenCL
• High Level: Directive-based Programming
• Just like Java for Windows and Linux in
the early days
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
Staff
Adjuncts
User Committee
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
Current Research Topics on π
• Directive-based Programming on Accelerators
– Ninja GAP[1]
• Application Performance Evaluation and Optimization on Accelerators
– Particle in Cell
– Molecular Dynamics
[1] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and P. Dubey, “Can traditional programming bridge the Ninja performance gap for parallel computing applications?,” ISCA 2011
• We may have many reasons to use directive-based
programming, but for me, it can keep the code
readable/maintainable from the application developers'
point of view
CUDA Experts
Application Developers
Version 1 Version 2 is based on develops’ own version
Port to CUDA
CUDA Version 1
Unmaintainable to application developers
Why Directive-based Programming?
Directive-based Programming for Accelerators[1]
• Standard – OpenACC
– OpenMP >=4.0
• Product
– PGI Accelerators
– HMPP
• Research Projects – R-Stream
– HiCUDA
– OpenMPC/OpenMP for accelerators
[1] S. Lee and J. S. Vetter, “Early evaluation of directive-based GPU programming models for productive exascale computing,” presented at the High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, 2012, pp. 1–11.
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
OpenACC
• Standard version evolution is much faster than OpenMP and MPI : ~1.5year
– Maybe too fast for compiler vendors and application developers to catch up
• NVIDIA proposes a "Fork and Merge" model for OpenACC to OpenMP
Standard/Version 1.0 2.0 3.0 4.0
OpenACC Nov 2011 July 2013 -- --
OpenMP-Fortran Oct 1997 Nov 2000 May 2008 July 2013
MPI June 1994 July 1997 Sept 2012 --
EPCC OpenACC Benchmark Suite
• Version 1.0 is released in this summer – Developed by Nick Johnson of EPCC
– Source code is available on github
• Divided into 3 sections – Level A benchmarks mainly measure the speed of
operation of various OpenACC functions
– Level B benchmarks measure the performance of some BLAS-type kernels
– Level C are real application codes • Himeno
• 27stencil
• Le2d
http://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openacc-benchmark-suite
Preliminary Results
1. Most cases of CUDA on K20 run faster then OpenCL on MIC
2. Except Guassian and Himeno
Roofline Model[1]
[1] S. Williams, A. Waterman, and D. Patterson, “Roofline,” Commun. ACM, vol. 52, no. 4, p. 65, Apr. 2009.
• The reason we apply Roofline mode here is to explore any relationship between arithmetic Intensity with performance portability
• What kind of application could archive good performance portability
Intensity of Benchmark
Level A SpMV GEMM GESUMM 2MM 3MM ATAX BICG MVT
Intensity 0.06 4.41~ 122.04 0.24 3.28~62.35 6.10~101.67 0.49 0.49 0.49
Level B SYRK SYR2K 2DConv 3DConv COR COV Pathfinder
Intensity 5.28~135.96 4.97~192.02 0.708 1.244 2.77~50.87 2.00~51.15 0.09
Level C Hotspot Gaussian 27Stencil Himeno Le2d
Intensity 0.50 0.24 1.47 1.78 1.63
Roofline Model Analysis
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
OpenMP 4.0 for Accelerators
• Released in July 2013, it supports on directive-based programming on accelerators, such as GPU and Xeon Phi
• Directives for – Parallelism: target/parallel
– Data: target data
– Two levels of parallelism: teams/distribute
• A standard supported by Intel and will be supported by NVIDIA
Code Example: Jacobi Kernel with OpenMP4.0
HOMP, OpenMP compiler for CUDA
• Developed by LLNL, it is build on ROSE[1],
a source-to-source compiler
• It is an early research implementation[2] of
OpenMP4.0 on GPU for CUDA5.0
• Support C/C++ only now
[1] http://rosecompiler.org
[2] C. Liao, Y. Yan, B. R. Supinski, D. J. Quinlan, and B. Chapman, “Early Experiences with the OpenMP Accelerator Model,” IWOMP13
Missing Features of OpenMP4.0
• Multiple Device Support
• Combined Constructs Separate
• No-Middle-Sync
• Array Sections
• Global Barrier
• Mapping Nested Loops
• Mapped data Reuse
C. Liao, Y. Yan, B. R. Supinski, D. J. Quinlan, and B. Chapman, “Early Experiences with the OpenMP Accelerator Model,” IWOMP13
OpenMP 4.0
HOMP-Rose
Kepler K20m
Intel Compiler
MIC 5110p
AXPY Jacobi MM
Experiment Methodology
Test cases
• AXPY
• Jacobi
• Matrix Multiplication
Software used in π
• HOMP
• CUDA 5.0
• Intel Compiler
34
Hardware used in π
Xeon E5-2670 Xeon Phi 5110P Tesla K20m
Performance (SP)
333 GFlops
2022 GFlops
3520 GFlops
Memory Bandwidth 51.2 GB/s 320 GB/s (ECC off)
208 GB/s (ECC off)
Memory Size --- 8 GB 5 GB
number of cores 8 60/61 2496
Clock Speed 2.6 GHz 1.053 GHz 0.706 GHz
Anatomy of a GPU/Phi Node in π
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
Summary
• π, the supercomputer of SJTU, is a GPU-
Phi hybrid cluster
• Directive-based programming approach,
such as OpenACC and OpenMP4.0, has
potential for single source running on both
GPU and Xeon Phi
Contact Information Website http://hpc.sjtu.edu.cn
Blog http://ccoe.sjtu.edu.cn
Weibo http://e.weibo.com/sjtuhpc
Email [email protected]
http://icsc2014.sjtu.edu.cn/
May 7-9, 2014 @SJTU
Pre-conference GPU tutorial, May 5-6, 2014
• Parallel Programming tutorial, V.Morozov
• GPU and Beyond, L.Grinberg
• Hybrid/GPU Programming, TH Tang