james lin - hpc advisory council › events › 2013 › china-workshop › ...openmp accelerator...
TRANSCRIPT
![Page 1: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/1.jpg)
Early Experience with OpenACC and OpenMP4 on
π, the supercomputer of SJTU
Visiting Associate
Satoshi MATSUOKA Laboratory
Tokyo Institute of Technology
James Lin
Vice Director
Center for High Performance Computing
Shanghai Jiao Tong University
HPC Advisory 2013@Guilin
![Page 2: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/2.jpg)
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
![Page 3: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/3.jpg)
π, the supercomputer of SJTU
![Page 4: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/4.jpg)
![Page 5: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/5.jpg)
π, the supercomputer of SJTU
• NO.1 in China MOE Universities, NO.158 of TOP500 in June 2013
– Plan to update some accelerators in Y2015 and V2.0 around Y2017
• Specification
– Type: INSPUR TS10000
– Performance: 830 Intel SNB E5-2670 + 100 NVIDIA Kepler K20 + 10 Intel Xeon Phi 5110P=196.2/262.6 TFlops
– Nodes: Intel EPSD JP/WP
– Interconnection: Mellanox IB FDR 56Gbps, 648ports
– Parallel File Storage: DDN SFA12K 720TB with Lustre
– SSD: 80 Intel SSD 400G
– Cooling System: Rittal Liquid
• Applications
– Open Source: Gromacs, LAMMPS…
– In-house: PIC, NS3D, DSMC…
![Page 6: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/6.jpg)
Milestones in Year 2013
Apr Oct June
Assembled
Early Access Program
Production Submit LINPCK to TOP500
![Page 7: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/7.jpg)
Why we build a GPU-Phi Cluster?
• Mostly based on academic consideration, and quite confident GPU or Phi can be fully used – Large Scale GPU Supercomputer: Titan, TSUBME
2.5
– Large Scale Xeon Phi Supercomputer: Tianhe-2, Stampede
• Accelerators could be a path to Exascale
• More applications are ready for GPU in this generation
• We will keep our mind open to next generation, Maxwell and Knights Landing
![Page 8: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/8.jpg)
Single source code base for GPU and Xeon Phi?
• Low Level: OpenCL
• High Level: Directive-based Programming
• Just like Java for Windows and Linux in
the early days
![Page 9: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/9.jpg)
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
![Page 10: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/10.jpg)
Staff
![Page 11: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/11.jpg)
Adjuncts
![Page 12: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/12.jpg)
User Committee
![Page 13: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/13.jpg)
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
![Page 14: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/14.jpg)
Current Research Topics on π
• Directive-based Programming on Accelerators
– Ninja GAP[1]
• Application Performance Evaluation and Optimization on Accelerators
– Particle in Cell
– Molecular Dynamics
[1] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and P. Dubey, “Can traditional programming bridge the Ninja performance gap for parallel computing applications?,” ISCA 2011
![Page 15: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/15.jpg)
• We may have many reasons to use directive-based
programming, but for me, it can keep the code
readable/maintainable from the application developers'
point of view
CUDA Experts
Application Developers
Version 1 Version 2 is based on develops’ own version
Port to CUDA
CUDA Version 1
Unmaintainable to application developers
Why Directive-based Programming?
![Page 16: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/16.jpg)
Directive-based Programming for Accelerators[1]
• Standard – OpenACC
– OpenMP >=4.0
• Product
– PGI Accelerators
– HMPP
• Research Projects – R-Stream
– HiCUDA
– OpenMPC/OpenMP for accelerators
[1] S. Lee and J. S. Vetter, “Early evaluation of directive-based GPU programming models for productive exascale computing,” presented at the High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, 2012, pp. 1–11.
![Page 17: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/17.jpg)
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
![Page 18: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/18.jpg)
OpenACC
• Standard version evolution is much faster than OpenMP and MPI : ~1.5year
– Maybe too fast for compiler vendors and application developers to catch up
• NVIDIA proposes a "Fork and Merge" model for OpenACC to OpenMP
Standard/Version 1.0 2.0 3.0 4.0
OpenACC Nov 2011 July 2013 -- --
OpenMP-Fortran Oct 1997 Nov 2000 May 2008 July 2013
MPI June 1994 July 1997 Sept 2012 --
![Page 19: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/19.jpg)
EPCC OpenACC Benchmark Suite
• Version 1.0 is released in this summer – Developed by Nick Johnson of EPCC
– Source code is available on github
• Divided into 3 sections – Level A benchmarks mainly measure the speed of
operation of various OpenACC functions
– Level B benchmarks measure the performance of some BLAS-type kernels
– Level C are real application codes • Himeno
• 27stencil
• Le2d
http://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openacc-benchmark-suite
![Page 20: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/20.jpg)
Preliminary Results
1. Most cases of CUDA on K20 run faster then OpenCL on MIC
2. Except Guassian and Himeno
![Page 21: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/21.jpg)
Roofline Model[1]
[1] S. Williams, A. Waterman, and D. Patterson, “Roofline,” Commun. ACM, vol. 52, no. 4, p. 65, Apr. 2009.
• The reason we apply Roofline mode here is to explore any relationship between arithmetic Intensity with performance portability
• What kind of application could archive good performance portability
![Page 22: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/22.jpg)
Intensity of Benchmark
Level A SpMV GEMM GESUMM 2MM 3MM ATAX BICG MVT
Intensity 0.06 4.41~ 122.04 0.24 3.28~62.35 6.10~101.67 0.49 0.49 0.49
Level B SYRK SYR2K 2DConv 3DConv COR COV Pathfinder
Intensity 5.28~135.96 4.97~192.02 0.708 1.244 2.77~50.87 2.00~51.15 0.09
Level C Hotspot Gaussian 27Stencil Himeno Le2d
Intensity 0.50 0.24 1.47 1.78 1.63
![Page 23: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/23.jpg)
Roofline Model Analysis
![Page 24: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/24.jpg)
![Page 25: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/25.jpg)
![Page 26: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/26.jpg)
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
![Page 27: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/27.jpg)
OpenMP 4.0 for Accelerators
• Released in July 2013, it supports on directive-based programming on accelerators, such as GPU and Xeon Phi
• Directives for – Parallelism: target/parallel
– Data: target data
– Two levels of parallelism: teams/distribute
• A standard supported by Intel and will be supported by NVIDIA
![Page 28: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/28.jpg)
Code Example: Jacobi Kernel with OpenMP4.0
![Page 29: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/29.jpg)
HOMP, OpenMP compiler for CUDA
• Developed by LLNL, it is build on ROSE[1],
a source-to-source compiler
• It is an early research implementation[2] of
OpenMP4.0 on GPU for CUDA5.0
• Support C/C++ only now
[1] http://rosecompiler.org
[2] C. Liao, Y. Yan, B. R. Supinski, D. J. Quinlan, and B. Chapman, “Early Experiences with the OpenMP Accelerator Model,” IWOMP13
![Page 30: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/30.jpg)
Missing Features of OpenMP4.0
• Multiple Device Support
• Combined Constructs Separate
• No-Middle-Sync
• Array Sections
• Global Barrier
• Mapping Nested Loops
• Mapped data Reuse
C. Liao, Y. Yan, B. R. Supinski, D. J. Quinlan, and B. Chapman, “Early Experiences with the OpenMP Accelerator Model,” IWOMP13
![Page 31: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/31.jpg)
OpenMP 4.0
HOMP-Rose
Kepler K20m
Intel Compiler
MIC 5110p
AXPY Jacobi MM
Experiment Methodology
![Page 32: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/32.jpg)
Test cases
• AXPY
• Jacobi
• Matrix Multiplication
![Page 33: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/33.jpg)
Software used in π
• HOMP
• CUDA 5.0
• Intel Compiler
![Page 34: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/34.jpg)
34
Hardware used in π
Xeon E5-2670 Xeon Phi 5110P Tesla K20m
Performance (SP)
333 GFlops
2022 GFlops
3520 GFlops
Memory Bandwidth 51.2 GB/s 320 GB/s (ECC off)
208 GB/s (ECC off)
Memory Size --- 8 GB 5 GB
number of cores 8 60/61 2496
Clock Speed 2.6 GHz 1.053 GHz 0.706 GHz
![Page 35: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/35.jpg)
Anatomy of a GPU/Phi Node in π
![Page 36: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/36.jpg)
![Page 37: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/37.jpg)
![Page 38: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/38.jpg)
![Page 39: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/39.jpg)
Outline
1. π, the supercomputer of SJTU
2. People behind π
3. Research on π
– Background
– OpenACC
– OpenMP for Accelerators
4. Summary
![Page 40: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/40.jpg)
Summary
• π, the supercomputer of SJTU, is a GPU-
Phi hybrid cluster
• Directive-based programming approach,
such as OpenACC and OpenMP4.0, has
potential for single source running on both
GPU and Xeon Phi
![Page 41: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/41.jpg)
Contact Information Website http://hpc.sjtu.edu.cn
Blog http://ccoe.sjtu.edu.cn
Weibo http://e.weibo.com/sjtuhpc
Email [email protected]
![Page 42: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/42.jpg)
http://icsc2014.sjtu.edu.cn/
May 7-9, 2014 @SJTU
![Page 43: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/43.jpg)
![Page 44: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/44.jpg)
![Page 45: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined](https://reader034.vdocument.in/reader034/viewer/2022042402/5f146c9b14eaba3ed7738599/html5/thumbnails/45.jpg)
Pre-conference GPU tutorial, May 5-6, 2014
• Parallel Programming tutorial, V.Morozov
• GPU and Beyond, L.Grinberg
• Hybrid/GPU Programming, TH Tang