amrita mathuriya - intel · amrita mathuriya hpc application engineer dcg/intel corporation...

Amrita Mathuriya HPC Application Engineer DCG/Intel Corporation November 2016

Optimizations of B-spline based SPO evaluations in QMC for Multi/many-core Shared Memory Processors

In collaboration with Jeongnim Kim (Intel), Victor Lee(Intel) Ye Luo, Anouar Benali (Argonne National Laboratory) Luke Shulenburger (Sandia National Laboratories )

11/12/2016

Presenter: Amrita Mathuriya

3

§  HPC application Engineer at HPC Ecosystem Application Engineering Team; working for code modernization and optimization on Xeon and Xeon Phi™

§  Working at Intel for past 8 Years.

–  Expert at algorithms and optimizations for IA architectures.

–  Worked on HPC applications in areas of Computational Geometry, Optical Proximity Correction (OPC), Electromagnetics, Computational Biology, Quantum Monte Carlo.

–  Working on code modernization for Intel® Xeon and Xeon Phi™ architectures.

§  MS in Computer Science with the specialization in Computational Science and Engineering from Georgia Tech, USA under the guidance of Professor David Bader.

§  Obtained B. Tech degree in Computer Science from Indian Institute of Technology (IIT) Roorkee, India.

11/12/2016

Systems §  KNC: Intel® Xeon Phi™ coprocessor 7120P

•  61 cores @ 1.238 GHz, 4-way Intel® Hyper-Threading Technology, Memory: 15872 MB •  Intel® Many-core Platform Software Stack Version 3.6.1 •  OS Version : 3.10.0-229.el7.x86_64

§  Intel® Xeon Phi™ 7250P (code-named Knights Landing, KNL), 68 cores, 1.4GHz with 16GB MCDRAM (used in flat mode), cluster boot mode=Quad, Turbo=enable. KNL used in Quad/Flat mode.

§  Intel® Xeon® E5-2697v4(BDW) node single socket, 18 cores HT Enabled @2.3GHz 145W (E5-2697v4 w/128GB RAM DDR4 2400 8*16GB DIMMS.

§  Bluegene/Q (BG/Q) processor from Mira Supercomputer, at Argonne National lab facility.

§  Compilers and MPI and math library. •  icc version 16.0.2 (gcc version 4.8.3 compatibility) •  Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)

4

11/12/2016

Agenda

§  KNL overview and motivation

§  Intro to quantum Monte Carlo and QMCPACK

§  Current status of QMCPACK

§  Analysis of CORAL graphite benchmark

§  Optimizations to B-spline based SPO evaluations for QMC

§  Summary

5

11/12/2016 6

11/12/2016

Important Characteristics of KNL

7

§  Increasing core count per node on both Intel® Xeon® and Xeon Phi™ processors.

§  Large SIMD units – AVX512 supporting 16 single precision floating point simultaneously.

§  Two level Cache system L1/L2 and high memory bandwidth.

11/12/2016

How to gain performance?

8

§  Scalability –  Enable data sharing with hybrid parallelism using MPI + threading.

–  Design and implement scalable algorithms

§  SIMD Parallelism – adapt Data layouts to enable efficient vectorization.

§  Efficiently utilize caches and memory bandwidth with Tiling (cache-blocking).

11/12/2016 9

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.

7% of peak GFLOPS achieved with the current AoS version

Roofline Performance Analysis on KNL

Peak GFLOPS at (0.22 Flops/Byte

VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware AI.

0.22

GFLOPS NOW, 7%Scalar add peak

11/12/2016

Performance Portable on Intel® Xeon®, Intel Xeon Phi™ and BG/Q Processors

10

§  Optimizations for efficiently utilizing SIMD units and caches. –  SoA data layout transformation.

–  Tiling or AoSoA data layout transformation.

–  Nested thread parallelization to reduce time-to-solution and memory usage.

§  Optimization work done on KNC.

§  Later ported on KNL – works out of the box.

§  Optimizations result in significant performance improvement on BG/Q.

11/12/2016

Parallel efficiency of QMCPACK on US-DOE facilities. The legendshows the MPI tasks and OpenMP threads of the reference computing unit (CU) and the maximum number of nodes on each platform.

QMCPACK An open-source US-DOE flagship many-body ab initio quantum Monte Carlo (QMC) code for computing the electronic structure of atoms, molecules, and solids. http://qmcpack.org/

(a) DMC charge-density of AB stacked graphite and (b) the ball-and-stick rendering and the 4-Carbon unit cell in blue.

11

J. Kim, K. P. Esler, J. McMinis, M. A. Morales, B. K. Clark, L. Shulenburger, and D. M. Ceperley, “Hybrid algorithms in quantum monte carlo,” Journal of Physics: Conference Series, vol. 402, no. 1, p. 012008, 2012. [Online]. Available: http://stacks.iop.org/1742-6596/402/i=1/a=012008

11/12/2016

Diffusion Monte Carlo Schematics

Ensemble evolves according to •  Diffusion •  Drift •  Branching

Possible new configurations

Old configurations Random Walking

New configurations

ensemble

w=0.8

w=1.6

w=2.4

w=0.3

11/12/2016

How is QMCPACK parallelized QMCPACK utilizes OpenMP to optimize memory usage and to take advantage of the growing number of cores per SMP node.

13

§  Walkers within a MPI task is distributed among the cores in CPU.

§  Big common data is shared by all the walkers like wave function coefficients.

§  Frequency stops increasing. §  Node count stops growing. §  Nodes are getting more powerful but require

applications to expose more concurrency.

Free lunch is over. On node performance is challenging.

11/12/2016

QMCPACK status

§  Excellent MPI & OpenMP parallel efficiency at the walker level

§  All in double precision except 3D cubic B-Spline. –  Work done recently to implement mixed precision. Speeds up by 1.2-1.5x.

§  SIMD efficiency low

§  Basically scalar performance with few exceptions –  B-Spline – SSE/SSE2/QPX

–  Distance tables with QPX

§  Array of Structure (AoS) for D-dim N-particle attributes, e.g., R (N,3), Gradients (N,3), Hessian matrices (N,9)

Pretty good and we can even do better!

14

11/12/2016

CORAL Benchmark – KNL Profiling

29%

34%

18%

18%

Coral Benchmark Profile on KNL

Einspline

Distance Table

Jastraw

Others

4x4x1 AB-stacked graphite64 carbon256 electrons

15

The three compute kernels account for 80% run time in QMCPACK on KNL.

11/12/2016

QMC: Single particle orbital (SPO) representation with B-spline basis set

16

One Dimensional cubic B-spline function

Precomputed coefficients4D Read only array.Stored in SOA format, P[nx][ny][nz][N]Provided by DFT or HF computations using Quantum Espresso

Tensor product in each Cartesian direction, Representation for 3D orbital,

11/12/2016

Simplified miniQMC

17

§  Only contains B-spline evaluation routines.

§  Mimics the computational and data access patterns of B-spline SPO evaluations in QMC.

B-spline SPO evaluation kernels

Random position generation

11/12/2016

Array-of-Structs (AoS)

§  Pros: Logical for expression of physical abstractions in 3D or higher dimensions.

Struct-of-Arrays (SoA)

§  Pros: Contiguous loads/stores for efficient vectorization.

Hybrid (AoSoA)

§  Pros: Potentially useful for increasing cache locality. Also supports efficient vectorization.

x x xx x x

y y yy y y

z z zz z z

x x

x x

x x

y y

y y

y y

z z

z z

z z

Data Layout – Performance Considerations

18

x x

x x

x x

…

…

…

yy

yy

yy

…

…

…

z z

z z

z z

…

…

…

11/12/2016

Pseudocode - VGH Computes value, gradient, Hessian at random (x,y,z)

19

Random

Data access pattern of read-only B-spline coefficients P at a random position (x; y; z) and j0=floor(y/dy) etc. The outermost x dimension is not shown.

Strided access for output arrays.

11/12/2016

SoA transformation for output arrays

20

Output arrays in SoA (Structure of arrays) format

x x xx x xy y yy y yz z zz z z

………

11/12/2016

How to evaluate performance of QMC

§  Rate of Monte Carlo sample generations (throughputs) per resource

§  For the miniapp,

Throughput = (number of evaluations)/(T)

Evaluations = (Number of walkers) X (Number of iterations) X (Number of splines)

T = Time per call of a function ( such as VGH )

§  Throughput represents work done on a node.

§  Ideally, it should stay constant across problem sizes.

21

11/12/2016

VGH throughput by AoS-to-SoA transformation Higher the better


2x-4x Performance improvement for small to medium problem sizes.

22

11/12/2016

Pseudocode - VGH Computes value, gradient, Hessian at random (x,y,z)

23

Random

Data access pattern of read-only B-spline coefficients P at a random position (x; y; z) and j0=floor(y/dy) etc. The outermost x dimension is not shown.

Strided access for output arrays.

11/12/2016

Why low performance for large N? •  AoS-to-SoA improves SIMD efficiency

•  But, caches can be utilized better

•  Reduction on the arrays G& H of size N

•  Streaming access at 4x4x4 block

•  Pressure on resources with large N, e.g., TLB

•  How to keep the write data in L1/L2

•  How to maximize LLC sharing

Core Core

HU

B

24

Reduction of output arrays over 64N values

11/12/2016

AoSoA Data Layout Transformation

25

Tiled Input array

Tiled output arrays

Data access pattern of read-only B-spline table a) Current b) Tiled

x xx xx x

………

yyyyyy

………

z zz zz z

………

Efficient cache utilization, by tiling both input and output arrays along the innermost dimension.

11/12/2016

Performance gain with tiling/AoSoA - Higher the better

AoSoA helps achieve sustained throughput across problem sizes for all architectures.

VGH Performance with SoA to AoSoA transformation (tiling)


26

11/12/2016

VGH throughput with tiling, higher the better

Tiling improves performance for all three processors.

Performance of VGH at N = 2048 with respect to tile size.


27

§  BDW – peak at 64 §  The tiled input array fits in L3

cache.

§  KNC, KNL – peak at 512 §  For tile size > 512, output arrays

fall out of caches.

11/12/2016

Hybrid OpenMP/MPI Parallelism in QMCPACK

28

§  Current parallelism over walkers (Nw).

§  Working set size in QMCPACK grows with number of walkers.

§  Parallelizing each walker update §  Specifically, for Intel Xeon Phi, with large

number of cores/threads, next level of parallelism becomes essential for strong scaling. Parallel efficiency of QMCPACK on US-DOE facilities. The legend

shows the MPI tasks and OpenMP threads of the reference computing unit (CU) and the maximum number of nodes on each platform.

J. Kim, K. P. Esler, J. McMinis, M. A. Morales, B. K. Clark, L. Shulenburger, and D. M. Ceperley, “Hybrid algorithms in quantum monte carlo,” Journal of Physics: Conference Series, vol. 402, no. 1, p. 012008, 2012. [Online]. Available: http://stacks.iop.org/1742-6596/402/i=1/a=012008

11/12/2016

Parallelism within a walker – nested threading

29

#pragma omp parallel

Strong Scaling:- Independent execution of tiles in different threads.

•  Reduces memory requirement and time to solution on a node, by reducing the number of walkers on a node.

•  miniQMC replaces OpenMP nested threading with manual assignment of work.

11/12/2016


Strong Scaling Results on KNL

30

Reduces time to solution by ~14x with 16 threads per walker

Speedup on KNL w.r.t. number of walkers per thread.

11/12/2016


Strong Scaling Results on KNL

31

Reduces time to solution by ~14x with 16 threads per walker

Speedup on KNL w.r.t. number of walkers per thread.

Performance of VGH at N = 2048 with respect to tile size.

11/12/2016


32

§  SoA data layout conversion –  Increases cache aware AI

from 0.22 to 0.32

–  ~7% of the achievable peak.

–  1.5x speedup wrt. AoS version.


VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware AI and X (b) the best performance (AoSoA) on DDR.

0.22 à 0.32

SoA, ~7% of peak

GFlops

11/12/2016


33

§  AoSoA version increases cache reuse with the same AI. –  Better cache utilization.

–  ~2.25x gain in performance.



0.22 à 0.32

AoSoA, 11% of peak

GFlops

11/12/2016


34

§  AoSoA version with MCDRAM ~3.3x faster than DDR.



AoSoA, 3.3x speedup

With MCDRAM

11/12/2016

Roofline performance analysis on BDW

35

Performance improved to ~50% of peak GFLOPS with the AoSoA version.


660 GFLOPS SP Vector FMA Peak

VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware AI

AoSoA, ~50% of

achievable GFlops

11/12/2016

Performance Summary

36

§  The improvements are portable to 4 types of CPUs, even from different vendors.

§  Significant speedups even on BG/Q.

On VGH routine BGQ BDW KNC KNL

SOA and basic 1.9x 1.7x 2.6x 1.7x

AoSoA/Tiling 2.7x 3.7x 5.2x 2.3x

Strong scaling 5.2x 6.4x 35.2x 33.1x

Number of threads per walker (The optimal tile size)

2(32) 2(32) 8(256) 16(128)

11/12/2016

Symmetric Distance table computation AoS to SoA transformation of particle positions.

37

0.8 0.6 0.6

7.5

13.1 13.0

18

30 30

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

256 512 800

Spee

dup

wrt

. BD

W B

asel

ine

Number of electrons

Speedup vs. Problem SizeHigher the better

KNL Baseline(256 TH) BDW Opt(2MPI/36TH) KNL Opt(256 TH)

KNL 50x Faster with SoA data layout


•  KNL used in Quad/Cache mode for these experiments

•  Here, TH = threads.•  BDW has 2 sockets for

these experiments.

11/12/2016

Results

§  Array of structures (AOS) to structure of arrays (SOA) transform helps achieve efficient vectorization.

§  Tiling for better memory access helps achieve approximately constant throughput across problem sizes.

§  Nested parallelism over the AoSoA objects on KNL helps reduce the time-to-solution by ~14x speedup with 16 threads.

§  Optimizations result in significant performance gain on all three distinct cache-coherent architectures.

38

11/12/2016

Ways we increased the performance!

39

§  SIMD Parallelism –  SoA data layout adaption.

§  Efficient cache utilization –  Tiling/Cache-Blocking.

§  Scalability –  Next level of threading to reduce time to solution.

–  Takes advantage of reduced working set size.

11/12/2016

Reference

40

Amrita Mathuriya, Ye Luo, Anouar Benali, Luke Shulenburger, Jeongnim Kim

“Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors”

arXiv:1611.02665

11/12/2016

Legal Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Knights Landing and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user

Intel, Look Inside, Xeon, Intel Xeon Phi, Pentium, Cilk, VTune and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.Copyright © 2016 Intel Corporation

41

11/12/2016

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Legal Disclaimers

Optimization Notice

42

11/12/2016

§  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

§  Estimated Results Benchmark Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

§  Software Source Code Disclaimer: Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

§  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

§  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Legal Disclaimers

43

Thank you for your time Amrita Mathuriya

[email protected]

www.intel.com/hpcdevcon

Backup

11/12/2016

SymmetricDTD::moveonsphere – Code Sinippet §  For efficient auto-vectorization with the compiler

§  Three separate arrays for X, Y and Z instead of a single array with (x, y, z) as a data member.

§  Similar SOA (structure of arrays) data layout for the output array.

AoS

SoA

46

amrita mathuriya - intel · amrita mathuriya hpc application engineer dcg/intel corporation...

Documents