intel cluster poisson solver library

1
Intel®Cluster Poisson Solver Library, a research project for heterogeneous clusters Alexander Kalinkin, Ilya Krjukov, Intel Corporation Introduction This research explores Intel®Cluster Poisson Solver Library that implements a direct method to solve a grid Laplace problem in 3D parallelepiped domain on a cluster of Intel® Xeon® processors. This method is based on a novel approach of data decomposition and transportation, which leads to performance improvement on large- scale clusters. Elliptic boundary value problems with separable variables can be solved in a fast and direct manner. This type of problems usually presume a single computational domain (rectangle or circle) and constant coefficients [1], [2]. They can be used to generate preconditioners for iterative solvers that solve far more complex problems. For example, high-accuracy models for atmospheric and oceanic flow simulation, such as those used in the Numerical Weather Simulations, can be solved iteratively using a Helmholtz solver with constant coefficients as a preconditioner. Because the preconditioner is used in every iteration step, the Helmholtz solver performance is critical to the overall computation time of the iterative solver. On a cluster, the size of the initial grid and data distribution determine the number of data transfers among computing processes, as well as the amount of computations needed for the Helmholtz solver. These can significantly affect its performance. This work studies the implementation of a Helmholtz solver on clusters using 2D memory decomposition with the objective of minimizing data transfer and synchronization overhead. This work is a continuation of a series of works on Helmholtz solver for shared and distributed memory machines. Paper [3] compared the performance of a Poisson solver from Intel®Math Kernel Library (Intel®MKL) [6] with the NETLIB * Fishpack solver. It also presented an implementation of Intel®Cluster Poisson Solver Library. Paper [4] demonstrated the performance of Intel® MKL Poisson Solvers with the support of periodic boundary conditions. Algorithm The 3D Helmholtz problem is to find an approximate solution of the Helmholtz equation: + = , , , = Problems in a parallelepiped domain with Neumann, Direchlet or periodical boundary conditions can be solved using the standard seven- point finite difference approximation on the mesh. •At a mesh point (x_i, y_i, z_i), if the values of the right-hand side f(x, y, z) are given and the values of the appropriate boundary functions at the mesh point are known, then on a shared memory computer the equation can be solved using a sequence of 5 steps. Each step works with one dimension of the data by doing an FFT and an LU decomposition of a 3-diagonal matrix. On a distributed memory cluster, this algorithm still applies, but the problem of data distribution arises. Depending on how the mesh is distributed among the computing processes, the number of data transfers between these processes varies and has a significant impact on performance. To minimize the total number of data transfers, we propose the following initial data distribution as depicted in Figure 1: transpose the mesh at the beginning of each step such that all processes can run in parallel on independent data. With this approach, the total number of data transfers is 4x , where is the number of MPI processes. Comparing to the algorithm in [3], where the total number of data transfer is 2x , this approach will be more efficient when the number of MPI processes is large. Elements of the same color along the x-axis are stored on the same process. They can be processed independently with respect to elements on other processes. Then, the mesh is transposed as shown in Figure 2: After the transposition, elements of the same color along the y-axis are stored on the same process; and they can be processed independently. Following this scheme, we Experiments All experiments have been performed on a cluster with Infiniband* interconnect, consisting of 128 computational nodes where each node contains two Intel®Xeon® E5-2670 processors and 64G of RAM. We used Intel®MKL version 11.0.1 [6] and Intel®MPI version 4.1. For the first set of tests we choose a grid problem with 0.81*10 9 of unknowns ("small" problem). second one (medium) test have about 3*10 9 of unknowns and, finally, last test contain more than 45*10 9 of unknowns. On the Table below one can see the time results for our algorithm as a function of a number of cores used in the computation. All results are measured in seconds. Reference 1. A.A.Samarskii and E.S.Nikolaev, Methods of Solution of Grid Problems, Nauka, Moscow, (1978) (in Russian). 2. R. W. Hockney, A fast direct solution of Poisson equation using Fourier analysis, J. Assoc. Comput. Mach., vol. 8, 1965, pp. 95-113. 3. . A. Kalinkin, Y.M. Laevsky, S.V. Gololobov, 2D Fast Poisson Solver for High- Performance Computing, Parallel Computing Technologies, Lecture Notes in Computer Science 2009, Vol. 5698/2009 4. A. Kalinkin, A. Kuzmin, Intel\textregistered MKL Poisson Library for scalable and efficient solution of elliptic problems with separable variables, Collection of Works International Scientific Conference Parallel Computing Technologies 2012, pp 336-341 5. PALM - A PArallelized LES Model http://palm.muk.uni-hannover.de 6. Intel®Math Kernel Library http://software.intel.com/en-us/intel-mkl Cores 64 128 256 512 1024 2048 4096 Small 0.81*10 9 of unkn. 2.87 1.56 0.907 0.627 X X X Medium 3*10 9 of unkn. X X X 7.87 1.80 1.34 X Large 45*10 9 of unkn. X X X X X X 4.13 Performance scales almost linearly up to a certain number of processes for each problem size. Larger problems can efficiently use larger number of processes. Configuration Info - Versions: Intel® Math Kernel Library (Intel® MKL) 11.0.5, Intel® Compiler 13.0, Hardware: Intel® Xeon® Processor E5-268 ; Benchmark Source: Intel Corporation. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark- limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/en-ru/articles/optimization-notice/ *Other brands and names are the property of their respective owners.

Upload: ilya-kryukov

Post on 27-Jul-2015

243 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Intel Cluster Poisson Solver Library

Intel®Cluster Poisson Solver Library, a research project for heterogeneous clusters Alexander Kalinkin, Ilya Krjukov, Intel Corporation

Introduction

• This research explores Intel®Cluster Poisson Solver Library that implements a direct method to solve a grid Laplace problem in 3D parallelepiped domain on a cluster of Intel® Xeon® processors. This method is based on a novel approach of data decomposition and transportation, which leads to performance improvement on large-scale clusters.

• Elliptic boundary value problems with separable variables can be solved in a fast and direct manner. This type of problems usually presume a single computational domain (rectangle or circle) and constant coefficients [1], [2]. They can be used to generate preconditioners for iterative solvers that solve far more complex problems. For example, high-accuracy models for atmospheric and oceanic flow simulation, such as those used in the Numerical Weather Simulations, can be solved iteratively using a Helmholtz solver with constant coefficients as a preconditioner. Because the preconditioner is used in every iteration step, the Helmholtz solver performance is critical to the overall computation time of the iterative solver. On a cluster, the size of the initial grid and data distribution determine the number of data transfers among computing processes, as well as the amount of computations needed for the Helmholtz solver. These can significantly affect its performance.

• This work studies the implementation of a Helmholtz solver on clusters using 2D memory decomposition with the objective of minimizing data transfer and synchronization overhead. This work is a continuation of a series of works on Helmholtz solver for shared and distributed memory machines. Paper [3] compared the performance of a Poisson solver from Intel®Math Kernel Library (Intel®MKL) [6] with the NETLIB* Fishpack solver. It also presented an implementation of Intel®Cluster Poisson Solver Library. Paper [4] demonstrated the performance of Intel® MKL Poisson Solvers with the support of periodic boundary conditions.

Algorithm

The 3D Helmholtz problem is to find an approximate solution of the Helmholtz equation:

−𝝏𝟐𝒖

𝝏𝒙𝟐−

𝝏𝟐𝒖

𝝏𝒚𝟐 −

𝝏𝟐𝒖

𝝏𝒛𝟐+ 𝒒𝒖 = 𝒇 𝒙, 𝒚, 𝒛 , 𝒒 = 𝒄𝒐𝒏𝒔𝒕

Problems in a parallelepiped domain with Neumann, Direchlet or periodical boundary conditions can be solved using the standard seven -point finite difference approximation on the mesh.

•At a mesh point (x_i, y_i, z_i), if the values of the right-hand side f(x, y, z) are given and the values of the appropriate boundary functions at the mesh point are known, then on a shared memory computer the equation can be solved using a sequence of 5 steps. Each step works with one dimension of the data by doing an FFT and an LU decomposition of a 3-diagonal matrix. On a distributed memory cluster, this algorithm still applies, but the problem of data distribution arises. Depending on how the mesh is distributed among the computing processes, the number of data transfers between these processes varies and has a significant impact on performance. To minimize the total number of data transfers, we propose the following initial data distribution as depicted in Figure 1:

transpose the mesh at the beginning of each step such that all processes can run in parallel on independent data. With this approach, the total number of data transfers is 4x 𝑛𝑝𝑟𝑜𝑐 , where 𝑛𝑝𝑟𝑜𝑐 is the number of MPI processes. Comparing to the algorithm in [3], where the total number of data transfer is 2x 𝑛𝑝𝑟𝑜𝑐 , this approach will be more efficient when the number of MPI processes is large.

Elements of the same color along the x-axis are

stored on the same process. They can be processed

independently with respect to elements on other

processes. Then, the mesh is transposed as shown

in Figure 2: After the transposition, elements of the same

color along the y-axis are stored on the same

process; and they can be processed independently. Following this scheme, we

Experiments

All experiments have been performed on a cluster with Infiniband* interconnect, consisting of 128 computational nodes where each node contains two Intel®Xeon® E5-2670 processors and 64G of RAM. We used Intel®MKL version 11.0.1 [6] and Intel®MPI version 4.1.

For the first set of tests we choose a grid problem with 0.81*109 of unknowns ("small" problem). second one (medium) test have about 3*109 of unknowns and, finally, last test contain more than 45*109 of unknowns. On the Table below one can see the time results for our algorithm as a function of a number of cores used in the computation. All results are measured in seconds.

Reference

1. A.A.Samarskii and E.S.Nikolaev, Methods of Solution of Grid Problems, Nauka, Moscow, (1978) (in Russian).

2. R. W. Hockney, A fast direct solution of Poisson equation using Fourier analysis, J. Assoc. Comput. Mach., vol. 8, 1965, pp. 95-113.

3. . A. Kalinkin, Y.M. Laevsky, S.V. Gololobov, 2D Fast Poisson Solver for High-Performance Computing, Parallel Computing Technologies, Lecture Notes in Computer Science 2009, Vol. 5698/2009

4. A. Kalinkin, A. Kuzmin, Intel\textregistered MKL Poisson Library for scalable and efficient solution of elliptic problems with separable variables, Collection of Works International Scientific Conference Parallel Computing Technologies 2012, pp 336-341

5. PALM - A PArallelized LES Model http://palm.muk.uni-hannover.de

6. Intel®Math Kernel Library http://software.intel.com/en-us/intel-mkl

Cores 64 128 256 512 1024 2048 4096

Small 0.81*109 of unkn.

2.87 1.56 0.907 0.627 X X X

Medium 3*109 of unkn.

X X X 7.87 1.80 1.34 X

Large 45*109 of unkn.

X X X X X X 4.13

• Performance scales almost linearly up to a certain number of processes for each problem size.

• Larger problems can efficiently use larger number of processes.

Configuration Info - Versions: Intel® Math Kernel Library (Intel® MKL) 11.0.5, Intel® Compiler 13.0, Hardware: Intel® Xeon® Processor E5-268 ; Benchmark Source: Intel Corporation.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in

system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are

considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-

limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/en-ru/articles/optimization-notice/ *Other brands and names are the property of their respective owners.