intel cluster poisson solver library
TRANSCRIPT
![Page 1: Intel Cluster Poisson Solver Library](https://reader038.vdocument.in/reader038/viewer/2022110123/55b5a288bb61eba3108b46ef/html5/thumbnails/1.jpg)
Intel®Cluster Poisson Solver Library, a research project for heterogeneous clusters Alexander Kalinkin, Ilya Krjukov, Intel Corporation
Introduction
• This research explores Intel®Cluster Poisson Solver Library that implements a direct method to solve a grid Laplace problem in 3D parallelepiped domain on a cluster of Intel® Xeon® processors. This method is based on a novel approach of data decomposition and transportation, which leads to performance improvement on large-scale clusters.
• Elliptic boundary value problems with separable variables can be solved in a fast and direct manner. This type of problems usually presume a single computational domain (rectangle or circle) and constant coefficients [1], [2]. They can be used to generate preconditioners for iterative solvers that solve far more complex problems. For example, high-accuracy models for atmospheric and oceanic flow simulation, such as those used in the Numerical Weather Simulations, can be solved iteratively using a Helmholtz solver with constant coefficients as a preconditioner. Because the preconditioner is used in every iteration step, the Helmholtz solver performance is critical to the overall computation time of the iterative solver. On a cluster, the size of the initial grid and data distribution determine the number of data transfers among computing processes, as well as the amount of computations needed for the Helmholtz solver. These can significantly affect its performance.
• This work studies the implementation of a Helmholtz solver on clusters using 2D memory decomposition with the objective of minimizing data transfer and synchronization overhead. This work is a continuation of a series of works on Helmholtz solver for shared and distributed memory machines. Paper [3] compared the performance of a Poisson solver from Intel®Math Kernel Library (Intel®MKL) [6] with the NETLIB* Fishpack solver. It also presented an implementation of Intel®Cluster Poisson Solver Library. Paper [4] demonstrated the performance of Intel® MKL Poisson Solvers with the support of periodic boundary conditions.
Algorithm
The 3D Helmholtz problem is to find an approximate solution of the Helmholtz equation:
−𝝏𝟐𝒖
𝝏𝒙𝟐−
𝝏𝟐𝒖
𝝏𝒚𝟐 −
𝝏𝟐𝒖
𝝏𝒛𝟐+ 𝒒𝒖 = 𝒇 𝒙, 𝒚, 𝒛 , 𝒒 = 𝒄𝒐𝒏𝒔𝒕
Problems in a parallelepiped domain with Neumann, Direchlet or periodical boundary conditions can be solved using the standard seven -point finite difference approximation on the mesh.
•At a mesh point (x_i, y_i, z_i), if the values of the right-hand side f(x, y, z) are given and the values of the appropriate boundary functions at the mesh point are known, then on a shared memory computer the equation can be solved using a sequence of 5 steps. Each step works with one dimension of the data by doing an FFT and an LU decomposition of a 3-diagonal matrix. On a distributed memory cluster, this algorithm still applies, but the problem of data distribution arises. Depending on how the mesh is distributed among the computing processes, the number of data transfers between these processes varies and has a significant impact on performance. To minimize the total number of data transfers, we propose the following initial data distribution as depicted in Figure 1:
transpose the mesh at the beginning of each step such that all processes can run in parallel on independent data. With this approach, the total number of data transfers is 4x 𝑛𝑝𝑟𝑜𝑐 , where 𝑛𝑝𝑟𝑜𝑐 is the number of MPI processes. Comparing to the algorithm in [3], where the total number of data transfer is 2x 𝑛𝑝𝑟𝑜𝑐 , this approach will be more efficient when the number of MPI processes is large.
Elements of the same color along the x-axis are
stored on the same process. They can be processed
independently with respect to elements on other
processes. Then, the mesh is transposed as shown
in Figure 2: After the transposition, elements of the same
color along the y-axis are stored on the same
process; and they can be processed independently. Following this scheme, we
Experiments
All experiments have been performed on a cluster with Infiniband* interconnect, consisting of 128 computational nodes where each node contains two Intel®Xeon® E5-2670 processors and 64G of RAM. We used Intel®MKL version 11.0.1 [6] and Intel®MPI version 4.1.
For the first set of tests we choose a grid problem with 0.81*109 of unknowns ("small" problem). second one (medium) test have about 3*109 of unknowns and, finally, last test contain more than 45*109 of unknowns. On the Table below one can see the time results for our algorithm as a function of a number of cores used in the computation. All results are measured in seconds.
Reference
1. A.A.Samarskii and E.S.Nikolaev, Methods of Solution of Grid Problems, Nauka, Moscow, (1978) (in Russian).
2. R. W. Hockney, A fast direct solution of Poisson equation using Fourier analysis, J. Assoc. Comput. Mach., vol. 8, 1965, pp. 95-113.
3. . A. Kalinkin, Y.M. Laevsky, S.V. Gololobov, 2D Fast Poisson Solver for High-Performance Computing, Parallel Computing Technologies, Lecture Notes in Computer Science 2009, Vol. 5698/2009
4. A. Kalinkin, A. Kuzmin, Intel\textregistered MKL Poisson Library for scalable and efficient solution of elliptic problems with separable variables, Collection of Works International Scientific Conference Parallel Computing Technologies 2012, pp 336-341
5. PALM - A PArallelized LES Model http://palm.muk.uni-hannover.de
6. Intel®Math Kernel Library http://software.intel.com/en-us/intel-mkl
Cores 64 128 256 512 1024 2048 4096
Small 0.81*109 of unkn.
2.87 1.56 0.907 0.627 X X X
Medium 3*109 of unkn.
X X X 7.87 1.80 1.34 X
Large 45*109 of unkn.
X X X X X X 4.13
• Performance scales almost linearly up to a certain number of processes for each problem size.
• Larger problems can efficiently use larger number of processes.
Configuration Info - Versions: Intel® Math Kernel Library (Intel® MKL) 11.0.5, Intel® Compiler 13.0, Hardware: Intel® Xeon® Processor E5-268 ; Benchmark Source: Intel Corporation.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in
system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are
considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-
limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/en-ru/articles/optimization-notice/ *Other brands and names are the property of their respective owners.