a high performance computing course guided by the lu...

24
A High Performance Computing Course Guided by the LU Factorization Gregorio Bernabé, Javier Cuenca, Domingo Giménez, Luis P. García and Sergio Rivas Universidad de Murcia/Universidad Politécnica de Cartagena Scientific Computing and Parallel Programming Group International Conference on Computational Science June 10-12, 2014 Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 1 / 24

Upload: others

Post on 14-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

A High Performance Computing Course Guided bythe LU Factorization

Gregorio Bernabé , Javier Cuenca, Domingo Giménez, Luis P.García and Sergio Rivas

Universidad de Murcia/Universidad Politécnica de CartagenaScientific Computing and Parallel Programming Group

International Conference on Computational ScienceJune 10-12, 2014

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 1 / 24

Page 2: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Outline

1 General organization of the course

2 The LU factorization

3 Development of the course

4 Evaluating Teaching

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 2 / 24

Page 3: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Outline

1 General organization of the course

2 The LU factorization

3 Development of the course

4 Evaluating Teaching

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 3 / 24

Page 4: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Course description

Parallel Programming and High Performance Computing

Master in New Technologies in Computer Science

Specialization of High Performance Architectures andSupercomputing

Small class⇒ high level students, interested in the subject

Initiation to research⇒ techniques for the Master’s Thesis

Guided by the LU factorization

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 4 / 24

Page 5: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Syllabus

Parallel programming environmentsOpenMP, MPI, CUDA

Matrix computationSequential algorithms, Algorithms by blocks, Out-of-corealgorithms, Parallel algorithms

Numerical librariesBLAS, LAPACK, MKL, PLASMA, MAGMA, ScaLAPACK

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 5 / 24

Page 6: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Proposed problem

LU factorization of large matrices in today’s heterogeneouscomputational systemsStudents use LU factorization to develop their own implementationsbased on

Efficient use of optimized libraries

Use of different parallel programming paradigms

Out-of-core techniques for large matrices

Combination of the different approaches for clusters ofmulticore+GPU

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 6 / 24

Page 7: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Outline

1 General organization of the course

2 The LU factorization

3 Development of the course

4 Evaluating Teaching

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 7 / 24

Page 8: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

LU factorization by blocks

A LU factorization basic version is explained to the students to work.This version is based on four steps:

A00 A01 A02

A10 A11 A12

A20 A21 A22

=

L00

L10 L11

L20 L21 L22

U00 U01 U02

U11 U12

U22

Step 1: A00 = L00 ∗ U00 (LU no blocks factorization)

Step 2: A0i = L00 ∗ U0i (multiple lower triangular systems)

Step 3: Ai0 = Li0 ∗ U00 (multiple upper triangular systems)

Step 4: Aij = Aij − Li0 ∗ U0j (update south-east blocks)

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 8 / 24

Page 9: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Implementations

Different implementations based on the structure by blocks:

Shared-memoryassignation of the work with the blocks to different threadsuse of multithread libraries

Message-passingdistribution of blocks to the processescommunication of blocks needed for local computation

GPUuse of libraries for GPUassignation of blocks to CPU and GPU

Out-of-coreblocks stored in secondary memorybrought to main memory for computation

Heterogeneous systemsbalanced assignation of blocks to the computational components

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 9 / 24

Page 10: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Outline

1 General organization of the course

2 The LU factorization

3 Development of the course

4 Evaluating Teaching

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 10 / 24

Page 11: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Organization and methodology

Students with different knowledge

from different universities, degrees and specializations

and interests

from companiesoptional subjectHPC used in their Master’s ThesisMaster’s Thesis on HPC

⇒ Problem-based learning,favors autonomous workand individual supervision.

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 11 / 24

Page 12: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Initial sessions

Presentationpresents the course, its organization, the problem to work withand the tasks to be done by the students

OpenMP and MPItwo sessions are organized outside the general course timetablefor students without knowledge of parallel programming

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 12 / 24

Page 13: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Matrix algorithms

Basic concepts of sparse and dense basic linear algebra routines.

Column and row major storage schemes,concept of leading dimension.

Algorithms by blocks.

Basic routines.

LU factorization, versions without blocks and by blocks.

Precision issues.

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 13 / 24

Page 14: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Numerical libraries

General structure of numerical libraries

Centered on dense linear algebra libraries:

Basic routines:structure of BLASmultithread implementations (MKL, GotoBLAS, ATLAS)auto-tuning (ATLAS)Higher level routines:structure of LAPACKmultithread implementations (MKL)alternative approaches (PLAPACK)recent efforts of optimization for multicore (PLASMA)

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 14 / 24

Page 15: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Practical on basic algorithms and multithreadlibraries

Compare the execution time of versions of the LU:

Sequential without and with blocks,Blocks with matrix multiplication with different basic libraries (MKL,GotoBLAS and ATLAS)Direct calls to LU in MKL and PLASMA.

� �

���� ���� ���� ���� ���� ���� ���� ��� ��� �����

��

���

���

���

���

���

���

�� �������

�� �

������ �

� � ��������

�� ��

���������� !"��!

"#

$%&#

Speed-up of different versions of the LU factorization with respect to thesequential implementation. In a NUMA with 4 hexa-cores.

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 15 / 24

Page 16: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

GPU

Basic concepts of GPU programming with CUDA

In the second semester a course on Advanced Programming ofMulticore Architectures

No implementations of LU for GPU

Use of linear algebra libraries for GPU (CULA, CUBLAS, MAGMA)

Load balancing CPU-GPU

Cost of data transference

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 16 / 24

Page 17: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Shared-memory algorithms

OpenMP versions reusing the ideas from block algorithmsMultilevel parallelism:

two-level OpenMP routinesOpenMP + multithread librariesdifferent numbers of threads at BLAS level and higher level in MKLroutines

In the practical , study of the optimal number of OpenMP threads andlibrary threads.

� �

���� ���� ���� ���� ���� ���� ���� ��� ��� �����

��

��

��

�� �������

��

��

��

����������������

������

�� �

�!��

Comparison of the execution time of different OpenMP+MKL versions.

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 17 / 24

Page 18: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Out-of-core algorithms

Scientific problems with large memory requirementOut-of-core linear algebra librariesIn/Out librariesAlgorithms for out-of-core LU factorization

In the practical , out-of-core implementations and combination withOpenMP.

� �

����� ����� ����� ����� ����� ����� ����� ���� ����������

����

����

����

���

�����

�����

�����

�����

��

�� �������������

�� ������������������

������������ ���

�����!

�"��#

$

Comparison of the execution time of different out-of-core versions.

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 18 / 24

Page 19: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Message-passing algorithms

Some basic ideas for the development of CPU+GPU andmessage-passing versions of the LU are discussed

Message-passing linear algebra routinesLibraries for distributed systems (ScaLAPACK)Distributed memory LU factorization

In the practical , combination of the paradigms studied with MPI toimplement LU for large matrices in an heterogeneous cluster with 52cores and 10 GPUs:

One quad-core + 1 GPU gforce 112 cores.One NUMA with 4 hexa-cores + 1 GPU Kepler 2048 cores.Two hexa-cores, each with 1 GPU gforce 512 cores.One node with 2 hexa-cores + 4 GPU gforce each 512 cores + 2GPU Tesla each 448 cores.

There are many possible combinations. The students decide which toexplore, depending on their interest and the possible application totheir work for the Master’s Thesis.

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 19 / 24

Page 20: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Final sessions

There are additional sessions:

Research on the SCPP group, so that students can identify somedirections to apply in their Master’s Thesis:

Research on optimization and auto-tuning of parallel linear algebraroutines.Linear algebra routines for more recent systems (Kepler, MIC)Application of efficient linear algebra routines for large scientificproblems (molecule simulation, electromagnetism...)Other scientific applications of HPC (parallel metaheuristics,statistic models...)

Control sessionsto discuss the approaches and problems of the students whenworking in the practicals and to guide their work.

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 20 / 24

Page 21: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Outline

1 General organization of the course

2 The LU factorization

3 Development of the course

4 Evaluating Teaching

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 21 / 24

Page 22: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Test to evaluate if the teaching objectives fulfilled

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 22 / 24

Page 23: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

Conclusions

The problem-based learning approach has proved positive for areduced group of students with different knowledge and interests.

The autonomous work has contributed to the understanding of thedifferent issues, and to learn how to tackle practical aspects.

The students are allowed to center some of the practicals on theissues they are more interested in.

It allowed us to connect the course with their work for the Master’sThesis.

The experience is positive, but for successive courses we will try tocenter the course more on the subject of the Master’s Thesis, which isdifficult because not all the students have decided from the beginningof the course the subject of the Thesis, and because not all of themcan be easily related with High Performance Computation.

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 23 / 24

Page 24: A High Performance Computing Course Guided by the LU ...dis.um.es/~domingo/14/ICCSWTCS/presentation.pdf · A High Performance Computing Course Guided by the LU Factorization Gregorio

A High Performance Computing Course Guided bythe LU Factorization

Gregorio Bernabé , Javier Cuenca, Domingo Giménez, Luis P.García and Sergio Rivas

Universidad de Murcia/Universidad Politécnica de CartagenaScientific Computing and Parallel Programming Group

International Conference on Computational ScienceJune 10-12, 2014

Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 24 / 24