high%performance%% linear%algebra%with% intel%xeon%phi

High Performance Linear Algebra with Intel Xeon Phi Coprocessors

University of Tennessee, Knoxville

Stan Tomov Azzam Haidar, Khairul Kabir, J. Dongarra, M. Gates, Y. Jia, P. Luszczek

ORNL Seminar, TN September 9, 2013

MAGMA: LAPACK for hybrid systems

•  MAGMA •  A new generation of HP Linear Algebra Libraries •  To provide LAPACK/ScaLAPACK on hybrid architectures •  http://icl.cs.utk.edu/magma/

•  MAGMA MIC 1.0 •  For hybrid, shared memory systems featuring Intel Xeon Phi coprocessors •  Included are main DLA algorithms

•  one and two-‐sided factorizations and solvers

•  Open Source Software ( http://icl.cs.utk.edu/magma )

•  MAGMA developers & collaborators •  UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) •  Community effort, similar to LAPACK/ScaLAPACK

Key aspects of MAGMA

A New Generation of DLA Software

Software/Algorithms follow hardware evolution in time

LINPACK (70’s)

(Vector operations)

Rely on

- Level-1 BLAS

operations

LAPACK (80’s)

(Blocking, cache

friendly)

Rely on

- Level-3 BLAS

operations

ScaLAPACK (90’s)

(Distributed Memory)

Rely on

- PBLAS Mess Passing

PLASMA (00’s)

New Algorithms

(many-core friendly)

Rely on

- a DAG/scheduler

- block data layout

- some extra kernels

MAGMA Hybrid Algorithms (heterogeneity friendly)

Rely on - hybrid scheduler - hybrid kernels

MAGMA Libraries & Software Stack

MAGMA 1.4 Hybrid Tile Algorithms through MORSE

single

multi

distr.

C P U GPU / Coprocessor H Y B R I D

BLAS

BLAS

MAGMA BLAS, panels, auxiliary

LAPACK

CUDA / OpenCL / MIC

Support: Linux, Windows, Mac OS X; C/C++, Fortran; Matlab, Python

MAGMA SPARSE (kernels, data formats, iterative & direct )

MAGMA 1.4, MAGMA MIC 1.0, clMAGMA 1.0 (Hybrid LAPACK & 2-stage eigensolvers)

StarPU run-time system PLASMA / Quark

MAGMA 1.4, MAGMA MIC 1.0 (Hybrid LAPACK & 2-stage eigensolvers)

Hybrid LAPACK/ScaLAPACK & Tile Algorithms

MAGMA Functionality

•  80+ hybrid algorithms have been developed (total of 320+ routines) •  Every algorithm is in 4 precisions (s/c/d/z) •  There are 3 mixed precision algorithms (zc & ds) •  These are hybrid algorithms, expressed in terms of BLAS •  MAGMA MIC provides support for Intel Xeon Phi Coprocessors

Methodology overview

•  MAGMA MIC uses hybridization methodology based on •  Representing linear algebra algorithms as collections

of tasks and data dependencies among them •  Properly scheduling tasks' execution over

multicore CPUs and manycore coprocessors

•  Successfully applied to fundamental linear algebra algorithms •  One-‐ and two-‐sided factorizations and solvers •  Iterative linear and eigensolvers

•  Productivity 1) High level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions

Hybrid CPU+MIC algorithms (small tasks for multicores and large tasks for MICs)

A methodology to use all available resources:

MIC

MIC

MIC

Hybrid Algorithms

•  Hybridization •  Panels (Level 2 BLAS) are factored on CPU using LAPACK

•  Trailing matrix updates (Level 3 BLAS) are done on the MIC using “look-ahead”

One-Sided Factorizations (LU, QR, and Cholesky)

!"#$%&'(%)&*%

next

pan

el i+

1

trailing submatrix i

trailing matrix

pane

l i

pane

l i

Programming LA on Hybrid Systems

•  Algorithms expressed in terms of BLAS •  Use vendor-‐optimized BLAS •  Algorithms expressed in terms of tasks •  Use some scheduling/run-‐time system

Intel Xeon Phi speciPic considerations

•  Intel Xeon Phi coprocessors (vs. GPUs) are less dependent on host [ can login on the coprocessor, develop, and run programs in native mode]

•  There is OpenCL support (as of Intel SDK XE 2013 Beta) facilitating Intel Xeon Phi’s use from the host

•  There is pragma API but it may be too high-‐level for HP numerical libraries

•  We used Intel Xeon Phi’s Low Level API (LLAPI) to develop MAGMA API [allows us to uniformly handle hybrid systems]

MAGMA MIC programming model

•  Intel Xeon Phi acts as coprocessor

•  On the Intel Xeon Phi, MAGMA runs a “server”

•  Communications are implemented using LLAPI

PCIe

Host Intel Xeon Phi

Main ( ) server ( )

LLAPI

A Hybrid Algorithm Example •  Left-‐looking hybrid Cholesky factorization in MAGMA

•  The difference with LAPACK – the 4 additional lines in red •  Line 8 (done on CPU) is overlapped with work on the MIC (from line 6)

MAGMA MIC programming model

Host program

Intel Xeon Phi interface

Send asynchronous requests to the MIC; Queued & Executed on the MIC

MAGMA MIC Performance ( QR )

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s

System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183

0

100

200

300

400

500

600

700

800

900

1000

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DGEQRF

CPU_DGEQRF

MAGMA MIC Performance (Cholesky)

0

100

200

300

400

500

600

700

800

900

1000

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DPOTRF

CPU_DPOTRF




MAGMA MIC Performance (LU)

0

100

200

300

400

500

600

700

800

900

1000

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DGETRF

CPU_DGETRF




MAGMA MIC Two-sided Factorizations

1. Copy dP to CPU

2. Copy v to GPUj

4. Copy to CPU

Work dWork

dYY dV

0C P U G P U

i

3. Copy y to CPUj

Aj

•  Panels are also hybrid, using both CPU and MIC (vs. just CPU as in the one-‐sided factorizations)

•  Need fast Level 2 BLAS – use MIC’s high bandwidth

MAGMA MIC Performance (Hessenberg)

0

20

40

60

80

100

120

140

160

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DGEHRD

CPU_DGEHRD




MAGMA MIC Performance (Bidiagonal)




0

10

20

30

40

50

60

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DGEBRD

CPU_DGEBRD

MAGMA MIC Performance (Tridiagonal)




0

10

20

30

40

50

60

70

80

0 5000 10000 15000 20000 25000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA_DSYTRD

CPU_DSYTRD

Toward fast Eigensolver

2k3k4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k0

40

80

120

160

200

240

280

320

360

400

440

480

Matrix size

Gflo

p/s

DSYTRD_3GPUsDSYTRD_2GPUsDSYTRD_1GPUDSYTRD_MKL

Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

flops formula: 4n3/3*time Higher is faster

A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-‐GPU generalized eigensolver for electronic structure calcula;ons based on fine grained memory aware tasks, ICL Technical report, 03/2012.


  Characteristics •  Stage 1: BLAS-3, increasing computational intensity, •  Stage 2: BLAS-1.5, new cache friendly kernel, •  4X/12X faster than standard approach, •  Bottelneck: if all Eigenvectors are required, it has 1

back transformation extra cost.

Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

flops formula: 4 n3/3*time Higher is faster


2k3k4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k0

40

80

120

160

200

240

280

320

360

400

440

480

Matrix size

Gflo

p/s

DSYTRD_2stages_3GPUsDSYTRD_2stages_2GPUsDSYTRD_2stages_1GPUDSYTRD_3GPUsDSYTRD_2GPUsDSYTRD_1GPUDSYTRD_MKL Acceleration w/ 3 GPUs:

15 X vs. 12 Intel cores

0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 10160 20 40 60

0

10

20

30

40

50

60

nz = 3600

first

stage

0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 10160 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 178

second

stage

1k 2k 3k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k0

50

100

150

200

250

300

350

400

450

500

Matrix size

Gflo

p/s

MAGMA_DSYTRD_2stages_16SB_with_MIC (new algorithm)MAGMA_DSYTRD_2stages_16SB_with_Kepler (new algorithm)MAGMA_DSYTRD_2stages_16SB_with_Fermi (new algorithm)MAGMA_DSYTRD_standard_16SB_with_KeplerMAGMA_DSYTRD_standard_16SB_with_MICMKL_DSYTRD_standard_16SB


  Characteristics •  Stage 1: BLAS-3, increasing computational intensity, •  Stage 2: BLAS-1.5, new cache friendly kernel, •  4X/12X faster than standard approach, •  Bottelneck: if all Eigenvectors are required, it has 1

back transformation extra cost.

flops formula: 4 n3/3*time Higher is faster


0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 10160 20 40 60

0

10

20

30

40

50

60

nz = 3600

first

stage

0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 10160 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 178

second

stage

From Single to MultiMIC Support

•  Data distribution •  1-‐D block-‐cyclic distribution

•  Algorithm •  MIC holding current panel is sending it to CPU

•  All updates are done in parallel on the MICs

•  Look-‐ahead is done with MIC holding the next panel

MIC 0

MIC 1

MIC 2

MIC 0 . . .

nb

MAGMA MIC Scalability LU Factorization Performance in DP

!"

#!!"

$!!"

%!!"

&!!"

'!!!"

'#!!"

'$!!"

'%!!"

'&!!"

#!!!"

##!!"

#$!!"

#%!!"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"

9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM""

'"9NK"





!"

#!!"

$!!"

%!!"

&!!"

'!!!"

'#!!"

'$!!"

'%!!"

'&!!"

#!!!"

##!!"

#$!!"

#%!!"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"


#"9NK" '"9NK"





!"

#!!"

$!!"

%!!"

&!!"

'!!!"

'#!!"

'$!!"

'%!!"

'&!!"

#!!!"

##!!"

#$!!"

#%!!"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"


)"9NK" #"9NK" '"9NK"





!"

#!!"

$!!"

%!!"

&!!"

'!!!"

'#!!"

'$!!"

'%!!"

'&!!"

#!!!"

##!!"

#$!!"

#%!!"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"


$"9NK" )"9NK" #"9NK" '"9NK"




MAGMA MIC Scalability QR Factorization Performance in DP




0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

0 10000 20000 30000 40000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA DGEQRF Performance(MulYple Card)

4 MIC

3 MIC

2 MIC

1 MIC

MAGMA MIC Scalability Cholesky Factorization Performance in DP




0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

0 10000 20000 30000 40000

Performan

ce GFLOP/s

Matrix Size N X N

MAGMA DPOTRF Performance(MulYple Card)

4 MIC

3 MIC

2 MIC

1 MIC

Current and Future MIC-speciPic directions

•  Explore the use of tasks of lower granularity on MIC [ GPU tasks in general are large, data parallel]

•  Reduce synchronizations [ less fork-‐join synchronizations ]

•  Scheduling of less parallel tasks on MIC [ on GPUs, these tasks are typically ofiloaded to CPUs] [ to reduce CPU-‐MIC communications ]

•  Develop more algorithms, porting newest developments in LA, while discovering further MIC-‐speciiic optimizations


•  Synchronization avoiding algorithms using Dynamic Runtime Systems

48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200


High-productivity w/ Dynamic Runtime Systems From Sequential Nested-‐Loop Code to Parallel Execution

for (k = 0; k < min(MT, NT); k++){ zgeqrt(A[k;k], ...); for (n = k+1; n < NT; n++) zunmqr(A[k;k], A[k;n], ...); for (m = k+1; m < MT; m++){ ztsqrt(A[k;k],,A[m;k], ...); for (n = k+1; n < NT; n++) ztsmqr(A[m;k], A[k;n], A[m;n], ...); }

}


High-productivity w/ Dynamic Runtime Systems From Sequential Nested-‐Loop Code to Parallel Execution

for (k = 0; k < min(MT, NT); k++){ Insert_Task(&cl_zgeqrt, k , k, ...); for (n = k+1; n < NT; n++) Insert_Task(&cl_zunmqr, k, n, ...); for (m = k+1; m < MT; m++){ Insert_Task(&cl_ztsqrt, m, k, ...); for (n = k+1; n < NT; n++) Insert_Task(&cl_ztsmqr, m, n, k, ...); }

}

Contact magma-[email protected] Innovative Computing Laboratory University of Tennessee, Knoxville

MAGMA Team http://icl.cs.utk.edu/magma PLASMA team http://icl.cs.utk.edu/plasma Collaborating Partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France (StarPU team) KAUST, Saudi Arabia

Collaborators and support

PLASMA MAGMA

Support:

high%performance%% linear%algebra%with% intel%xeon%phi

Documents