high%performance%% linear%algebra%with% intel%xeon%phi

35
High Performance Linear Algebra with Intel Xeon Phi Coprocessors University of Tennessee, Knoxville Stan Tomov Azzam Haidar, Khairul Kabir, J. Dongarra, M. Gates, Y. Jia, P. Luszczek ORNL Seminar, TN September 9, 2013

Upload: others

Post on 18-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

High  Performance    Linear  Algebra  with  Intel  Xeon  Phi  Coprocessors  

University  of  Tennessee,  Knoxville  

Stan  Tomov  Azzam  Haidar,  Khairul  Kabir,  J.  Dongarra,  M.  Gates,  Y.  Jia,  P.  Luszczek  

ORNL Seminar, TN September 9, 2013

Page 2: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA:  LAPACK  for  hybrid  systems  

•  MAGMA  •  A  new  generation  of  HP  Linear  Algebra  Libraries  •  To  provide  LAPACK/ScaLAPACK  on  hybrid  architectures    •  http://icl.cs.utk.edu/magma/  

•  MAGMA  MIC  1.0  •  For  hybrid,  shared  memory  systems  featuring  Intel  Xeon  Phi  coprocessors    •  Included  are  main  DLA  algorithms  

•   one  and  two-­‐sided  factorizations  and  solvers  

•  Open  Source  Software    (  http://icl.cs.utk.edu/magma  )  

•  MAGMA  developers  &  collaborators  •  UTK,  UC  Berkeley,  UC  Denver,  INRIA  (France),  KAUST  (Saudi  Arabia)  •  Community  effort,  similar  to  LAPACK/ScaLAPACK  

Page 3: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Key  aspects  of  MAGMA  

Page 4: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

A  New  Generation  of  DLA  Software  

Software/Algorithms follow hardware evolution in time

LINPACK (70’s)

(Vector operations)

Rely on

- Level-1 BLAS

operations

LAPACK (80’s)

(Blocking, cache

friendly)

Rely on

- Level-3 BLAS

operations

ScaLAPACK (90’s)

(Distributed Memory)

Rely on

- PBLAS Mess Passing

PLASMA (00’s)

New Algorithms

(many-core friendly)

Rely on

- a DAG/scheduler

- block data layout

- some extra kernels

MAGMA Hybrid Algorithms (heterogeneity friendly)

Rely on - hybrid scheduler - hybrid kernels

Page 5: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  Libraries  &  Software  Stack  

MAGMA 1.4 Hybrid Tile Algorithms through MORSE

single

multi

distr.

C P U GPU / Coprocessor H Y B R I D

BLAS

BLAS

MAGMA BLAS, panels, auxiliary

LAPACK

CUDA / OpenCL / MIC

Support: Linux, Windows, Mac OS X; C/C++, Fortran; Matlab, Python

MAGMA SPARSE (kernels, data formats, iterative & direct )

MAGMA 1.4, MAGMA MIC 1.0, clMAGMA 1.0 (Hybrid LAPACK & 2-stage eigensolvers)

StarPU run-time system PLASMA / Quark

MAGMA 1.4, MAGMA MIC 1.0 (Hybrid LAPACK & 2-stage eigensolvers)

Hybrid LAPACK/ScaLAPACK & Tile Algorithms

Page 6: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  Functionality  

•   80+  hybrid  algorithms  have  been  developed    (total  of  320+  routines)  •  Every  algorithm  is  in  4  precisions  (s/c/d/z)  •  There  are  3  mixed  precision  algorithms    (zc  &  ds)  •  These  are  hybrid  algorithms,  expressed  in  terms  of  BLAS  •  MAGMA  MIC  provides  support  for  Intel  Xeon  Phi  Coprocessors  

Page 7: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Methodology  overview    

•  MAGMA  MIC  uses  hybridization  methodology  based  on  •  Representing  linear  algebra  algorithms  as  collections    

of  tasks  and  data  dependencies  among  them  •  Properly  scheduling  tasks'  execution  over    

multicore  CPUs  and  manycore  coprocessors  

•  Successfully  applied  to  fundamental  linear  algebra  algorithms  •  One-­‐  and  two-­‐sided  factorizations  and  solvers  •  Iterative  linear  and  eigensolvers  

•  Productivity  1)  High  level;    2)  Leveraging  prior  developments;    3)  Exceeding  in  performance  homogeneous  solutions  

Hybrid  CPU+MIC  algorithms  (small  tasks  for  multicores  and    large  tasks  for  MICs)  

A  methodology  to  use  all  available  resources:  

MIC

MIC

MIC

Page 8: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Hybrid  Algorithms  

•  Hybridization  •  Panels (Level 2 BLAS) are factored on CPU using LAPACK

•  Trailing matrix updates (Level 3 BLAS) are done on the MIC using “look-ahead”  

One-­Sided  Factorizations  (LU,  QR,  and  Cholesky)  

!"#$%&'(%)&*%

next

pan

el i+

1

trailing submatrix i

trailing matrix

pane

l i

pane

l i

Page 9: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Programming  LA  on  Hybrid  Systems  

•  Algorithms  expressed  in  terms  of  BLAS  •  Use  vendor-­‐optimized  BLAS  •  Algorithms  expressed  in  terms  of  tasks  •  Use  some  scheduling/run-­‐time  system  

Page 10: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Intel  Xeon  Phi  speciPic  considerations  

•  Intel  Xeon  Phi  coprocessors  (vs.  GPUs)  are  less  dependent  on  host    [  can  login  on  the  coprocessor,  develop,  and  run    programs  in  native  mode]  

•  There  is  OpenCL  support  (as  of  Intel  SDK  XE  2013  Beta)  facilitating  Intel  Xeon  Phi’s  use  from  the  host  

•  There  is  pragma  API  but  it  may  be  too  high-­‐level  for  HP  numerical  libraries  

•  We  used  Intel  Xeon  Phi’s  Low  Level  API  (LLAPI)  to  develop  MAGMA  API    [allows  us  to  uniformly  handle  hybrid  systems]  

Page 11: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  programming  model  

•  Intel Xeon Phi acts as coprocessor

•  On the Intel Xeon Phi, MAGMA runs a “server”

•  Communications are implemented using LLAPI

PCIe

Host Intel Xeon Phi

Main ( ) server ( )

LLAPI

Page 12: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

A  Hybrid  Algorithm  Example  •  Left-­‐looking  hybrid  Cholesky  factorization  in  MAGMA  

•  The  difference  with  LAPACK  –  the  4  additional  lines  in  red  •  Line  8  (done  on  CPU)  is  overlapped  with  work  on  the  MIC  (from  line  6)  

Page 13: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  programming  model  

Host program

Intel Xeon Phi interface

Send asynchronous requests to the MIC; Queued & Executed on the MIC

Page 14: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Performance  (  QR  )  

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s

System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183

0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

1000  

0   5000   10000   15000   20000   25000  

Performan

ce  GFLOP/s  

Matrix  Size  N  X  N  

MAGMA_DGEQRF  

CPU_DGEQRF  

Page 15: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Performance  (Cholesky)  

0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

1000  

0   5000   10000   15000   20000   25000  

Performan

ce  GFLOP/s  

Matrix  Size  N  X  N  

MAGMA_DPOTRF  

CPU_DPOTRF  

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s

System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183

Page 16: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Performance  (LU)  

0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

1000  

0   5000   10000   15000   20000   25000  

Performan

ce  GFLOP/s  

Matrix  Size  N  X  N  

MAGMA_DGETRF  

CPU_DGETRF  

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s

System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183

Page 17: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Two-­sided  Factorizations  

1. Copy dP to CPU

2. Copy v to GPUj

4. Copy to CPU

Work dWork

dYY dV

0C P U G P U

i

3. Copy y to CPUj

Aj

•  Panels  are  also  hybrid,  using  both  CPU  and  MIC    (vs.  just  CPU  as  in  the  one-­‐sided  factorizations)    

•  Need  fast  Level  2  BLAS  –  use  MIC’s  high  bandwidth  

Page 18: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Performance  (Hessenberg)  

0  

20  

40  

60  

80  

100  

120  

140  

160  

0   5000   10000   15000   20000   25000  

Performan

ce  GFLOP/s  

Matrix  Size  N  X  N  

MAGMA_DGEHRD  

CPU_DGEHRD  

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s

System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183

Page 19: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Performance  (Bidiagonal)  

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s

System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183

0  

10  

20  

30  

40  

50  

60  

0   5000   10000   15000   20000   25000  

Performan

ce  GFLOP/s  

Matrix  Size  N  X  N  

MAGMA_DGEBRD  

CPU_DGEBRD  

Page 20: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Performance  (Tridiagonal)  

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.23 GHz) DP Peak 1180 GFlop/s

System DP Peak 1512 GFlop/s MPSS 2.1.6720-15 compiler_xe_2013.4.183

0  

10  

20  

30  

40  

50  

60  

70  

80  

0   5000   10000   15000   20000   25000  

Performan

ce  GFLOP/s  

Matrix  Size  N  X  N  

MAGMA_DSYTRD  

CPU_DSYTRD  

Page 21: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Toward  fast  Eigensolver  

2k3k4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k0

40

80

120

160

200

240

280

320

360

400

440

480

Matrix size

Gflo

p/s

DSYTRD_3GPUsDSYTRD_2GPUsDSYTRD_1GPUDSYTRD_MKL

Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

flops formula: 4n3/3*time Higher is faster

               A.  Haidar,  S.  Tomov,  J.  Dongarra,  T.  Schulthess,  and  R.  Solca,      A  novel  hybrid  CPU-­‐GPU  generalized  eigensolver                      for  electronic  structure  calcula;ons  based  on  fine  grained  memory  aware  tasks,  ICL  Technical  report,  03/2012.  

Page 22: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Toward  fast  Eigensolver  

  Characteristics •  Stage 1: BLAS-3, increasing computational intensity, •  Stage 2: BLAS-1.5, new cache friendly kernel, •  4X/12X faster than standard approach, •  Bottelneck: if all Eigenvectors are required, it has 1

back transformation extra cost.

Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

flops formula: 4 n3/3*time Higher is faster

               A.  Haidar,  S.  Tomov,  J.  Dongarra,  T.  Schulthess,  and  R.  Solca,      A  novel  hybrid  CPU-­‐GPU  generalized  eigensolver                      for  electronic  structure  calcula;ons  based  on  fine  grained  memory  aware  tasks,  ICL  Technical  report,  03/2012.  

2k3k4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k0

40

80

120

160

200

240

280

320

360

400

440

480

Matrix size

Gflo

p/s

DSYTRD_2stages_3GPUsDSYTRD_2stages_2GPUsDSYTRD_2stages_1GPUDSYTRD_3GPUsDSYTRD_2GPUsDSYTRD_1GPUDSYTRD_MKL Acceleration w/ 3 GPUs:

15 X vs. 12 Intel cores

0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 10160 20 40 60

0

10

20

30

40

50

60

nz = 3600

first

stage

0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 10160 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 178

second

stage

Page 23: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

1k 2k 3k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k0

50

100

150

200

250

300

350

400

450

500

Matrix size

Gflo

p/s

MAGMA_DSYTRD_2stages_16SB_with_MIC (new algorithm)MAGMA_DSYTRD_2stages_16SB_with_Kepler (new algorithm)MAGMA_DSYTRD_2stages_16SB_with_Fermi (new algorithm)MAGMA_DSYTRD_standard_16SB_with_KeplerMAGMA_DSYTRD_standard_16SB_with_MICMKL_DSYTRD_standard_16SB

Toward  fast  Eigensolver  

  Characteristics •  Stage 1: BLAS-3, increasing computational intensity, •  Stage 2: BLAS-1.5, new cache friendly kernel, •  4X/12X faster than standard approach, •  Bottelneck: if all Eigenvectors are required, it has 1

back transformation extra cost.

flops formula: 4 n3/3*time Higher is faster

               A.  Haidar,  S.  Tomov,  J.  Dongarra,  T.  Schulthess,  and  R.  Solca,      A  novel  hybrid  CPU-­‐GPU  generalized  eigensolver                      for  electronic  structure  calcula;ons  based  on  fine  grained  memory  aware  tasks,  ICL  Technical  report,  03/2012.  

0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 10160 20 40 60

0

10

20

30

40

50

60

nz = 3600

first

stage

0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 10160 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 178

second

stage

Page 24: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

From  Single  to  MultiMIC  Support  

•  Data  distribution  •  1-­‐D  block-­‐cyclic  distribution  

•  Algorithm  •  MIC  holding  current  panel  is  sending  it  to  CPU  

•  All  updates  are  done  in  parallel  on  the  MICs  

•  Look-­‐ahead  is  done  with  MIC  holding  the  next  panel  

MIC 0

MIC 1

MIC 2

MIC 0 . . .

nb

Page 25: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Scalability          LU  Factorization  Performance  in  DP  

!"

#!!"

$!!"

%!!"

&!!"

'!!!"

'#!!"

'$!!"

'%!!"

'&!!"

#!!!"

##!!"

#$!!"

#%!!"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"

9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM""

'"9NK"

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s

System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

Page 26: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Scalability          LU  Factorization  Performance  in  DP  

!"

#!!"

$!!"

%!!"

&!!"

'!!!"

'#!!"

'$!!"

'%!!"

'&!!"

#!!!"

##!!"

#$!!"

#%!!"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"

9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM""

#"9NK" '"9NK"

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s

System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

Page 27: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Scalability          LU  Factorization  Performance  in  DP  

!"

#!!"

$!!"

%!!"

&!!"

'!!!"

'#!!"

'$!!"

'%!!"

'&!!"

#!!!"

##!!"

#$!!"

#%!!"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"

9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM""

)"9NK" #"9NK" '"9NK"

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s

System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

Page 28: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Scalability          LU  Factorization  Performance  in  DP  

!"

#!!"

$!!"

%!!"

&!!"

'!!!"

'#!!"

'$!!"

'%!!"

'&!!"

#!!!"

##!!"

#$!!"

#%!!"

!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!"

*+,-.,/

012+"3456*78"

90:,;<"=;>+"?"@"?"

9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM""

$"9NK" )"9NK" #"9NK" '"9NK"

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s

System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

Page 29: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Scalability          QR  Factorization  Performance  in  DP  

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s

System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

0  

200  

400  

600  

800  

1000  

1200  

1400  

1600  

1800  

2000  

2200  

0   10000   20000   30000   40000  

Performan

ce  GFLOP/s  

Matrix  Size  N  X  N  

MAGMA  DGEQRF  Performance(MulYple  Card)  

4  MIC  

3  MIC  

2  MIC  

1  MIC  

Page 30: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

MAGMA  MIC  Scalability          Cholesky  Factorization  Performance  in  DP  

Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s

Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s

System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

0  

200  

400  

600  

800  

1000  

1200  

1400  

1600  

1800  

2000  

2200  

0   10000   20000   30000   40000  

Performan

ce  GFLOP/s  

Matrix  Size  N  X  N  

MAGMA  DPOTRF  Performance(MulYple  Card)    

4  MIC  

3  MIC  

2  MIC  

1  MIC  

Page 31: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Current  and  Future  MIC-­speciPic  directions  

•  Explore  the  use  of  tasks  of  lower  granularity  on  MIC  [  GPU  tasks  in  general  are  large,  data  parallel]  

•  Reduce  synchronizations  [  less  fork-­‐join  synchronizations  ]  

•  Scheduling  of  less  parallel  tasks  on  MIC  [  on  GPUs,  these  tasks  are  typically  ofiloaded  to  CPUs]  [  to  reduce  CPU-­‐MIC  communications  ]  

•  Develop  more  algorithms,    porting  newest  developments  in  LA,    while  discovering  further  MIC-­‐speciiic  optimizations  

Page 32: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Current  and  Future  MIC-­speciPic  directions  

•  Synchronization  avoiding  algorithms  using                                            Dynamic  Runtime  Systems  

48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200

Page 33: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Current  and  Future  MIC-­speciPic  directions  

High-­productivity  w/  Dynamic  Runtime  Systems  From  Sequential  Nested-­‐Loop  Code  to  Parallel  Execution  

for  (k  =  0;  k  <  min(MT,  NT);  k++){    zgeqrt(A[k;k],  ...);    for  (n  =  k+1;  n  <  NT;  n++)      zunmqr(A[k;k],  A[k;n],  ...);    for  (m  =  k+1;  m  <  MT;  m++){      ztsqrt(A[k;k],,A[m;k],  ...);      for  (n  =  k+1;  n  <  NT;  n++)        ztsmqr(A[m;k],  A[k;n],  A[m;n],  ...);    }  

}  

Page 34: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Current  and  Future  MIC-­speciPic  directions  

High-­productivity  w/  Dynamic  Runtime  Systems  From  Sequential  Nested-­‐Loop  Code  to  Parallel  Execution  

for  (k  =  0;  k  <  min(MT,  NT);  k++){    Insert_Task(&cl_zgeqrt,  k  ,  k,  ...);    for  (n  =  k+1;  n  <  NT;  n++)      Insert_Task(&cl_zunmqr,  k,  n,  ...);    for  (m  =  k+1;  m  <  MT;  m++){      Insert_Task(&cl_ztsqrt,  m,  k,  ...);      for  (n  =  k+1;  n  <  NT;  n++)        Insert_Task(&cl_ztsmqr,  m,  n,  k,  ...);    }  

}  

Page 35: High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Contact  magma-­[email protected]  Innovative  Computing  Laboratory  University  of  Tennessee,  Knoxville  

MAGMA  Team  http://icl.cs.utk.edu/magma  PLASMA  team  http://icl.cs.utk.edu/plasma  Collaborating  Partners  University  of  Tennessee,  Knoxville  University  of  California,  Berkeley  University  of  Colorado,  Denver  INRIA,  France  (StarPU  team)  KAUST,  Saudi  Arabia  

Collaborators  and  support  

PLASMA  MAGMA  

Support: