universitÉ grenoble alpes ecole doctorale...

13
UNIVERSITÉ GRENOBLE ALPES Jun 16, 2015 Minh Quan HO DRAKKAR - LIG 1 Data optimization for linear algebra and stencil compact on Many-core processor Etudiant : Minh Quan HO Equipe : DRAKKAR Directeur de thèse : Bernard TOURANCHEAU Co-encadrant : Christian OBRECHT Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique Kalray MPPA-256

Upload: vuongdang

Post on 15-Sep-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

UNIVERSITÉ GRENOBLE ALPES

Jun 16, 2015 Minh Quan HODRAKKAR - LIG

1

Data optimization for linear algebra and stencil compact on Many-core processor

Etudiant : Minh Quan HOEquipe : DRAKKARDirecteur de thèse : Bernard TOURANCHEAUCo-encadrant : Christian OBRECHT

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

Kalray MPPA-256

Jun 16, 2015 2

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

● Context● Applications● Work done this year● Conclusion and perspectives

Outline

Jun 16, 2015 3

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

● Exascale (1018) by 2020 ● Power =< 20 MW● >= 50 Gflops/W

Context

Intel Xeon E5-2687W

Intel Xeon Phi co-processor

31S1P

Tesla K20X

Kalray MPPA-256

?

Gflops/W (FP64) 1.5 3.4 4.8 6.8 >= 50

Number of cores/node

8 57 2688 256 ?

Frequency (MHz) 3100 1100 732 400 ?

Jun 16, 2015 4

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

Applications

Linear algebra (BLAS)

Compact stencils PDE solvers(e.g Lattice Boltzmann Method)

● Vector & Matrix operations● Irregular access pattern

● Access to neighbor nodes● Non-contiguous memory access

● Cache coherence. How and Where ?● Hardware complexity & latency increases with # of cores● Incoherence-aware software design – Non-trivial

● Memory● Non-Uniform-Memory-Access (NUMA)● Difficult to do and maintain Global Memory + overhead● Bandwidth could hardly follow # of cores and clock

● Programming framework ?

X

Y

0 1 2 3 4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0 1 2 3 4

0 1 2 3 4

0 1 2 3 4

0 1 2 3 4

Jun 16, 2015 5

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

1. Many-core as Stand-alone node● MPI design and implementation● MPI optimizations comparison

2. Many-core as Accelerator● clBLAS● clMAGMA

3. Benchmark● High Performance Linkpack (HPL)

Work done

Jun 16, 2015 6

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

● Issues : Isolated cores, limited memory ● Design : 16 + 1 MPI processes per Many-core

● MPI compute ranks : 16 x 2 MB on-chip memory● MPI I/O rank : 1 x 2 GB DDR

1. Many-core as Stand-alone node

MPI

MPI_NOC_API MPI implementation

DMA manager NoC barrier

send_post

recv_post

pending_isend

datatype

comm

pending_irecv

Callback Handler

Server (RM)

Local buffers

Control messages buffers

MPPAIPC

Network-on-Chip

Data

Control messages (outgoing)

Data

Control messages (incoming)

MPPAMPI components

Main thread (PE)

Jun 16, 2015 7

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

● Optimizations : ● Eager send for short messages● Lazy send for medium messages

1. Many-core as Stand-alone node

Jun 16, 2015 8

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

CLBLAS

● Offloading computation from Host to Many-core● Static kernels for level 1-2● Dynamic code generation for level 3

● 2.6 Gflops (SGEMM) – 1.3 Gflops (DGEMM)

2. Many-core as Accelerator

A, B,...

C

do_GEMM(A, B, C,...); do_something(){

.

.

.}get_GEMM(C);

Jun 16, 2015 9

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

CLMAGMA

● LAPACK-based interface (QR, LU, CHOL … )● Auxiliary kernels written in OpenCL● Use clBLAS to offload computation on device(s)

2. Many-core as Accelerator

Jun 16, 2015 10

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

CLBLAS / CLMAGMA issues

● Block decomposition too small compared to page-size● Concurrent accesses to same memory block● Kernels written for GPU, non-optimal for Many-core

2. Many-core as Accelerator

Jun 16, 2015 11

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

● HPL benchmark with 16 MPI ranks, BLAS / OpenBLAS ● Limited local memory space (16 x 2 MB)● Matrix size 1500 x 1500● MPI Eager send gains 10% performance

3. Benchmark

Jun 16, 2015 12

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

● Many-core as Stand-alone node● MPI mono-Many-core, BLAS/OpenBLAS and HPL● Next :

● MPI + DSM to increase memory space via I/O DDR + optimization modeling

● MPI multi-Many-core for scalability● Many-core as Accelerator

● clBLAS + clMAGMA● Next :

● Block decomposing tuning for Many-core● {Page|Cache}-size-aware kernels● LBM design & accelerating

[1] M. Q. Ho, B. Tourancheau, C. Obrecht et al., “MPI communication on MPPA Many-core NoC: design, modeling and performance issues”, PARCO, Edinburgh, Scotland, UK, 1-4 September 2015.

[2] Kalray Inc., “ManycoreLabs report”, vol. 6, pp. 39-41, 2015

Conclusion

Jun 16, 2015 13

UNIVERSITÉ GRENOBLE ALPES

Minh Quan HODRAKKAR - LIG

Ecole Doctorale Mathématiques, Sciences et Technologies de l'Information, Informatique

Question ?