tensor computations: e ciency or productivity?hpac.cs.umu.se/~pauldj/talks/siam-cse19.pdftensor...

Tensor Computations: Efficiency Or Productivity?

Paolo Bientinesi – Umea Universitet & RWTH Aachen UniversityRasmus Bro – University of Copenhagen

Edoardo Di Napoli – Julich Supercomputing Centre

Lars Karlsson – Umea Universitet

March 1, 2019SIAM Conference on Computational Science and Engineering

Spokane, Washington, USA

Part I

The HPC perspective

Disclaimer: Talk about awareness, not results

2 / 21

Typical HPC “workflow” – not limited to tensor computations

I 1. Target problemI Fixed problem size?I Target architecture? (parallel paradigm)

I 2. Identify building blocks, bottlenecks, hotspots, critical paths, . . .

I 3. Algorithmic & code optimizationsObjectives: speedups, scalability, portability, . . .

I : High exploitation of platform’s potential, time savings, green computing(publications, “more science per hour”, . . . )

I : However, ... often limited integration into actual scientific codes(little funding, not researchy/publishable effort, no credits, . . . )

⇒ Caveat: Gains in the building blocks... often lost at the higher levels

3 / 21

Examples (1/2) Genome-Wide Association Studies (GWAS)

I What is it? Data correlation analysis. 2D grid of generalized least squares problems (GLS)

t Ph-s

bγ,α

Mα Xγ

Mα yα

bγ,η

Mη Xγ

Mη yη bµ,η

Mη Xµ

Mη yη

4 / 21

Examples (1/2) Genome-Wide Association Studies (GWAS)

I How is it related to tensors? 1D×1D cartesian product of GLSs, 2D output = 4D data

t Ph-s

bγ,α

Mα Xγ

Mα yα

bγ,η

Mη Xγ

Mη yη bµ,η

Mη Xµ

Mη yη

4 / 21

Computing Petaflops over Terabytes of Data:The Case of Genome-Wide Association Studies.

ACM TOMS, 2014

t Ph-s

bγ,α

Mα Xγ

Mα yα

bγ,η

Mη Xγ

Mη yη bµ,η

Mη Xµ

Mη yη

I Algorithmic improvement: lower computational complexity

I HPC optimizations: asynch I/O, overlap, BLAS-3, parallelism, . . .

I 100x – 1000x speedups! Library available: OmicABEL

I But...I Interface: C vs. RI Data management: data formats, overwriting, multiple filesI Data manipulation: imputation, filtering, selectionI Workflow not as fixed as first understood: M vs no-M, . . .I “The pre-processing is slower than the analysis”

⇒ Performance is important, but not as much as we like to think

5 / 21

ACM TOMS, 2014

t Ph-s

bγ,α

Mα Xγ

Mα yα

bγ,η

Mη Xγ

Mη yη bµ,η

Mη Xµ

Mη yη

5 / 21

ACM TOMS, 2014

t Ph-s

bγ,α

Mα Xγ

Mα yα

bγ,η

Mη Xγ

Mη yη bµ,η

Mη Xµ

Mη yη

I But...

I Interface: C vs. RI Data management: data formats, overwriting, multiple filesI Data manipulation: imputation, filtering, selectionI Workflow not as fixed as first understood: M vs no-M, . . .I “The pre-processing is slower than the analysis”

5 / 21

ACM TOMS, 2014

t Ph-s

bγ,α

Mα Xγ

Mα yα

bγ,η

Mη Xγ

Mη yη bµ,η

Mη Xµ

Mη yη

5 / 21

ACM TOMS, 2014

t Ph-s

bγ,α

Mα Xγ

Mα yα

bγ,η

Mη Xγ

Mη yη bµ,η

Mη Xµ

Mη yη

5 / 21

Examples (2/2) High-Performance Tensor Kernels

Paul Springer

I Tensor Transpositions

I Summations

I Tensor Contractions

6 / 21

Examples (2/2) High-Performance Tensor Kernels

Paul Springer

Bi1i2...iN ← α · Aπ(i1i2...iN) + β · Bi1i2...iN

I Summations — linear summation over tensor transpositions

Bi0i1i2 ← 2Ai0i1i2 −Ai2i1i0 −Ai0i2i1

Bi0i1i2 ← 4Ai0i1i2 − 2Ai1i0i2 − 2Ai2i1i0 +Ai1i2i0 − 2Ai0i2i1 +Ai2i0i1

Bi0i1i2i3 ← 2Ai0i1i2i3 −Ai2i1i0i3 −Ai0i2i1i3 −Ai0i1i3i2

CπC(Im∪In) ← α · AπA(Im∪Ik ) × BπB(In∪Ik ) + β · CπC(Im∪In)

6 / 21

Examples (2/2) High-Performance Tensor KernelsPaul Springer

TTC: A high-performance Compiler for Tensor Transpositions. ACM TOMS, 2017

Compiler: https://github.com/HPAC/TTC Library: https://github.com/HPAC/hptt

I Summations — linear summation over tensor transpositions

Spin Summations: A High-Performance Perspective. ACM TOMS, 2019

Generator: https://github.com/springer13/spin-summations

Design of a high-performance GEMM-like Tensor-Tensor Multiplication. ACM TOMS, 2018

Compiler: https://github.com/HPAC/tccg Library: https://github.com/springer13/tcl

6 / 21

But...

I “Wrong” level of abstraction for domain scientists

I Kernels: good for developers – too low level for most end users

I Mismatch → mapping problem

7 / 21

But...

7 / 21

But...

7 / 21

But...

7 / 21

2D case: “Right” level of abstraction

Generalized Least Squares b := (XTM−1X )−1XTM−1y n > m; M ∈ Rn×n, SPD; X ∈ Rn×m; y ∈ Rn×1

Signal Processing x :=(A−TBTBA−1 + RTLR

)−1A−TBTBA−1y

Kalman Filter Kk := PbkH

T (HPbkH

T + R)−1; xak := xbk + Kk (zk − Hxbk ); Pak := (I − KKH)Pb

Ensemble Kalman Filter X a := X b +(B−1 + HTR−1H

)−1 (Y − HX b

)Image Restoration xk := (HTH + λσ2In)−1(HT y + λσ2(vk−1 − uk−1))

Rand. Matrix Inversion Xk+1 := S(STAS)−1ST + (In − S(STAS)−1STA)Xk (In − AS(STAS)−1ST )

Stochastic Newton Bk := kk−1

Bk−1(In − ATWk ((k − 1)Il + WTk ABk−1A

TWk )−1WTk ABk−1)

Optimization xf := WAT (AWAT )−1(b − Ax); xo := W (AT (AWAT )−1Ax − c)

Tikhonov Regularization x := (ATA + ΓT Γ)−1ATb A ∈ Rn×m; Γ ∈ Rm×m; b ∈ Rn×1

Gen. Tikhonov Reg. x := (ATPA + Q)−1(ATPb + Qx0) P ∈ Rn×n, SSPD; Q ∈ Rm×m, SSPD; x0 ∈ Rm×1

LMMSE estimator Kt+1 := CtAT (ACtAT + Cz )−1; xt+1 := xt + Kt+1(y − Axt); Ct+1 := (I − Kt+1A)Ct

8 / 21

x := A(BTB + ATRTΛRA)−1BTBA−1y

{C† := PCPT + Q

K := C†HT (HC†H

T )−1

E := Q−1U(I + UTQ−1U)−1UT . . .

MUL ADD MOV

MOVAPD

VFMADDPD . . .9 / 21

{C† := PCPT + Q

K := C†HT (HC†H

T )−1

E := Q−1U(I + UTQ−1U)−1UT . . .

y := αx + y LU = A · · · C := αAB + βC

X := A−1B C := ABT + BAT + C X := L−1ML−T QR = A

BLAS . . . LAPACK . . .

MUL ADD MOV

MOVAPD

VFMADDPD . . .

LINEARALGEBRAMAPPINGPROBLEM

(LAMP)

10 / 21

{C† := PCPT + Q

K := C†HT (HC†H

T )−1

E := Q−1U(I + UTQ−1U)−1UT . . .

y := αx + y LU = A · · · C := αAB + βC

MUL ADD MOV

MOVAPD

VFMADDPD . . .

(LAMP)

10 / 21

{C† := PCPT + Q

K := C†HT (HC†H

T )−1

E := Q−1U(I + UTQ−1U)−1UT . . .

y := αx + y LU = A · · · C := αAB + βC

MUL ADD MOV

MOVAPD

VFMADDPD . . .

(LAMP)

10 / 21

LAMP: A problem often ignoredLinnea: A compiler for linear algebra – Henrik Barthels, Christos Psarras

Linnea’s speedups

Jl: Julia, Arma: Armadillo, Eig: Eigen, Mat: Matlab. n/r: naive/recommended implementation11 / 21

Tensor App #1 Tensor App #2 . . . Tensor App #N

MUL ADD MOV

MOVAPD

VFMADDPD . . .

12 / 21

Tensor App #1 Tensor App #2 . . . Tensor App #N

Transposition Contraction · · · Alternating LS

Khatri-Rao SpMTTKRP · · · TTV, TTM

HPTT TCL . . . BLAS

MUL ADD MOV

MOVAPD

VFMADDPD . . .

13 / 21

nD case: Exemplary applicationsCoupled-Cluster methods

Finite Element 3D diffusion operator

credits to D. Matthews, E. Solomonik, J. Stanton, and J. Gauss

credits to A. Fisher – https://github.com/LLNL/acrotensor

14 / 21

nD case: Exemplary applicationsCoupled-Cluster methods Finite Element 3D diffusion operator

credits to D. Matthews, E. Solomonik, J. Stanton, and J. Gauss credits to A. Fisher – https://github.com/LLNL/acrotensor

14 / 21

Awareness

I Performance:Speedups in building blocks do not always translate to speedups in applications

I Research Problem: Mapping onto building blocks

I Solved by hand → loss of productivity (possibly loss of efficiency too)I Solved automatically → loss of efficiency

Beware: It’s challenging even for “simple” matrix computations!

15 / 21

Awareness

I Performance:Speedups in building blocks do not always translate to speedups in applications

I Research Problem: Mapping onto building blocks

I Solved by hand → loss of productivity (possibly loss of efficiency too)I Solved automatically → loss of efficiency

15 / 21

Part II

The computational scientists’ perspective

“The fastest FLOPS are those that are not executed.”– Lars

16 / 21

Chromatography-MS

PARAFAC Tucker PARAFAC2

Transposition Contraction · · · Alternating LS

Khatri-Rao SpMTTKRP · · · TTV, TTM

HPTT TCL . . . BLAS

MUL ADD MOV

MOVAPD

VFMADDPD . . .17 / 21

Example application: Untargeted chemical profilingChromatography with mass spectrometry detection

I Problem: Identify components in a sample

Jessica Torres – Bitesize Bio

18 / 21

I 3-way data: Mass-spectrum × elution time × sample

Rasmus Bro

18 / 21

I 3-way tensor → Individual components

Rasmus Bro

18 / 21

Workflow

I 1. (Pre-processing: alignment)

I 2. Choose time intervals across samples

I 3. (Data preparation: remove background)

I 4. Fit model: 1–15 components; if needed: non-negativity constraintsPARAFAC — PARAFAC2 — TUCKER

I 5. Determine whether or not one of the models is “right”

I : Determine which of the components represent chemical information

I : Start over; add/change constraints, change model

19 / 21

Workflow

19 / 21

Workflow

19 / 21

Workflow

19 / 21

Awareness & ConclusionsI Performance:

Speedups in building blocks do not always translate to speedups in applications

I Research Problem: Mapping onto building blocksI Solved by hand → loss of productivity (possibly loss of efficiency too)I Solved automatically → loss of efficiency

I Application as a wholeI Massive redundancy! Careful with black-box librariesI Man-in-the-middle workflow. Manual “check & decide”

I Gap: Domain scientists’ needs ↔ Computer scientists’ artifacts

I Problem: How to maximize scientific output?I Speedups in algorithms and building blocksI Efficient mappingI Reduction in computation in the application

Thank you

Questions?

20 / 21

Thank you

Questions?

20 / 21

Thank you

Questions?

20 / 21

Thank you

Questions?

20 / 21

Thank you

Questions?

20 / 21

I Spin Summations: A High-Performance Perspective,P. Springer, D. Matthews and P. Bientinesi.ACM Transactions on Mathematical Software, 2018. Accepted.

I Design of a High-Performance GEMM-like Tensor-Tensor Multiplication,P. Springer and P. Bientinesi.ACM Transactions on Mathematical Software, Vol. 44(3), 28:1–28:29, January 2018.

I The Generalized Matrix Chain Algorithm,H. Barthels, M. Copik and P. Bientinesi.Proceedings of the International Symposium on Code Generation and Optimization, February 2018.

I TTC: A High-Performance Compiler for Tensor Transpositions,P. Springer, J. Hammond and P. Bientinesi.ACM Transactions on Mathematical Software, Vol. 44(2), 15:1–15:21, August 2017.

I Gas chromatography – mass spectrometry data processing made easy,L. Johnsena, P. Skoub, B. Khakimovb, R. Bro.Journal of Chromatography A, 1503 (2017) 57–64.

I Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies,D. Fabregat-Traver and P. Bientinesi.ACM Transactions on Mathematical Software, Vol. 40(4), 27:1–27:22, June 2014.

I Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions,E. Di Napoli, D. Fabregat-Traver, G. Quintana-Orti and P. Bientinesi.Applied Mathematics and Computation, Vol. 235, 454–468, May 2014.

I Solving GC-MS problems with PARAFAC2,J. Amigo, T. Skov, J. Coello, S. Maspoch, R. Bro.Trends in Analytical Chemistry, Vol. 27, No. 8, 2008.

21 / 21

tensor computations: e ciency or productivity?hpac.cs.umu.se/~pauldj/talks/siam-cse19.pdftensor...

Documents