an m-script api for parallel out-of-core ... - hpca.uji.es file{rubio,quintana}@icc.uji.es high...

29
An M-Script API for Parallel Out-of-Core Linear Algebra Operations Rafael Rubio Enrique S. Quintana-Ort´ ı {rubio,quintana}@icc.uji.es High Performance Computing & Architectures Group Universidad Jaime I de Castell´ on (Spain) API for Parallel OOC Linear Algebra 1 PARA 2008

Upload: vankhue

Post on 06-Apr-2019

221 views

Category:

Documents


0 download

TRANSCRIPT

An M-Script API forParallel Out-of-Core Linear Algebra Operations

Rafael Rubio Enrique S. Quintana-Ortı{rubio,quintana}@icc.uji.es

High Performance Computing & Architectures GroupUniversidad Jaime I de Castellon (Spain)

API for Parallel OOC Linear Algebra 1 PARA 2008

Motivation

Very large-scale dense linear algebra problems occur in practice!

Space geodesy (e.g., computation of the gravity field)

Electromagnetism (as arising from BEM)

. . .

Problems are too large to fit in-core and use of virtual memory isinefficient

API for Parallel OOC Linear Algebra 2 PARA 2008

Motivation

As a result, message-passing dense linear algebra libraries exist:

SOLAR

OOC-ScaLAPACK

POOCLAPACK

Problems?

Message-passing may not be the best solution for multicore(many-core?)

Users (scientists and engineers) like high-level environments asmatlab (or clones Scilab, Octave), Labview, etc.

API for Parallel OOC Linear Algebra 3 PARA 2008

Motivation

Our solution:

Use FLAME@lab, the M-script API to FLAME

Extend FLASH to OOC

Extract parallelism from multithreaded BLAS (for now)

API for Parallel OOC Linear Algebra 4 PARA 2008

Outline

1 Motivation

2 FLAME@lab

3 FLASH for OOC

4 Experimental results

5 Concluding remarks

API for Parallel OOC Linear Algebra 5 PARA 2008

FLAME@lab

Outline

1 Motivation

2 FLAME@lab

Cholesky factorization (Overview of FLAME)

3 FLASH for OOC

4 Experimental results

5 Concluding remarks

API for Parallel OOC Linear Algebra 6 PARA 2008

FLAME@lab

The Cholesky Factorization

Definition

Given A → n× n symmetric positive definite, compute

A = L · LT ,

with L → n× n lower triangular

API for Parallel OOC Linear Algebra 7 PARA 2008

FLAME@lab

The Cholesky Factorization: Whiteboard Presentation

done

done

done

A(partially

updated)

?

α11 ?

a21 A22

?

α11:=√α11

?

a21:=a21/α11

A22:=

A22−a21aT21

?

done

done

done

A(partially

updated)

API for Parallel OOC Linear Algebra 8 PARA 2008

FLAME@lab

FLAME Notation

done

done

done

A(partially

updated)

?

α11 aT12

a21 A22

Repartition„ATL ATR

ABL ABR

«

0BB@A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1CCAwhere α11 is a scalar

API for Parallel OOC Linear Algebra 9 PARA 2008

FLAME@lab

FLAME Notation

Algorithm: [A] := Chol unb(A)

Partition A→“

AT L AT R

ABL ABR

”where ATL is 0× 0

while n(ABR) 6= 0 do

Repartition„ATL ATR

ABL ABR

«→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1Awhere α11 is a scalar

α11 :=√

α11

a21 := a21/α11

A22 := A22 − a21aT21

Continue with„ATL ATR

ABL ABR

«←

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1Aendwhile

API for Parallel OOC Linear Algebra 10 PARA 2008

FLAME@lab

FLAME@lab Code

From algorithm to code. . .

FLAME notationRepartition„

ATL ATR

ABL ABR

«→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1Awhere α11 is a scalar

FLAME@lab code[ A00, a01, A02, ...

a10t, alpha11, a12t, ...

A20, a21, A22 ] = FLA_Repart_2x2_to_3x3( ATL, ATR, ...

ABL, ABR, ...

1, 1, ’FLA_BR’ );

API for Parallel OOC Linear Algebra 11 PARA 2008

FLAME@lab

(Unblocked) FLAME@lab Code

function [ A_out ] = FLA_Cholesky_unb_var1( A )

% ... FLA_Part_2x2( ); ...

while ( size( ABR, 2 ) ~= 0 )

[ A00, a01, A02, ...

a10t, alpha11, a12t, ...

A20, a21, A22 ]

= FLA_Repart_2x2_to_3x3( ATL, ATR, ...

ABL, ABR, ...

1, 1, ’FLA_BR’ );

%----------------------------------------------------%

alpha11 = sqrt( alpha11 );

a21 = a21 / alpha11;

A22 = A22 - tril(a21 * a21’);

%----------------------------------------------------%

% ... FLA_Cont_with_3x3_to_2x2( ); ...

end

A_out = ATL;

return

API for Parallel OOC Linear Algebra 12 PARA 2008

FLAME@lab

Blocked FLAME@lab Code

function [ A_out ] = FLA_Cholesky_blk_var1( A, nb_alg )

% ... FLA_Part_2x2( ); ...

while ( size( ATL, 2 ) ~= 0 )

b = min( size( ABR, 1 ), nb_alg );

[ A00, A01, A02, ...

A10, A11, A12, ...

A20, A21, A22 ]

= FLA_Repart_2x2_to_3x3( ATL, ATR, ...

ABL, ABR, ...

b, b, ’FLA_BR’ );

%----------------------------------------------------%

FLA_Cholesky_unb( A11 );

A21 = A21 * inv( tril( A11 ) )’;

A22 = A22 - tril( A21 * A21’ );

%----------------------------------------------------%

% ... FLA_Cont_with_3x3_to_2x2( ); ...

end

A_out = ATL;

return

API for Parallel OOC Linear Algebra 13 PARA 2008

FLAME@lab

FLAME Code

Visit http://www.cs.utexas.edu/users/flame/Spark/. . .

M-script code formatlab: FLAME@lab

C code: FLAME/C

Other APIs:

FLATEXFortran-77LabViewMessage-passingparallel:PLAPACKFLAG: GPUs

API for Parallel OOC Linear Algebra 14 PARA 2008

FLASH

Outline

1 Motivation

2 FLAME@lab

3 FLASH for OOC

Overview of FLASHTiled OOC algorithm

4 Experimental results

5 Concluding remarks

API for Parallel OOC Linear Algebra 15 PARA 2008

FLASH

FLASH

Storage-by-blocks:

In general, storage-by-blocks enhaces data locality and,therefore, performance. . .

at the cost of more complex programming

API for Parallel OOC Linear Algebra 16 PARA 2008

FLASH

FLASH

Extending FLAME for hierarchical storage:

Natural extension: elements of matrices can be themselvesmatrices (hierarchical/recursive structure)

FLAME code is independent of actual storage (details hiddenin the objects)

OOC data structures are just one level in this hierarchy, withdata stored on disk

API for Parallel OOC Linear Algebra 17 PARA 2008

FLASH

OOC API

Driver:

% Create matrix

A = FLAOOC_Obj_create( ’FLA_DOUBLE’, rows, cols, tile_size,

file_name );

% Fill-in: A( i:i+m-1, j:j+n-1 ) = alpha*B+A( i:i+m-1, j:j+n-1 );

% B is m x n

FLAOOC_Axpy_matrix_to_object( alpha, B, A, i, j );

% Cholesky factorization

A = FLAOOC_Cholesky_var2( A );

% Free matrix

FLAOOC_Obj_free( A );

OOC to in-core and vice-versa:FLAOOC OOC to INC

FLAOOC INC to OOC

API for Parallel OOC Linear Algebra 18 PARA 2008

FLASH

OOC API

function [ A_out ] = FLA_Cholesky_blk_var2( A, nb_alg )

% ... FLA_Part_2x2( ); ...

while ( size( ABR, 2 ) ~= 0 )

b = min( size( ABR, 1 ), nb_alg );

[ A00, A01, A02, ...

A10, A11, A12, ...

A20, A21, A22 ]

= FLA_Repart_2x2_to_3x3( ATL, ATR, ...

ABL, ABR, ...

b, b, ’FLA_BR’ );

%----------------------------------------------------%

A10 = A10 \ tril( A00 )’;

A11 = A11 - tril( A10 * A10’ );

A11 = FLA_Cholesky_unb( A11 );

%----------------------------------------------------%

% ... FLA_Cont_with_3x3_to_2x2( ); ...

end

A_out = ATL;

return

API for Parallel OOC Linear Algebra 19 PARA 2008

FLASH

OOC API

function [ A_out ] = FLAOOC_Cholesky_var2( A )

% ... FLAOOC_Part_2x2( ); ...

while ( FLAOOC_Obj_width( ABR ) ~= 0 )

% ... FLAOOC_Repart_2x2_to_3x3( );

%----------------------------------------------------%

A10 = FLAOOC_Trsm( % Mode arguments

1, A00,

A10 ); % A10 := A10 * TRIL(A00)^-T

AINC = FLAOOC_OOC_to_INC( A11 );% Copy A11 in-core

AINC = FLAOOC_Syrk( % Mode arguments

-1, A10,

1, AINC ); % A11 := A11 - A10 * A10^T

AINC = chol(AINC)’; % A11 := chol( A11 )

FLAOOC_INC_to_OOC( AINC, A11 ); % Store A11 out-of-core

%----------------------------------------------------%

% ... FLAOOC_Cont_with_3x3_to_2x2( ); ...

end

A_out = ATL;

return

API for Parallel OOC Linear Algebra 20 PARA 2008

FLASH

Parallelization on Multithreaded Architectures

Traditional approach

Link with multithreaded BLAS

Advanced approach (future)

Extract parallelism at higher level: SuperMatrix runtimesystem

API for Parallel OOC Linear Algebra 21 PARA 2008

FLASH

Outline

1 Motivation

2 FLAME@lab

3 FLASH

4 OOC API

5 Experimental results

6 Concluding remarks

API for Parallel OOC Linear Algebra 22 PARA 2008

Results

Experimental Results

General

Platform Specs.

matserv SMP with two Intel [email protected] GHzand 1 GB of RAM

cat SMP with four Intel [email protected] GHzand 4 GB of RAM

Implementations

Matlab/Octave routine chol linked to multithreaded MKL(7.0 or higher)

FLA Cholesky blk linked to multithreaded MKL

API for Parallel OOC Linear Algebra 23 PARA 2008

Results

Experimental Results: matserv

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

GF

LOP

s

Matrix dimension

Cholesky factorization on 1 processor

FLAOOC_Chol + MKLMATLAB chol + MKL

API for Parallel OOC Linear Algebra 24 PARA 2008

Results

Experimental Results: cat

0

1

2

3

4

5

6

0 10000 20000 30000 40000 50000 60000

GF

LOP

s

Matrix dimension

Cholesky factorization on 1 processor

FLAOOC_Chol + MKLOCTAVE chol + MKL

API for Parallel OOC Linear Algebra 25 PARA 2008

Results

Experimental Results: matserv

0

1

2

3

4

5

6

7

8

9

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

GF

LOP

s

Matrix dimension

Cholesky factorization on 2 processors

FLAOOC_Chol + MKLMATLAB chol + MKL

API for Parallel OOC Linear Algebra 26 PARA 2008

Results

Experimental Results: cat

0

5

10

15

20

0 10000 20000 30000 40000 50000 60000

GF

LOP

s

Matrix dimension

Cholesky factorization on 4 processors

FLAOOC_Chol + MKLOCTAVE Chol + MKL

API for Parallel OOC Linear Algebra 27 PARA 2008

Remarks

Concluding Remarks

FLAOOC@lab is an easy-to-use tool to develop OOC codes fordense linear algebra operations

The new API allows the solution of large linear systems fromMatlab/Octave

Focus is on programmability but performance will come with asimilar C API

API for Parallel OOC Linear Algebra 28 PARA 2008

Remarks

Thanks for your attention!

For more information. . .

Visit http://www.cs.utexas.edu/users/flame

Support. . .

National Science Foundation awards CCF-0702714 andCCF-0540926 (ongoing till 2010)

Spanish CICYT project TIN2005-09037-C02-02

API for Parallel OOC Linear Algebra 29 PARA 2008