cr07 - graal.ens-lyon.frbucar/cr07/introsparse.pdf · a selection of references i books i du ,...

CR07 – Sparse Matrix Computations /Cours de Matrices Creuses (2010-2011)

Jean-Yves L’Excellent (INRIA) and Bora Ucar (CNRS)

LIP-ENS Lyon, Team: ROMA

([email protected], [email protected])

Fridays 10h15-12h15

prepared in collaboration with P. Amestoy (ENSEEIHT-IRIT)

1/ 94

MotivationsI Applications ayant un besoin croissant en puissance de calcul:

I modelisation,I simulation (plutot qu’experimentation),I optimisation numerique

I Typiquement:Probleme continu ⇒ Discretisation (maillage)⇒ Algorithme numerique de resolution (selon lois physiques)⇒ Probleme matriciel (Ax = b, . . .)

I Besoins:I Modelisations de plus en plus precisesI Problemes de plus en plus complexesI Applications critiques en temps de reponseI Minimisation des couts du calcul

⇒ Calculateurs haute performance, parallelisme

⇒ Algorithmes permettant de tirer le meilleur parti de cescalculateurs et des proprietes des matrices considerees

2/ 94

Example of preprocessing for Ax = b

Original (A =lhr01) Preprocessed matrix (A′(lhr01))

0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 184270 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 18427

Modified Problem:A′x ′ = b′ with A′ = PnPDr AQDcPt

3/ 94

Quelques exemples dans le domaine du calculscientifique

I Constraints of duration: weather forecast

4/ 94

A few examples in the field of scientific computing

I Cost constraints: wind tunnels, crash simulation, . . .

5/ 94

Scale Constraints

I large scale: climate modelling, pollution, astrophysics

I tiny scale: combustion, quantum chemistry

6/ 94

Contents of the course

I Introduction, reminders on graph theory and on numericallinear algebra

- Introduction to sparse matrices- Graphs and hypergraphs, trees, some classical algorithms- Gaussian elimination, LU factorization, fill-in- Conditioning/sensitivity of a problem and error analysis

7/ 94


↔ ↔ Graph

I Sparse Gaussian elimination

- Sparse matrices and graphs- Gaussian elimination of sparse matrices- Ordering and permuting sparse matrices

7/ 94


I Maximum (weighted) Matching algorithms and their use insparse linear algebra

I Efficient factorization methods

- Implementation of sparse direct solvers- Parallel sparse solvers- Scheduling of computations to optimize memory usageand/or performance

I Graph and hypergraph partitioning

I Iterative methods

I Current research activities

7/ 94

Tentative outline

A. INTRODUCTIONI. Sparse matrices

II. Graph theory and algorithmsIII. Linear algebra basics

------------------------------------------------------B. SPARSE GAUSSIAN ELIMINATION

IV. Elimination tree and structure predictionV. Fill-reducing ordering methods

VI. Matching in bipartite graphsVII. Factorization: MethodsVIII. Factorization: Parallelization aspects

------------------------------------------------------C. SOME OTHER ESSENTIAL SPARSE MATRIX ALGORITHMS

IX. Graph and hypergraph partitioningX. Iterative methods

------------------------------------------------------D. CLOSING

XI. Current research activitiesXII. Presentations

8/ 94

Tentative organization of the course

I References, teaching material will be made available athttp://graal.ens-lyon.fr/~bucar/CR07/

I 2+ hours dedicated to exercises (manipulation of sparsematrices and graphs)

I Evaluation: mainly based on the study of research articlesrelated to the contents of the course; the students will write areport and do an oral presentation.

9/ 94

http://graal.ens-lyon.fr/~bucar/CR07/

Outline

Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationParallel and high performance computingNumerical simulation and sparse matricesDirect vs iterative methodsConclusion

10/ 94

A selection of references

I Books

I Duff, Erisman and Reid, Direct methods for Sparse Matrices,Clarendon Press, Oxford 1986.

I Dongarra, Duff, Sorensen and H. A. van der Vorst, SolvingLinear Systems on Vector and Shared Memory Computers,SIAM, 1991.

I Davis, Direct methods for sparse linear systems, SIAM, 2006.I Saad, Iterative methods for sparse linear systems, 2nd edition,

SIAM, 2004.

I Articles

I Gilbert and Liu, Elimination structures for unsymmetric sparseLU factors, SIMAX, 1993.

I Liu, The role of elimination trees in sparse factorization,SIMAX, 1990.

I Heath and E. Ng and B. W. Peyton, Parallel Algorithms forSparse Linear Systems, SIAM review 1991.

11/ 94


12/ 94

Motivations

I solution of linear systems of equations → key algorithmickernel

Continuous problem↓

Discretization↓

Solution of a linear system Ax = b

I Main parameters:I Numerical properties of the linear system (symmetry, pos.

definite, conditioning, . . . )I Size and structure:

I Large (> 1000000× 1000000 ?), square/rectangularI Dense or sparse (structured / unstructured)I Target computer (sequential/parallel/multicore)

→ Algorithmic choices are critical

13/ 94

Motivations for designing efficient algorithms

I Time-critical applications

I Solve larger problems

I Decrease elapsed time (parallelism ?)

I Minimize cost of computations (time, memory)

14/ 94

Difficulties

I Access to data :I Computer : complex memory hierarchy (registers, multilevel

cache, main memory (shared or distributed), disk)I Sparse matrix : large irregular dynamic data structures.

→ Exploit the locality of references to data on the computer(design algorithms providing such locality)

I Efficiency (time and memory)I Number of operations and memory depend very much on the

algorithm used and on the numerical and structural propertiesof the problem.

I The algorithm depends on the target computer (vector, scalar,shared, distributed, clusters of Symmetric Multi-Processors(SMP), multicore).

→ Algorithmic choices are critical

15/ 94


16/ 94

Sparse matrices

Example:

3 x1 + 2 x2 = 52 x2 - 5 x3 = 1

2 x1 + 3 x3 = 0

can be represented as

Ax = b,

where A =

3 2 00 2 −52 0 3

, x =

x1

x2

x3

, and b =

510

Sparse matrix: only nonzeros are stored.

17/ 94

Sparse matrix ?

0 100 200 300 400 500

0

100

200

300

400

500

nz = 5104

Matrix dwt 592.rua (N=592, NZ=5104);Structural analysis of a submarine

18/ 94

Sparse matrix ?

Matrix from Computational Fluid Dynamics;(collaboration Univ. Tel Aviv)

0 1000 2000 3000 4000 5000 6000 7000

0

1000

2000

3000

4000

5000

6000

7000

nz = 43105

“Saddle-point” problem

19/ 94

Preprocessing sparse matrices

Original (A =lhr01) Preprocessed matrix (A′(lhr01))

0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 184270 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 18427

Modified Problem:A′x ′ = b′ with A′ = PnPDr ADcQPt

20/ 94

Factorization process

Solution of Ax = b

I A is unsymmetric :I A is factorized as: A = LU, where

L is a lower triangular matrix, andU is an upper triangular matrix.

I Forward-backward substitution: Ly = b then Ux = y

I A is symmetric:I A = LDLT or LLT

I A is rectangular m × n with m ≥ n and minx ‖Ax− b‖2 :I A = QR where Q is orthogonal (Q−1 = QT) and R is

triangular.I Solve: y = QTb then Rx = y

21/ 94

Difficulties

I Only non-zero values are stored

I Factors L and U have far more nonzeros than A

I Data structures are complex

I Computations are only a small portion of the code (the rest isdata manipulation)

I Memory size is a limiting factor ( → out-of-core solvers )

22/ 94

Key numbers:

1- Small sizes : 500 MB matrix;Factors = 5 GB; Flops = 100 Gflops ;

2- Example of 2D problem: Lab. Geosiences Azur, ValbonneI Complex 2D finite difference matrix n=16× 106 , 150× 106

nonzerosI Storage (single prec): 2 GB (12 GB with the factors)I Flops: 10 TeraFlops

3- Example of 3D problem: EDF (Code Aster, structuralengineering)

I real matrix finite elements n = 106 , nz = 71× 106 nonzerosI Storage: 3.5× 109 entries (28 GB) for factors, 35 GB totalI Flops: 2.1× 1013

4- Typical performance (MUMPS):I PC LINUX 1 core (P4, 2GHz) : 1.0 GFlops/sI Cray T3E (512 procs) : Speed-up ≈ 170, Perf. 71 GFlops/sI AMD Opteron 8431, 24 [email protected] GHz: 50 GFlops/s (1 core:

7 GFlop/s)

23/ 94

Typical test problems:

BMW car body,227,362 unknowns,5,757,996 nonzeros,MSC.Software

Size of factors: 51.1 million entriesNumber of operations: 44.9× 109

24/ 94

Typical test problems:

BMW crankshaft,148,770 unknowns,5,396,386 nonzeros,MSC.Software

Size of factors: 97.2 million entriesNumber of operations: 127.9× 109

25/ 94

Sources of parallelism

Several levels of parallelism can be exploited:

I At problem level: problem can be decomposed intosub-problems (e.g. domain decomposition)

I At matrix level: Sparsity implies independency in calculation

I At submatrix level: within dense linear algebra computations(parallel BLAS, . . . )

26/ 94

Data structure for sparse matrices

I Storage scheme depends on the pattern of the matrix and onthe type of access required

I band or variable-band matricesI “block bordered” or block tridiagonal matricesI general matrixI row, column or diagonal access

27/ 94

Data formats for a general sparse matrix A

What needs to be represented

I Assembled matrices: MxN matrix A with NNZ nonzeros.

I Elemental matrices (unassembled): MxN matrix A with NELTelements.

I Arithmetic: Real (4 or 8 bytes) or complex (8 or 16 bytes)

I Symmetric (or Hermitian)→ store only part of the data.

I Distributed format ?

I Duplicate entries and/or out-of-range values ?

28/ 94

Classical Data Formats for Assembled MatricesI Example of a 3x3 matrix with NNZ=5 nonzeros

a31

a23a22

a11

a33

1 2 3

1

2

3

I Coordinate formatIRN [1 : NNZ ] = 1 3 2 2 3JCN [1 : NNZ ] = 1 1 2 3 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33

I Compressed Sparse Column (CSC) formatIRN [1 : NNZ ] = 1 3 2 2 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33

COLPTR [1 : N + 1] = 1 3 4 6column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1

I Compressed Sparse Row (CSR) format:Similar to CSC, but row by row

I Diagonal format (M=N):NDIAG = 3IDIAG = −2 0 1

VAL =

na a11 0na a22 a23

a31 a33 na

(na: not accessed)

VAL(i,j) corresponds to A(i,i+IDIAG(j)) (for 1 ≤ i + IDIAG(j)≤ N)

29/ 94

Classical Data Formats for Assembled Matrices

I Example of a 3x3 matrix with NNZ=5 nonzeros

a31

a23a22

a11

a33

1 2 3

1

2

3

I Diagonal format (M=N):NDIAG = 3IDIAG = −2 0 1

VAL =

na a11 0na a22 a23

a31 a33 na

(na: not accessed)

VAL(i,j) corresponds to A(i,i+IDIAG(j)) (for 1 ≤ i + IDIAG(j)≤ N)

29/ 94

Sparse Matrix-vector products Y ← AX

Algorithm depends on sparse matrix format:

I Coordinate format:Y ( 1 :M) = 0DO k=1,NNZ

Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k ) ∗ X(JCN( k ) )ENDDO

I CSC format:

Y ( 1 :M) = 0DO J=1,N

Xj=X( J )DO k=COLPTR( J ) ,COLPTR( J+1)−1

Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k )∗XjENDDO

ENDDO

I CSR format

DO I =1,MYi=0DO k=ROWPTR( I ) ,ROWPTR( I +1)−1

Yi = Yi + VAL( k )∗X(JCN( k ) )ENDDOY( I )=Yi

ENDDO

30/ 94





I CSC format:Y ( 1 :M) = 0DO J=1,N



ENDDO

I CSR format

DO I =1,MYi=0DO k=ROWPTR( I ) ,ROWPTR( I +1)−1


ENDDO

30/ 94





I CSC format:Y ( 1 :M) = 0DO J=1,N



ENDDO

I CSR formatDO I =1,M

Yi=0DO k=ROWPTR( I ) ,ROWPTR( I +1)−1


ENDDO

30/ 94





I Diagonal format: (VAL(i,j) corresponds to A(i,i+IDIAG(j)))

Y ( 1 :N) = 0DO j =1,NDIAG

DO i= max(1,1− IDIAG ( j ) ) , min (N, N−IDIAG ( j ) )Y( i ) = Y( i ) + VAL( i , j )∗X( i+IDIAG ( j ) )

END DOEND DO

30/ 94

Jagged diagonal storage (JDS)

a31

a23a22

a11

a33

1 2 3

1

2

3

1. Shift all elements left (similar to CSR) and keep column

indicesa11 (1)a22 (2) a23 (3)a31 (1) a33 (3)

2. Sort rows in decreasing order of their number of nonzeros

3. Store corresponding row permutation: PERM = 2 3 1

4. Stored jagged diagonals (columns of step 2)VAL = a22 a31 a11 a23 a33

COL IND = 2 1 1 3 3COL PTR = 1 4 6

31/ 94


a31

a23a22

a11

a33

1 2 3

1

2

3

1. Shift all elements left (similar to CSR) and keep column

indicesa11 (1)a22 (2) a23 (3)a31 (1) a33 (3)

2. Sort rows in decreasing order of their number of nonzerosa22 (2) a23 (3)a31 (1) a33 (3)a11 (1)

3. Store corresponding row permutation: PERM = 2 3 14. Stored jagged diagonals (columns of step 2)

VAL = a22 a31 a11 a23 a33

COL IND = 2 1 1 3 3COL PTR = 1 4 6

31/ 94


a31

a23a22

a11

a33

1 2 3

1

2

3

1. Shift all elements left (similar to CSR) and keep columnindices

2. Sort rows in decreasing order of their number of nonzerosa22 (2) a23 (3)a31 (1) a33 (3)a11 (1)



COL IND = 2 1 1 3 3COL PTR = 1 4 6

31/ 94


a31

a23a22

a11

a33

1 2 3

1

2

3

1. Shift all elements left (similar to CSR) and keep columnindices

2. Sort rows in decreasing order of their number of nonzeros



COL IND = 2 1 1 3 3COL PTR = 1 4 6

Pros: manipulate longer vectors than CSR (interesting on vectorcomputers or GPU’s).Cons: extra-indirection due to permutation array.

31/ 94

Example of elemental matrix format

A =

−1 2 3 0 0

2 1 1 0 01 1 3 −1 30 0 1 2 −10 0 3 2 1

= A1 + A2

A1 =123

−1 2 32 1 11 1 1

, A2 =345

2 −1 31 2 −13 2 1

32/ 94

Example of elemental matrix format

A1 =123

−1 2 32 1 11 1 1

, A2 =345

2 −1 31 2 −13 2 1

I N=5 NELT=2 NVAR=6 A =

∑NELTi=1 Ai

IELTPTR [1:NELT+1] = 1 4 7ELTVAR [1:NVAR] = 1 2 3 3 4 5ELTVAL [1:NVAL] = -1 2 1 2 1 1 3 1 1 2 1 3 -1 2 2 3 -1 1

I Remarks:

I NVAR = ELTPTR(NELT+1)-1I NVAL =

∑S2

i (unsym) ou∑

Si (Si + 1)/2 (sym), avecSi = ELTPTR(i + 1)− ELTPTR(i)

I storage of elements in ELTVAL: by columns

32/ 94

File storage: Rutherford-Boeing

I Standard ASCII format for filesI Header + Data (CSC format). key xyz:

I x=[rcp] (real, complex, pattern)I y=[suhzr] (sym., uns., herm., skew sym., rectang.)I z=[ae] (assembled, elemental)I ex: M T1.RSA, SHIP003.RSE

I Supplementary files: right-hand-sides, solution,permutations. . .

I Canonical format introduced to guarantee a uniquerepresentation (order of entries in each column, no duplicates).

33/ 94

File storage: Rutherford-Boeing

DNV-Ex 1 : Tubular joint-1999-01-17 M_T1

1733710 9758 492558 1231394 0

rsa 97578 97578 4925574 0

(10I8) (10I8) (3e26.16)

1 49 96 142 187 231 274 346 417 487

556 624 691 763 834 904 973 1041 1108 1180

1251 1321 1390 1458 1525 1573 1620 1666 1711 1755

1798 1870 1941 2011 2080 2148 2215 2287 2358 2428

2497 2565 2632 2704 2775 2845 2914 2982 3049 3115

...

1 2 3 4 5 6 7 8 9 10

11 12 49 50 51 52 53 54 55 56

57 58 59 60 67 68 69 70 71 72

223 224 225 226 227 228 229 230 231 232

233 234 433 434 435 436 437 438 2 3

4 5 6 7 8 9 10 11 12 49

50 51 52 53 54 55 56 57 58 59

...

-0.2624989288237320E+10 0.6622960540857440E+09 0.2362753266740760E+11

0.3372081648690030E+08 -0.4851430162799610E+08 0.1573652896140010E+08

0.1704332388419270E+10 -0.7300763190874110E+09 -0.7113520995891850E+10

0.1813048723097540E+08 0.2955124446119170E+07 -0.2606931100955540E+07

0.1606040913919180E+07 -0.2377860366909130E+08 -0.1105180386670390E+09

0.1610636280324100E+08 0.4230082475435230E+07 -0.1951280618776270E+07

0.4498200951891750E+08 0.2066239484615530E+09 0.3792237438608430E+08

0.9819999042370710E+08 0.3881169368090200E+08 -0.4624480572242580E+08

34/ 94

File storage: Matrix-market

I Example

%%MatrixMarket matrix coordinate real general% Comments

5 5 81 1 1.000e+002 2 1.050e+013 3 1.500e-021 4 6.000e+004 2 2.505e+024 4 -2.800e+024 5 3.332e+015 5 1.200e+01

35/ 94

Examples of sparse matrix collections

I The University of Florida Sparse Matrix Collectionhttp://www.cise.ufl.edu/research/sparse/matrices/

I Matrix market http://math.nist.gov/MatrixMarket/

I Rutherford-Boeinghttp://www.cerfacs.fr/algor/Softs/RB/index.html

I TLSE http://gridtlse.org/

36/ 94

http://www.cise.ufl.edu/research/sparse/matrices/

http://math.nist.gov/MatrixMarket/

http://www.cerfacs.fr/algor/Softs/RB/index.html

http://gridtlse.org/


37/ 94

Gaussian elimination

A = A(1), b = b(1), A(1)x = b(1):0@ a11 a12 a13a21 a22 a23a31 a32 a33

1A 0@ x1x2x3

1A =

0@ b1b2b3

1A 2← 2− 1× a21/a113← 3− 1× a31/a11

A(2)x = b(2)0B@ a11 a12 a13

0 a(2)22 a

(2)23

0 a(2)32 a

(2)33

1CA0@ x1

x2x3

1A =

0B@ b1

b(2)2

b(2)3

1CA b(2)2 = b2 − a21b1/a11 . . .

a(2)32 = a32 − a31a12/a11 . . .

Finally A(3)x = b(3)0B@ a11 a12 a13

0 a(2)22 a

(2)23

0 0 a(3)33

1CA0@ x1

x2x3

1A =

0B@ b1

b(2)2

b(3)3

1CAa

(3)(33)

= a(2)(33)− a

(2)32 a

(2)23 /a

(2)22 . . .

Typical Gaussian elimination step k : a(k+1)ij = a

(k)ij −

a(k)ik a

(k)kj

a(k)kk

38/ 94

Relation with A = LU factorization

I One step of Gaussian elimination can be written:A(k+1) = L(k)A(k) , with

L(k) =

0BBBBBBB@

1.

.1

−lk+1,k .. .−ln,k 1

1CCCCCCCAand lik =

a(k)ik

a(k)kk

.

I Then, A(n) = U = L(n−1) . . .L(1)A, which gives A = LU ,

with L = [L(1)]−1 . . . [L(n−1)]−1 =

0BBB@1 0

..

.li,j 1

1CCCA ,

I In dense codes, entries of L and U overwrite entries of A.

I Furthermore, if A is symmetric, A = LDLT with dkk = a(k)kk :

A = LU = At = U tLt implies (U)(Lt)−1 = L−1U t = D diagonal

and U = DLt , thus A = L(DLt) = LDLt

Gaussian elimination and sparsity

Step k of LU factorization (akk pivot):

I For i > k compute lik = aik/akk (= a′ik),

I For i > k, j > k

a′ij = aij −aik × akj

akk

ora′ij = aij − lik × akj

I If aik 6= 0 and akj 6= 0 then a′ij 6= 0

I If aij was zero → its non-zero value must be stored k j

k

i

x

x

x

x

k j

k

i

x

x

x

0

fill-in

40/ 94

I Idem for Cholesky :

I For i > k compute lik = aik/√

akk (= a′ik),

I For i > k, j > k, j ≤ i (lower triang.)

a′ij = aij −aik × ajk√

akk

ora′ij = aij − lik × ajk

41/ 94

Example

I Original matrix

x x x x xx xx xx xx x

I Matrix is full after the first step of elimination

I After reordering the matrix (1st row and column ↔ last rowand column)

42/ 94

x xx x

x xx x

x x x x x

I No fill-inI Ordering the variables has a strong impact on

I the fill-inI the number of operations

NP-hard problem in general (Yannakakis, 1981)

43/ 94

Illustration: Reverse Cuthill-McKee on matrixdwt 592.rua

Harwell-Boeing matrix: dwt 592.rua, structural computing on asubmarine. NZ(LU factors)=58202

Original matrix Factorized matrix

0 100 200 300 400 500

0

100

200

300

400

500

nz = 51040 100 200 300 400 500

0

100

200

300

400

500

nz = 58202

44/ 94

Illustration: Reverse Cuthill-McKee on matrixdwt 592.rua

NZ(LU factors)=16924

Permuted matrix Factorized permuted matrix(RCM)

0 100 200 300 400 500

0

100

200

300

400

500

nz = 51040 100 200 300 400 500

0

100

200

300

400

500

nz = 16924

44/ 94

Table: Benefits of Sparsity on a matrix of order 2021 × 2021 with 7353nonzeros. (Dongarra etal 91) .

Procedure Total storage Flops Time (sec.)on CRAY J90

Full Syst. 4084 Kwords 5503 ×106 34.5Sparse Syst. 71 Kwords 1073×106 3.4Sparse Syst. and reordering 14 Kwords 42×103 0.9

45/ 94

Control of numerical stability: numerical pivoting

I In dense linear algebra partial pivoting commonly used (ateach step the largest entry in the column is selected).

I In sparse linear algebra, flexibility to preserve sparsity isoffered :

I Partial threshold pivoting : Eligible pivots are not too smallwith respect to the maximum in the column.

Set of eligible pivots = r | |a(k)rk | ≥ u ×maxi |a(k)

ik |, where0 < u ≤ 1.

I Then among eligible pivots select one preserving bettersparsity.

I u is called the threshold parameter (u = 1 → partial pivoting).I It restricts the maximum possible growth of: aij = aij − aik×akj

akk

to 1 + 1u which is sufficient to the preserve numerical stability.

I u ≈ 0.1 is often chosen in practice.

I For symmetric indefinite problems 2by2 pivots (withthreshold) is also used to preserve symmetry and sparsity.

Threshold pivoting and numerical accuracy

Table: Effect of variation in threshold parameter u on matrix 541× 541with 4285 nonzeros (Dongarra etal 91) .

u Nonzeros in LU factors Error

1.0 16767 3 ×10−9

0.25 14249 6 ×10−10

0.1 13660 4 ×10−9

0.01 15045 1 ×10−5

10−4 16198 1 ×102

10−10 16553 3 ×1023

47/ 94

Threshold pivoting and numerical accuracy

Table: Effect of variation in threshold parameter u on matrix 541× 541with 4285 nonzeros (Dongarra etal 91) .

u Nonzeros in LU factors Error

1.0 16767 3 ×10−9

0.25 14249 6 ×10−10

0.1 13660 4 ×10−9

0.01 15045 1 ×10−5

10−4 16198 1 ×102

10−10 16553 3 ×1023

Difficulty: numerical pivoting implies dynamic datastructures thatcan not be forecasted symbolically

47/ 94

Three-phase scheme to solve Ax = b

1. Analysis stepI Preprocessing of A (symmetric/unsymmetric orderings,

scalings)I Build the dependency graph (elimination tree, eDAG . . . )

2. Factorization (A = LU, LDLT, LLT, QR)

Numerical pivoting

3. Solution based on factored matricesI triangular solves: Ly = b, then Ux = yI improvement of solution (iterative refinement), error analysis

48/ 94

Efficient implementation of sparse algorithms

I Indirect addressing is often used in sparse calculations: e.g.sparse SAXPY

do i = 1, mA( ind(i) ) = A( ind(i) ) + alpha * w( i )

enddoI Even if manufacturers provide hardware for improving indirect

addressingI It penalizes the performance

I Identify dense blocks or switch to dense calculations as soonas the matrix is not sparse enough

49/ 94

Effect of switch to dense calculations

Matrix from 5-point discretization of the Laplacian on a 50× 50grid (Dongarra etal 91)

Density for Order of Millions Timeswitch to full code full submatrix of flops (seconds)

No switch 0 7 21.81.00 74 7 21.40.80 190 8 15.00.60 235 11 12.50.40 305 21 9.00.20 422 50 5.50.10 531 100 3.70.005 1420 1908 6.1

Sparse structure should be exploited if density < 10%.

50/ 94


51/ 94

Main processor (r)evolutions

I pipelined functional units

I superscalar processors

I out-of-order execution (ILP)

I larger caches

I evolution of instruction set (CISC, RISC, EPIC, . . . )

I multicores

52/ 94

Pourquoi des traitements paralleles ?

I Besoins de calcul non satisfaits dans beaucoup de disciplines(pour resoudre des problemes significatifs)

I Performance uniprocesseur proche des limites physiques

Temps de cycle 0.5 nanoseconde↔ 8 GFlop/s (avec 4 operations flottantes / cycle)

I Calculateur 40 TFlop/s ⇒ 5000 coeurs→calculateurs massivement paralleles

I Pas parce que c’est le plus simple mais parce que c’estnecessaire

Puissance actuelle (juin 2010, cf http://www.top500.org):

Cray XT5, Oak Ridge Natl Lab:

1.7 PFlop/s, 300 TBytes de memoire, 224256 coeurs

53/ 94

http://www.top500.org

Quelques unites pour le calcul haute performance

Vitesse

1 MFlop/s 1 Megaflop/s 106 operations / seconde1 GFlop/s 1 Gigaflop/s 109 operations / seconde1 TFlop/s 1 Teraflop/s 1012 operations / seconde1 PFlop/s 1 Petaflop/s 1015 operations / seconde1 EFlop/s 1 Exaflop/s 1015 operations / seconde

Memoire

1 kB / 1 ko 1 kilobyte 103 octets1 MB / 1 Mo 1 Megabyte 106 octets1 GB / 1 Go 1 Gigabyte 109 octets1 TB / 1 To 1 Terabyte 1012 octets1 PB / 1 Po 1 Petabyte 1015 octets

54/ 94

Mesures de performance

I Nombre d’operations flottantes par seconde (pas MIPS)I Performance crete :

I Ce qui figure sur la publicite des constructeursI Suppose que toutes les unites de traitement sont activesI On est sur de ne pas aller plus vite :

Performance crete = #unites fonctionnellesclock (sec.)

I Performance reelle :I Habituellement tres inferieure a la precedente

(malheureusement)

55/ 94

Rapport (Performance reelle / performance de crete) souvent bas !!Soit P un programme :

1. Processeur sequentiel:I 1 unite scalaire (1 GFlop/s)I Temps d’execution de P : 100 s

2. Machine parallele a 100 processeurs:I Chaque processor: 1 GFlop/sI Performance crete: 100 GFlop/s

3. Si P : code sequentiel (10%) + code parallelise (90%)I Temps d’execution de P : 0.9 + 10 = 10.9 sI Performance reelle : 9.2 GFlop/s

4. Performance reellePerformance de crete = 0.1

56/ 94

Moore’s law

I Gordon Moore (co-fondateur d’Intel) a predit en 1965 que ladensite en transitors des circuits integres doublerait tous les24 mois.

I A aussi servi de but a atteindre pour les fabriquants.I A ete deforme:

I 24 → 18 moisI nombre de transistors → performance

57/ 94

Comment accroıtre la vitesse de calcul ?

I Accelerer la frequence avec des technologies plus rapides

On approche des limites:I Conception des pucesI Consommation electrique et chaleur dissipeeI Refroidissement ⇒ probleme d’espace

I On peut encore miniaturiser, mais:I pas indefinimentI resistance des conducteurs (R = ρ×l

s ) augmente et ..la resistance est responsable de la dissipation d’energie (effetJoule).

I effets de capacites difficiles a maıtriser

Remarque: 0.5 nanoseconde = temps pour qu’un signalparcourt 15 cm de cable

I Temps de cycle 0.5 nanosecond ↔ 8 GFlop/s (avec 4operations flottantes par cycle)

58/ 94

Seule solution: le parallelisme

I parallelisme: execution simultanee de plusieurs instructions al’interieur d’un programme

I A l’interieur d’un cœur :I micro-instructionsI traitement pipelineI recouvrement d’instructions executees par des unites de calcul

distinctes

→ transparent pour le programmeur(gere par le compilateur ou durant l’execution)

I Entre des processeurs ou cœurs distincts:I suites d’instructions differentes executees → synchronisations:

I implicites (compilateur, parallelisation automatique, utilisationde librairies paralleles)

I ou explicites (transfert de messages, programmationmultithreads, sections critiques)

59/ 94

Probleme d’acces aux donnees

I On est souvent (en pratique) a 10% de la performance creteI Processeurs plus rapides → acces aux donnees plus rapide :

I organisation processeur,I organisation memoire,I communication inter-processeurs

I Hardware plus complexe : pipe, technologie, reseau, . . .

I Logiciel plus complexe : compilateur, systeme d’exploitation,langages de programmation, gestion du parallelisme, outils demise au point . . . applications

Il devient plus difficile de programmer efficacement

60/ 94

Problemes de debit memoire

I L’acces aux donnees est un probleme crucial dans lescalculateurs modernes

I Performance processeur : + 60% par anI Memoire DRAM : + 9% par an

→ Ratio performance processeurtemps acces memoire augmente d’environ 50% par an!

MFlop/s plus faciles que MB/s pour debit memoire

I Hierarchie memoire de plus en plus complexe (mais latenceaugmente)

I Facon d’acceder aux donnees de plus en plus critique:I Minimiser les defauts de cacheI Minimiser la pagination memoireI Localite: ameliorer le rapport references a des memoires

locales/ references a des memoires a distanceI Reutilisation, blocage: accroıtre le ratio flops/memory accessI Gestion des transferts de donnees ”a la main” ? (Cell, GPU)

61/ 94

Cache level #2

Cache level #1 1−2 / 8 − 66

6−15 / 30 − 200

Main memory 10 − 100

Remote memory 500 − 5000

Registers < 1

256 KB − 16 MB

1 − 128 KB

Average access time (# cycles) hit/missSize

Disks 700,000 / 6,000,000

1 − 10 GB

Figure: Exemple de hierarchie memoire.

62/ 94

Conception memoire pour nombre important deprocesseurs ?

Comment 500 processeurs peuvent-ils avoir acces a des donneesrangees dans une memoire partagee (technologie, interconnexion,prix ?)→ Solution a cout raisonnable : memoire physiquement distribuee(chaque processeur ou groupe de processeurs a sa propre memoirelocale)

I 2 solutions :I memoires locales globalement adressables : Calulateurs a

memoire partagee virtuelleI transferts explicites des donnees entre processeurs avec

echanges de messagesI Scalabilite impose :

I augmentation lineaire debit memoire / vitesse du processeurI augmentation du debit des communications / nombre de

processeursI Rapport cout/performance → memoire distribuee et bon

rapport cout/performance sur les processeurs63/ 94

Architecture des multiprocesseurs

Nombre eleve de processeurs → memoire physiquement distribuee

Organisation Organisation physiquelogique Partagee (32 procs max) DistribueePartagee multiprocesseurs espace d’adressage global

a memoire partagee (hard/soft) au dessus de messagesmemoire partagee virtuelle

Distribuee emulation de messages echange de messages(buffers)

Table: Organisation des processeurs

64/ 94

Terminologie

Architecture SMP (Symmetric Multi Processor)

I Memoire partagee (physiquement et logiquement)

I Temps d’acces uniforme a la memoire

I Similaire du point de vue applicatif aux architecturesmulti-cœurs (1 cœur = 1 processeur logique)

I Mais communications bcp plus rapides dans les multi-cœurs(latence < 3ns, bande passantee > 20 GB/s) que dans lesSMP (latence ≈ 60ns, bande passantee ≈ 2 GB/s)

Architecture NUMA (Non Uniform Memory Access)

I Memoire physiquement distribuee et logiquement partagee

I Plus facile d’augmenter le nombre de procs qu’en SMP

I Temps d’acces depend de la localisation de la donnee

I Acces locaux plus rapides qu’acces distants

I hardware permet la coherence des caches (ccNUMA)65/ 94

Programmation

Standards de programmation

Org. logique partagee: threads POSIX, directives OpenMPOrg. logique distribuee: PVM, MPI, sockets (Message Passing)

I Programmation hybride: MPI + OpenMP

I Machines a 1 million de cœurs? architectures emergentes typeGPGPU ? pas encore de standard !

66/ 94

Evolution du calcul haute performance

I Evolution rapide des architectures haute performance (SMP,clusters, NUMA, multicoeurs, Cell, GPUs, . . . )

I Parallelisme a plusieurs niveaux

I Hierarchie memoire de plus en plus complexe

⇒ Programmation de plus en plus difficile avec des outils logicielset des standards qui ont toujours un temps de retard.

67/ 94


68/ 94

Simulation numerique et matrices creuses

I Demarche generale pour le calcul scientifique:

1. Probleme de simulation (probleme continu)2. Application de lois physiques (Equations aux derivees

partielles)3. Discretisation, mise en equations en dimension finie4. Resolution de systemes lineaires (Ax = b)5. (Etude des resultats, remise en cause eventuelle du modele ou

de la methode)

I Resolution de systemes lineaires=noyau algorithmiquefondamental. Parametres a prendre en compte:

I Proprietes du systeme (symetrie, defini positif,conditionnement, sur-determine, . . . )

I Structure: dense ou creux,I Taille: plusieurs millions d’equations ?

69/ 94

Equations aux derivees partielles

I Modelisation d’un phenomene physique

I Equations differentielles impliquant:I forcesI momentsI temperaturesI vitessesI energiesI temps

I Solutions analytiques rarement disponibles

70/ 94

Exemples d’equations aux derivees partielles

I Trouver le potentiel electrique φ pour une distribution decharge donnee f :∇2ϕ = f ⇔ ∆ϕ = f , or∂2

∂x2ϕ(x , y , z) + ∂2

∂y2ϕ(x , y , z) + ∂2

∂z2ϕ(x , y , z) = f (x , y , z)

I Equation de la chaleur (ou equation de Fourier):∂2u

∂x2+∂2u

∂y 2+∂2u

∂z2=

1

α

∂u

∂tavec

I u = u(x , y , z , t): temperature,I α: diffusivite thermique du milieu.

I Equations de propagation d’ondes, equation de Schrodinger,Navier-Stokes,. . .

71/ 94

Discretisation (etape qui suit la modelisationphysique)

Travail du numericien:

I Realisation d’un maillage (regulier, irregulier)

I Choix des methodes de resolution et etude de leurcomportement

I Etude de la perte d’information due au passage a la dimensionfinie

Principales techniques de discretisation

I Differences finies

I Elements finis

I Volumes finis

72/ 94

Discretization with finite differences (1D)

I Basic approximation (ok if h is small enough):(du

dx

)(x) ≈ u(x + h)− u(x)

h

I Results from Taylor’s formula

u(x + h) = u(x) + hdu

dx+

h2

2

d2u

dx2+

h3

6

d3u

dx3+ O(h4)

I Replacing h by −h:

u(x − h) = u(x)− hdu

dx+

h2

2

d2u

dx2− h3

6

d3u

dx3+ O(h4)

I Thus:

d2u

dx2=

u(x + h)− 2u(x) + u(x − h)

h2+ O(h2)

73/ 94

Discretization with finite differences (1D)

d2u

dx2=

u(x + h)− 2u(x) + u(x − h)

h2+ O(h2)

3-point stencil for the centered difference approximation tothe second order derivative:

−21 1

74/ 94

Finite Differences for the Laplacian Operator (2D)

Assuming same mesh refinement h in x and y directions:

∆u(x) ≈ u(x−h,y)−2u(x ,y)+u(x+h,y)h2 + u(x ,y−h)−2u(x ,y)+u(x ,y+h)

h2

∆u(x) ≈ 1h2 (u(x−h, y)+u(x +h, y)+u(x , y−h)+u(x , y +h)−4u(x , y))

1 1

1

1

−4 −4

1 1

11

5-point stencils for the centered difference approximation tothe Laplacian operator (left) standard (right) skewed

75/ 94

27-point stencil usedfor 3D geophysicalapplications (collabo-ration with GeoscienceAzur, S.Operto andJ.Virieux).

1D example

I Consider the problem

−u′′(x) = f (x) for x ∈ (0, 1)u(0) = u(1) = 0

I xi = i × h, i = 0, . . . , n + 1, f (xi ) = fi , u(xi ) = ui

h = 1/(n + 1)

I Centered difference approximation:

−ui−1 + 2ui − ui+1 = h2fi (u0 = un+1 = 0),

I We obtain a linear system Au = f or (for n = 6):

1

h2

2 −1 0 0 0 0−1 2 −1 0 0 0

0 −1 2 −1 0 00 0 −1 2 −1 00 0 0 −1 2 −10 0 0 0 −1 2

u1

u2

u3

u4

u5

u6

=

f1

f2

f3

f4

f5

f6

77/ 94

Slightly more complicated (2D)

Consider an elliptic PDE:

−∂(a(x , y)∂u

∂x )

∂x−∂(b(x , y)∂u

∂y )

∂y+ c(x , y)× u = g(x , y) sur Ω

u(x , y) = 0 sur ∂Ω

0 ≤ x , y ≤ 1

a(x , y) > 0

b(x , y) > 0

c(x , y) ≥ 0

78/ 94

I Case of a regular 2D mesh:

0 1

1

2 3 41

5

discretization step: h = 1n+1 , n = 4

I 5-point finite difference scheme:

∂(a(x , y)∂u∂x )ij

∂x=

ai+ 12,j(ui+1,j − ui ,j)

h2−

ai− 12,j(ui ,j − ui−1,j)

h2+O(h2)

I Similarly:

∂(b(x , y)∂u∂y )ij

∂y=

bi ,j+ 12(ui ,j+1 − ui ,j)

h2−

bi ,j− 12(ui ,j − ui ,j−1)

h2+O(h2)

I ai+ 12,j , bi+ 1

2,j , cij , . . . known.

I With the ordering of unknows of the example, we obtain alinear system of the form:

Ax = b,

I where

x1 ↔ u1,1 = u( 1n+1 ,

1n+1 )

x2 ↔ u2,1 = u( 2n+1 ,

1n+1 )

x3 ↔ u3,1

x4 ↔ u4,1

x5 ↔ u1,2, . . .

I and A is n2 by n2, b is of size n2, with the following structure:

80/ 94

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16|x x x | 1 |g11||x x x x | 2 |g21|| x x x x | 3 |g31|| x x 0 x | 4 |g41||x 0 x x x | 5 |g12|| x x x x x | 6 |g22|| x x x x x | 7 |g32|

A=| x x x 0 x | 8 b=|g42|| x 0 x x x | 9 |g13|| x x x x x |10 |g23|| x x x x x |11 |g33|| x x x 0 x |12 |g43|| x 0 x x |13 |g14|| x x x x |14 |g24|| x x x x |15 |g34|| x x x |16 |g44|

81/ 94

Solution of the linear system

Often the most costly part in a numerical simulation code

I Direct methods:I L U factorization followed by triangular substitutionsI parallelism depends highly on the structure of the matrix

I Iterative methods:I usually rely on sparse matrix-vector productsI algebraic preconditioner useful

82/ 94

Evolution in time of a complex phenomenon

I Examples:I climate modeling, evolution of radioactive waste, . . .

I heat equation:

∆u(x , y , z , t) = ∂u(x,y ,z,t)

∂tu(x , y , z , t0) = u0(x , y , z)

I Discretization in both space and time (1D case):I Explicit approaches:

un+1j −un

j

tn+1−tn=

unj+1−2un

j +unj−1

h2 .I Implicit approaches:

un+1j −un

j

tn+1−tn=

un+1j+1 −2un+1

j +un+1j−1

h2 .

I Implicit approaches are preferred (more stable, larger timesteppossible) but are more numerically intensive: a sparse linearsystem must be solved at each iteration.

83/ 94

Discretization with Finite elements

I Consider a partial differential equation of the form (PoissonEquation):

∆u = ∂2u∂x2 + ∂2u

∂y2 = f

u = 0 on ∂Ω

I we can show (using Green’s formula) that the previousproblem is equivalent to:

a(u, v) = −∫

Ωf v dx dy ∀v such that v = 0 on ∂Ω

where a(u, v) =∫

Ω

(∂u∂x

∂v∂x + ∂u

∂y∂v∂y

)dxdy

84/ 94

Finite element scheme: 1D Poisson Equation

I ∆u = ∂2u∂x2 = f , u = 0 on ∂Ω

I Equivalent to

a(u, v) = g(v) for all v (v|∂Ω = 0)

where a(u, v) =∫

Ω∂u∂x

∂v∂x and g(v) = −

∫Ω f (x)v(x)dx

(1D: similar to integration by parts)

I Idea: we search u of the form =∑

k αkΦk(x)

(Φk)k=1,n basis of functions such that Φk is linear on all Ei ,

and Φk(xi ) = δik = 1 if k = i , 0 otherwise.

Ω

Φk Φk+1Φk−1

xkEk Ek+1

85/ 94

Finite Element Scheme: 1D Poisson Equation

Ω

Φk Φk+1Φk−1

xkEk Ek+1

I We rewrite a(u, v) = g(v) for all Φk :a(u,Φk) = g(Φk) for all k ⇔

∑i αia(Φi ,Φk) = g(Φk)

a(Φi ,Φk) =∫

Ω∂Φi∂x

∂Φk∂x = 0 when |i − k| ≥ 2

I kth equation associated with Φk

αk−1a(Φk−1,Φk) + αka(Φk ,Φk) + αk+1a(Φk+1,Φk) = g(Φk)

I a(Φk−1,Φk) =∫Ek

∂Φk−1

∂x∂Φk∂x

a(Φk+1,Φk) =∫Ek+1

∂Φk+1

∂x∂Φk∂x

a(Φk ,Φk) =∫Ek

∂Φk∂x

∂Φk∂x +

∫Ek+1

∂Φk∂x

∂Φk∂x

86/ 94

Finite Element Scheme: 1D Poisson Equation

From the point of view of Ek , we have a 2x2 contribution matrix:( ∫Ek

∂Φk−1

∂x∂Φk−1

∂x

∫Ek

∂Φk−1

∂x∂Φk∂x∫

Ek

∂Φk−1

∂x∂Φk∂x

∫Ek

∂Φk∂x

∂Φk∂x

)=

(IEk

(Φk−1,Φk−1) IEk(Φk−1,Φk)

IEk(Φk ,Φk−1) IEk

(Φk ,Φk)

)

210 3 4 Ω

E3

Φ1 Φ2 Φ3

E4E1 E2 IE1 (Φ1,Φ1) + IE2 (Φ1,Φ1) IE2 (Φ1,Φ2)IE2 (Φ2,Φ1) IE2 (Φ2,Φ2) + IE3 (Φ2,Φ2) IE3 (Φ2,Φ3)

IE3 (Φ2,Φ3) IE3 (Φ3,Φ3) + IE4 (Φ3,Φ3)

×

α1

α2

α3

=

g(φ1)g(φ2)g(φ3)

87/ 94

Finite Element Scheme in Higher Dimension

I Can be used for higher dimensions

I Mesh can be irregular

I Φi can be a higher degree polynomial

I Matrix pattern depends on mesh connectivity/ordering

88/ 94

Finite Element Scheme in Higher Dimension

I Set of elements (tetrahedras, triangles) to assemble:

j

T

i

k C (T ) =

aTi ,i aT

i ,j aTi ,k

aTj ,i aT

j ,j aTj ,k

aTk,i aT

k,j aTk,k

Needs for the parallel case

I Assemble the sparse matrix A =∑

k C (Tk): graph coloringalgorithms

I Parallelization domain by domain: graph partitioning

I Solution of Ax = b: high performance matrix computationkernels

88/ 94

Other example: linear least squares

I mathematical model + approximate measures ⇒ estimateparameters of the model

I m ”experiments” + n parameters xi :min‖Ax − b‖ avec:

I A ∈ Rm×n,m ≥ n: data matrixI b ∈ Rm: vector of observationsI x ∈ Rn: parameters of the model

I Solving the problem:I Decompose A under the form A = QR, with Q orthogonal, R

triangularI ‖Ax−b‖ = ‖QT Ax−QT b‖ = ‖QT QRx−QT b‖ = ‖Rx−QT b‖

I Problems can be large (meteorological data, . . . ), sparse ornot

89/ 94


90/ 94

Solution of sparse linear systems Ax = b (Direct or Iterative approaches ?

Direct methods

I Very general/robust

I Numerical accuracyI Irregular/unstructured

problems

I Factorization of matrix A

I May be costly(flops/memory)

I Factors can be reused formultiple right-hand sides b

I Computing issues

I Good granularity ofcomputations

I Several levels of parallelismcan be exploited

Iterative methods

I Efficiency depends on:

I convergence –preconditioning

I numerical properties /structure of A

I Rely on efficient Mat-Vect product

I memory effectiveI successive right-hand sides

is problematic

I Computing issues

I Smaller granularity ofcomputations

I Often, only one level ofparallelism


92/ 94

Summary – sparse matrices

I Widely used in engineering and industry

I Irregular data structures

I Strong relations with graph theoryI Efficient algorithms are critical

I OrderingI Sparse Gaussian eliminationI Sparse matrix-vector multiplicationI Parallelization

I Challenges:I Modern applications leading to

I bigger and bigger problemsI different types of matrices and requirements

I Dynamic data structures (numerical pivoting) → need fordynamic scheduling

I More and more parallelism (evolution of parallel architectures)

93/ 94

Suggested home reading

I Google page rank, The world’s largest matrix computation,Cleve Moler.

I Architecture and Performance Characteristics of ModernHigh Performance Computers, Georg Hager and GerhardWellein, Lect. Notes in Physics

I Optimization Techniques for Modern High PerformanceComputers, Georg Hager and Gerhard Wellein, Lect. Notesin Physics

94/ 94

cr07 - graal.ens-lyon.frbucar/cr07/introsparse.pdf · a selection of references i books i du ,...

Documents