sparse linear algebra

Sparse Linear Algebra: Direct Methods, advanced features

P. Amestoy and A. Buttari(INPT(ENSEEIHT)-IRIT)

A. Guermouche (Univ. Bordeaux-LaBRI),J.-Y. L’Excellent and B. Ucar (INRIA-CNRS/LIP-ENS Lyon)

F.-H. Rouet (Lawrence Berkeley National Laboratory)C. Weisbecker INPT(ENSEEIHT)-IRIT

2013-2014

Sparse Matrix Factorizations

• A ∈ <m×m, symmetric positive definite → LLT = A

Ax = b

• A ∈ <m×m, symmetric → LDLT = A

Ax = b

• A ∈ <m×m, unsymmetric → LU = A

Ax = b

• A ∈ <m×n, m 6= n → QR = A

minx‖Ax− b‖ if m > n

min ‖x‖ such that Ax = b if n > m

Cholesky on a dense matrix

a11a21 a22a31 a32 a33a41 a42 a43 a44

modified

Left−looking Right−looking

used for modification

left-looking Cholesky

for k = 1, ..., n dofor i = k, ..., n do

for j = 1, ..., k − 1 do

a(k)ik = a

(k)ik − lij lkj

end forend for

√a(k−1)kk

for i = k + 1, ..., n do

lik = a(k−1)ik /lkk

end forend for

right-looking Cholesky

for k = 1, ..., n do

√a(k−1)kk

for i = k + 1, ..., n do

lik = a(k−1)ik /lkk

for j = k + 1, ..., i do

a(k)ij = a

(k)ij − likljk

end forend for

end for

Factorization of sparse matrices: problems

The factorization of a sparse matrix is problematic due to thepresence of fill-in.The basic LU step:

a(k+1)i,j = a

(k)i,j −

a(k)i,k a

(k)k,j

a(k)k,k

Even if a(k)i,j is null, a

(k+1)i,j can be a nonzero

138 JOSEPH W. H. LIU

o o h0

FIG. 2.1. An example of matrix structures.

G(AG(F)

G(Ft)= T(A

FIG. 2.2. Graph structures of the example in Fig. 2.1.

o o h0

G(AG(F)

G(Ft)= T(A

• the factorization is more expensive than O(nz)• higher amount of memory required than O(nz)• more complicated algorithms to achieve the factorization

Factorization of sparse matrices: problems

Because of the fill-in, the matrix factorization is commonly precededby an analysis phase where:

• the fill-in is reduced through matrix permutations

• the fill-in is predicted though the use of (e.g.) elimination graphs

• computations are structured using elimination trees

The analysis must complete much faster than the actual factorization

Analysis

Fill-reducing ordering methods

Three main classes of methods for minimizing fill-in duringfactorization

Local approaches : At each step of the factorization, selection of thepivot that is likely to minimize fill-in.

• Method is characterized by the way pivots areselected.

• Markowitz criterion (for a general matrix)• Minimum degree (for symmetric matrices)

Global approaches : The matrix is permuted so as to confine thefill-in within certain parts of the permuted matrix

• Cuthill-McKee, Reverse Cuthill-McKee• Nested dissection

Hybrid approaches : First permute the matrix globally to confine thefill-in, then reorder small parts using local heuristics.

The elimination process in the graphs

GU (V,E)← undirected graph of Afor k = 1 : n− 1 doV ← V − k remove vertex kE ← E − (k, `) : ` ∈ adj(k) ∪ (x, y) : x ∈ adj(k) and y ∈adj(k)Gk ← (V,E) for definition

end for

Gk are the so-called elimination graphs (Parter,’61).

H0 =G0 :

The Elimination TreeThe elimination tree is a graph that expresses all the dependencies inthe factorization in a compact way.

Elimination tree: basic idea

If i depends on j (i.e., lij 6= 0, i > j) and k depends on i (i.e.,lki 6= 0, k > i), then k depends on j for the transitive relation andtherefore, the direct dependency expressed by lkj 6= 0, if it exists, isredundant

For each column j < n of L, remove all the nonzeros in the column jexcept the first one below the diagonal.Let Lt denote the remaining structure and consider the matrixFt = Lt + LT

t . The graph G(Ft) is a tree called the elimination tree.

o o h0

G(AG(F)

G(Ft)= T(A

o o h0

G(AG(F)

G(Ft)= T(A

Uses of the elimination tree

The elimination tree has several uses in the factorization of a sparsematrix:

• it expresses the order in which variables can be eliminated: the elimination of a variable only affects (directly or indirectly) its

ancestors and ... ... only depends on its descendants

Therefore, any topological order of the elimination tree leads to acorrect result and to the same fill-in

• it expresses concurrence: because variables in separate subtrees donot affect each other, they can be eliminated in parallel

Matrix factorization

Cholesky on a sparse matrix

The Cholesky factorization of a sparse matrix can be achieved with aleft-looking, right-looking or multifrontal method.

Reference case: regular 3× 3 grid ordered by nested dissection.Nodes in the separators are ordered last (see the section on orderings)

Notation:

• cdiv(j):√A(j, j) and then A(j + 1 : n, j)/

√A(j, j)

• cmod(j,k): A(j : n, j)−A(j, k) ∗A(j : n, k)

• struct(L(1:k,j)): the structure of L(1:k,j) submatrix

Sparse left-looking Cholesky

left-looking

for j=1 to n do

for k in struct(L(j,i:j-1)) do

cmod(j,k)

end for

cdiv(j)

end for

In the left-looking method, before variable j is eliminated, column j isupdated with all the columns that have a nonzero on line j. In theexample above, struct(L(7,1:6))=1, 3, 4, 6.

• this corresponds to receiving updates from nodes lower in thesubtree rooted at j

• the filled graph is necessary to determine the structure of each row

Sparse right-looking Cholesky

right-looking

for k=1 to n do

cdiv(k)

for j in struct(L(k+1:n,k)) do

cmod(j,k)

end for

In the right-looking method, after variable k is eliminated, column kis used to update all the columns corresponding to nonzeros incolumn k. In the example above, struct(L(4:9,3))=7, 8, 9.

• this corresponds to sending updates to nodes higher in theelimination tree

• the filled graph is necessary to determine the structure of eachcolumn

The Multifrontal Method

Each node of the elimination tree is associated with a dense frontThe tree is traversed bottom-up and at each node

1. assembly of the front2. partial factorization

a44 a46 a47a64 0 0a74 0 0

→ l44

l64 b66 b67l74 b76 b77

a55 a56 a59a65 0 0a95 0 0

→ l55

l65 c66 c69l95 c96 c99

a66 0 a68 00 0 0 0

a86 0 0 00 0 0 0

b66 b67 0 0b76 b77 0 00 0 0 00 0 0 0

c66 0 0 c690 0 0 00 0 0 0c96 0 0 c99

l66l76 d77 d78 d79l86 d87 d88 d89l86 d97 d98 d99

A dense matrix, called frontal matrix, is associated at each node ofthe elimination tree. The Multifrontal method consists in a bottom-uptraversal of the tree where at each node two operations are done:

• Assembly Nonzeros from the original matrix are assembledtogether with the contribution blocks from children nodes into thefrontal matrix

+ +...+cb-1 cb-n

arrowhead

• Elimination A partial factorization of the frontal matrix is done.The variable associated to the node of the frontal tree (called fullyassembled) can be eliminated. This step produces part of the finalfactors and a Schur complement (contribution block) that will beassembled into the father node

frontal

matrix

The Multifrontal Method: example

Once the matrix is factorized, the problem can be solved against oneor more right-hand sides:

AX = LLTX = B, A,L ∈ Rn×n, X ∈ Rn×k, B ∈ Rn×k

The solution of this problem can be achieved in two steps:

forward substitution LZ = B

backward substitution LTX = Z

Solve: left-looking

Forward Substitution

Backward Substitution

Solve: right-looking

Forward Substitution

Backward Substitution

Direct solvers: resume

The solution of a sparse linear system can be achieved in three phasesthat include at least the following operations:

Analysis

• Fill-reducing permutation• Symbolic factorization• Elimination tree computation

Factorization

• The actual matrix factorization

• Forward substitution• Backward substitution

These phases, especially the analysis, can include many otheroperations. Some of them are presented next.

Parallelism

Parallelization: two levels of parallelism

tree parallelism arising from sparsity, it is formalized by the fact thatnodes in separate subtrees of the elimination tree can beeliminated at the same time

node parallelism within each node: parallel dense LU factorization(BLAS)

llelis

Exploiting the second level of parallelism is crucial

Multifrontal factorization(1) (2)

Computer #procs MFlops (speed-up) MFlops (speed-up)

Alliant FX/80 8 15 (1.9) 34 (4.3)IBM 3090J/6VF 6 126 (2.1) 227 (3.8)CRAY-2 4 316 (1.8) 404 (2.3)CRAY Y-MP 6 529 (2.3) 1119 (4.8)

Performance summary of the multifrontal factorization on matrix BCSSTK15. Incolumn (1), we exploit only parallelism from the tree. In column (2), we combinethe two levels of parallelism.

Task mapping and scheduling

Task mapping and scheduling: objective

Organize work to achieve a goal like makespan (total execution time)minimization, memory minimization or a mixture of the two:

• Mapping: where to execute a task?

• Scheduling: when and where to execute a task?

• several approaches are possible: static: Build the schedule before the execution and follow it at

run-time

• Advantage: very efficient since it has a global view of the system• Drawback: Requires a very-good performance model for the platform

dynamic: Take scheduling decisions dynamically at run-time

• Advantage: Reactive to the evolution of the platform and easy to useon several platforms

• Drawback: Decisions taken with local criteria (a decision which seemsto be good at time t can have very bad consequences at time t+ 1)

hybrid: Try to combine the advantages of static and dynamic

Task mapping and scheduling

The mapping/scheduling is commonly guided by a number ofparameters:

• First of all, a mapping/scheduling which is good for concurrency iscommonly not good in terms of memory consumption

• Especially in distributed-memory environments, data transfers haveto be limited as much as possible

Dynamic scheduling on shared memory computers

The data can be shared between processors without anycommunication∗, therefore a dynamic scheduling is relatively simpleto implement and efficient:

• The dynamic scheduling can be done through a pool of “ready”tasks (a task can be, for example, seen as the assembly andfactorization of a front)

• as soon as a task becomes ready (i.e., when all the tasksassociated with the child nodes are done), it is queued in the pool

• processes pick up ready tasks from the pool and execute them. Assoon as one process finishes a task, it goes back to the pool topick up a new one and this continues until all the nodes of the treeare done

• the difficulty lies in choosing one task among all those in the poolbecause the order can influence both performance and memoryconsumption

Decomposition of the tree into levels

• Determination of Level L0 based on subtree cost.

Subtree roots

1,2,...,8

1,2,3,4

5,6,7,8

7 833 1 2

• Scheduling of top of the tree can be dynamic.

Simplifying the mapping/scheduling problem

The tree commonly contains many more nodes than the availableprocesses.Objective : Find a layer L0 such that subtrees of L0 can be mappedonto the processor with a good balance.

Step A Step B Step C

Construction and mapping of the initial level L0

Let L0 ← Roots of the assembly treerepeat

Find the node q in L0 whose subtree has largest computationalcostSet L0 ← (L0\q) ∪ children of qGreedy mapping of the nodes of L0 onto the processorsEstimate the load unbalance

until load unbalance < threshold

Static scheduling on distributed memory computers

Main objective: reduce the volume of communication betweenprocessors because data is transferred through the (slow) network.

Proportional mapping

• initially assigns all processes to root node.

• performs a top-down traversal of the tree where the processesassign to a node are subdivided among its children in a way that isproportional to their relative weight

181,2,3,4,5

1 1 2 3 3 4 4 5

Mapping of the tasks onto the 5 processors

Good at localizingcommunication but not soeasy if no overlappingbetween processorpartitions at each step.

Assumptions and Notations

• Assumptions : We assume that each column of L/each node of the tree is assigned

to a single processor. Each processor is in charge of computing cdiv(j) for columns j that

it owns.

• Notation : mycols(p) is the set of columns owned by processor p. map(j) gives the processor owning column j (or task j). procs(L(:,k)) = map(j) — j ∈ struct(L(:,k))

(only processors in procs(L(:,k)) require updates from column k –they correspond to ancestors of k in the tree).

father(j) is the father of node j in the elimination tree

Computational strategies for parallel direct solvers

The parallel algorithm is characterized by:

• Computational graph dependency

• Communication graph

There are three classical approaches to distributed memoryparallelism:

1. Fan-in : The fan-in algorithm is very similar to the left-lookingapproach and is demand-driven: data required are aggregatedupdate columns computed by sending processor

2. Fan-out : The fan-out algorithms is very similar to theright-looking approach and is data driven: data is sent as soon asit is produced.

3. Multifrontal : The communication pattern follows a bottom-uptraversal of the tree. Messages are contribution blocks and are sentto the processor mapped on the father node

Fan-in variant (similar to left looking)

fan-infor j=1 to n

for all k in (struct(L(j,1:j-1)) ∩ mycols(p) )

cmod(u,k)

end forif map(j) != p

send u to processor map(j)

elseincorporate u in column j

receive all the updates on column j and incorporate them

cdiv(j)

end ifend for

Fan-in variant

P0 P1 P2 P3

if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportional mapping.

Fan-in variant

P0 P1 P2 P3

Fan-in variant

P0 P1 P2 P3

Fan-in variant

P0 P1 P2 P3

Fan-in variant

P0 P1 P2 P3

Fan-in variant

P0 P1 P2 P3

Communication

Fan-in variant

P0 P0 P0 P0

Communication

Fan-out variant (similar to right-looking)

fan-outfor all leaf nodes j in mycols(p)

cdiv(j)

send column L(:,j) to procs(L(:,j))

mycols(p) = mycols(p) - jend forwhile mycols(p) != ∅

receive any column (say L(:,k))

for j in struct(L(:,k)) ∩ mycols(p)

cmod(j,k)

if column j is completely updated

cdiv(j)

send column L(:,j) to procs(L(:,j))

mycols(p) = mycols(p) - jend if

end forend while

Fan-out variant

P0 P1 P2 P3

if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 to

update the processor in charge of the father

Fan-out variant

P0 P0 P0 P0

Communication

Fan-out variant

P0 P0 P0 P0

Communication

Fan-out variant

P0 P0 P0 P0

Communication

Fan-out variant

P0 P0 P0 P0

Communication

Fan-out variant

Properties of fan-out:

• Historically the first implemented.

• Incurs greater interprocessor communications than fan-in (ormultifrontal) approach both in terms of total number of messages total volume

• Does not exploit data locality of proportional mapping.

• Improved algorithm (local aggregation): send aggregated update columns instead of individual factor columns

for columns mapped on a single processor. Improve exploitation of data locality of proportional mapping. But memory increase to store aggregates can be critical (as in fan-in).

Multifrontal

multifrontalfor all leaf nodes j in mycols(p)

assemble front j

partially factorize front j

send the schur complement to procs(father(j))

mycols(p) = mycols(p) - jend forwhile mycols(p) != ∅

receive any contribution block (say for node j)

assemble contribution block into front j

if front j is completely assembled

partially factorize front j

send the schur complement to procs(father(j))

mycols(p) = mycols(p) - jend if

end while

Multifrontal variant

Fan-in

Fan-out

Multifrontal

Fan-in

Fan-out

Multifrontal

Fan-in

Fan-out

Multifrontal

Fan-in

Fan-out

Multifrontal

Fan-in

Fan-out

Multifrontal

Importance of the shape of the tree

Suppose that each node in the tree corresponds to a task that:

• consumes temporary data from the children,

• produces temporary data, that is passed to the parent node.

• Wide tree Good parallelism Many temporary blocks to store Large memory usage

• Deep tree Less parallelism Smaller memory usage

Impact of fill reduction on the shape of the tree

Reorderingtechnique

Shape of the tree observations

AMD• Deep well-balanced

• Large frontal matrices ontop

AMF• Very deep unbalanced

• Small frontal matrices

Impact of fill reduction on the shape of the tree

PORD• deep unbalanced

• Small frontal matrices

SCOTCH• Very wide well-balanced

• Large frontal matrices

METIS• Wide well-balanced

• Smaller frontal matrices(than SCOTCH)

Memory consumption in the multifrontal method

Equivalent reorderings

A tree can be regarded as dependency graph where node producedata that is, then, processed by its parent.By this definition, a tree has to be traversed in topological order. Butthere are still many topological traversals of the same tree.

Definition

Two orderings P and Q are equivalent if the structures of the filledgraphs of PAP T and QAQT are the same (that is they areisomorphic).

Equivalent orderings result in the same amount of fill-in andcomputation during factorization. To ease the notation, we discussonly one ordering wrt A, i.e., P is an equivalent ordering of A if thefilled graph of A and that of PAP T are isomorphic.

Equivalent reorderings

Any topological ordering on T (A) are equivalent

Let P be the permutation matrix corresponding to a topologicalordering of T (A). Then, G+(PAP T ) and G+(A) are isomorphic.

Any topological ordering on T (A) are equivalent

Let P be the permutation matrix corresponding to a topologicalordering of T (A). The elimination tree T (PAP T ) and T (A) areisomorphic.

Because the fill-in won’t change, we have the freedom to choose anyspecific topological order that will provide other properties

Tree traversal orders

Which specific topological order to choose? postorder: why?Because data produced by nodes is consumed by parents in a LIFOorder.In the multifrontal method, we can thus use a stack memory wherecontribution blocks are pushed as soon as they are produced by theelimination on a frontal matrix and popped at the moment where thefather node is assembled.This provides a better data locality and makes the memorymanagement easier.

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:

• Factors

• Active memory

FactorsActivefrontalmatrix

Stack ofcontributionblocks

Active storage

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

1 2 3 4 512345

A= L+U-I=

• Factors

• Active memory

Elimination tree

Active memory

• Factors

• Active memory

Elimination tree

Active memory

Example 1: Processing a wide tree

1 2 3 6

Memory

unused memory space stack memory spacefactor memory space non-free memory space

Active memory

Example 2: Processing a deep tree

Memory

unused memory space stack memory spacefactor memory space non-free memory space

1 2 33

1 2 33 4

1 23 4

Allocation of 3

Assembly step for 3

Factorization step for 3 +

Stack step for 3

Postorder traversals: memory

Postorder provides a good data locality and better memoryconsumption that a general topological order since father nodes areassembled as soon as its children have been processed.But there are still many postorders of the same tree. Which one tochoose? the one that minimizes memory consumption

a b ab

Best (abcdefghi) Worst (hfdbacegi)

Leaves

Problem model

• Mi: memory peak for complete subtree rooted at i,

• tempi: temporary memory produced by node i,

• mparent: memory for storing the parent.

M(parent)

temp3temp2

Mparent = max( maxnbchildrenj=1 (Mj +

∑j−1k=1 tempk),

mparent +∑nbchildren

j=1 tempj)(1)

Objective: order the children to minimize Mparent

Problem model

• Mi: memory peak for complete subtree rooted at i,

• tempi: temporary memory produced by node i,

• mparent: memory for storing the parent.

M(parent)

temp3temp2

Mparent = max( maxnbchildrenj=1 (Mj +

∑j−1k=1 tempk),

mparent +∑nbchildren

j=1 tempj)(1)

Objective: order the children to minimize Mparent

Memory-minimizing schedules

Theorem

[Liu,86] The minimum of maxj(xj +∑j−1

i=1 yi) is obtained when thesequence (xi, yi) is sorted in decreasing order of xi − yi.

Corollary

An optimal child sequence is obtained by rearranging the childrennodes in decreasing order of Mi − tempi.

Interpretation: At each level of the tree, child with relatively largepeak of memory in its subtree (Mi large with respect to tempi)should be processed first.

⇒ Apply on complete tree starting from the leaves(or from the root with a recursive approach)

Optimal tree reordering

Objective: Minimize peak of stack memory

Tree Reorder (T ):for all i in the set of root nodes do

Process Node(i);end for

Process Node(i):if i is a leaf thenMi=mi

elsefor j = 1 to nbchildren do

Process Node(jth child);end forReorder the children of i in decreasing order of (Mj − tempj);Compute Mparent at node i using Formula (1);

end if

Memory consumption in parallel

In parallel multiple branches have to be traversed at the same time inorder to feed all the processes. This means that an higher number ofCBs will have to be stored in memory

Metric: memory efficiency

e(p) =Sseq

p× Smax(p)

We would like e(p) ' 1, i.e. Sseq/p on each processor.

Common mappings/schedulings provide a poor memory efficiency:

• Proportional mapping: limp→∞

e(p) = 0 on regular problems.

Proportional mapping: top-down traversal of the tree, where processorsassigned to a node are distributed among its children proportionally to theweight of their respective subtrees.

• Targets run time but poor memory efficiency.• Usually a relaxed version is used: more memory-friendly but unreliable

estimates.

3262640% 10% 50%

estimates.

2 2 221 11

estimates.

Proportional mapping: assuming that the sequential peak is 5 GB,

Smax(p) > max

= 0.16 GB⇒ e(p) 6

64× 0.166 0.5

326264GB 1GB 5GB

A more memory-friendly strategy. . .

All-to-all mapping: postorder traversal of the tree, where all the processors

work at every node:

646464

Optimal memory scalability (Smax(p) = Sseq/p) but no tree parallelism and

prohibitive amounts of communications.

“memory-aware” mappings

“Memory-aware” mapping (Agullo ’08): aims at enforcing a given memoryconstraint (M0, maximum memory per processor):

1. Try to apply proportional mapping.

2. Enough memory for each subtree?

If not, serialize them, and update M0:processors stack equal shares of the CBs from previous nodes.

326264GB 1GB 5GB

• Ensures the given memory constraint and provides reliable estimates.• Tends to assign many processors on nodes at the top of the tree

=⇒ performance issues on parallel nodes.

1. Try to apply proportional mapping.2. Enough memory for each subtree?

326264GB 1GB 5GB

1. Try to apply proportional mapping.2. Enough memory for each subtree? If not, serialize them, and update M0:

processors stack equal shares of the CBs from previous nodes.

64 6464

1. Try to apply proportional mapping.2. Enough memory for each subtree?

64 6464

22 21 21

subroutine ma_mapping(r, p, M, S)

! r : root of the subtree

! p(r) : number of processes mapped on r

! M(r) : memory constraint on r

! S(:) : peak memory consumption for all nodes

call prop_mapping(r, p, M)

forall (i children of r)

! check if the constraint is respected

if( S(i)/p(i) > M(i) ) then

! reject PM and revert to tree serialization

call tree_serialization(r, p, M)

exit ! from forall block

end if

end forall

! apply MA mapping recursively to all siblings

call ma_mapping(i, p, M, S)

end forall

end subroutine ma_mapping

subroutine prop_mapping(r, p, M)

p(i) = share of the p(r) processes from PM

M(i) = M(r)

end forall

end subroutine prop_mapping

subroutine tree_serialization(r, p, M)

stack_siblings = 0

forall (i children of r) ! in appropriate order

p(i) = p(r)

! update the memory constraint for the subtree

M(i) = M(r) - stack_siblings

stack_siblings = stack_siblings + cb(i)/p(r)

end forall

end subroutine tree_serialization

A finer “memory-aware” mapping? Serializing all the children at once is very

constraining: more tree parallelism can be found.

51 641364

Find groups on which proportional mapping works, and serialize thesegroups.

Heuristic: follow a given order (e.g. the serial postorder) and form groups as

large as possible.

Experiments

• Matrix: finite-difference model of acoustic wave propagation,27-point stencil, 192× 192× 192 grid; Seiscope consortium.N = 7 M, nnz = 189 M, factors=152 GB (METIS).Sequential peak of active memory: 46 GB.

• Machine: 64 nodes with two quad-core Xeon X5560 per node.We use 256 MPI processes.

• Perfect memory scalability: 46 GB/256=180MB.

Experiments

Prop Memory-aware mapping

Map M0 = 225 MB M0 = 380 MB

w/o groups w/ groups

Max stack peak (MB) 1932 227 330 381

Avg stack peak (MB) 626 122 177 228

Time (s) 1323 2690 2079 1779

• proportional mapping delivers the fastest factorization because ittargets parallelism but, at the same time, it leads to a poormemory scaling with e(256) = 0.09

• the memory aware mapping with and without groups respects theimposed memory constraint with a speed that depends on howtight the constraint is

• grouping clearly provides a benefit in term of speed because itdelivers more concurrency while still respecting the constraint

Low-Rank approximations

Low-rank matrices

Take a dense matrix B of size n×n and compute its SVD B = XSY :

Low-rank matrices

B = X1S1Y1 +X2S2Y2 with S1(k, k) = σk > ε, S2(1, 1) = σk+1 ≤ ε

Low-rank matrices

If B = X1S1Y1 then ‖B − B‖2 = ‖X2S2Y2‖2 = σk+1 ≤ ε

Low-rank matrices

If B = X1S1Y1 then ‖B − B‖2 = ‖X2S2Y2‖2 = σk+1 ≤ ε

If the singular values of B decay very fast (e.g. exponentially) thenk n even for very small ε (e.g. 10−14) ⇒ memory and CPUconsumption can be reduced considerably with a controlled loss ofaccuracy (≤ ε) if B is used instead of B

Multifrontal BLR LU factorizationFrontal matrices are not low-rank but in many applications theyexhibit low-rank blocks.

L i1,1

U i1,1

L i2,2

U i2,2

L idi,di

U idi,di

Because each row and column of a frontal matrix are associated witha point of the discretization mesh, blocking the frontal matrixamounts to grouping the corresponding points of the domain. Theobjective is thus to group (cluster) the points in such a way that thenumber of blocks with low-rank property is maximized and theirrespective rank is minimized.

Clustering of points/variables

A block represents the interaction between two subdomains σ and τ .If they have a small diameter and are far away the interaction is weak⇒ rank is low

τσ rank

0 10 20 30 40 50 60 70 80 90 100

distance between and

Guidelines for computing the grouping of points:

• ∀σ, cmin ≥ |σ| ≥ cmax : we want many compressible blocks butnot too small to maintain a good compression rate and a goodefficiency of operations.

• ∀σ, minimize diam(σ): for a given cluster size, this gives wellshaped clusters and consequently maximizes mutual distances.

τσ rank

0 10 20 30 40 50 60 70 80 90 100

τσ rank

0 10 20 30 40 50 60 70 80 90 100

τσ rank

0 10 20 30 40 50 60 70 80 90 100

τσ rank

0 10 20 30 40 50 60 70 80 90 100

τσ rank

0 10 20 30 40 50 60 70 80 90 100

Clustering

For example, if the points associated to the rows/columns of a frontalmatrix lie on a square surface, it is much better to cluster in acheckerboard fashion (right), rather than into slices (left).

In practice and in an algebraic context (where the discretization meshis not known) this can be achieved by extracting the subgraphassociated with the front variables and by computing a k-waypartitioning (with a tool like METIS or SCOTCH, for example).

Clustering

Here is an example of variables clustering achieved with the x-waypartitioning method in the SCOTCH package

Although irregular, the partitioning provides relatively well-shapedclusters.

Admissibility condition

Once the clustering of variables is defined, it is possible to compressall the blocks F (σ, τ) that satisfy a (strong) admissibility condition:

Adms(σ, τ) : max(diam(σ), diam(τ)) ≤ η dist(σ, τ)

For simplicity, we can use a relaxed version called weak admissibilitycondition:

Admw(σ, τ) : dist(σ, τ) > 0.

The far field of a cluster σ is defined as

F(σ) = τ | Adm(σ, τ) is true.

In practice it is easier to consider all block (except diagonal) eligible,try to compress them and eventually revert to non compressed form ifrank is too high.

Dense BLR LU factorization

off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

LU off-diagonal blocks are

compressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)

• FACTOR Akk = LkkUkk

• SOLVE

• COMPRESS

• UPDATE

U U U off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)

• FACTOR

• SOLVE Ukj = L−1kkAkj , Lik = AikU−1kk

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE Aij = Aij − LikUkj

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

LU off-diagonal blocks are

compressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS L∗k = X∗kYT∗k, Uk∗ = Xk∗Y

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

• FACTOR

• SOLVE

• COMPRESS

• UPDATE

Experimental MF complexity

Setting:

1. Poisson: N3 grid with a 7-point stencil with u = 1 on theboundary ∂Ω

∆u = f

2. Helmholtz: N3 grid with a 27-point stencil, ω is the angularfrequency, v(x) is the seismic velocity field, and u(x, ω) is thetime-harmonic wavefield solution to the forcing term s(x, ω).(

−∆− ω2

)u(x, ω) = s(x, ω)

Experimental MF complexity: entries in factor

64 96 128 160 192 224

Problem size N

Poisson entries in factors

56n1.07log(n)

62n1.04log(n)

BLR 10-14

BLR 10-10

Full Rank: O(n1.3)

64 96 128 160 192 224

Problem size N

Helmholtz entries for factors

18n1.20log(n)

BLR 10-4

Full Rank: O(n1.3)

• ε only plays a role in the constant factor

• for Poisson a factor ∼ 3 gain with almost no loss of accuracy

Experimental MF complexity: operations

64 96 128 160 192 224

Problem size N

Poisson Flop

1065n1.50

1674n1.52

BLR 10-14

BLR 10-10

Full Rank: O(n2)

64 96 128 160 192 224

Problem size N

Helmholtz Flop

37n1.85

BLR 10-4

Full Rank: O(n2)

• ε only plays a role in the constant factor

• for Poisson a factor ∼ 9 gain with almost no loss of accuracy

Experiments: Poisson

1e-14 1e-12 1e-10 1e-8 1e-6 1e-4 1e-2

%Flops wrt FR

Comp.wise Scaled Residual

BLR accuracy

Poisson Equation, 7pt on 1283domain

% Flops

% Time

CSR with IR

• Compression improves with higher ε values• Time reduction is not as good as flops due to lower efficiency of

operations• Componentwise Scaled Residual is in good accordance with ε• Iterative Refinement can be used to recover the lost accuracy• For very high values of ε, the factorization can be used as a

preconditioner

Experiments: Poisson

1e-14 1e-12 1e-10 1e-8 1e-6 1e-4 1e-2

%Flops wrt FR

Comp.wise Scaled Residual

BLR accuracy

Helmholtz Equation, 27pt on 2563domain

% Flops

% Time

CSR with IR

• Compression improves with higher ε values• Time reduction is not as good as flops due to lower efficiency of

operations• Componentwise Scaled Residual is in good accordance with ε• Iterative Refinement can be used to recover the lost accuracy• For very high values of ε, the factorization can be used as a

preconditioner

Sparse QR Factorization

Overdetermined Problems

In the case Ax = b with A ∈ Rm×n, m > n, the problem isoverdetermined and the existence of a solution is not guaranteed (thesolution only exists if b ∈ R(A)). In this case, the desired solutionmay be the one that minimizes the norm of the residual (least squaresproblem):

minx‖Ax− b‖2

The figure above shows, intuitively, that the residual has minimumnorm if it is orthogonal to the R(A), i.e., Ax is the projection of bonto R(A).

Least Squares Problems: normal equations

minx ‖Ax− b‖2 ⇔ minx ‖Ax− b‖22

‖Ax− b‖22 = (Ax− b)T (Ax− b) = xTATAx− 2xTAT b+ bT b

First x-derivative is equal to zero is x = (ATA)−1AT b. This is thesolution of the n× n fully determined problem:

ATAx = AT b

called the normal equations system:

• ATA is symmetric, positive definite if rank(A) = n (nonnegativedefinite otherwise)

• can be solved with a Cholesky factorization. Not necessarilybackward stable because κ(ATA) = κ(A)2

• AT (Ax− b) = AT r = 0 means that the residual is orthogonal toR(A) which delivers the desired solution

Least Squares Problems: the QR factorization

The QR factorization of a matrix A ∈ Rm×n consists in finding anorthonormal matrix Q ∈ Rm×m and an upper triangular matrixR ∈ Rn×n such that

]Assuming Q = [Q1Q2] where Q1 is made up of the first n columns ofQ and Q2 by the remaining m− n

QT b =

]we have

‖Ax− b‖22 = ‖QTAx−QT b‖22 = ‖Rx− c‖22 + ‖d‖22.

In the case of a full-rank matrix A, this quantity is minimized ifRx = c.

Least Squares Problems: the QR factorization

Assume Q = [Q1Q2] where Q1 contains the first n and Q2 the lastm− n columns of Q:

• Q1 is an orthogonal basis for R(A) and Q2 is an orthogonal basisfor N (AT )

• P1 = Q1QT1 and P2 = Q2Q

T2 are projectors onto R(A) and

N (AT ). Thus, Ax = P1b and r = b−Ax = P2b

Properties of the solution by QR factorization

• it is backward stable

• it requires 3n2(m− n/3) flops (much more expensive thanCholesky)

Underdetermined Problems

In the case Ax = b with A ∈ Rm×n, m < n, the problem isunderdetermined and admits infinite solutions; if xp is any solution ofAx = b, then the complete set of solutions is

x|Ax = y = xp + z|z ∈ N (A)

In this case, the desired solution may be the one with minimum2-norm (minimum 2-norm solution problem):

minimize ‖x‖2subject to Ax = b

Minimum 2-norm solution Problems

The minimum 2-norm solution is

xmn = AT (AAT )−1b

In fact, take any other solution x; then A(x− xmn) = 0 and

(x− xmn)Txmn = (x− xmn)TAT (AAT )−1b= (A(x− xmn))T (AAT )−1b= 0

which means (x− xmn)T ⊥ xmn. Thus

‖x‖22 = ‖xmn + x− xmn‖22 = ‖xmn‖22 + ‖x− xmn‖22

i.e., xmn has the smallest 2-norm of any solution.As for the normal equations system, this approach incurs instabilitydue to κ(AAT )

Minimum 2-norm solution Problems: the QR factorization

Assume

QR = [Q1Q2]

is the QR factorization of AT where Q1 is made of the first mcolumns of Q and Q2 of the last n−m ones. Then

Ax = RTQTx = [RT1 0]

The minimum 2-norm solution follows by setting z2 = 0.Note that Q2 is an orthogonal basis for N (A) thus, the minimum2-norm solution is achieved by removing from any solution xp all itscomponents in N (A).

Householder QRA matrix H of the form

H = I − 2vvT

is called a Householder Reflection and v aHouseholder Vector. The transformation

Pv =vvT

is a projection onto the space of v

Householder reflections can be used insuch a way to annihilate all thecoefficients of a vector x except onewhich corresponds to reflecting x on oneof the original axes. If v = x± ‖x‖2e1then Hx has all coefficients equal to zeroexcept the first.

Householder QRA matrix H of the form

H = I − 2vvT

is called a Householder Reflection and v aHouseholder Vector. The transformation

Pv =vvT

is a projection onto the space of v

Householder reflections can be used insuch a way to annihilate all thecoefficients of a vector x except onewhich corresponds to reflecting x on oneof the original axes. If v = x± ‖x‖2e1then Hx has all coefficients equal to zeroexcept the first.

Householder QR

Note that in practice the H matrix is never explicitly formed. Instead,it is applied through products with the vector v.Also, note that if x is sparse, the v vector (and thus the H matrix)follows its structure. Imagine a sparse x vector and an x one builtremoving all the zeros in x:

If H is the Householder reflector that annihilates all the elements of xbut the first, then the equivalent transformation H for X is

], H =

[H11 H12

H21 H22

]⇒v =

H11 0 H12

0 1 0H21 0 H22

Householder QR

The complete QR factorization of a matrix is achieved in n steps:

(I − 2v1v

vT1 v1

a11 a12 a13a21 a22 a23a31 a32 a33a41 a42 a43

r11 r12 r130 a122 a1230 a132 a1330 a142 a143

1 0 0 00

(I − 2v2v

vT2 v2

r11 r12 r130 a122 a1230 a132 a1330 a142 a143

r11 r12 r130 r22 r230 0 a2330 0 a243

1 0 0 00 1 0 00 0

(I − 2v3v

vT3 v3

r11 r12 r130 r22 r230 0 a2330 0 a243

r11 r12 r130 r22 r230 0 r330 0 0

The Householder vectors can be stored in the bottom left corner ofthe original matrix where the zeros were introduced:

H1 ∗H2 ∗H3 = Q ∗

a11 a12 a13a21 a22 a23a31 a32 a33a41 a42 a43

r11 r12 r13v21 r22 r23v31 v32 r33v41 v42 v43

Householder QR

Depending on the position of the coefficients that have to beannihilated, the Householder transformation only touches certainregions of the matrix. Imagine a matrix

[B D0 E

]QBRB = B being the QR factorization of B.

Transformations that modify non overlapping regions of a matrix canbe applied independently and thus in parallel

Householder QR

Depending on the position of the coefficients that have to beannihilated, the Householder transformation only touches certainregions of the matrix. Imagine a matrix

[B D0 E

]QBRB = B being the QR factorization of B.

Transformations that modify non overlapping regions of a matrix canbe applied independently and thus in parallel

Sparse Multifrontal QR

Assume a pseudo-sparse matrix A defined as below. Its QRfactorization can be achieved in three steps:Step-1

QBRB =

B FC G

D HE L

→ A1 =

GD HE L

Step-2

QDRD =

]A1 → A2 =

Step-3

QMRM =

A2 → R =

We just did a sparse, multifrontal QR factorization. The previousoperation can be arranged in the following elimination tree:

Most of the theory related to the multifrontal QR factorization can beformalized thanks to structural properties of the normal equationmatrix.

Sparse Multifrontal QR: fill-in

After applying an Householder reflection (or a set of reflections) thataffects k rows of a sparse matrix, all the columns presenting at leastone nonzero along those lines will be completely filled, i.e. will have anonzero on each of those k rows:

Sparse Multifrontal QR: fill-in

After applying an Householder reflection (or a set of reflections) thataffects k rows of a sparse matrix, all the columns presenting at leastone nonzero along those lines will be completely filled, i.e. will have anonzero on each of those k rows:

The strong Hall property

Based on the relation ATA = (QR)TQR = RTR, the R factor of amatrix A is equal to the Cholesky factor of the corresponding normalequation ATA. The structure of the R factor can be performed by asymbolic Cholesky factorization of the ATA matrix is A is a StrongHall matrix

Strong Hall

A matrix A ∈ Rm×n,m ≥ n is said to be Strong Hall if every subsetof 0 < k < m columns, the corresponding submatrix has nonzeros inat least k + 1 rows.In this case, the symbolic Cholesky factorization of ATA will correctlypredict the structure of R.

The symbolic Cholesky factorization would not predict structuralcancellations that happen in the case where A is not strong Hall.

The strong Hall property: two examples

NON-Strong Hall

struct(A)× × × × ×××××

predicted struct(R)× × × × ×× × × ×× × ×× ××

actual struct(R)× × × × ×××××

Strong Hall

struct(A)× × × ×××××

predicted struct(R)× × × ×× × ×× ××

actual struct(R)× × × ×× × ×× ××

The strong Hall property

In the case of non-strong Hall matrices, two options are possible:

1. Reduce the matrix to Block-Triangular Form (BTF). In this case allthe diagonal blocks have the strong Hall property and can beQR-factorized independently:

Coarse BTF Fine BTF

2. Rediscover the structure of R during the factorization phase. Thesymbolic Cholesky factorization of ATA overestimates the fill-in inR. This analysis can be thus taken as a starting point and can berefined during the numerical computation according to the actualfill-in

Without loss of generality we can assume A is always strong Hall

Sparse Multifrontal QR: the analysis phase

The analysis phase of the sparse, multifrontal QR factorizationproceeds in the same way as for the Cholesky or LU factorizations andincludes the same operations:

• fill-reducing ordering: this operation aims at computing a columnpermutation that minimizes the amount of fill-in introduced in Rduring the factorization. This is achieved by computing afill-reducing order for the Cholesky factor of ATA

• symbolic factorization: this operation computes, symbolically, thestructure of the Cholesky factor of ATA

• Elimination tree computation: computation of the elimination treefor the Cholesky factorization of ATA

• Tree amalgamation

• Tree traversal order

• etc.

Sparse Multifrontal QR: the factorization phase

The numerical factorization phase also proceeds as in multifrontal LUor Cholesky (a bottom up traversal of the tree) but the factorizationand assembly of a frontal matrix are different:

assembly : in the QR factorization the contribution blocks fromchildren nodes are not summed but simply appended tocoefficients from the original matrix. This can be donein two different ways depending on the factorizationstrategy (below)

factorization : the QR factorization of the frontal matrix can beeither partial or full

Assume a frontal matrix of size m× n with npiv fully assembledvariables. In the first strategy, only npiv out of min(m,n)Householder reflections are applied to the frontal matrix. This leadsto a rectangular contribution block.

Factorization

Assembly

Assume a frontal matrix of size m× n with npiv fully assembledvariables. In the second strategy, min(m,n) Householder reflectionsare applied to the frontal matrix which leads to a triangularcontribution block; this results in a relatively sparse frontal matrixwhich can be row-permuted in order to move most of the zeroes tothe bottom-left corner in order to spare the related computations.

Factorization

Assembly

Strategy 1:

Strategy 2:

• Strategy 1 is much easier to implement

• Strategy 2 : leads to smaller fronts requires less storage minimizes the overall number of operations reduces communications

matrix Strat. Max front flops size Q facto time

medium21 6896 3197633920 4226923 1022.682 689 281359552 1065149 254.34

Nfac701 14284 1312200576 4269607 428.382 240 24316052 476636 60.34

WEST20211 26 409225 17796 7.782 26 390033 22439 8.84

Example: the flower 4 4 matrix

Sparse Direct LU/Cholesky: resume

1. Analysis: a preprocessing phase aiming at improving the structuralproperties of the problem and preparing the ground for thenumerical factorization. Only symbolic operations are done at thisphase: Fill-reducing ordering/permutations; Symbolic factorization; Elimination tree computation; Tree amalgamation; Tree traversal order (for memory minimization); Tree mapping (for parallel factorization).

2. Factorization: the actual numerical factorization is performedbased on the result of the analysis. This can be done in parallelend, eventually, with pivoting to improve numerical stability (nopivoting in the symmetric, positive definite case).

3. Solve: this step solves the problem against one or more right-handsides using the forward and backward substitutions. This can alsobe done in parallel following the same computational pattern of thefactorization phase.

Iterative vs Direct solvers

Iterative Solvers

N lower memory consumption

N fewer flops

N better scalability (if prec not too complex)

H less robust: may not converge or converge too slowly

Direct Solvers

N more robust

N ideal for multiple solutions

N works better with multiple rhs

H requires more memory and flops (due to fill-in) especially in 3D problems

H less scalable

sparse linear algebra

Documents

dense and sparse parallel linear algebra algorithms on...

linear algebra and applications: numerical linear...

latoolbox: a multi-platform sparse linear algebra toolbox

linear algebra: matrices, vectors, determinants. linear...

cuda math libraries - dkrz.de · cuda libraries ......

sage & algebraic techniques for the lazy symmetric...

direct methods for sparse linear systems: matlab sparse...

lsqr: an algorithm for sparse linear equations and sparse...

optimization workshop - nccs.nasa.gov · •sparse blas...

saving energy in sparse and dense linear algebra...

sparse linear algebra in cuda · tutorial: high performance...

linear algebra primer - artificial...

implementing sparse matrix-vector multiplication on ... ·...

sparse linear algebra: direct methods

distributed sparse linear algebra with pytrilinos · sparse...

l10/l11: sparse linear algebra on gpus

linear algebra and smarandache linear algebra

gilbert: declarative sparse linear algebra on massively...

minimizing communication in numerical linear algebra demmel...

exercises 7: sparse linear algebra via petsc