ieee cluster talk

34
High Performance Dense Linear System Solver with Soft Error Resilience Peng Du, Piotr Luszczek, Jack Dongarra June 28, 2022

Upload: peng-du

Post on 12-Apr-2017

372 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Ieee cluster talk

High Performance Dense Linear System Solver with Soft Error Resilience

Peng Du, Piotr Luszczek, Jack Dongarra

May 3, 2023

Page 2: Ieee cluster talk

Agenda

• Soft error threat to the dense linear solver• LU factorization• Error propagation

• Error modeling

• Fault tolerant algorithm

• Performance Evaluation

May 3, 2023 2

Page 3: Ieee cluster talk

Soft error• Silent error due to radiation

• Alpha particle• High energy neutron• Thermal neutron

• Outbreaks• Commercial computing system from Sun Microsystem in 2000• ASC Q supercomputer at Los Alamos National Lab in 2003

May 3, 2023 3

0 0 1 1 0 10 1

1 0 1 1 0 10 1

Page 4: Ieee cluster talk

LU based linear solver

Page 5: Ieee cluster talk

Block LU factorization

GETF2 TRSM

GEMM GETF2 STRSM

Page 6: Ieee cluster talk

General work flow

(1) Generate checksum for the input matrix as additional columns

(2) Perform LU factorization WITH the additional checksum columns

(3) Solve Ax=b using LU from the factorization (even if soft error occurs during LU factorization)

(4) Check for soft error

(5) Correct solution x

Page 7: Ieee cluster talk

Why is soft error hard to handle?

• Soft error occurs silently

• Propagation

Page 8: Ieee cluster talk

Example: Error propagationError location (using matlab notation and 1-based index)

Error strikes right before panel factorization of (41:200, 41:60),

Case 1: Error at (35,10), in L area

Case 2: Error at (50,120), in A’ area

Note: Pivoting on the left of panel factorization is delayed to the end of error detection and recovery so that error in L area does not move

Page 9: Ieee cluster talk

Case 1: Non-propagating error

Error does not propagate in this case

Page 10: Ieee cluster talk

Case 2: Propagating error

Page 11: Ieee cluster talk

Soft error challenge

May 3, 2023 11

Page 12: Ieee cluster talk

Error modeling (for propagating error)

• When?• Answer: Doesn’t really matter

May 3, 2023 12

LU

Page 13: Ieee cluster talk

Error modeling (for “where”)

1 1 1t t t tA L P A

Define an initial erroneous initial matrix

Input matrix

One step of LU

If no soft error occurs

If soft error occurs at step t

Page 14: Ieee cluster talk

Locate Error

Column j

Page 15: Ieee cluster talk

Recover Ax=b

• Sherman Morison Formula

Page 16: Ieee cluster talk

Recover Ax=b

Given:

To Solve:Ax b

Page 17: Ieee cluster talk

Recover Ax=b

Page 18: Ieee cluster talk

Recover Ax=bRecall:

Therefore:

Page 19: Ieee cluster talk

Recover Ax=b

Sherman Morrison

Page 20: Ieee cluster talk

Recover Ax=b

Page 21: Ieee cluster talk

Recover Ax=b

Page 22: Ieee cluster talk

Recover Ax=b

Needs protection

Page 23: Ieee cluster talk

How to detect & recovery a soft error in L?

• The recovery of Ax=b requires a correct L

• L does not change once produced• Static checkpointing for L

• Delay pivoting on L to prevent checksum of L from being invalidated

LU

Page 24: Ieee cluster talk

• PDGEMM based checkpointing• Checkpointing time increases when scaled to more processes and

larger matrices

Checkpointing for L, idea 1

NOT SCALABLE

Page 25: Ieee cluster talk

Checkpointing for L, idea 2• Local Checkpointing• Each process checkpoints their local involved data• Constant checkpointing time

SCALABLE

Page 26: Ieee cluster talk

Encoding for L

• On each process, for a column of L

Page 27: Ieee cluster talk

Kraken Performance

Two 2.6 GHz six-core AMD Opteron processors per node

32x32 MPI processes, 6 threads/(process, core) 6,144 cores used in total

Page 28: Ieee cluster talk

Kraken Performance

Two 2.6 GHz six-core AMD Opteron processors per node

64x64 MPI processes, 6 threads/(process, core) 24,576 used cores in total

Page 29: Ieee cluster talk

May 3, 2023 29

Page 30: Ieee cluster talk

• Backup slides

May 3, 2023 30

Page 31: Ieee cluster talk

A

A e A w

cA

cPA LU

c v

L

U

Locate Error

U

U e

U w

Page 32: Ieee cluster talk

Locate Error

A

A e A w

cA

cPA LU

c v

L

U

U

U e

U w

Page 33: Ieee cluster talk

Locate Error

• Wj is the jth element of vector w in the generator matrix

• Component-wise division of s and r reveals wj

• Search wj in w reveals the initial soft error’s column

Page 34: Ieee cluster talk

Extra Storage

• For input matrix of size MxN on PxQ grid• A copy of the original matrix

• Not necessary when it’s easy to re-generate the required column of the original matrix

• 2 additional columns: 2 x M• Each process has 2 rows: , in total 2 N

Q 2P N

2 2

2 2 2N

extra storage M P Nrmatrix storage M N

P PN M M