preliminary investigations on resilient parallel numerical linear algebra … · 2018. 5. 25. ·...
TRANSCRIPT
![Page 1: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/1.jpg)
SIAM EX14 WorkshopJuly 7, Chicago - IL
Preliminary Investigations on ResilientParallel Numerical Linear Algebra Solvers
Luc Giraud
joint work withE. Agullo, P. Salas, E. F. Yetkin, M. Zounonfunded by ANR RESCUE and G8-ECS
HiePACS Inria ProjectJoint Inria-CERFACS labINRIA Bordeaux Sud-Ouest
![Page 2: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/2.jpg)
Context
L. Giraud - Resilient numerical linear algebra solvers 2/ 25
Resilience: Ability to compute a correct output in presence of faults
I Context: Numerical linear algebraI Goal: Keep converging in presence of faultI Method: Recover-restart strategy without Checkpoint
I HPC systems are not fault-freeI A faulty components (node, core, memory) loses
all its dataI Simulations at exascale have to be resilient
![Page 3: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/3.jpg)
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 3/ 25
![Page 4: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/4.jpg)
Faults in HPC Systems
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 4/ 25
![Page 5: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/5.jpg)
Faults in HPC Systems
Framework
Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF
ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue
Faults in this presentationI Detected corrupted memory space (node crashes, damaged
memory pages, uncorrected bit-flip, . . . )
L. Giraud - Resilient numerical linear algebra solvers 5/ 25
![Page 6: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/6.jpg)
Faults in HPC Systems
Framework
Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF
ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue
Faults in this presentationI Detected corrupted memory space (node crashes, damaged
memory pages, uncorrected bit-flip, . . . )
L. Giraud - Resilient numerical linear algebra solvers 5/ 25
![Page 7: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/7.jpg)
Faults in HPC Systems
Framework
Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF
ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue
Faults in this presentationI Detected corrupted memory space (node crashes, damaged
memory pages, uncorrected bit-flip, . . . )
L. Giraud - Resilient numerical linear algebra solvers 5/ 25
![Page 8: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/8.jpg)
Sparse linear systems
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 6/ 25
![Page 9: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/9.jpg)
Sparse linear systems
L. Giraud - Resilient numerical linear algebra solvers 7/ 25
x bA
=
Ax = bWe attempt to design fault tolerant solversfor sparse linear system
Two classes of iterative methodsI Stationary methods (Jacobi, Gauss-Seidel, . . . )I Krylov subspace methods (CG, GMRES, Bi-CGStab, . . . )
I Krylov methods have attractive potential for Extreme-scale
![Page 10: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/10.jpg)
Interpolation methods
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 8/ 25
![Page 11: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/11.jpg)
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Block row distributionx bA
P
P
P
P
1
2
3
4
=
We distinguish two categories of data:I Static dataI Dynamic data
![Page 12: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/12.jpg)
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Block row distributionx bA
P
P
P
P
1
2
3
4
=
We distinguish two categories of data:I Static dataI Dynamic data
![Page 13: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/13.jpg)
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Block row distributionx bA
P
P
P
P
1
2
3
4
=
We distinguish two categories of data:I Static dataI Dynamic data
![Page 14: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/14.jpg)
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
����������������
����������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data
=
We distinguish two categories of data:I Static dataI Dynamic data
![Page 15: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/15.jpg)
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
����������������
����������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data
=
We distinguish two categories of data:I Static dataI Dynamic data
Let’s assume that P1 fails
![Page 16: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/16.jpg)
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
����������������
����������������
��������
��������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data Lost data
=
We distinguish two categories of data:I Static dataI Dynamic data
Let’s assume that P1 fails
![Page 17: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/17.jpg)
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
����������������
����������������
������������
������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data Lost data
=
We distinguish two categories of data:I Static dataI Dynamic data
Let’s assume that P1 failsI Failed processor is replacedI Static data are restored
![Page 18: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/18.jpg)
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
������������
������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data Lost data
0
=
We distinguish two categories of data:I Static dataI Dynamic data
Let’s assume that P1 failsI Failed processor is replacedI Static data are restored
Reset: Set (x1) to initial value
![Page 19: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/19.jpg)
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
��������
��������
������������������������������
������������������������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data Lost data Interpolatedv data
=
We distinguish two categories of data:I Static dataI Dynamic data
Let’s assume that P1 failsI Failed processor is replacedI Static data are restored
Our algorithms aim at recovering x1and restart
![Page 20: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/20.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
I Sequential simulationsI Simulation of parallel
environment
![Page 21: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/21.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
![Page 22: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/22.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
![Page 23: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/23.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
![Page 24: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/24.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
![Page 25: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/25.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
![Page 26: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/26.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
Restart
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
![Page 27: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/27.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
Restart
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
![Page 28: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/28.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
Restart
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
![Page 29: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/29.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
Restart
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
![Page 30: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/30.jpg)
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
Restart
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
![Page 31: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/31.jpg)
Interpolation methods
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(x1x2
)=
(b1b2
)
Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]
Solve A11x1 = b1 − A12x2
Least Squares Interpolation (LSI)
(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−(
A11A21
)x −
(A12A22
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 11/ 25
![Page 32: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/32.jpg)
Interpolation methods
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(?x2
)=
(b1b2
)How to recover x1?
Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]
Solve A11x1 = b1 − A12x2
Least Squares Interpolation (LSI)
(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−(
A11A21
)x −
(A12A22
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 11/ 25
![Page 33: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/33.jpg)
Interpolation methods
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(?x2
)=
(b1b2
)How to recover x1?
Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]
Solve A11x1 = b1 − A12x2
Least Squares Interpolation (LSI)
(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−(
A11A21
)x −
(A12A22
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 11/ 25
![Page 34: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/34.jpg)
Interpolation methods
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(?x2
)=
(b1b2
)How to recover x1?
Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]
Solve A11x1 = b1 − A12x2
Least Squares Interpolation (LSI)(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−
(A11A21
)x −
(A12A22
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 11/ 25
![Page 35: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/35.jpg)
Interpolation methods
Main properties - basic linear algebra
PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing
PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure
L. Giraud - Resilient numerical linear algebra solvers 12/ 25
![Page 36: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/36.jpg)
Interpolation methods
Main properties - basic linear algebra
PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing[LI might not be defined for non-SPD matrices as diagonal blocksmight be singular]
PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure
L. Giraud - Resilient numerical linear algebra solvers 12/ 25
![Page 37: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/37.jpg)
Interpolation methods
Main properties - basic linear algebra
PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing[LI might not be defined for non-SPD matrices as diagonal blocksmight be singular]
PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure
L. Giraud - Resilient numerical linear algebra solvers 12/ 25
![Page 38: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/38.jpg)
Numerical experiments
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 13/ 25
![Page 39: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/39.jpg)
Numerical experiments
Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 4 faults
L. Giraud - Resilient numerical linear algebra solvers 14/ 25
![Page 40: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/40.jpg)
Numerical experiments
Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 8 faults
L. Giraud - Resilient numerical linear algebra solvers 14/ 25
![Page 41: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/41.jpg)
Numerical experiments
Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 17 faults
L. Giraud - Resilient numerical linear algebra solvers 14/ 25
![Page 42: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/42.jpg)
Numerical experiments
Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 40 faults
L. Giraud - Resilient numerical linear algebra solvers 14/ 25
![Page 43: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/43.jpg)
Numerical experiments
Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 3 % data lost
L. Giraud - Resilient numerical linear algebra solvers 15/ 25
![Page 44: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/44.jpg)
Numerical experiments
Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 0.8 % data lost
L. Giraud - Resilient numerical linear algebra solvers 15/ 25
![Page 45: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/45.jpg)
Numerical experiments
Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 0.2 % data lost
L. Giraud - Resilient numerical linear algebra solvers 15/ 25
![Page 46: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/46.jpg)
Numerical experiments
Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 0.001 % data lost
L. Giraud - Resilient numerical linear algebra solvers 15/ 25
![Page 47: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/47.jpg)
Numerical experiments
Penalty of restart strategy
I Recover-restart strategyI When restarting, we lose the Krylov subspace built before the
faultI Consequence: delay of convergence due to restartI Restarting mechanism is naturally implemented in GMRES to
reduce the computational resource consumptionI CG does not need to be restarted
L. Giraud - Resilient numerical linear algebra solvers 16/ 25
![Page 48: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/48.jpg)
Numerical experiments
Penality of restart strategy on PCG
1e-13
1e-12
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 83 166 249 332 415 498 581 664 747 830
A-n
orm
(err
or)
Iterations
Reset
LI
LSI
SC
Figure: PCG on a 7-point stencil 3D Poisson equation with 70 faults -5 % data lost
L. Giraud - Resilient numerical linear algebra solvers 17/ 25
![Page 49: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/49.jpg)
Numerical experiments
Penality of restart strategy on PCG
1e-13
1e-12
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 83 166 249 332 415 498 581 664 747 830
A-n
orm
(err
or)
Iterations
Reset
LI
LSI
SC
REF
Figure: PCG on a 7-point stencil 3D Poisson equation with 70 faults -5 % data lost
L. Giraud - Resilient numerical linear algebra solvers 17/ 25
![Page 50: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/50.jpg)
Resilience in eigensolvers
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 18/ 25
![Page 51: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/51.jpg)
Resilience in eigensolvers
Recovery-restart for eigensolvers
Fault in eigenproblem(A11 A12A21 A22
)(x1x2
)= λ
(x1x2
)
Linear Interpolation (LI)Solve the linear system
(A11 − λI1
)x1 = −A12x2
Least Squares Interpolation (LSI)
(A11A21
)x1 +
(A21A22
)x2 = λ
(x1x2
)x1 = argmin
x
∥∥∥∥(A11 − λI1A21
)x +
(A12
A22 − λI2
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 19/ 25
![Page 52: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/52.jpg)
Resilience in eigensolvers
Recovery-restart for eigensolvers
Fault in eigenproblem(A11 A12A21 A22
)(?x2
)= λ
(?x2
)How to recover x1?
Linear Interpolation (LI)Solve the linear system
(A11 − λI1
)x1 = −A12x2
Least Squares Interpolation (LSI)(A11A21
)x1 +
(A21A22
)x2 = λ
(x1x2
)x1 = argmin
x
∥∥∥∥(A11 − λI1A21
)x +
(A12
A22 − λI2
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 19/ 25
![Page 53: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/53.jpg)
Resilience in eigensolvers
L. Giraud - Resilient numerical linear algebra solvers 20/ 25
xA
=
x If Ax = λx with x 6= 0, where A ∈ Cn×n,x ∈ Cn, and λ ∈ C , then,
I λ : eigenvalueI x : eigenvectorI (λ, x) : eigenpair
Two classes of methodsI Fixed Point Methods (Power Method, Subspace iteration)I Subpace Methods (Jacobi-Davidson, Arnoldi, IRA/Krylov
Schur)
![Page 54: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/54.jpg)
Resilience in eigensolvers
Thermo-acoustic test example
(a few smallest eigenvalues)
L. Giraud - Resilient numerical linear algebra solvers 21/ 25
![Page 55: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/55.jpg)
Resilience in eigensolvers
Jacobi-Davidson method
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1e+01
0 24 48 72 96 120 144 168 192 216 240
||(A
x -
lam
bda*x
)||/||la
mbda||
Iteration
0 1 2 2 3 4
LSI
REF
Figure: Jacobi-Davidson method with 5 faults - 1 % lost data.Convergence history using LSI and Checkpoint of current iterate
L. Giraud - Resilient numerical linear algebra solvers 22/ 25
![Page 56: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/56.jpg)
Concluding remarks and perspectives
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 23/ 25
![Page 57: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/57.jpg)
Concluding remarks and perspectives
Concluding remarks
SummaryI We have designed techniques to interpolate meaningfull lost
data based on simple linear algebra toolsI Our techniques preserve some of the key monotonicy of Krylov
solvers but lack of robustness of LI for non-SPD problemsI The restarting effect remains reasonable within the GMRES
contextI No fault, no overheadI These techniques can be adpated to multiple faultsI What about silent soft-error - CGPOP preliminary
experiments ?
L. Giraud - Resilient numerical linear algebra solvers 24/ 25
![Page 58: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel](https://reader035.vdocument.in/reader035/viewer/2022071218/60523379ac3ca25fe7271a1f/html5/thumbnails/58.jpg)
Merci for your attentionQuestions ?
https://team.inria.fr/hiepacs/