![Page 1: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/1.jpg)
1
Silent error resilience innumerical time-stepping schemes
Austin [email protected] UniversityICME Colloquium, Jan. 26 2015
Joint work withSven Schmit, StanfordRob Schreiber, HP Labs
code + data: http://stanford.edu/~arbenson/silent.htmlpaper: Intl. J. of High Performance Computing Applications, 2014
![Page 2: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/2.jpg)
2
Computer systems are getting bigger and more complicated. Software systems are getting bigger and more complicated. Pushing energy limits. Things break.
![Page 3: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/3.jpg)
3
What breaks?
Hardware wears out Bit flips from cosmic rays Data races and other software bugs Firmware bugs
Silent errors are errors in application state that have escaped low-level error detection.
![Page 4: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/4.jpg)
4
What can we do?
Checkpoint/restart: Occasionally save state of system. If error is detected, restart.
Does not scale. How to detect errors?
Other ABFT: Clever algorithms that address these issues for particular algorithms.
This work: Error detection for iterative computation in general, numerical time-stepping schemes in particular.
![Page 5: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/5.jpg)
5
Spot the error!
![Page 6: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/6.jpg)
6
At time step 120, multiplied single entry in right-hand-side of Crank-Nicolson and Backward Euler linear solves by 0.995.
![Page 7: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/7.jpg)
7
General algorithm: “Base method” generates sequence B1, B2, … “Auxiliary method” generates sequence A1, A2, … If Di = ||Bi – Ai|| is abnormal, possible error
![Page 8: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/8.jpg)
8
Base method: high-order numerical integration scheme: Runge-Kutta 5
Auxiliary method: lower-order scheme: Runge-Kutta 4
Difference: Di = |Bi – Ai|
Re-purposing an old idea for step-size control[Fehlberg, 1969], [Dormand and Prince, 1980]
![Page 9: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/9.jpg)
9
Key idea: re-use data
RK 1/2 scheme for u’ = f(t, u)
Second-order scheme has error O(h^3)
No extra function evaluations.Provides O(h^2) check.
![Page 10: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/10.jpg)
10
Key idea: re-use data
Implicit solve that is stable
Explicit solve checks.
It is OK that the explicit solve may be unstable. (Why?)
e.g., Backward Euler
e.g., Forward Euler
![Page 11: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/11.jpg)
11
Backward/Forward Euler Richardson/Crank-Nicolson Runge-Kutta 1/2, 2/3, 4/5 Adams-Bashforth linear multistep method 2/3, 4/5 Explicit check on implicit scheme Extrapolation
Lots of these checks fornumerical time-stepping algorithms…
![Page 12: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/12.jpg)
12
Exercise in step detection (change point detection)Algorithmic details in the paper. Main parameters:
Relative jump
Variance change
![Page 13: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/13.jpg)
13
Experimental setup:
Solve heat equation for T time steps and artificially inject error at one time step.
Do this many times with differenttypes of errors.
True positive rate: #(real errors detected) / #(trials)
False positive rate: #(non-errors “detected”) / #(time steps)
![Page 14: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/14.jpg)
14
Are large errors easier to detect?
Local truncation error (LTE)-normalized error
Output when no fault is injected.
Output when fault is injected.
![Page 15: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/15.jpg)
15
Error injection:Multiply single entry of RHS in linear solves byz ~ N(1, 5e-5) at a single time step
![Page 16: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/16.jpg)
16
Error injection:Multiply q(x, t) at one discrete x by z ~ N(1, 0.1)at a single time step
![Page 17: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/17.jpg)
17
Takeaways
We have a general framework for detecting silent errors. Numerical integration is our central application. We detect large errors more easily. Not too many false positives.
![Page 18: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/18.jpg)
18
How many silent errors are there? How worried should we be? Do we need systems solutions or algorithmic solutions? Both? “Defense in depth” is good. But how easy are ABFT methods to
incorporate into existing solvers?
Resilience: what do we need to discuss?
![Page 19: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/19.jpg)
19
Silent error resilience innumerical time-stepping schemes
Austin [email protected] UniversityICME Colloquium, Jan. 26 2015
Joint work withSven Schmit, StanfordRob Schreiber, HP Labs
code + data: http://stanford.edu/~arbenson/silent.htmlpaper: Intl. J. of High Performance Computing Applications, 2014
![Page 20: Silent error resilience in numerical time-stepping schemes](https://reader035.vdocument.in/reader035/viewer/2022062514/55baf8a6bb61eb31508b4614/html5/thumbnails/20.jpg)
20
Tardy error detection