![Page 1: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/1.jpg)
Fault tolerance, malleability and migration for divide-and-conquer
applications on the Grid
Gosia Wrzesińska, Rob V. van Nieuwpoort, Jason Maassen, Henri E. Bal
vrije Universiteit Ibis
![Page 2: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/2.jpg)
Distributed supercomputing
• Parallel processing on geographically distributed computing systems (grids)
• Needed:– Fault-tolerance: survive node crashes
(also entire clusters)– Malleability: add or remove
machines at runtime– Migration: move a running
application to another set of machines (comes for free with malleability)
• We focus on divide-and-conquer applications
LeidenDelft
Brno
Internet
Berlin
![Page 3: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/3.jpg)
Outline
• The Ibis grid programming environment• Satin: a divide-and-conquer framework
• Fault-tolerance, malleability and migration in Satin• Performance evaluation
![Page 4: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/4.jpg)
The Ibis system• Java-centric => portability
– „write once, run anywhere”
• Efficient communication– Efficient pure Java implementation– Optimized solutions for special cases with native code
• High level programming models:– Divide & Conquer (Satin)– Remote Method Invocation (RMI)– Replicated Method Invocation (RepMI)– Group Method Invocation (GMI)
http://www.cs.vu.nl/ibis/
![Page 5: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/5.jpg)
Satin: divide-and-conquer on the Grid
• Effective paradigm for Grid applications (hierarchical)
• Satin: Grid-friendly load balancing (aware of cluster hierarchy)
• Missing support for– Fault tolerance– Malleability– Migration
fib(1) fib(0) fib(0)
fib(0)
fib(4)
fib(1)
fib(2)
fib(3)
fib(3)
fib(5)
fib(1) fib(1)
fib(1)
fib(2) fib(2)cpu 2
cpu 1cpu 3
cpu 1
![Page 6: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/6.jpg)
Example: Fibonacci
class Fib { int fib (int n) {
if (n < 2) return n;int x = fib(n-1);int y = fib(n-2);return x + y;
}}
fib(1) fib(0) fib(0)
fib(0)
fib(4)
fib(1)
fib(2)
fib(3)
fib(3)
fib(5)
fib(1) fib(1)
fib(1)
fib(2) fib(2)
Single-threaded Java
![Page 7: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/7.jpg)
Example: Fibonacci
public interface FibInter extends ibis.satin.Spawnable {
public int fib (int n);}
class Fib extends ibis.satin.SatinObject implements FibInter { public int fib (int n) {
if (n < 2) return n;int x = fib(n-1); /*spawned*/int y = fib(n-2); /*spawned*/sync();return x + y;
}}
LeidenDelft
Brno
Internet
Berlin
![Page 8: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/8.jpg)
Compiling Satin programs
Javacompiler
bytecoderewriter JVM
JVM
JVM
source bytecodebytecode
![Page 9: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/9.jpg)
Executing Satin programs
• Spawn: put work in work queue
• Sync: – Run work from queue– If empty: steal (load balancing)
![Page 10: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/10.jpg)
Example application: Fibonacci
Fib(3) Fib(2)
Fib(4) Fib(3)
Fib(2)
Fib(1)
Fib(1)
Fib(0)
Fib(1) Fib(0) Fib(1)
Fib(2) Fib(1)
Fib(0)
Fib(5)
processor 1(master)
processor 2
processor 3 2
5
8
![Page 11: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/11.jpg)
Satin: load balancing for Grids
• Random Stealing (RS)– Pick a victim at random
– Provably optimal on a single cluster (Cilk)– Problems on multiple clusters:
• (C-1)/C % stealing over WAN
• Synchronous protocol
![Page 12: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/12.jpg)
Grid-aware load balancing
• Cluster-aware Random Stealing (CRS) [van Nieuwpoort et al., PPoPP 2001]– When idle:
• Send asynchronous steal request to random node in different cluster
• In the meantime steal locally (synchronously)• Only one wide-area steal request at a time
![Page 13: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/13.jpg)
Satin raytracer on a grid• GridLab testbed: 5 cities in
Europe• 40 cpus
• Distance up to 2000km• Factor of 10 difference in CPU
speeds• Latencies:
– 0.2 – 210 ms daytime
– 0.2 – 66 ms night
• Bandwidth:– 9KB/s – 11MB/s
• Three orders of magnitude difference in communication speeds
![Page 14: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/14.jpg)
Configuration
1 x 16MIPSIrixSMPZIB Berlin,
Germany
1 x 4AlphaTru64SMPLecce,
Italy
1 x 2Pentium-3LinuxSMPCardiff,
Wales, UK
4 x 2XeonLinuxClusterBrno,
Czech Republic
1 x 2SparcSolarisSMPAmsterdam,
The Netherlands
8 x 1Pentium-3LinuxClusterAmsterdam,
The Netherlands
CPUsCPUOSTypeLocation
![Page 15: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/15.jpg)
CRS performance on GridLab testbed
0
10
20
30
40
50
60
70
80
90
100
single cluster RS night CRS night RS daytime CRS daytime
Eff
icie
ncy
in %
![Page 16: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/16.jpg)
Satin summary
• Satin allows rapid development of parallel applications which are able to run with high efficiency in geographically distributed and highly heterogeneous environments
• Applications:– Barnes-Hut, Raytracer, SAT solver, TSP, Knapsack– All master-worker algorithms
• Still missing: FT, malleability, migration
![Page 17: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/17.jpg)
Fault-tolerance, malleability, migration
• Can be implemented by handling processors joining or leaving the ongoing computation
• Processors may leave either unexpectedly (crash) or gracefully
• Handling joining processors is trivial:– Let them start stealing jobs
• Handling leaving processors is harder:– Recompute missing jobs– Problems: orphan jobs, partial results from gracefully
leaving processors
![Page 18: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/18.jpg)
18
Crashing processors
1
12
3
6 7
13
14
10 11
4
2
5
9
15
processor 1
processor 2
processor 3
8
![Page 19: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/19.jpg)
Crashing processors
1
3
6 74
9
15
processor 1
processor 3
8
14
12 13
![Page 20: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/20.jpg)
Crashing processors
1
3
6 74
9
15
processor 1
processor 3
2
8
14
12 13
![Page 21: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/21.jpg)
Crashing processors
1
3
6 74
9
15
processor 1
processor 3
?
8
Problem: orphan jobs – jobs stolen from crashed processors
14
12 13
2
![Page 22: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/22.jpg)
Crashing processors
1
3
6 74
9
15
processor 1
processor 3
2?
4 5
8
Problem: orphan jobs – jobs stolen from crashed processors
14
12 13
![Page 23: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/23.jpg)
Handling orphan jobs• For each finished orphan, broadcast (jobID,processorID)
tuple; abort the rest• All processors store tuples in orphan tables• Processors perform lookups in orphan tables for each
recomputed job• If successful: send a result request to the owner (async), put
the job on a stolen jobs list
processor 3
broadcast(9,cpu3)(15,cpu3)
14
4
9
15
8
![Page 24: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/24.jpg)
Handling orphan jobs - example
1
3
6 7
11
4
2
5
9
15
processor 1
processor 2
processor 3
8
14
10 12 13
![Page 25: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/25.jpg)
Handling orphan jobs - example
1
3
6 74
9
15
processor 1
processor 3
8
14
12 13
![Page 26: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/26.jpg)
Handling orphan jobs - example
1
3
6 7
processor 1
processor 3
4
9
15
8
14
12 13
2
![Page 27: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/27.jpg)
Handling orphan jobs - example
1
3
6 7
processor 1
processor 3
(9, cpu3)(15,cpu3)
1 cpu315 cpu3
12 13
2
4
9
15
8
14
![Page 28: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/28.jpg)
Handling orphan jobs - example
1
3
6 7
processor 1
processor 3
1 cpu315 cpu3
12 13
2
9
15
![Page 29: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/29.jpg)
Handling orphan jobs - example
1
3
6 74
2
5
9
15
processor 1
processor 3
1 cpu315 cpu3
12 13
![Page 30: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/30.jpg)
Handling orphan jobs - example
1
3
6 74
2
5
9
15
processor 1
processor 3
1 cpu315 cpu3
12 13
![Page 31: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/31.jpg)
Processors leaving gracefully
1
3
6 7
11
4
2
5
9
15
processor 1
processor 2
processor 3
8
14
10 12 13
![Page 32: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/32.jpg)
Processors leaving gracefully
1
3
6 7
11
4
2
5
9
15
processor 1
processor 2
processor 3
8
Send results to another processor; treat those results as orphans14
10 12 13
![Page 33: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/33.jpg)
Processors leaving gracefully
1
3
6 7
11
4
9
15
processor 1
processor 3
8
14
12 13
Send results to another processor; treat those results as orphans
![Page 34: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/34.jpg)
Processors leaving gracefully
1
3
6 7
119
15
processor 1
processor 3(11,cpu3)(9,cpu3)(15,cpu3)
11 cpu39 cpu3
15 cpu3
12 13
2
![Page 35: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/35.jpg)
Processors leaving gracefully
1
3
6 7
11
2
5
9
15
processor 1
processor 3
4
11 cpu39 cpu3
15 cpu3
12 13
![Page 36: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/36.jpg)
Processors leaving gracefully
1
3
6 7
11
2
5
9
15
processor 1
processor 3
4
11 cpu39 cpu3
15 cpu3
12 13
![Page 37: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/37.jpg)
A crash of the master
• Master: the processor that started the computation by spawning the root job
• If master crashes:– Elect a new master– Execute normal crash recovery– New master restarts the applications– In the new run, all results from the previous run are reused
![Page 38: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/38.jpg)
Some remarks about scalability
• Little data is broadcast (< 1% jobs, pointers)
• Message combining
• Lightweight broadcast: no need for reliability, synchronization, etc.
![Page 39: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/39.jpg)
Performance evaluation
• Leiden, Delft (DAS-2) + Berlin, Brno (GridLab)• Bandwidth:
62 – 654 Mbit/s• Latency:
2 – 21 ms
![Page 40: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/40.jpg)
Impact of saving partial results
0
50
100
150
200
250
300
350
400
450
wide-area DAS-2 GridLab
runt
ime
(sec
.)
1 cluster leavesunexpectedly (withoutsaving orphans)
1 cluster leavesunexpectedly (withsaving orphans)
1 cluster leavesgracefully
1.5/3.5 clusters (nocrashes)
16 cpus Leiden 16 cpus Delft
8 cpus Leiden, 8 cpus Delft4 cpus Berlin, 4 cpus Brno
![Page 41: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/41.jpg)
Migration overhead
0
50
100
150
200
250
300
350
1
runt
ime
(sec
.)
with migration
without migration
8 cpus Leiden 4 cpus Berlin4 cpus Brno(Leiden cpus replaced by Delft)
![Page 42: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/42.jpg)
Crash-free execution overhead
0
5
10
15
20
25
30
35
Raytracer TSP SAT solver Knapsack
Spee
dup
on 3
2 cp
us
plain Satin malleable Satin
Used: 32 cpus in Delft
![Page 43: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/43.jpg)
Summary
• Satin implements fault-tolerance, malleability and migration for divide-and-conquer applications
• Save partial results by repairing the execution tree
• Applications can adapt to changing numbers of cpus and migrate without loss of work (overhead < 10%)
• Outperform traditional approach by 25%• No overhead during crash-free execution
![Page 44: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/44.jpg)
Further information
http://www.cs.vu.nl/ibis/
Publications and a software distribution available at:
![Page 45: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/45.jpg)
Additional slides
![Page 46: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/46.jpg)
Ibis design
![Page 47: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/47.jpg)
Partial results on leaving cpus
If processors leave gracefully:
• Send all finished jobs to another processor
• Treat those jobs as orphans = broadcast (jobID, processorID) tuples
• Execute the normal crash recovery procedure
![Page 48: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/48.jpg)
Job identifiers
• rootId = 1
• childId = parentId * branching_factor + child_no
• Problem: need to know maximal branching factor of the tree
• Solution: strings of bytes, one byte per tree level
![Page 49: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/49.jpg)
Distributed ASCI Supercomputer (DAS) – 2
Node configuration
Dual 1 GHz Pentium-III>= 1 GB memory100 Mbit Ethernet +(Myrinet)Linux
VU (72 nodes)UvA (32)
Leiden (32) Delft (32)
GigaPort(1-10 Gb)
Utrecht (32)
![Page 50: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/50.jpg)
GridLab testbed
interface FibInter extends ibis.satin.Spawnable {
public int fib(long n);}
class Fib extends ibis.satin.SatinObject implements FibInter { public int fib (int n) {
if (n < 2) return n;int x = fib (n - 1);int y = fib (n - 2);sync();return x + y;
}}
Java + divide&conquer
Example
![Page 51: Fault tolerance, malleability and migration for divide-and](https://reader030.vdocument.in/reader030/viewer/2022020621/61e9eb884283d954e1130770/html5/thumbnails/51.jpg)
Grid results
67 %223Compression
88 %285SAT-solver
81 %405Raytracer
EfficiencyCPUssitesProgram
• Efficiency based on normalization to single CPU type (1GHz P3)