![Page 1: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/1.jpg)
Using Application Structureto Handle Failures
and Improve Performancein a Migratory File Service
John Bent, Douglas Thain, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dusseau, and Miron Livny
WiND and Condor Project
14 April 2003
![Page 2: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/2.jpg)
Disclaimer
We have a lot of stuff to describe,
so hang in there until the end!
![Page 3: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/3.jpg)
Outline
• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA
• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions
• Evaluation– Performance– Failure
• Philosophizing
![Page 4: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/4.jpg)
CPU Bound
• SETI@Home, Folding@Home, etc...– Excellent application of dist comp.– KB of data, days of CPU time.– Efficient to do tiny I/O on demand.
• Supporting Systems:– Condor– BOINC– Google Toolbar– Custom software.
![Page 5: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/5.jpg)
I/O Bound
• D-Zero data analysis:– Excellent app for cluster computing.– GB of data, seconds of CPU time.– Efficient to compute whenever data is
ready.
• Supporting Systems:– Fermi SAM– High-throughput document scanning– Custom software.
![Page 6: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/6.jpg)
Batch Pipelined Applications
c1
data
b1
a1
x y z
c2
data
b2
a2
x y z
c3
data
b3
a3
x y z
data
PipelineSharedData
Batch Width
BatchSharedData
Pip
elin
e
![Page 7: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/7.jpg)
Example: AMANDA
corsika
corama
mmc
amasim
NUCNUCCSGLAUBTAR
EGSDATA3.3QGSDATA4
(1 MB)
DAT(23 MB)
corama.out(26 MB)
mmc_input.txt
mmc_output.dat(126 MB)
amasim_input.dat
ice tables(3 files, 3MB)
amasim_output.txt(5MB)
expt geometry(100s files, 500 MB)
corsika_input.txt(4 KB)
![Page 8: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/8.jpg)
Computing Evironment
• Clusters dominate:– Similar configurations.– Fast interconnects.– Single administrative domain.– Underutilized commodity storage.– En masse, quite unreliable.
• Users wish to harness multiple clusters, but have jobs that are both I/O and CPU intensive.
![Page 9: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/9.jpg)
Ugly Solutions
• “FTP-Net”– User finds remote clusters.– Manually stages data in.– Submits jobs, deals with failures.– Pulls data out.– Lather, rinse, repeat.
• “Remote I/O”– Submit jobs to a remote batch system.– Let all I/O come back to the archive.– Return in several decades.
![Page 10: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/10.jpg)
What We Really Need
• Access resources outside my domain.– Assemble your own army.
• Automatic integration of CPU and I/O access.– Forget optimal: save administration costs.– Replacing remote with local always wins.
• Robustness to failures.– Can’t hire babysitters for New Year’s Eve.
![Page 11: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/11.jpg)
Hawk: A Migratory File Service
• Automatically deploys a “task force” acorss an existing distributed system.
• Manages applications from a high level, using knowledge of process interactions.
• Provides dependable performance through peer-to-peer techniques.
• Understands and reacts to failures using knowledge of the system and workloads.
![Page 12: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/12.jpg)
Philsophy of Hawk
“In allocating resources, strive to avoid disaster, rather than attempt to obtain an optimum.” - Butler Lampson
![Page 13: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/13.jpg)
Why not AFS+Make?
• Quick answer:– Distributed filesystems provide an
unnecessarily strong abstraction that is unacceptably expensive to provide in the wide area.
• Better answer after we explain what Hawk is and how it works.
![Page 14: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/14.jpg)
Outline
• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA
• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions
• Evaluation– Performance– Failure
• Philosophizing
![Page 15: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/15.jpg)
Workflow Language 1
job a a.sub
job b b.sub
job c c.sub
job d d.sub
parent a child c
parent b child d
a b
c d
![Page 16: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/16.jpg)
v1
Home Storage
mydata
v2 v3
Workflow Language 2
volume v1 ftp://home/mydata
mount v1 a /datamount v1 b /data
volume v2 scratchmount v2 a /tmpmount v2 c /tmp
volume v3 scratchmount v3 b /tmpmount v3 d /tmp
a b
c d
![Page 17: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/17.jpg)
v1
Home Storage
mydata
v2 v3
Workflow Language 3
extract v2 x ftp://home/out.1
extract v3 x ftp://home/out.2
a b
c dx
out.1 out.2
x
![Page 18: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/18.jpg)
Mapping Logical to Physical
• Abstract Jobs– Physical jobs in a batch system– May run more than once!
• Logical “scratch” volumes– Temporary containers on a scratch disk.– May be created, replicated, and destroyed.
• Logical “read” volumes– Striped across cooperative proxy caches.– May be created, cached, and evicted.
![Page 19: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/19.jpg)
Node
Starting System
MatchMaker
BatchQueueArchive
Node Node
NodeNodeNode
PBS Head Node
Node Node
NodeNode
Condor Pool
WorkflowManager
![Page 20: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/20.jpg)
Node
Gliding In
MatchMaker
BatchQueueArchive
StartDProxy
Master
Node Node
NodeNodeNodeStartDProxy
Master
StartDProxy
Master
PBS Head Node
Node Node
NodeNode
Condor Pool
StartDProxy
Master
StartDProxy
Master
StartDProxy
Master Glide-InJob
![Page 21: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/21.jpg)
Hawk ArchitectureStartD
Proxy
MatchMaker
BatchQueueArchive
WorkflowManager
StartD
Proxy
StartD
Proxy
Wide Area Caching
CoopCache
CoopCache
SystemModel
AppFlow
Job
Agent
Job
Agent
Job
Agent
![Page 22: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/22.jpg)
I/O InteractionsStartD
Job
Agent
Proxy
POSIX Library Interface
Local Area Network
/tmp container://host5/120/data cache://host5/archive/data
MatchMaker
BatchQueueArchive
WorkflowManager
CooperativeBlockCache
OtherProxies
Cont. 119 Cont. 120
foo
outfile
tmpfile
bar baz
creat(“/tmp/outfile”);open(“/data/d15”);
![Page 23: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/23.jpg)
Cooperative ProxiesStartD
ProxyA
MatchMaker
BatchQueueArchive
WorkflowManager
StartD
ProxyB
StartD
ProxyC
Job
Agent
Job
Agent
Job
Agent
DiscoverDiscoverDiscoverC
C
C
Hash MapPaths -> Proxies
Ct1:
BCt2:
C B At3:
C Bt4:
![Page 24: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/24.jpg)
Summary
• Archive – Sources input data, chooses coordinator.
• Glide-In– Deploy a “task force” of components.
• Cooperative Proxies– Provide dependable batch read-only data.
• Data Containers– Fault-isolated pipeline data.
• Workflow Manager– Directs the operation.
![Page 25: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/25.jpg)
Outline
• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA
• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions
• Evaluation– Performance– Failure
• Philosophizing
![Page 26: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/26.jpg)
Performance Testbed
• Controlled testbed:– 32 550 MHZ dual-cpu cluster machines, 1
GB, SCSI disks, 100Mb/s ethernet.– Simulated WAN: restrict archive storage
across router to 800 KB/s.
• Also some preliminary tests on uncontrolled systems:– MFS over PBS cluster at Los Alamos– MFS over Condor system at INFN Italy.
![Page 27: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/27.jpg)
Synthetic Apps
a
b
10 MBpipe
a
b
5 MBbatch
5 MBpipe
a
b
10 MBbatch
Pipe Intensive Mixed Batch Intensive
Local
Co-
Locate Data
Don’t
Co-
Locate
Remote
System Configurations
![Page 28: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/28.jpg)
Pipeline Optimization
![Page 29: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/29.jpg)
Everything Together
![Page 30: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/30.jpg)
Network Consumption
![Page 31: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/31.jpg)
Failure Handling
![Page 32: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/32.jpg)
Real Applications
• BLAST– Search tool for proteins and nucleotides in
genomic databases.
• CMS– Simulation of a high energy physics expt to begin
operation at CERN in 2006.
• H-F– Simulation of the non relativistic interactions
between nuclei and electrons
• AMANDA– Simulation of a neutrino detector buried in the ice
of the South Pole.
![Page 33: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/33.jpg)
Application Throughput
Name Stages Remote Hawk
BLAST 1 4.67 747.40
CMS 2 33.78 1273.96
HF 3 40.96 3187.22
AMANDA 4
![Page 34: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/34.jpg)
Outline
• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA
• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions
• Evaluation– Performance– Failure
• Philosophizing
![Page 35: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/35.jpg)
Related Work
• Workflow management
• Dependency managers: TREC, make
• Private namespaces: UFO, db views
• Cooperative caching: no writes.
• P2P systems: wrong semantics.
• Filesystems: overly strong
![Page 36: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/36.jpg)
Why Not AFS+Make?
• Namespaces– Constructed per-process at submit-time
• Consistency– Enforced at the workflow level
• Selective Commit– Everything tossed unless explicitly saved.
• Fault Awareness– CPUs and data can be lost at any point.
• Practicality– No special permission required.
![Page 37: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/37.jpg)
Conclusions
• Traditional systems build from the bottom up: this disk must have five nines, or we’re in big trouble!
• MFS builds from the top down: application semantics drive system structure.
• By posing the right problem, we solve the traditional hard problems of file systems.
![Page 38: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e3d550346895d93b60f/html5/thumbnails/38.jpg)
For More Info...
• Paper in progress...• Application study:
– “Pipeline and Batch Sharing in Grid Workloads”, to appear in HPDC-2003.
– www.cs.wisc.edu/condor/doc/profiling.ps
• Talk to us!• Questions now?