real-world barriers to scaling up scientific applications
DESCRIPTION
Real-World Barriers to Scaling Up Scientific Applications. Douglas Thain University of Notre Dame Trends in HPDC Workshop Vrije University, March 2012. The Cooperative Computing Lab. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/1.jpg)
Real-World Barriers to Scaling Up Scientific Applications
Douglas ThainUniversity of Notre Dame
Trends in HPDC WorkshopVrije University, March 2012
![Page 2: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/2.jpg)
2
The Cooperative Computing LabWe collaborate with people who have large scale computing problems in science, engineering, and other fields.We operate computer systems on the O(1000) cores: clusters, clouds, grids.We conduct computer science research in the context of real people and problems.We release open source software for large scale distributed computing.
http://www.nd.edu/~ccl
![Page 3: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/3.jpg)
3
Our Application CommunitiesBioinformatics– I just ran a tissue sample through a sequencing device.
I need to assemble 1M DNA strings into a genome, then compare it against a library of known human genomes to find the difference.
Biometrics– I invented a new way of matching iris images from
surveillance video. I need to test it on 1M hi-resolution images to see if it actually works.
Molecular Dynamics– I have a new method of energy sampling for ensemble
techniques. I want to try it out on 100 different molecules at 1000 different temperatures for 10,000 random trials
![Page 4: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/4.jpg)
4
The Good News:Computing is Plentiful!
![Page 6: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/6.jpg)
![Page 7: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/7.jpg)
7
greencloud.crc.nd.edu
![Page 8: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/8.jpg)
8
Superclusters by the Hour
http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars
![Page 9: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/9.jpg)
9
The Bad News:It is inconvenient.
![Page 10: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/10.jpg)
10
I have a standard, debugged, trusted application that runs on my laptop. A toy problem completes in one hour.A real problem will take a month (I think.)
Can I get a single result faster?Can I get more results in the same time?
Last year,I heard aboutthis grid thing.
What do I do next?
This year,I heard about
this cloud thing.
![Page 11: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/11.jpg)
What users want.
11
What they get.
![Page 12: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/12.jpg)
12
The TraditionalApplication Model?
Every program attempts to grow until it can read mail.
- Jamie Zawinski
![Page 13: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/13.jpg)
What goes wrong? Everything!Scaling up from 10 to 10,000 tasks violates ten different hard coded limits in the kernel, the filesystem, the network, and the application.Failures are everywhere! Exposing error messages is confusing, but hiding errors causes unbounded delays.User didn’t know that program relies on 1TB of configuration files, all scattered around the home filesystem.User discovers that the program only runs correctly on Blue Sock Linux 3.2.4.7.8.2.3.5.1!User discovers that program generates different results when run on different machines.
![Page 14: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/14.jpg)
Abstractions for Scalable Apps
B1
B2
B3
A1 A2 A3
F F F
F
F F
F F
F
All-Pairs(Regular Graph)
Makeflow(Irregular Graph)
A
1
B
2 3
75 64
C D E
8 9 10
A
Work Queue(Dynamic Graph)
while( more work to do) {
foreach work unit {t = create_task();submit_task(t);
}
t = wait_for_task();process_result(t);
}
![Page 15: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/15.jpg)
15
Work Queue API
wq_task_create( files and program );wq_task_submit( queue, task);wq_task_wait( queue ) -> task
C implementation + Python and Perl
![Page 16: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/16.jpg)
Work Queue Apps
T=10K T=20K T=30K T=40K
Ensemble MD
Work Queue
SAND
Work Queue
Align Align Alignx100s
AGTCACACTGTACGTAGAAGTCACACTGTACGTAA…
ACTGAGCTAATAAG
Fully Assembled Genome
Raw Sequence Data
![Page 17: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/17.jpg)
17
An Old Idea: Make
part1 part2 part3: input.data split.py ./split.py input.data
out1: part1 mysim.exe ./mysim.exe part1 >out1
out2: part2 mysim.exe ./mysim.exe part2 >out2
out3: part3 mysim.exe ./mysim.exe part3 >out3
result: out1 out2 out3 join.py ./join.py out1 out2 out3 > result
![Page 18: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/18.jpg)
Makeflow Applications
![Page 19: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/19.jpg)
19
Why Users Like Makeflow
Use existing applications without change.Use an existing language everyone knows. (Some apps are already in Make.)Via Workers, harness all available resources: desktop to cluster to cloud.Transparent fault tolerance means you can harness unreliable resources.Transparent data movement means no shared filesystem is required.
![Page 20: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/20.jpg)
PrivateCluster
CampusCondor
Pool
PublicCloud
Provider
SharedSGE
Cluster
Application
Work Queue API
Local Files and Programs
Work Queue Overlaysge_submit_workers
W
W
W
ssh
WW
WW
W
Wv
W
condor_submit_workers
W
W
W
Hundreds of Workers in a
Personal Cloud
submittasks
![Page 21: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/21.jpg)
PrivateCluster
CampusCondor
Pool
PublicCloud
Provider
SharedSGE
Cluster
Elastic Application Stack
W
W
WWW
W
W
W
Wv
Work Queue Library
All-Pairs Wavefront Makeflow CustomApps
Hundreds of Workers in aPersonal Cloud
![Page 22: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/22.jpg)
22
The Elastic Application Curse
Just because you can run on a thousand cores… doesn’t mean you should!The user must make an informed decision about scale, cost, performance efficiency.(Obligatory halting problem reference.)Can the computing framework help the user to make an informed decision?
![Page 23: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/23.jpg)
23
Input Files
Elastic App
Service Provider
Scale of System
Run
time
Cos
t
![Page 24: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/24.jpg)
Abstractions for Scalable Apps
B1
B2
B3
A1 A2 A3
F F F
F
F F
F F
F
All-Pairs(Regular Graph)
Makeflow(Irregular Graph)
A
1
B
2 3
75 64
C D E
8 9 10
A
Work Queue(Dynamic Graph)
while( more work to do) {
foreach work unit {t = create_task();submit_task(t);
}
t = wait_for_task();process_result(t);
}
![Page 25: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/25.jpg)
25
Different Techniques
For regular graphs (All-Pairs), almost everything is known and accurate predications can be made.For irregular graphs (Makeflow) input data and cardinality are known, but time and intermediate data are not.For dynamic programs (Work Queue) prediction is not possible in the general case. (Runtime controls still possible.)
![Page 26: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/26.jpg)
26
Is there a Chomsky Hierarchy for Distributed Computing?
Language Construct Required MachineRegular Expressions Finite State Machines
Context Free Grammar Pushdown Automata
Context Sensitive Grammar Turing Machine
![Page 27: Real-World Barriers to Scaling Up Scientific Applications](https://reader035.vdocument.in/reader035/viewer/2022081515/568168a6550346895ddf3fda/html5/thumbnails/27.jpg)
27
Papers, Software, Manuals, …http://www.nd.edu/~ccl
This work was supported by NSF Grants CCF-0621434, CNS-0643229, and CNS 08554087.