high level abstractions for data-intensive computing christopher moretti, hoang bui,
DESCRIPTION
High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon Rich, and Douglas Thain University of Notre Dame. Computing’s central challenge, “How not to make a mess of it,” has not yet been met. -Edsger Dijkstra. Overview. - PowerPoint PPT PresentationTRANSCRIPT
1Christopher Moretti – University of Notre Dame4/30/2008
High Level Abstractions for
Data-Intensive Computing
Christopher Moretti, Hoang Bui, Brandon Rich, and Douglas Thain
University of Notre Dame
2Christopher Moretti – University of Notre Dame4/30/2008
Computing’s central challenge,“How not to make a mess of it,”
has not yet been met.
-Edsger Dijkstra
3Christopher Moretti – University of Notre Dame4/30/2008
Overview
Many systems today give end users access to hundreds or thousands of CPUs.
But, it is far too easy for the naive user to create a big mess in the process.
Our Solution: Deploy high-level abstractions that describe both data
and computation needs. Some examples of current work:
All-Pairs: An abstraction for biometric workloads. Distributed Ensemble Classification DataLab: A system and language for data-parallel
computation.
5Christopher Moretti – University of Notre Dame4/30/2008
Distributed Computing is Hard!
How do I fit my
workload into jobs?
Which resources
?
What happens when things
fail?
How Many?What is Condor?
What do I do with the results?
How can I measure job
stats?
What about job input
data?
How long will it take?
6Christopher Moretti – University of Notre Dame4/30/2008
Distributed Computing is Hard!
How do I fit my
workload into jobs?
Which resources
?
What happens when things
fail?
How Many?What is Condor?
What do I do with the results?
How can I measure job
stats?
What about job input
data?
How long will it take?
ARGH!
11Christopher Moretti – University of Notre Dame4/30/2008
The All-Pairs Problem
All-Pairs(
Set S1,
Set S2,
Function F
)
yields a matrix M:
Mij = F(S1i,S2j)
60K 20KB images >1GB
3.6B comparisons
@ 50/s = 2.3 CPUYrs
x 8B output = 29GB
S2 1
S2 2
S2 3
S2 4
S2 5
S2 6
S2 7
S1 1
S1 2
S1 3
S1 4
S1 5
S1 6
S1 7
12Christopher Moretti – University of Notre Dame4/30/2008
Biometric All-Pairs Comparison
1 .8 .1 0 0 .1
1 0 .1 .1 0
1 0 .1 .7
1 0
1 .1
1
F
13Christopher Moretti – University of Notre Dame4/30/2008
Naïve Mistakes
Computing Problem: Even expert users don’t know how to tune jobs optimally, and can make 100 CPUs even slower than one by overloading the file server, network, or resource manager.
CPU CPU CPU CPU
fileserver
Bat
ch S
yste
mEach CPU
reads 10TB!
For all $X :For all $Y :cmp $X to $Y
14Christopher Moretti – University of Notre Dame4/30/2008
Consequences of Naïve Mistakes
15Christopher Moretti – University of Notre Dame4/30/2008
All Pairs Abstraction
set S of filesbinary function F
F
M = AllPairs(F,S)
invocation
17Christopher Moretti – University of Notre Dame4/30/2008
Web Portal300 active storage units500 CPUs, 40TB disk
F G H
S T
All-PairsEngine
2 - AllPairs(F,S)
F F F
F F F
3 - O(log n) distributionby spanning tree.
6 - Return resultmatrix to user.
1 - Upload F and Sinto web portal.
5 - Collect andassemble results.
4 – Choose optimal partitioningand submit batch jobs.
All-Pairs Production System at Notre Dame
18Christopher Moretti – University of Notre Dame4/30/2008
19Christopher Moretti – University of Notre Dame4/30/2008
20Christopher Moretti – University of Notre Dame4/30/2008
21Christopher Moretti – University of Notre Dame4/30/2008
Returning the Result Matrix
4.374.37
6.016.01
2.222.22
4.37
7.13
8.94
6.72
1.34
…
…
…
0.98
4.37
7.13
8.94
6.72
1.34
…
…
…
0.98
Too many files.
Hard to do prefetching.
Too large files.
Must scan entire file.
Row/Column ordered.
How can we build it?
22Christopher Moretti – University of Notre Dame4/30/2008
Chirp_array allows users to create, manage, modify large arrays without having to realize underlying form.
Operations on chirp_array: create a chirp_array open a chirp_array set value A[i,j] get value A[i,j]
get row A[i] get column A[j] set row A[i] set column A[j]
Result Storage by Abstraction
CPU
Disk
CPU
Disk
CPU
Disk
X
X
23Christopher Moretti – University of Notre Dame4/30/2008
CPU
Disk
CPU
Disk
CPU
Disk
Result Storage with chirp_array
chirp_array_get(i,j)
24Christopher Moretti – University of Notre Dame4/30/2008
CPU
Disk
CPU
Disk
CPU
Disk
Result Storage with chirp_array
chirp_array_get(i,j)
25Christopher Moretti – University of Notre Dame4/30/2008
CPU
Disk
CPU
Disk
CPU
Disk
Result Storage with chirp_array
chirp_array_get(i,j)
26Christopher Moretti – University of Notre Dame4/30/2008
Data Mining on Large Data Sets
Problem: Supercomputers are expensive, not all scientists have access to them for completing very large memory problems. Classification on large data sets without sufficient memory can degrade throughput, degrade accuracy, or fail outright.
27Christopher Moretti – University of Notre Dame4/30/2008
trainingdata
partitioning/sampling(optional)
algorithm 1 algorithm n
classifier 1 classifier n
testinstance
voting
classification
Data Mining Using Ensembles
(From Steinhaeuser and Chawla, 2007)
28Christopher Moretti – University of Notre Dame4/30/2008
trainingdata
partitioning/sampling(optional)
algorithm 1 algorithm n
classifier 1 classifier n
testinstance
voting
classification
Data Mining Using Ensembles
(From Steinhaeuser and Chawla, 2007)
29Christopher Moretti – University of Notre Dame4/30/2008
CPU CPU CPU CPU
Abs
trac
tion
Eng
ine
Here are my algorithms.Here is my data set.Here is my test set.
Abstraction for Ensembles Using Natural Parallelism
Local Votes
Choose optimal partitioningand submit batch jobs.
Return local votes for tabulation and final prediction.
30Christopher Moretti – University of Notre Dame4/30/2008
unixfilesys
chirpserver
unixfilesys
chirpserver
unixfilesys
chirpserver
chirpserver
tcshemacs
perl
parrot
set S
chirpserver
X Y
F
A B C
file F
distributed data structures
Y = F(X)
job_startjob_commitjob_waitjob_remove
file system function evaluation
DataLab Abstractions
31Christopher Moretti – University of Notre Dame4/30/2008
apply F on S into T
chirpserver
chirpserver
chirpserver
chirpserver
set S
chirpserver
A B C
set T
A B C
F F F
F
DataLab Language Syntax
32Christopher Moretti – University of Notre Dame4/30/2008
For More Information
Christopher Moretti [email protected]
Douglas Thain [email protected]
Cooperative Computing Lab http://cse.nd.edu/~ccl