cloud computing wrap-up - kth
TRANSCRIPT
Cloud Computing Wrap-up
Åke Edlund, PhDKTH PDC Center for High Performance ComputingKTH PDC Cloud Group
You’ve come a long way- now you can start using Amazon on your own
• Introduction to cloud computing • Overview (hands-on testing)• Review of Amazon management console• Launch of first instance (pre defined image)• Customizing images• Clone of image and relaunch new instance, using the image• SNIC Cloud (added)• Map Reduce overview• Amazon Simple Workflow (SWF) overview• Galaxy on Amazon• Q&A, more hands-on testing
2
MapReduce, SWF, PDC Cloud ... and much more...
• Depending on the interest we might in the future give workshops on some of these more specific topics.
• Read more about MapReduce, e.g. at: http://hadoop.apache.org/mapreduce
4
MapReduce - A framework for processing (easily) parallel problems across huge data sets ('Big Data') on a large number of nodes.
5
Hadoop - open source MapReduce
Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (seefigure below.) MapReduce divides applications into many small blocks of work. HDFScreates multiple replicas of data blocks for reliability, placing them on compute nodes aroundthe cluster. MapReduce can then process the data where it is located.
Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is10,000 node clusters.
Here's what makes Hadoop especially useful:• Scalable: Hadoop can reliably store and process petabytes.• Economical: It distributes the data and processing across clusters of commonly availablecomputers. These clusters can number into the thousands of nodes.• Efficient: By distributing the data, Hadoop can process it in parallel on the nodes wherethe data is located. This makes it extremely rapid.• Reliable: Hadoop automatically maintains multiple copies of data and automaticallyredeploys computing tasks based on failures.
1. Map - divide the input into smaller subproblems, and distribute
them to worker nodes (can be repeated in many levels).
2. Process - the worker node process the smaller problem, and passes the
answer back to its master node (can be many if we did this in a many levels).
3. Reduce - the master node collects the answers to all the sub-problems and combines them to
produce the answer.
(a) Map Only!(d) Loosely
Synchronous (c) Iterative MapReduce
(b) Classic MapReduce
Input
map
reduce
Input
map
reduce
Iterations Input
!!
Output
!! !!
map
!!
!!
!!
Pij
BLAST Analysis
Smith-Waterman
Distances
Parametric sweeps
PolarGrid Matlab data
analysis
High Energy Physics
(HEP) Histograms
Distributed search
Distributed sorting
Information retrieval
Many MPI scientific
applications such as
solving differential
equations and
particle dynamics
Domain of MapReduce and Iterative Extensions MPI
Expectation maximization
clustering e.g. Kmeans
Linear Algebra
Multimensional Scaling
Page Rank
Twister (Geoffrey Fox, Indiana)
Iterative MapReduce: http://salsahpc.indiana.edu/tutorial/twister-intro.html
Logistic(Regression(Performance(
0"500"
1000"1500"2000"2500"3000"3500"4000"4500"
1" 5" 10" 20" 30"
Run
ning
(Tim
e((s)(
Number(of(Iterations(
Hadoop"
Spark"
127"s"/"iteration"
first"iteration"174"s"further"iterations"6"s"
Example:)Logistic)Regression)
Goal:&find&best&line&separating&two&sets&of&points&
+
–
+ + +
+
+
+ + +
– – –
–
–
– – –
+
target&
–
random&initial&line&val points = spark.textFile(...).map(parsePoint).cache()var w = Vector.random(D) // current separating planefor (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}println("Final separating plane: " + w)
Logistic RegressionApplies the same MapReduce operation repeatedly to the same dataset, it benefits greatly from caching the input data in RAM across iterations
Matei&Zaharia,&Mosharaf&Chowdhury,&Tathagata&Das,&Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&Scott&Shenker,&Ion&Stoica&
Spark&Fast,&Interactive,&LanguageBIntegrated&Cluster&Computing&
UC&BERKELEY&www.sparkBproject.org&&
SNIC CloudSNIC Cloud project facilitate Swedish and Norwegian researchers to use Public cloud in their research - without knowing much technical cloud details. In the first phase we are focusing on Amazon Web Services and later expand on adding other Public cloud offerings as-well-as Private cloud.
Each SNIC center has:
9
A budget of 250 ksek for helping their users into SNIC Cloud.
A budget of 83 ksek for running on SNIC Cloud ( = Amazon costs, mangaged by PDC)
10
Web$Portal$$
+ via$federa0on$+ using$zones$
+ Instance$isola0on$
Amazon$ Private$
ENGINE$
2$
1$
1$ OpenNebula$Sunstone.$Via$Federa0on.$Using$oZone$
2$ NEW.$Pulling$data$from$Amazon$and$Private.$
Private$Private$
SSO,$x.509,$…$
OCCI,$EC2$
IAM$
EC2+API$
€$€$
€$
OCCI$
NEW.$Mone0zing$engine$
OCCI$
SNIC Cloud - Goal
Together with UNINETT SIGMA (Norway)
11
SNIC Cloud - Current
3 Levels for now
Level 3 - AWS User (isolation)
Level 2 - AWS User (no isolation)
Level 1 - AWS Instance ( ssh, isolation)
Stay updated!
SNIC Cloud & SNIC Cloud @ PDCÅke Edlund, [email protected],
Zeeshan Ali Shah, [email protected] Gholami, [email protected]
SNIC Cloud @ UPPMAXHans Karlsson, [email protected]
12
www.pdc.kth.se/resources/computers/swecloudwww.pdc.kth.se/resources/computers/pdc-cloud
www.pdc.kth.se/resources/computers/swecloud/cloud-computing-course-101
Slides, updates: