machine learning in the cloud with spark - cl.cam.ac.uk · machine learning in the cloud with spark...
TRANSCRIPT
![Page 1: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/1.jpg)
Machine Learning in the Cloud with Spark
R212 Data Centric Systems and NetworkingOpen Source Project Study
by Haikal Pribadi
![Page 2: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/2.jpg)
Machine Learning in the Cloud with Spark
Problem domain
Motivation and Contribution
Project Goal
Project Evaluation
![Page 3: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/3.jpg)
Converging Trends
![Page 4: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/4.jpg)
Converging Trends
![Page 5: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/5.jpg)
Converging Trends
![Page 6: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/6.jpg)
Converging Trends
![Page 7: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/7.jpg)
Converging Trends
![Page 8: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/8.jpg)
Motivation and Contribution
![Page 9: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/9.jpg)
Why Scale Up to the Cloud?
Input data size– Large training data instances
● e.g DryadLINQ and MapReduce
– High input dimensionality● e.g GraphLab and GraphChi
![Page 10: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/10.jpg)
Why Scale Up to the Cloud?
Complexity of data and computation– Data complexity brings algorithm comlexity
● e.g. PLANET (on top of MapReduce)
![Page 11: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/11.jpg)
Why Scale Up to the Cloud?
Time constraint and parameter tuning– Distribute system usage to increase throughput– Model and hyper-parameter selection are repetitive
and independent
![Page 12: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/12.jpg)
Why Scale Up to the Cloud?
Data Parallelism– MapReduce, GraphLab
Task Parallelism– Multicores, GPUs (CUDA), MPI
or perhaps Hybrid?– Spark, GraphLab, DryadLINQ
![Page 13: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/13.jpg)
Problem?
With the various option of distributed architectures, implementing different machine systems become very task-specific
Different architectures brings different benefits and constraints
![Page 14: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/14.jpg)
Spark + MLbase
Unified scalable machine learning
Suitable for many common Machine Learning problem
(project hypothesis)
![Page 15: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/15.jpg)
Project Goal
![Page 16: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/16.jpg)
Develop Mainstream ML
Evaluate Spark+MLbase on developing common Machine Learning problems
![Page 17: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/17.jpg)
Develop Mainstream ML
Classification– e.g. Bayesian classifier for Spam Filtering
Clustering– e.g. k-means clustering for market segmentation
Regression– e.g. Linear regression on weather forecasting
Collaborative Filtering– Alternating Least Square for recommendation systems
![Page 18: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/18.jpg)
Project Goal
Platform– Amazon EC2
Run time– Spark
Application– MLbase
![Page 19: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/19.jpg)
Project Evaluation
![Page 20: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/20.jpg)
Evaluation and Analysis
Parallelism– Granularity of data parallelism and task parallelism
Algorithm complexity– Complexity of customization
Programming paradigm– Learning curve, expressiveness and dataflow
Dataset distribution– Management of large dataset
![Page 21: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi](https://reader030.vdocument.in/reader030/viewer/2022040613/5f07f1977e708231d41f8abe/html5/thumbnails/21.jpg)
Performance Comparison
Learn a most suitable algorithm to be come a benchmark
Compare performance with [e.g.] GraphLab– Speed (sequential runs)– Scalability (efficiency increase with parallelism)– Throughput (t ime / input size)